SERAD

Searching for Evidence of Recombination in Alignments of DNA

Version 1.0 alpha, May 2000

© Copyright 2000, Dirk Husmeier, Biomathematics and Statistics Scotland (BioSS).

SERAD is a free software package for the detection of recombination in multiple DNA sequence alignments with hidden Markov models. The algorithm is described in the following paper:

The programs are written in MATLAB. I have made the source code readable and sufficiently documented so that others can use it as a teaching tool, to add and explore their own algorithms, and to combine existing algorithms in novel ways.

Note: Click here for a more recent software implementation of a related algorithm, which has recently been incorporated in the more user-friendly softeware package TOPALi.


Table of Contents


Legal Agreement

The SERAD software package is provided without guarantee of maintenance or support, and without warranty. The copyright holder is not liable for any damages which may result in any manner from the use of this software. Permission is granted to copy and use this software package for personal academic use.

Downloading and Installing the Software

The programs and documentation files are contained in a zipped tar file, called SERAD.tar.gz. To extract them under UNIX, give the following commands:

% gunzip SERAD.tar.gz
% tar xvf SERAD.tar

If you use WINDOWS, use a program like WINZIP. Decide on a name for the directory in which you want to keep the new files, and add this name to your MATLAB path. For example, if you use UNIX, put the following command into your .login file:

setenv MATLABPATH new_path

where new_path is the name of the directory into which you have copied the program files.


Running the Programs

To run the software, carry out the following two steps:

Data

The following data files are included in the package:

DataNeisseria.dat Ascii file with a 787 base-pair alignment of four strains of Neisseria
make_data.m MATLAB program for generating the synthetic data. This gives the three files listed below.
Data.mat Synthetic DNA sequence alignments
Data_true_states.mat True states of the synthetic DNA sequence alignments
Data_true_w.mat True branch lengths for the synthetic DNA sequence alignments

Your own data must be in file mydata.dat.


Overview of the Software Package and the Program Files

The programs are written in MATLAB 5, making use of the object-oriented elements of the language. Functions belonging to a given class are kept in a directory @Object_Name . There are currently three such classes. @Tree4 is the class of 4-species trees. @Phylo is the class of emission probabilities of the HMM, which is the set of all possible @Tree4-classes (note that for 4-species trees there are 3 different topologies). @EmissionModel is an alternative class of emission probabilities, simulating a fair and a loaded die. To find out about the methods of each class, give the command methods Object_Name. The backbone of the HMM has (unfortunately!) not been programmed as an object. Programs acting on the HMM backbone, like those carrying out the forward-backward algorithm and the Viterbi path, make extensive use of global variables (follow this link to find a complete list of all the global variables used). It would have been better to program the HMM backbone as an object, as this would make the code more modular and would make it easier to add new algorithms or to combine algorithms in novel ways. For this reason you will find some functions that require a global variable to be passed to it in the argument list. This is a first step towards rewriting part of the code, should I ever have the time to do it ... For finding out more about the functions, make use of the help command.

Future Work

The main shortcoming of the program package in its current version is that the branch-length optimization is slow, much slower than, for instance, with DNAML of the PHYLIP package. This has two reasons. First, I wrote the programs as a model exploration tool and was mainly interested in a clear and transparent program code. I have not made any attempts to optimize the programs with respect to execution time, and I assume a lot could be gained with a better vectorization of the loops. Second, the iterative optimization of the branch lengths follows a simple gradient ascent scheme. I did not find any significant improvements by using second-order methods . However, I think that a considerable speed-up can be achieved by incorporating DNAML of the PHYLIP package in my programs. The only modification required would be a small extension of DNAML so as to include a weighting scheme for the different sites in the alignment. I would be grateful for any comments if you know of a way how this can easily be accomplished.

Back to my homepage