ABBESS

Approximation of the Basic Bayesian Evidence for Sequence Segmentation

Version 1.0 alpha, March 2001

© Copyright 2001, Dirk Husmeier, Biomathematics and Statistics Scotland (BioSS).

ABBESS is a free software package for the approximation of the Bayesian evidence for a DNA sequence alignment segmentation. The programs are written in MATLAB. The software is provided without guarantee of maintenance or support, and without warranty. The copyright holder is not liable for any damages which may result in any manner from the use of this software.


Table of Contents


Downloading and Installing the Software

The programs and documentation files are contained in zipped tar files, called ABBESS.tar.gz and Manual.tar.gz. To extract them under UNIX, give the following commands:

% gunzip ABBESS.tar.gz
% tar xvf ABBESS.tar
% gunzip Manual.tar.gz
% tar xvf Manual.tar

If you use WINDOWS, use a program like WINZIP. Decide on a name for the directory in which you want to keep the new files, and add this name to your MATLAB path. For example, if you use UNIX, put the following command into your .login file:

setenv MATLABPATH new_path

where new_path is the name of the directory into which you have copied the program files. The MATLAB programs analyse output files from BAMBE, so it is necessary that you install and compile this software package first.


Programs in the Software Package

The software package contains the following programs:

Further information about these functions and different invocations can be obtained by typing help plus the program name at the MATLAB prompt.


Running the Programs

The MATLAB function BayesSegmentationSelection computes the Bayesian evidence for a segment of a DNA sequence alignment. The program reads in the file with the DNA sequence alignment (optional, but advisable) as well as the output, parameter, and summary files created with BAMBE. The default names are

DNA sequence file dna.dat
BAMBE output file run1.out
BAMBE parameter file run1.par
BAMBE summary file run1.summary

It is therefore advisable to put the following line into the BAMBE control file (otherwise you get prompted for the file names on every invocation of BayesSegmentationSelection):

file-root = run1

and then to call the BAMBE command summarize as follows:

summarize -n T run1.top > runs.sum

Here, T is an integer value that indicates the number of topologies you want to skip (usually those of the equilibrium phase).

BayesSegmentationSelection can deal with both versions of BAMBE, BAMBE-1 and BAMBE-2. It also detects automatically whether or not a molecular clock constraint has been applied. There are several possible input arguments:

  1. No input, e.g. BayesSegmentationSelection. You are prompted for the version of Bambe, the names of the Bambe MCMC files, and whether you want to neglect non-informative sites. If you do, you are also prompted for the name of the file with the DNA sequence alignment.
  2. One input, e.g. BayesSegmentationSelection(2). The argument indicates the version of Bambe, that is, it must be either 1 or 2. The default file names for the Bambe MCMC files and the DNA sequence alignment file are used, and non-informative sites are neglected.
  3. Two inputs, e.g. BayesSegmentationSelection(2,1). The first argument indicates the version of Bambe, 2 in this case. The second variable tells the program what to do with the non-informative sites. If the number is 1, as in the example above, non-informative sites are discarded. To this end, the DNA sequence alignment will be read from file dna.in. If the second number is 0, non-informative sites are included in the count (not advisable). The default file names are used.
  4. Three inputs, e.g. BayesSegmentationSelection(2,1,1). The same as option 2), except that the third input (whatever it is) makes the program prompt you for the names of the Bambe MCMC files and, in case you want to discard non-informative sites, also for the name of the DNA sequence file.

If you want to discard non-informative sites (which is advisable), the file with the DNA sequence alignment must satisfy certain format specifications, which are explained using the following example of a valid input file:

3 153
singa90
ATGAACAACCAACGAAAAAAGACGGCTCGACCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGGTTCACAGTTGGCGAAGAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTGATGGCTTTC
ATA
thail80
ATGAACAACCAACGGAAAAAGACGGGTAACCCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGGTTCACAGCTGGCGAAAAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTGATGGCTTTC
GTA
phili84
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGTTTCACAGTTGGCGAAGAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTAATGGCTTTT
ATA

The first line specifies the number of taxa (3) and the length of the DNA sequences (153). The next line states the name of the first species or strain (singa90). The DNA sequence is given in the following lines. The only permissible symbols are A, C, G, T, and U; gaps are not allowed. There must be no blanks in the lines, and each line must contain exactly 50 characters except for the last line, after which the name of the next strain or species (thail80) is given ; and so on.


Test Datasets

The software has been tested on various synthetic and real-world DNA sequence alignments. They are included in the file Examples.tar.gz. To extract this file under UNIX, give the following command:

% gunzip Examples.tar.gz
% tar xvf Examples.tar

On a successful extraction, you should find the following directories:

synthetic_w1 Synthetic data, unit branch length w=0.1
synthetic_w075 Synthetic data, unit branch length w=0.075
synthetic_w05 Synthetic data, unit branch length w=0.05
synthetic_w025 Synthetic data, unit branch length w=0.025
synthetic_w1 Synthetic data, unit branch length w=0.01
neisseria Neisseria
dengue Dengue virus
hepatitisB Hepatitis-B virus
Click here for further information on these data sets.

Each directory contains the following files:
dna.dat The complete DNA sequence alignment
infile The BAMBE control file
job.m MATLAB job to run BayesSegmentationSelection on the different segments.
job_analyse.m MATLAB job to combine the results obtained with job.m so as to get the evidence scores for the different segmentations.
Subdirectories The subdirectories contain segments of the total alignment

The subdirectories contain the following files:
dna.in Segment of the DNA sequence alignment
infile Bambe control file, identical to the file in the parent directory


Apply the Software to the Test Data Sets

To run the analysis, proceed as follows:
  1. Run the MCMC simulations in each subdirectory. If you have properly installed BAMBE with the required modification of the PATH variable (see BAMBE documentation), just give the command:

    bambe < infile

  2. Run the BAMBE command summarize in each subdirectory. On the synthetic data sets, give the command

    summarize -n 120 run1.top > run1.summary

    On the real-world sequence alignments, give the command

    summarize -n 240 run1.top > run1.summary

    The number after the letter -n specifies the number of discarded configurations and has been set such that the whole equilibration period and the first ten percent of the sampling period are discarded. (Discarding the beginning of the sampling period seems to be reasonable since the MOVE type of the proposal distributions gets changed from GLOBAL to LOCAL - see infile.)

  3. Return to the parent directory, start MATLAB, and give the command

    >> job

    This command goes into each subdirectory and applies BayesSegmentationSelection to each segment of the alignment.

  4. Finally, give the command

    >> job_analyse

    (again at the MATLAB prompt). This combines the partial results obtained with the previous command to obtain the evidence score for each of the different candidate segmentations.


Back to my homepage
Last modified: March 2001