ABBESS is a free software package for the approximation of the Bayesian evidence for a DNA sequence alignment segmentation. The programs are written in MATLAB. The software is provided without guarantee of maintenance or support, and without warranty. The copyright holder is not liable for any damages which may result in any manner from the use of this software.
ABBESS.tar.gz and Manual.tar.gz.
To extract them under UNIX,
give the following commands:
% gunzip ABBESS.tar.gz
% tar xvf ABBESS.tar
% gunzip Manual.tar.gz
% tar xvf Manual.tar
If you use WINDOWS,
use a program like WINZIP. Decide on a name for the
directory in which you want to keep the new files,
and add this name to your MATLAB path. For example,
if you use UNIX, put the following
command into your .login file:
setenv MATLABPATH new_path
where new_path
is the name of the directory into which you have copied
the program files.
The MATLAB programs analyse output files from
BAMBE,
so it is necessary that you install and compile this software
package first.
Programs in the Software Package
The software package contains the following programs:
BayesSegmentationSelection
This is the main MATLAB function for the analysis, which calls the three functions below.
EntropyTopoFromBambe
Computes the contribution of the marginal distribution over tree topologies to the overall entropy
SampleSizeEffectiveACF
Computes the effective size of an MCMC sample by correcting for the autocorrelation effect.
NumberOfInformativeSites
Computes the number of informative sites in the DNA sequence alignment. This requires the DNA sequence file to satisfy certain format specifications.
Further information about these functions and different
invocations can be obtained by typing
help plus the program name at the
MATLAB prompt.
Running the Programs
The MATLAB function BayesSegmentationSelection
computes the Bayesian evidence for a segment of a DNA
sequence alignment.
The program reads in the
file with the DNA sequence alignment (optional, but advisable)
as well as
the output, parameter, and summary files
created with
BAMBE.
The default names are
| DNA sequence file | dna.dat |
| BAMBE output file | run1.out |
| BAMBE parameter file | run1.par |
| BAMBE summary file | run1.summary |
It is therefore advisable to put the
following line into the BAMBE control file
(otherwise you get prompted for the file names on
every invocation of BayesSegmentationSelection):
file-root = run1
and then to call the BAMBE command summarize
as follows:
summarize -n T run1.top > runs.sum
Here, T is an integer value that indicates
the number of topologies you want to skip (usually those
of the equilibrium phase).
BayesSegmentationSelection can deal with both
versions of BAMBE, BAMBE-1 and BAMBE-2.
It also detects automatically whether or not a
molecular clock constraint has been applied.
There are several possible input arguments:
BayesSegmentationSelection.
You are prompted for the version of Bambe,
the names of the Bambe MCMC files, and whether you
want to neglect non-informative sites.
If you do, you are also prompted for the name of the
file with the DNA sequence alignment.
BayesSegmentationSelection(2).
The argument indicates the version of Bambe, that is,
it must be either 1 or 2.
The default file names
for the Bambe MCMC files and
the DNA sequence alignment file are used,
and non-informative sites are neglected.
BayesSegmentationSelection(2,1).
The first argument indicates the version of Bambe, 2
in this case. The second variable tells the program
what to do with the non-informative sites. If the
number is 1, as in the example above, non-informative
sites are discarded. To this end, the DNA sequence
alignment will be read from file dna.in.
If the second number is 0, non-informative sites are
included in the count (not advisable).
The default file names are used.
BayesSegmentationSelection(2,1,1).
The same as option 2), except that the third input
(whatever it is) makes the program prompt you for the
names of the Bambe MCMC files and, in case you want to discard
non-informative sites, also for the name of the DNA
sequence file.
If you want to discard non-informative sites (which is advisable), the file with the DNA sequence alignment must satisfy certain format specifications, which are explained using the following example of a valid input file:
3 153
singa90
ATGAACAACCAACGAAAAAAGACGGCTCGACCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGGTTCACAGTTGGCGAAGAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTGATGGCTTTC
ATA
thail80
ATGAACAACCAACGGAAAAAGACGGGTAACCCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGGTTCACAGCTGGCGAAAAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTGATGGCTTTC
GTA
phili84
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAA
ACGCGCGAGAAACCGCGTGTCAACTGTTTCACAGTTGGCGAAGAGATTCT
CAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGTAATGGCTTTT
ATA
The first line specifies the number of taxa (3) and the length of the DNA sequences (153). The next line states the name of the first species or strain (singa90). The DNA sequence is given in the following lines. The only permissible symbols are A, C, G, T, and U; gaps are not allowed. There must be no blanks in the lines, and each line must contain exactly 50 characters except for the last line, after which the name of the next strain or species (thail80) is given ; and so on.
Examples.tar.gz.
To extract this file under UNIX,
give the following command:
% gunzip Examples.tar.gz
% tar xvf Examples.tar
On a successful extraction, you should find the following directories:
synthetic_w1 | Synthetic data, unit branch length w=0.1 |
synthetic_w075 | Synthetic data, unit branch length w=0.075 |
synthetic_w05 | Synthetic data, unit branch length w=0.05 |
synthetic_w025 | Synthetic data, unit branch length w=0.025 |
synthetic_w1 | Synthetic data, unit branch length w=0.01 |
neisseria | Neisseria |
dengue | Dengue virus |
hepatitisB | Hepatitis-B virus |
Each directory contains the following files:
dna.dat |
The complete DNA sequence alignment | |
infile |
The BAMBE control file | |
job.m |
MATLAB job to run BayesSegmentationSelection
on the different segments. |
|
job_analyse.m |
MATLAB job to combine the results obtained with job.m
so as to get the evidence scores for the different segmentations.
|
|
Subdirectories |
The subdirectories contain segments of the total alignment |
The subdirectories contain the following files:
dna.in |
Segment of the DNA sequence alignment |
infile |
Bambe control file, identical to the file in the parent directory |
bambe < infile
summarize in each subdirectory.
On the synthetic data sets, give the command
summarize -n 120 run1.top > run1.summary
On the real-world sequence alignments, give the command
summarize -n 240 run1.top > run1.summary
The number after the letter -n
specifies the number of discarded configurations
and has been set such that the whole equilibration period
and the first ten percent of the sampling period are
discarded. (Discarding the beginning of the sampling
period seems to be reasonable since the MOVE type
of the proposal distributions gets changed from
GLOBAL to
LOCAL - see infile.)
>> job
This command goes into each subdirectory
and applies BayesSegmentationSelection
to each segment of the alignment.
>> job_analyse
(again at the MATLAB prompt). This combines the partial results obtained with the previous command to obtain the evidence score for each of the different candidate segmentations.