Jambe

Probabilistic Divergence Measures for Detecting Recombination with Markov Chain Monte Carlo

Version 1.0 alpha, September 2001

© Copyright 2001, Dirk Husmeier, Biomathematics and Statistics Scotland (BioSS).

Input files


Output files


dna.dat
This file contains the dna sequence alignment in the following format: The first line contains two integers, where the first denotes the number of species, and the second the length of the alignment. For each of the species, a string in a heading line defines the name of the species. This is followed by the actual sequences (of the specified length), which can be in as presented in any number of lines. For example, the file below contains an alignment for four species of length 100 nucleotides.

4 100
Species_1
GGAACTCAGTCCGTAACCAGCTAATTCTTCTATCAAGGTAGACTCCCGTC
TGGGTGGTTGAGCGCTGTCAAGCCAGCTGGCAATGCATTCAGGCGGCATC
Species_2
GGAACTCAATACGTAGTCAGCTAATGCTTCTATCAAGTTAGACTCCCGTC
TGGGTGATTGAGCGCTGTCAAGCCGACCGGCAATGCACGCACGCGGCATG
Species_3
GGAATTACATCTGTAGCCAGTCAATGCTTCGAACAAGTTTCACTGTTTTC
AAGTTGGTTGATCGCTTCCGAGCCGGCTTTCGACGCTTTCACTAAGCACC
Species_4
GGATCTCCGTTTGTAGCCAGTACATGCTTTCACCAACATACGCGTCTCCG
AAGTTGGTTGATCGCTTCCGAGCCGGCTTTCGACGCTTTCACTAAGCACC

Important: Do not change the name of this file!


in
This file specifies the length of the moving window and the step size with which it is moved along the DNA sequence alignment.

Example:

1 500 10

A window of length 500 bases is moved along the alignment with a step size of 10 bases. Leave the first number unchanged.


infile
The standard BAMBE run control file, which defines the settings of the MCMC simulation. Important: Do not change the name of this file! Also, the following lines in this file must not be changed:

file-root=run1 # root name for output files

data-file=dna.in # file name

JAMBE can deal with Version 1.01 and Version 2.02 of BAMBE. (Other versions might be o.k., but I have not tested this). Here are standard forms of the BAMBE run control file infile:

infile, BAMBE Version 1.01
infile, BAMBE Version 2.02
infile, BAMBE, most recent version

Note that the most recent version might not be supported by JAMBE. Also, make sure that you understand the various options of these files and do not use the default values blindly! Click here to get the version that I used for the example.

JAMBE reads in a DNA sequence alignment from file dna.dat and writes out moving windows along the alignment to file dna.in. The subalignments in dna.in are read in from BAMBE, which writes the results out into various files starting with the prefix run1. These files are read in by JAMBE for the computation of the entropy and the probabilistic divergence measures.


resultsAllTopos.out
This file contains a record of all the topology strings written out during the moving-window MCMC simulation, and is read in by subsequent analysis programs, like JambeAnalyseTopos.java.


resultsStringToIntegerTranslator.out
Program JambeAnalyseTopos.java translates the topology strings of file resultsAllTopos.out to integer numbers. This file, resultsStringToIntegerTranslator.out, contains a translation table. Here is an example for four taxa:

1 : (1,(2,(3,4)))
2 : (1,((2,4),3))
3 : (1,((2,3),4))


results_histo.out
This file, generated by Java class JambeAnalyseTopos.java, contains the posterior probabilities of the topologies for the different window positions. Here is an example:

0.897 0.047 0.056
0.899 0.045 0.056
0.525 0.051 0.424
0.555 0.031 0.414
0.549 0.036 0.415
0.479 0.056 0.465

The rows represent different window positions. The difference between the centres of two adjacent windows is given by the step size, which can be looked up in results_par.out. The columns represent topologies. Here, we have three topologies, which are numbered from 1 to 3 as we go from left to right. To see which topologies these numbers represent, have a look at resultsStringToIntegerTranslator.out.


results_par.out
Contains the window size (first integer) and the steps size (second integer) of the moving window that is slided along the alignment. This is to remind you which options you have chosen. Example:

500
10

This indicates that a window of length 500 bases has been moved along the alignment in step sizes of 10 bases.


results_sample_size.out
Contains one integer representing the MCMC sample size. This value is needed for the statistical significance test, since the chi2 depends on it. This file is generated with the Java class BambeInfile.


results_ACT_Neff.out
In an MCMC simulation, successive configurations are usually highly correlated. Consequently, the effective sample size, Neff, is smaller than the total sample size, N. An approximate formula is Neff= N/(2 tau), where tau is the autocorrelation or relaxation time (i.e. the time after which the autocorrelation function has decrease by a factor of 1/e).

This file, which is generated with AutoCorrelationTime.java, contains a line with five numbers, e.g.:

112 114 500 2.182 0.045

The first two numbers give the effective sample size, Neff, computed from the exact formula (112), and the approximate formula shown above (114), respectively. The third number shows the total sample size, N. Finally, the last two number show the average relaxation time tau (2.182) and its standard error (0.045).


results_kullback_mean.out
The Kullback-Leibler divergence between a local posterior distribution (conditional on the current window) and the mean posterior distribution (an average over all local posterior distributions). This is the first of two divergence measures discussed in Bioinformatics 17, Suppl. 1, S123-S131 . The second divergence measure discussed in this paper has not yet been implemented in Java, but can be computed with the additional MATLAB program provided.


results_entropy.out, results_kullback_mode.out
These are old, obsolete files that give you the entropy and the Kullback-Leibler divergence between a local posterior distribution (conditional on the current window) and the mode of the posterior distribution averaged over all local posterior distributions. However, both measures turned out to be rather unreliable for detecting recombination, and I no longer recommend using them.


Back to the main page.
Last modified: February 2002