Detecting Recombination
in DNA Sequence Alignments:
A Comparison between
Maximum Likelihood and
Markov Chain Monte Carlo
Dirk Husmeier
and
Grainne McGuire
Biomathematics and Statistics Scotland (BioSS)
SCRI, Dundee DD2 5DA, United Kingdom
March 2002
For a statistical method to detect recombination in
DNA sequence alignments, we compare two methods of
inference:
The Expectation Maximization (EM) algorithm, to optimize
the parameters in a maximum likelihood (ML) sense, and
Markov chain Monte Carlo (MCMC), whereby the parameters
are sampled in a Bayesian sense from the posterior distribution.
A simulation study suggests that while both methods
find the true mosaic structure with an accurate location of
the recombinant breakpoints, the MCMC approach gives a
better estimation
of the uncertainty of the prediction, as expressed
by the posterior probability of the tree topologies.
We explore a statistical method for detecting
recombination and gene conversion in DNA sequence alignments,
which is based on combining two probabilistic graphical
models:
(1) a taxon graph (phylogenetic tree) representing the
relationship between the taxa or genes,
and (2) a site graph (hidden Markov model)
representing interactions between different
sites in the DNA sequence alignments.
The objective of this study is to compare two methods of
inference:
The Expectation Maximization (EM) algorithm, to optimize
the parameters in a maximum likelihood (ML) sense, and
Markov chain Monte Carlo (MCMC), whereby the parameters
are sampled in a Bayesian sense from the posterior distribution.
-
Parameter estimation in a maximum likelihood sense with the
EM algorithm has been implemented in the software package
SERAD.
-
Sampling the parameters with MCMC from the posterior
distribution in a Bayesian sense has been implemented in
the software package
BARCE.
We simulated recombination in a
population of 4 taxa with the program
SIRENS, as described
here
(using a tree height of 0.2), and selected
the region between sites 400 and 800.
This gives us a 400-nucleotide alignment with one
recombination event, which corresponds to a
transition from topology 1 (for the 200 sites on the left)
to topology 3 (for the 200 sites on the right).
Click here for a reminder of the three
possible tree topologies.
We now take the same alignment without recombination,
that is, where the whole alignment has been generated
from topology 1, and randomly replace a certain
percentage of columns of the first alignment
by those of the second.
This is illustrated in the following figure:
The effect is that the probability for topology
3 in the right part of the alignment is smaller than 1,
and we want to test whether this is correctly predicted
by the detection method. We generated two alignments
with the MATLAB program
make_data.m.
In the first alignment,
seq10.phy,
10 percent of the sites in the second mosaic patch
(between sites 201 and 400) have been replaced,
hence the true probability for topology 3 is 0.9.
In the second alignment,
seq20.phy,
20 percent of the sites in the second mosaic patch
have been replaced,
giving a true probability for topology 3 of 0.8.
For optimizing the parameters of the model in a maximum likelihood
sense with the EM algorithm, we used the program
SERAD. For sampling the parameters in a Bayesian sense
with the MCMC algorithm, we used the program
BARCE. Further details are available
here.
The results are shown in the following figures:
We averaged the posterior probabilities over the
last 100 sites.
The predictions for the alignment
seq10.phy,
where 10 percent of the sites in the
recombinant region have been exchanged,
are:
| Average posterior probability for topology: |
1 | 2 | 3 |
| True |
0.100 | 0.000 | 0.900 |
| Predicted with ML |
0.000 | 0.000 | 1.000 |
| Predicted with MCMC |
0.098 | 0.000 | 0.902 |
The predictions for the alignment
seq20.phy,
where 20 percent of the sites in the
recombinant region have been exchanged,
are:
| Average posterior probability for topology: |
1 | 2 | 3 |
| True |
0.200 | 0.000 | 0.800 |
| Predicted with ML |
0.001 | 0.001 | 0.998 |
| Predicted with MCMC |
0.185 | 0.000 | 0.815 |
In a second simulation experiment, we studied the
effect of rate heterogeneity.
We simulated the evolution process along the
branches of a four-species
phylogenetic tree with the Kimura model,
but reduced the rate of
nucleotide substitution by a factor of 1/5 in the
centre region, between sites t=301 and t=600.
The results are shown in the following figure:
The maximum likelihood approach clearly over-fits and
erroneously predicts a recombinant region.
The Bayesian MCMC approach is only slightly affected in its
prediction of the posterior probabilities -
the graph of P(state_t=1|data) shows a small dent. This
does not lead to classifying any site in the alignment
as a topology other than state_t=1, though,
hence no recombination
is predicted, and over-fitting is avoided.
We have derived two methods of parameter inference
in an HMM-based recombination detection model.
SERAD optimizes the parameters in a maximum likelihood (ML)
sense with the EM algorithm.
BARCE samples the parameters from
the posterior distribution in a Bayesian sense by means of an
MCMC simulation. Our first simulation study suggests that
while both methods find the correct
mosaic structure of the sequence alignment,
ML is over-confident in its
prediction and gives a wrong estimation of the posterior
probabilities for the tree topologies.
This deficiency is redeemed with the Bayesian approach,
and is likely to be important in real-world applications,
where uncertainties in the predictions need to be
pointed out.
Our second simulation study suggests that the Bayesian
MCMC approach is more robust against over-fitting, and
that it is more reliable in distinguishing between recombination
and rate heterogeneity.
Back to my homepage.