Detecting Recombination in DNA Sequence Alignments:
A Comparison between
Maximum Likelihood and Markov Chain Monte Carlo

Dirk Husmeier and Grainne McGuire
Biomathematics and Statistics Scotland (BioSS)
SCRI, Dundee DD2 5DA, United Kingdom
March 2002

For a statistical method to detect recombination in DNA sequence alignments, we compare two methods of inference: The Expectation Maximization (EM) algorithm, to optimize the parameters in a maximum likelihood (ML) sense, and Markov chain Monte Carlo (MCMC), whereby the parameters are sampled in a Bayesian sense from the posterior distribution. A simulation study suggests that while both methods find the true mosaic structure with an accurate location of the recombinant breakpoints, the MCMC approach gives a better estimation of the uncertainty of the prediction, as expressed by the posterior probability of the tree topologies.

Introduction

We explore a statistical method for detecting recombination and gene conversion in DNA sequence alignments, which is based on combining two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationship between the taxa or genes, and (2) a site graph (hidden Markov model) representing interactions between different sites in the DNA sequence alignments. The objective of this study is to compare two methods of inference: The Expectation Maximization (EM) algorithm, to optimize the parameters in a maximum likelihood (ML) sense, and Markov chain Monte Carlo (MCMC), whereby the parameters are sampled in a Bayesian sense from the posterior distribution.

Software


Simulation A

We simulated recombination in a population of 4 taxa with the program SIRENS, as described here (using a tree height of 0.2), and selected the region between sites 400 and 800. This gives us a 400-nucleotide alignment with one recombination event, which corresponds to a transition from topology 1 (for the 200 sites on the left) to topology 3 (for the 200 sites on the right). Click here for a reminder of the three possible tree topologies. We now take the same alignment without recombination, that is, where the whole alignment has been generated from topology 1, and randomly replace a certain percentage of columns of the first alignment by those of the second. This is illustrated in the following figure:

The effect is that the probability for topology 3 in the right part of the alignment is smaller than 1, and we want to test whether this is correctly predicted by the detection method. We generated two alignments with the MATLAB program make_data.m. In the first alignment, seq10.phy, 10 percent of the sites in the second mosaic patch (between sites 201 and 400) have been replaced, hence the true probability for topology 3 is 0.9. In the second alignment, seq20.phy, 20 percent of the sites in the second mosaic patch have been replaced, giving a true probability for topology 3 of 0.8.

For optimizing the parameters of the model in a maximum likelihood sense with the EM algorithm, we used the program SERAD. For sampling the parameters in a Bayesian sense with the MCMC algorithm, we used the program BARCE. Further details are available here. The results are shown in the following figures:

We averaged the posterior probabilities over the last 100 sites. The predictions for the alignment seq10.phy, where 10 percent of the sites in the recombinant region have been exchanged, are:

Average posterior probability for topology: 123
True 0.1000.0000.900
Predicted with ML 0.0000.0001.000
Predicted with MCMC 0.0980.0000.902

The predictions for the alignment seq20.phy, where 20 percent of the sites in the recombinant region have been exchanged, are:

Average posterior probability for topology: 123
True 0.2000.0000.800
Predicted with ML 0.0010.0010.998
Predicted with MCMC 0.1850.0000.815

Simulation B

In a second simulation experiment, we studied the effect of rate heterogeneity. We simulated the evolution process along the branches of a four-species phylogenetic tree with the Kimura model, but reduced the rate of nucleotide substitution by a factor of 1/5 in the centre region, between sites t=301 and t=600.

The results are shown in the following figure:

The maximum likelihood approach clearly over-fits and erroneously predicts a recombinant region. The Bayesian MCMC approach is only slightly affected in its prediction of the posterior probabilities - the graph of P(state_t=1|data) shows a small dent. This does not lead to classifying any site in the alignment as a topology other than state_t=1, though, hence no recombination is predicted, and over-fitting is avoided.


Conclusion

We have derived two methods of parameter inference in an HMM-based recombination detection model. SERAD optimizes the parameters in a maximum likelihood (ML) sense with the EM algorithm. BARCE samples the parameters from the posterior distribution in a Bayesian sense by means of an MCMC simulation. Our first simulation study suggests that while both methods find the correct mosaic structure of the sequence alignment, ML is over-confident in its prediction and gives a wrong estimation of the posterior probabilities for the tree topologies. This deficiency is redeemed with the Bayesian approach, and is likely to be important in real-world applications, where uncertainties in the predictions need to be pointed out. Our second simulation study suggests that the Bayesian MCMC approach is more robust against over-fitting, and that it is more reliable in distinguishing between recombination and rate heterogeneity.


Back to my homepage.