Detecting Recombination in DNA Sequence Alignments:
A Comparison between BARCE and RECPARS

Dirk Husmeier and Grainne McGuire
Biomathematics and Statistics Scotland (BioSS)
SCRI, Dundee DD2 5DA, United Kingdom
March 2002

In this study we compare BARCE with RECPARS on a synthetic DNA sequence alignment, obtained by simulating evolution and recombination in a population of four taxa. The results clearly demonstrate the superiority of BARCE over RECPARS in locating recombinant breakpoints and identifying the nature of the mosaic structure of the alignment. However, while BARCE is only applicable to alignments of four taxa, RECPARS does not impose any restrictions on the number of taxa and therefore remains a powerful tool for larger sequence alignments. This suggests that in practical applications a combination of both methods may be useful.

Simulation

We simulated recombination in a DNA sequence alignment of four taxa with the program SIRENS; click here for a detailed description. The simulated process contains two independent recombination events, where the first event is more ancient and therefore more difficult to detect. We modelled two different mosaic structures. In mosaic structure A, both recombinant regions are 200 nucleotides long. In mosaic structure B, the more ancient recombination event covers a larger domain of 300 nucleotides, while the more recent event extends over a shorter region of only 100 nucleotides. The total length of the alignment is 1000 nucleotides in either case. The two mosaic structures are shown in the following figures:

For each mosaic structure, we repeated the simulation with three different tree heights, where the tree height is defined as half the sum of all the branch lengths between the two strains that are farthest apart. Note that as the tree height becomes smaller, the numbers of polymorphic and topology-defining sites decrease.

Tree height Mosaic structure Percentage of topology-defining sites
0.3
A
15.7%
0.3
B
16.4%
0.2
A
12.4%
0.2
B
12.9%
0.1
A
5.3%
0.1
B
5.5%

In the table above, the first column shows the tree height, the second column shows the mosaic structure, and the third column shows the percentage of topology-defining sites. Note that for a parsimony approach like RECPARS, inference is solely based on the topology-defining sites, and other sites are discarded. As we decrease the tree height, the percentage of topology-defining sites becomes smaller. This reduces the information content in the alignment and makes the detection of recombinant regions more difficult.

The objective of our simulation study is to test the performance of BARCE and RECPARS on different mosaic structures and for varying levels of difficulty of the detection problem. The following table contains hyperlinks to the sequence alignments (in sequential PHYLIP format) used in our study:

Mosaic structure A Tree height 0.1 Tree height 0.2 Tree height 0.3
Mosaic structure B Tree height 0.1 Tree height 0.2 Tree height 0.3


Applying RECPARS

For applying RECPARS we used the program written by Kim Fisker , which improves on Jotun Hein's original program. An exact specification of the parameters is given here.

Applying BARCE

We applied BARCE with the following default settings, which are described in more detail below.

Initialization
In principle, the initialization of the hidden states is unimportant since the Markov chain will forget its initial configuration and converge to the equilibrium distribution irrespective of its starting point. In practice, however, extreme starting values could slow down the mixing of the chain and result in a long burn-in, in which case the MCMC sampler may fail to converge towards the main support of the posterior distribution within the available simulation time. In general, it may therefore be advisable to start from an informative initialization, like the output of a prediction with RECPARS. However, in our simulations we found that the annealing scheme improves mixing very effectively and helps the Markov chain to lose its memory of the initialization very fast. Consequently, the initialization does not seem to be particularly important if annealing is used. The simulations reported here were all started from a uniform sequence of hidden states, where each hidden state was set to the global maximum likelihood topology - this is the topology one gets when applying standard maximum likelihood, as implemented, for instance, in PHYLIP, to the whole sequence alignment without correcting for recombination.

More details: Dependence on the initialization.

Simulation time
A first quick analysis with short equilibration and sampling periods of only 10,000 MCMC steps (which only takes a few minutes on a SUN Ultra 60) gave already a rather accurate prediction if annealing was used, but the MCMC trajectories might not yet have reached equilibrium. To ensure a proper convergence of the MCMC scheme, we increased both the equilibration and sampling periods by a factor of ten (which takes 40 minutes on a SUN Ultra 60).

More details: Dependence on the equilibration process.

Dependence on the prior
Two terms contribute to the posterior distribution: The prior, and the likelihood. While the first term is constant, the second term scales like N, the number of sites in the DNA sequence alignment. Consequently, for a reasonably long alignment, the weight of the likelihood is considerably higher than that of the prior, and the latter should therefore have only a marginal influence on the prediction. This was borne out in our simulations. Note that the
improvement of the Bayesian approach on maximum likelihood is not caused by the inclusion of explicit prior knowledge, but by the fact that the model parameters are sampled rather than optimized.

More details: Dependence on the prior


Performance criteria

The detection of recombination is basically a classification problem: Each site in the sequence alignment is assigned to one of the three possible tree topologies. For RECPARS, this is done directly. For BARCE, this is done by assigning a site to the mode of the posterior probability We use the following criteria to rate the performance of the methods:

Sensitivity
Percentage of correctly classified recombinant sites.

Specificity
Percentage of correctly classified non-recombinant sites.

As opposed to RECPARS, BARCE computes the site-dependent probability for the tree topologies. This contains more information than just a discrete class assignment, and allows the crispness or uncertainty of the classification to be estimated with the following criterion:

Relative entropy
Classification entropy divided by the maximum theoretically possible entropy. This is a value between 0 and 1, where 0 means no uncertainty at all, and 1 indicates complete uncertainty (equivalent to a random classifier).


Results

Detailed results: Mosaic structure A
Detailed results: Mosaic structure B

Overall performance
The figure below shows a plot of the sensitivity/specificity pairs, where the vertical axis represents the sensitivities, while the horizontal axis represents the specificities. Note that an optimal method that classifies all sites correctly has a score of 100/100, while a method that does not predict any recombination event has a score of 0/100. Method A is better than method B if it shows an improvement in both scores. The symbols in the figure represent the different methods compared:

Symbol Method
Cross BARCE
Square RECPARS, recombination cost/substitution cost=10
Circle RECPARS, recombination cost/substitution cost=3
Triangle RECPARS, recombination cost/substitution cost=1.5

The rows and columns in the figure refer to the different sequence alignments:

Left column Middle column Right column
Upper row Mosaic structure A, tree height 0.3 Mosaic structure A, tree height 0.2 Mosaic structure A, tree height 0.1
Bottom row Mosaic structure B, tree height 0.3 Mosaic structure B, tree height 0.2 Mosaic structure B, tree height 0.1

While the specificity scores are similar, BARCE shows a consistent and significant improvement on RECPARS in the sensitivity score. This is reflected in the detailed results, which suggest that even when RECPARS predicts the nature of the mosaic structure correctly (example), it is less accurate than BARCE in locating the breakpoints.


Conclusions

We have applied BARCE and RECPARS to various synthetic DNA sequence alignments, obtained by simulating recombination processes for different tree heights and different mosaic structures. When the tree height is sufficiently large (0.3, 0.2), both BARCE and RECPARS predict the true mosaic structure, but with two important differences. Firstly, RECPARS gives only an accurate prediction if the recombination and substitution costs have been set "appropriately". These parameters can not be inferred from the data, but rather have to be chosen in advance. It was suggested by Wiuf, Christensen, and Hein ( Molecular Biology and Evolution 18, 2001) that a ratio of the recombination and substitution costs of 1.5 works fine quite generally. However, this was not confirmed in our simulations, where for the largest tree height of 0.3 the predictions with this ratio were wrong ( example 1 , example 2 ), leading to a mosaic structure that is over-tessellated. Since BARCE infers all the parameters from the data, it does not suffer from this shortcoming. Secondly, even when RECPARS predicts the nature of the mosaic structure correctly (example), it is less accurate in locating the breakpoints than BARCE. This is a consequence of the fact that RECPARS uses only the topology-defining sites, and therefore discards a considerable proportion of sites in the DNA sequence alignment.

When the tree height is decreased to 0.1, neither RECPARS nor BARCE predict the mosaic structure of the alignment correctly. RECPARS finds only one recombinant region, which in one case is even badly misplaced. BARCE detects both recombinant regions and even locates them rather accurately, but misclassifies the topology change of one of them (example 1, example 2). This is most likely a consequence of the fact that for small tree heights, the number of mutations and, consequently, the number of polymorphic sites is small (see table above). Consequently, there is less information in the data and any inference is inevitably less accurate.

Comparing the performance of RECPARS and BARCE across all simulations, we found that BARCE gives a consistent and significant improvement on RECPARS in the accuracy of locating and classifying recombinant regions, as measured by the sensitivity and specificity scores discussed above. The price to pay for this is an increase in the computational costs: While a run of RECPARS takes only a few seconds, the MCMC simulations with BARCE typically took about 40 minutes (on a SUN Ultra 60). A more serious restriction of BARCE is its limitation to alignments of four sequences. This suggests that in practical applications both methods be, at best, combined: In a first, preliminary analysis, apply RECPARS to identify putative recombinant sequences and their approximate mosaic structures. In a second, subsequent step, apply BARCE to the tentative sets of four sequences that result from the previous step. This will allow a more accurate analysis of the mosaic structure of these alignments, and will resolve contradictions that are likely to arise from different (arbitrary) settings of the parsimony cost parameters of RECPARS.


Back to my homepage