For each mosaic structure, we repeated the simulation with three different tree heights, where the tree height is defined as half the sum of all the branch lengths between the two strains that are farthest apart. Note that as the tree height becomes smaller, the numbers of polymorphic and topology-defining sites decrease.
| Tree height | Mosaic structure | Percentage of topology-defining sites |
|---|---|---|
In the table above, the first column shows the tree height, the second column shows the mosaic structure, and the third column shows the percentage of topology-defining sites. Note that for a parsimony approach like RECPARS, inference is solely based on the topology-defining sites, and other sites are discarded. As we decrease the tree height, the percentage of topology-defining sites becomes smaller. This reduces the information content in the alignment and makes the detection of recombinant regions more difficult.
The objective of our simulation study is to test the performance of BARCE and RECPARS on different mosaic structures and for varying levels of difficulty of the detection problem. The following table contains hyperlinks to the sequence alignments (in sequential PHYLIP format) used in our study:
| Mosaic structure A | Tree height 0.1 | Tree height 0.2 | Tree height 0.3 |
|---|---|---|---|
| Mosaic structure B | Tree height 0.1 | Tree height 0.2 | Tree height 0.3 |
Initialization
In principle, the initialization of the hidden states is
unimportant since the Markov chain will forget its initial
configuration and converge to the equilibrium distribution irrespective of its starting point. In practice, however,
extreme starting values could slow down the mixing of the chain and result in a long burn-in, in which case the
MCMC sampler may fail to converge towards the main support of the posterior distribution within the available simulation time.
In general, it may therefore be advisable to start from
an informative initialization, like the output of a prediction
with
RECPARS.
However, in our simulations we found that the
annealing scheme
improves mixing very effectively and helps the
Markov chain to lose its memory of the initialization very fast.
Consequently, the initialization does not seem to be
particularly important if annealing is used.
The simulations reported here were all started from
a uniform sequence of hidden states, where each hidden state
was set to the global maximum likelihood topology -
this is the topology one gets when applying
standard maximum likelihood, as implemented, for instance, in
PHYLIP,
to the whole sequence alignment without
correcting for recombination.
More details: Dependence on the initialization.
Simulation time
A first quick analysis with short equilibration and
sampling periods of only 10,000 MCMC steps
(which only takes a few minutes on a SUN Ultra 60)
gave already a rather accurate prediction if
annealing was used,
but the MCMC trajectories might not yet have reached equilibrium.
To ensure a proper convergence of the MCMC scheme,
we increased both the equilibration and sampling
periods by a factor of
ten (which takes 40 minutes on a SUN Ultra 60).
More details: Dependence on the equilibration process.
Dependence on the prior
Two terms contribute to the posterior distribution:
The prior, and the likelihood. While the first term is
constant, the second term scales like N,
the number of sites in the DNA sequence alignment.
Consequently, for a reasonably long alignment,
the weight of the likelihood is considerably higher
than that of the prior, and the latter should therefore have
only a marginal influence on the prediction.
This was borne out in our simulations.
Note that the
improvement
of the Bayesian approach on maximum likelihood
is not caused by the
inclusion of explicit prior knowledge, but by the fact
that the model parameters are
sampled rather than optimized.
More details: Dependence on the prior
Sensitivity
Percentage of correctly classified recombinant sites.
Specificity
Percentage of correctly classified non-recombinant sites.
As opposed to RECPARS, BARCE computes the site-dependent probability for the tree topologies. This contains more information than just a discrete class assignment, and allows the crispness or uncertainty of the classification to be estimated with the following criterion:
Relative entropy
Classification entropy divided by the maximum
theoretically possible
entropy. This is a value between 0 and 1, where 0 means
no uncertainty at all, and 1 indicates complete uncertainty
(equivalent to a random classifier).
Overall performance
The
figure below
shows a plot of the
sensitivity/specificity
pairs,
where the vertical axis represents the
sensitivities,
while the horizontal axis represents the
specificities.
Note that an optimal method that classifies all sites correctly
has a score of 100/100, while
a method that does not predict any recombination event
has a score of 0/100.
Method A is better than method B if it shows an improvement
in both scores.
The symbols in the
figure
represent
the different methods compared:
| Symbol | Method |
|---|---|
| Cross | BARCE |
| Square | RECPARS, recombination cost/substitution cost=10 |
| Circle | RECPARS, recombination cost/substitution cost=3 |
| Triangle | RECPARS, recombination cost/substitution cost=1.5 |
The rows and columns in the figure refer to the different sequence alignments:
| Left column | Middle column | Right column | |
|---|---|---|---|
| Upper row | Mosaic structure A, tree height 0.3 | Mosaic structure A, tree height 0.2 | Mosaic structure A, tree height 0.1 |
| Bottom row | Mosaic structure B, tree height 0.3 | Mosaic structure B, tree height 0.2 | Mosaic structure B, tree height 0.1 |
While the specificity scores are similar, BARCE shows a consistent and significant improvement on RECPARS in the sensitivity score. This is reflected in the detailed results, which suggest that even when RECPARS predicts the nature of the mosaic structure correctly (example), it is less accurate than BARCE in locating the breakpoints.
When the tree height is decreased to 0.1, neither RECPARS nor BARCE predict the mosaic structure of the alignment correctly. RECPARS finds only one recombinant region, which in one case is even badly misplaced. BARCE detects both recombinant regions and even locates them rather accurately, but misclassifies the topology change of one of them (example 1, example 2). This is most likely a consequence of the fact that for small tree heights, the number of mutations and, consequently, the number of polymorphic sites is small (see table above). Consequently, there is less information in the data and any inference is inevitably less accurate.
Comparing the performance of RECPARS and BARCE across all simulations, we found that BARCE gives a consistent and significant improvement on RECPARS in the accuracy of locating and classifying recombinant regions, as measured by the sensitivity and specificity scores discussed above. The price to pay for this is an increase in the computational costs: While a run of RECPARS takes only a few seconds, the MCMC simulations with BARCE typically took about 40 minutes (on a SUN Ultra 60). A more serious restriction of BARCE is its limitation to alignments of four sequences. This suggests that in practical applications both methods be, at best, combined: In a first, preliminary analysis, apply RECPARS to identify putative recombinant sequences and their approximate mosaic structures. In a second, subsequent step, apply BARCE to the tentative sets of four sequences that result from the previous step. This will allow a more accurate analysis of the mosaic structure of these alignments, and will resolve contradictions that are likely to arise from different (arbitrary) settings of the parsimony cost parameters of RECPARS.
Back to my homepage