The effect is that the probability for topology
3 in the right part of the alignment is smaller than 1,
and we want to test if BARCE
can estimate this decreased
probability correctly. We generated two alignments
with the MATLAB program
make_data.m.
In the first alignment,
seq10.phy,
10 percent of the sites in the second mosaic patch
(between sites 201 and 400) have been replaced,
hence the true probability for topology 3 is 0.9.
In the second alignment,
seq20.phy,
20 percent of the sites in the second mosaic patch
have been replaced,
giving a true probability for topology 3 of 0.8.
We started BARCE
from two different initializations
of the hidden state sequences:
a uniform sequence of topology 2 (the "neutral" topology),
and a tessellated
mosaic structure, where topologies alternate and cover
blocks of 50 sites. The respective
mosaic.in
files look as follows:
Initialization 1
2
200
1 1
Initialization 2
8
50 100 150 200 250 300 350
0 1 2 0 1 2 0 2
Have a look at the
complete specifications
of the MCMC simulations.
You can find the results of the
simulations in the directories
swap10
(where 10 percent of the sites have been exchanged)
and swap20
(where 20 percent of the sites have been exchanged).
Each directory contains two subdirectories,
simu1 and simu2,
which contain the results of the two separate
MCMC simulations (starting from different
initializations).
To combine the two MCMC trajectories,
use the MATLAB program
PredictorEnsemble.m and
give the following commands at the MATLAB prompt:
cd swap10 OR cd swap20
cd simu1
load topol_prob.out;
P(:,:,1)=topol_prob;
load lpdfile.out;
L(:,1)=lpdfile;
cd ..
cd simu2
load topol_prob.out;
P(:,:,2)=topol_prob;
load lpdfile.out;
L(:,2)=lpdfile;
cd ..
P_topo_averaged=PredictorEnsemble(P,L)
seq20.phy,
where 20 percent of the sites in the
recombinant region have been exchanged.
The figures below show the posterior probabilities
for the three possible tree topologies
(vertical axis), plotted against the sites of the DNA
sequence alignment (horizontal axis).
Each Figure has three subfigures for the three
topologies 1 (top), 2 (middle) and 3 (bottom).
The first figure shows the results obtained from a single MCMC trajectory that was started from Initialization 1. The prediction is obviously suboptimal in that the recombination event is not detected at all.
The second figure shows the results obtained from a single MCMC trajectory that was started from Initialization 2. The prediction is also suboptimal. Although the recombinant region has been correctly identified, the predicted probability (about 1 rather than 0.8) is wrong.
By combining the two trajectories with the method described above, we get the following prediction:
This does not only correctly identify the recombinant region, but it also correctly estimates the uncertainty of the prediction. We averaged the posterior probability over the last 100 sites, and found that the predicted value was very close to the correct one:
| Average posterior probability for topology: | 1 | 2 | 3 |
|---|---|---|---|
| True | 0.20 | 0.0 | 0.80 |
| Predicted | 0.18 | 0.0 | 0.82 |
The prediction for the alignment
seq10.phy,
where 10 percent of the sites in the
recombinant region have been exchanged,
is in even better agreement with the
true probabilities:
| Average posterior probability for topology: | 1 | 2 | 3 |
|---|---|---|---|
| True | 0.100 | 0.0 | 0.900 |
| Predicted | 0.099 | 0.0 | 0.901 |
Note that in both cases a naive combination of the trajectories, where equal weights are assigned to them, leads to wrong predictions:
| Average posterior probability for topology: | 1 | 2 | 3 |
|---|---|---|---|
| Prediction from naive averaging | 0.50 | 0.0 | 0.50 |