Improving MCMC by Metropolis coupling of different trajectories:
Simulation study 2

Dirk Husmeier
Biomathematics and Statistics Scotland (BioSS)
SCRI, Dundee DD2 5DA, United Kingdom
March 2002

Data

We simulated recombination in a population of 4 taxa with the program SIRENS, as described here (using a tree height of 0.2), and selected the region between sites 400 and 800. This gives us a 400-nucleotide alignment with one recombination event, which corresponds to a transition from topology 1 (for the 200 sites on the left) to topology 3 (for the 200 sites on the right). We now take the same alignment without recombination, that is, where the whole alignment has been generated from topology 1, and randomly replace a certain percentage of columns of the first alignment by those of the second. This is illustrated in the following figure:

The effect is that the probability for topology 3 in the right part of the alignment is smaller than 1, and we want to test if BARCE can estimate this decreased probability correctly. We generated two alignments with the MATLAB program make_data.m. In the first alignment, seq10.phy, 10 percent of the sites in the second mosaic patch (between sites 201 and 400) have been replaced, hence the true probability for topology 3 is 0.9. In the second alignment, seq20.phy, 20 percent of the sites in the second mosaic patch have been replaced, giving a true probability for topology 3 of 0.8.

We started BARCE from two different initializations of the hidden state sequences: a uniform sequence of topology 2 (the "neutral" topology), and a tessellated mosaic structure, where topologies alternate and cover blocks of 50 sites. The respective mosaic.in files look as follows:

Initialization 1
2
200
1 1

Initialization 2
8
50 100 150 200 250 300 350
0 1 2 0 1 2 0 2

Have a look at the complete specifications of the MCMC simulations. You can find the results of the simulations in the directories swap10 (where 10 percent of the sites have been exchanged) and swap20 (where 20 percent of the sites have been exchanged). Each directory contains two subdirectories, simu1 and simu2, which contain the results of the two separate MCMC simulations (starting from different initializations). To combine the two MCMC trajectories, use the MATLAB program PredictorEnsemble.m and give the following commands at the MATLAB prompt:

cd swap10 OR cd swap20
cd simu1
load topol_prob.out;
P(:,:,1)=topol_prob;
load lpdfile.out;
L(:,1)=lpdfile;
cd ..
cd simu2
load topol_prob.out;
P(:,:,2)=topol_prob;
load lpdfile.out;
L(:,2)=lpdfile;
cd ..
P_topo_averaged=PredictorEnsemble(P,L)


Results

We first present the results obtained for the alignment seq20.phy, where 20 percent of the sites in the recombinant region have been exchanged. The figures below show the posterior probabilities for the three possible tree topologies (vertical axis), plotted against the sites of the DNA sequence alignment (horizontal axis). Each Figure has three subfigures for the three topologies 1 (top), 2 (middle) and 3 (bottom).

The first figure shows the results obtained from a single MCMC trajectory that was started from Initialization 1. The prediction is obviously suboptimal in that the recombination event is not detected at all.

The second figure shows the results obtained from a single MCMC trajectory that was started from Initialization 2. The prediction is also suboptimal. Although the recombinant region has been correctly identified, the predicted probability (about 1 rather than 0.8) is wrong.

By combining the two trajectories with the method described above, we get the following prediction:

This does not only correctly identify the recombinant region, but it also correctly estimates the uncertainty of the prediction. We averaged the posterior probability over the last 100 sites, and found that the predicted value was very close to the correct one:

Average posterior probability for topology: 123
True 0.200.00.80
Predicted 0.180.00.82

The prediction for the alignment seq10.phy, where 10 percent of the sites in the recombinant region have been exchanged, is in even better agreement with the true probabilities:

Average posterior probability for topology: 123
True 0.1000.00.900
Predicted 0.0990.00.901

Note that in both cases a naive combination of the trajectories, where equal weights are assigned to them, leads to wrong predictions:

Average posterior probability for topology: 123
Prediction from naive averaging 0.500.00.50


Back to the main page.