Sporadic recombination is a process by which certain bacteria and viruses exchange DNA/RNA subsequences, leading to so-called mosaic strains. The discovery of a surprisingly high frequency of such mosaic strains in HIV suggests that recombination between their genomes can occur in vivo to generate new biologically active viruses. A phylogenetic analysis of various bacterial genera suggests that recombination is an important, and previously underestimated, source of genetic diversification, by which new strains can occur with undesirable biological traits (like multiple resistance to antibiotics).
In my talk, I will describe various statistical methods for the detection of recombination in DNA sequence alignment.
The first part will briefly recapitulate the statistical approach to phylogenetics. Based on an explicit model of nucleotide substitution, a phylogenetic tree can be interpreted as a probabilistic generative model. This allows the calculation of the likelihood of the observed DNA sequence alignment, which forms the basis for the detection methods covered in my talk.
The second part will review several classical detection methods: Maximum chi2 (Maynard Smith, 1992), PLATO (Grassly, Holmes, 1997), RECPARS (Hein, 1993), and TOPAL (McGuire, Wright, Prentice, 1997).
The third part will describe a subset approach for detecting recombination: A fixed-size window is moved along a given DNA sequence alignment. For every window position, the marginal posterior probability of tree topologies is determined by means of a Markov chain Monte Carlo simulation. Two probabilistic divergence measures are plotted along the alignment, and are used to identify recombinant regions.
The fourth part will discuss the application of hidden Markov models (HMMs). This approach is based on the combination of two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationships between the taxa, and (2) a site graph (hidden Markov model) representing interactions between different sites in the sequence alignment. I will compare different parameter estimation techniques, and will discuss the results obtained on various synthetic and real-world DNA sequence alignments.