Statistical Genomics and Bioinformatics

Automated Bayesian phylogenetic analysis of protein-coding DNA

Modern Bayesian statistical methods for estimating phylogenetic trees, which describe the hierarchical relationships among species, are based on molecular sequence data, usually from protein-coding DNA regions. Proteins are encoded as codons (triplets) in DNA, with each codon specifying an amino acid in the protein sequence. These regions can be analysed using a codon position model, with each of the three codon positions modelled independently. Model selection can be carried out for each of the three codon positions, allowing for different relative rates of evolutionary change among the four nucleotide types (T, C, A, G) at each of the three positions.

Whereas previous approaches explored evolutionary models based on a fixed, approximate phylogenetic tree, we have developed an improved method of model selection for molecular sequence data, by jointly estimating the phylogenetic tree and evolutionary model. We have encoded our method of joint estimation in our TOPALi program which, for a given evolutionary model, subsequently makes automatic use of the MrBayes Bayesian inference program to estimate the phylogenetic tree using MCMC simulation settings taken from the main TOPALi analysis menu. This approach has been used to investigate the relationships among variants of a sheep immune system gene (Ovar-DRA). The model selected for codon position 2 differed from the model selected for positions 1 and 3.

Phylogenetic tree Phylogenetic tree showing that a new DRA variant, Ovar-DRA 0201, does not group with the other sheep Ovar-DRA (Ovar-DRA and Ovca-DRA) sequences, suggesting it plays a different role in the sheep immune system.

Further details from: Frank Wright

Article date 2011

Research

Statistical Genomics and Bioinformatics

Process and Systems Modelling

Statistical Methodology

PhD Opportunities

Meetings & Seminars