Statistical Bioinformatics

Statistical analysis of molecular sequence alignments

We have continued the development of the TOPALi software package for the statistical analysis of DNA and protein multiple sequence alignment data, with particular emphasis on improved model selection and phylogenetic tree estimation applied to protein-coding DNA. Modern statistical methods for estimating phylogenetic trees from molecular sequence data perform better when the underlying model of evolution is optimal. We have implemented an improved model selection protocol in our TOPALi software that greatly simplifies the procedure for biologists, especially for the analysis of protein-coding DNA. Our software automatically fits eighty-eight evolutionary models and displays the results graphically with a suggested choice based on three statistical criteria. Unlike existing phylogenetic model selection approaches, we jointly estimate a tree for each model, resulting in improved estimates of the likelihood and derived quantities, for example the Akaike or Bayesian information criterion (AIC, BIC).

TOPALi graphical display The graphical display shows the magnitude of the parameter values and also the estimated tree, allowing the user to see the influence of model choice on the tree topology. The user can accept the choice or choose an alternative before proceeding to full tree estimation using Bayesian or Maximum Likelihood approaches.

Further details from: Frank Wright

Research

Statistical Bioinformatics

Process and Systems Modelling

Statistical Methodology

PhD Opportunities