RESEARCH: Statistical Bioinformatics
Association mapping in inbreeding plant species
Association studies have been used to locate single genes for some diseases in humans. In plant studies, the analysis is complicated by issues of selection, self-fertilisation and especially population substructure. The latter varies from large scale structure, such as plants selected for different traits in different regions, to smaller scale differences in degrees of kinship. Different substructure models have been proposed, but when applied to experimental data often identify different regions as containing the quantitative trait loci (QTLs) that controlling important traits such as yield.
Simulation studies enable methods to be compared in a population where QTL locations, population structure and pedigrees are all known. We have modelled a simulation study on a real collection of barley germplasm from five regions. Ten genotyped landraces from each region were taken as founders. One thousand generations of landraces were generated with a selfing probability of 0.92. This was followed by 24 generations of selection on drought tolerance, yield or heading date, depending on region, to generate cultivars.
This simulated population is enabling us to compare methods for estimating kinship from various types of DNA marker data with the known pedigrees, and to explore the accuracy and limitations of models including population substructure and kinship to identify genuine QTLs. For example, we find many falsely significant associations (high values for -log 10 p-values) are estimated ignoring population structure that vanish when the structure is included in the model.
Assumptions about population structure have an important effect on the strength of evidence for associations of position on chromosome 1H with heading date of barley in a dry environment in Spain.
Statistical analysis of molecular sequence alignments using web services and cluster computing
A growing number of biological questions can be tackled by aligning homologous regions of DNA from
different organisms or from related genes within the same organism. We have extended our TOPALi Java
application to launch several statistical analyses of multiple alignment data from the user's desktop which run as
“web services” on remote, powerful computer clusters, with monitoring of the remote job and results displayed
locally. Some features of TOPALi v2
are described below.
Recombination breakpoint location estimation
DNA sequences can recombine during evolution. This can result in a recombinant sequence comprising regions, separated by recombination breakpoints, that have different evolutionary histories. Initial testing for breakpoints is crucial as many subsequent analyses assume no recombination. Our methods that use a parametric bootstrapping approach to assess statistical significance make optimal use of cluster computing resources.
Model selection, tree and ancestral sequence estimation

- Screenshot of our TOPLi software.
Model-based methods to construct phylogenetic trees require parameters in the evolutionary model to be
optimised prior to tree estimation. TOPALi v2
has a model selection web service (ModelGenerator software)
which ranks substitution models (88 models for proteins or 55 for DNA) according to statistical criteria.
Tree estimation web services include implementations of Maximum Likelihood (PhyML software) and Bayesian Inference (MrBayes software) methods. Ancestral sequences are predicted using a FASTML web service.
Positive selection analysis
TOPALi v2
has a “branch model” web service (using PAML software) to test for differences in evolutionary
rates among groups of sequences (e.g. after a past gene duplication event) and also a “sites model” web service
(also PAML) to test for sites evolving faster than the neutral model which may be functionally important.
Combining multiple laser scans of microarrays
The first stage in the analysis of microarray data is estimation from laser scans of the level of expression of each gene. Typically, data are only used from a single scan, although, if multiple scans are available, sampling error can be reduced by combining them: a functional regression problem. Maximum likelihood estimation fails, but many alternative estimators exist, one of which is to maximise the likelihood of a Gaussian structural regression model. We have found by simulation that, surprisingly, this estimator is efficient for our particular application, even though the distribution of gene expressions is severely skewed and hence far from Gaussian.
Measured responses from laser scans 2, 3 and 4 plotted against those from laser scan for each gene in a murine macrophage experiment (Division of Pathway Medicine, University of Edinburgh).
Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge
There have been various attempts to reconstruct gene regulatory networks from microarray expression data in the past. However, owing to the limited number of independent experimental conditions and the noise inherent in the measurements, the results have been rather modest so far. For this reason it seems advisable to include biological prior knowledge, related, for instance, to transcription factor binding locations in promoter regions or partially known signalling pathways from the literature. We have developed a Bayesian approach to integrate expression data with multiple sources of prior knowledge, e.g. extracted from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. We have evaluated the proposed scheme on the Raf signalling pathway, a cellular signalling network describing the interaction of 11 phosphorylated proteins and phospholipids in mammalian immune system cells, demonstrating the benefits of combining biological knowledge with gene expression data.
The currently accepted Raf signalling network, showing proteins (nodes), the presence of interactions (lines) and the direction of signal transduction (arrows).
Comparison of methods to predict the Raf regulatory network, with (DGE) and without (UGE) taking the edge directions into account. We determined the number of true positive interactions for a fixed number, five, of false positives.Bayesian networks (BN), and graphical Gaussian models (GGM), use information from the expression data. Biological knowledge from KEGG is used either in isolation (OnlyPrior) or using our new Bayesian integration scheme (BN&Prior).