Statistical Bioinformatics

Sequential analysis of microarray experiments

Microarray and other similar high-throughput experiments can be costly. However, there is usually little prior information available to decide how many samples / arrays are needed to give the scientist sufficient information on the question of interest.

In order to reduce sample size and thus costs in microarray experiments, BioSS has investigated groupsequential / adaptive designs for microarray studies involving a comparison of two groups (eg. treatment vs control). In such designs the total sample size is not chosen in advance but the experiment is conducted in several stages. After each stage a decision is made, as to whether more samples should be added or whether the experiment has generated sufficient information.

Histogram of p-values Histograms of p-values after different stages of a sequentially designed microarray experiment. From these histograms we can estimate the sensitivity, i.e. the percentage of truly differential genes that have been discovered. A pronounced peak around 0 indicates high sensitivity. Based on the sensitivity we decide whether to add more samples or stop the experiment.

We studied stopping criteria based on the distribution of p-values across all genes, which can be visualised in a p-value histogram. A peak close to zero indicates that the experiment managed to detect many differentially expressed genes. Using mixture models it is possible to estimate the numbers of true positives / negatives (TP/TN) and false positives/negatives (FP/NP) from this histogram for a given p-value cutoff. These estimates can be used to define several possible measures of success of the experiment. We were particularly interested in sensitivity, i.e. the expected percentage of truly differential genes that have been detected. For example one might choose to stop the experiment if sensitivity exceeds a specific threshold of, say, 80%, i.e. if the vast majority of differential genes have been detected.

Microarrays hybridised at the same stage are likely to be more similar than those hybridised at different stages. We hence use a p-value combination approach to combine the data from the different experimental stages. This is a simple but very efficient approach developed for meta-analyses in which the results of different studies are to be combined.

The most striking result of our research is that in this very high-dimensional situation the stopping decision does not bias the p-values obtained at later stages - a fundamental problem in classical sequential designs, where only one or few variables are being considered.

Further details from: Claus-Dieter Mayer

Research

Statistical Bioinformatics

Process and Systems Modelling

Statistical Methodology

PhD Opportunities