Postgraduate Research & Training

New methods for analysing very high dimensional data

The data collected from biological samples has for the past decade become increasing high dimensional, as it has become routine to study gene and protein expression, and the concentration of metabolites in different body tissues, as well as counting many bacterial species. Behavioural and motivational studies also generate very many variables. Datasets with hundreds or thousands of variables are commonplace. More recently, research projects have started to collect more than one high-dimensional data set from the same material. For example, a recent study on the variability of metabolic processes in human volunteers collected 19 sets of high dimensional data. The plot shows the use of a multivariate method, co-inertia analysis, to look for associations between two of these, amino acid measurements made in blood plasma and in urine.

There are many statistical issues that arise when studying such data. A particular interest at the Rowett Institute is to examine correlations between different variables, and between different datasets. This is made difficult by the large amount of technical variation (noise) present in the data, and the high risk of false positive inferences. It is also important to try to distinguish between correlations that derive from unrelated responses to treatment interventions, and those that are due to common biological mechanisms and pathways,

This PhD project will develop some ideas for managing such inference, and test its applicability to data generated from Rowett studies in nutritional science. Depending on the interests of the student, it should be possible to direct this research in various directions, either more theoretical or more applied.

score plot diagram

Co-inertia analysis is a dimension reduction method like Principal Component Analysis. The score plot (top) shows the best matching of the 50 individual samples with the arrows indicating where spots would move in the urine dataset relative to the plasma data, with the labels giving details about the subject and dietary intervention. The loadings plot (bottom) shows how the individual amino acid variables contribute to the calculation of the scores.

loading plot diagram

This project will be supervised by Graham Horgan at BioSS Aberdeen.

For further details, contact graham horgan

Knowledge Exchange

User Friendly Software

Training For Scientists

Postgraduate Research & Training