A model-based approach to gene clustering with missing observation reconstruction in a Markov Random Field framework

Abstract
The different measurement techniques that interrogate biological systems provide means for monitoring the behaviour of virtually all cell components at different scales and from complementary angles. However data generated in these experiments \red{are difficult to interpret. A first difficulty arises from high-dimensionality and inherent noise of such data. Organizing them into meaningful groups is then highly desirable to improve our knowledge of biological mechanisms. A more accurate picture can be obtained when accounting for dependencies between components ({\it e.g.} genes) under study. A second difficulty arises from the fact that biological experiments often produce missing values.} When it is not ignored, the latter issue has been solved by imputing the expression matrix prior to applying traditional analysis methods. Although helpful, this practice can lead to unsound results. We propose in this paper a statistical methodology that integrates individual dependencies in a missing data framework. More explicitly, we present a clustering algorithm \red{dealing with incomplete data in a Hidden Markov Random Field context.} This tackles the missing value issue in a probabilistic framework and still allows us to reconstruct missing observations \red{\textit{a posteriori}} without imposing any pre-processing of the data. Experiments on synthetic data validate the gain in using our method and real biological data analysis present its potential to extract biological knowledge. Availability: The SpaCEM3 software used in this study is available at http://mistis.inrialpes.fr/realisations.html Contact: matthieu@bioss.ac.uk or juliette.blanchet@inrialpes.fr.
Year
2009
Category
Refereed journal