Other Essays:-
Bayesian Networks for Analysing Gene Expression Data
August 2001
- Introduction
- Shortcomings of clustering
- The Bayesian network paradigm
- Correlation versus causality
- Practical implementation and limitations
- Experimental results
- References
Introduction
Molecular pathways consisting of interacting proteins underlie the major functions of living cells. A central goal of molecular biology is therefore to understand the regulatory mechanism that governs protein synthesis and activity. All the cells in an organism carry the same genomic data yet their protein makeup can be drastically different both temporally and spatially, due to regulation. Protein synthesis is regulated by control mechanisms at different stages:
- Transcription
- RNA splicing
- Translation
- Post-translational modifications
Oligonucleotide or cDNA microarrays allow us to focus on the first stage. While traditional methods in molecular biology could only report the expression levels of single genes, microarrays measure the abundance of thousands of mRNA targets simultaneously. This provides new rich data for understanding gene expression and regulation.
Shortcomings of clustering
The most commonly used computational method for analysing microarray gene expression data is clustering. Consider a microarray X_(ij) , whose rows X_(i.) correspond to genes and whose columns X_(.j) correspond to probes (tissue samples, experiments, etc.). Based on a correlation measure between the row vectors X_(i.) , genes are partitioned into clusters, using e.g. a hierarchical clustering algorithm.
Example (gene expression in yeast): Gene expression during the yeast cell cycle. The genes correspond to the rows, and the time points of each experiment are the columns. The ratio of induction/repression is shown for each gene such that the magnitude is indicated by the intensity of the colours displayed. The dendrogram on the left side of the figure shows the structure of the clusters. Figure copyrighted by Stanford University.
While clustering provides a compact summarisation of the data and might point to functional relationships between clustered genes (genes that are coexpressed might be coregulated), it suffers from the following shortcomings:
- Clustering is based on a global correlation measure. This obscures relationships that exist over only a subset of the data.
- Clustering fails to detect interactions between genes different from linear correlation.
- It is impossible to incorporate additional types of information, such as clinical data or experimental details.
Typical questions Bayesian networks can answer, but where clustering fails:
- Is the effect of a mutated gene on a target direct, or mediated by other genes?
- Which genes mediate the interactions within a cluster of genes or between clusters?
- What is the nature of the interaction between genes (e.g. does gene A inhibit gene B)?
The Bayesian network paradigm
Summary of the basic concept:
- Bayesian network: graphical display of dependence structure between multiple interacting quantities (here: expression levels of different genes).
- Probabilistic semantics: Fits the stochastic nature of both the biological processes and noisy experiments. Capable of handling noise and estimating the confidence in the different features of the network.
- Due to lack of data: Extract features that are pronounced in the data rather than a single model that explains the data.
Summary of the basic method:
- Directed acyclic graph (DAG): Graphical representation of the decomposition of a joint distribution into a product form. Nodes represent random variables. Arcs indicate conditional dependencies.
- Random variable X_i = measured expression level of gene i . Arcs = regulatory interactions between genes.
- Define the functional form of the conditional distributions (e.g. multinomial for discrete variables, linear Gaussian for continuous variables). Inference or learning consists of two tasks:
- Find the best network structure G.
- Given a network structure, find the best set of parameters for the conditional distributions.
"Best" is to be taken in a Bayesian way as the most probable structure/parameter vector given the data.
- Estimate uncertainty: Gene expression experiments typically give small data sets (many genes but only a few probes/experiments) which are not sufficiently informative to significantly determine that a single model is the "right" one. From a Bayesian perspective, the posterior probability over models is vague and not dominated by a single model. There are two ways to estimate uncertainty:
- Bayesian approach: Sample networks from the posterior distribution with Markov chain Monte Carlo (MCMC).
- Frequentist approach: Apply bootstrapping and repeat the optimisation algorithm several times on resampled data sets.
Example: Subnetwork for the yeast expression data, showing a local graphical map of the SVS1 yeast gene. Nodes represent genes, arcs represent conditional dependencies, and the width of an arc corresponds to the computed confidence level. The local map shows that CLN2 separates SVS1 from several other genes. For instance, the interaction between SRO4 and SVS1 is mediated by CLN2. While SRO4 might correlate with SVS1, conditional on CLN2 the expression levels of SRO4 and SVS1 are independent. Figure partly redrawn from Friedman et al.. For a presentation of the full network, click here.
Here is a more comprehensive introduction to Bayesian networks, written by Kevin P. Murphy.
Correlation versus causality
Ultimately, we are interested in the flow of causality: Does knocking out gene A lead to an increased expression of gene B? However, a Bayesian network is a model of dependencies between random variables, rather than causality . More than one graph can imply exactly the same set of independencies. Example: A-->B and A<--B are alternative ways of describing that A and B are not independent. Different graphs that imply the same set of independencies are called equivalent . The analysis of causality is hampered by the following fact:
- On the basis of observations (passive measurements) alone we cannot distinguish between equivalent graphs.
There are two ways to proceed in order to establish causal relations.
- Using observations alone: Analyse equivalence classes in the form of partially directed graphs (PDAG), where a directed edge A-->B denotes that all members of this equivalence class have this edge, whereas an undirected edge A--B denotes that some members of the class contain the edge A-->B, while others contain the edge A<--B. A directed edge A-->B is an indication, rather than evidence, that A might be a causal ancestor of B. Problem: For sparse data, like microarray data, the analysis is complicated by the fact that we do not have a single PDAG, but rather a (vague) posterior distribution over PDAGs.
- Using interventions , that is, setting the values of some variables using forces outside the causal model, e.g. gene knockout or over-expression. This modifies the score function (the posterior probability of the network conditional on the data) such that it is no longer structure equivalent : The score of equivalent graphs is no longer guaranteed to be the same, which helps us to determine the direction of causality. Effectively this reduces the set of networks that are structure equivalent to a subset of networks that are intervention equivalent .
Practical implementation and limitations
In a practical implementation of this scheme, one faces two problems: computational complexity and limited data .
Complexity
To reduce the computational complexity, the conditional probabilities are chosen of such a functional form that their parameters can be determined from closed form equations once the (complete) data and the network structure are known. (More precisely: The posterior distribution of these parameters can be calculated in closed form from some sufficient statistics of the data). This makes the parameter estimation process fast and allows the algorithm to focus on finding the optimal network structure (NP-hard problem). Two functional representations are commonly used:
- Multinomial model. Multinomial distribution for the possible states of a child variable given the state of the parents. This requires a discretisation of the data (typically using three categories of gene expression: under-expressed, baseline, over-expressed). Advantage: Can model nonlinear dependencies. Disadvantage: Discretisation causes loss of information. Results are sensitive to the discretisation thresholds. Pe'er et al. redeem the disadvantages of a fixed threshold by modelling the gene expression levels in different experiments as samples from a mixture of Gaussians and then discretising the data by assigning each expression level to the most likely mixture component.
- Linear Gaussian. Learn a linear regression model for the child variable given its parents. Advantage: Works with the actual data; no information loss caused by discretisation. Disadvantage: Can only model linear dependencies.
Learning the Bayesian network structure is an optimisation problem in the space of DAGs. The number of such graphs is superexponential in the number of variables and exhaustive search is intractable (NP-hard problem). Therefore, one needs to resort to heuristic, local (greedy) optimisation algorithms. Friedman et al. suggest the following sparse candidate algorithm:
- Identify a relatively small number of candidate parents for each gene based on simple local statistics (such as correlation).
- Restrict the search to networks in which only the candidate parents of a variable can be its parents.
- Disadvantage: Early choices can result in an overly restricted search space. To avoid this problem, a heuristic iterative algorithm is devised that adapts the candidate sets during the search.
Limited data
Typically, the number of genes (=random variables) is much larger than the number of gene expression measurements; e.g. the yeast data set contains 76 expression measurements for 800 genes. Consequence: one has a diffused posterior over a huge model space and cannot possibly list all the networks that are plausible given the data. The solution is as follows:
- Identify lower-complex properties of the network ("features").
- Estimate the posterior probability of these features given the data. Due to the lower complexity this estimation should be fairly robust even from a small set of sampled networks.
Examples of features:
- Markov relations. Markov neighbours are variables that are not separated by any other measured variable. These are parent-child relations (A-->B), where one gene regulates another, and spouse relations (A-->X<--B), where two genes coregulate a third. Markov relations indicate that two genes are related in some joint biological interaction. Note that in the structure A<--X-->B, A and B are not Markov related. Here, the interaction between A and B is mediated by X, that is, conditional on X, A and B are independent.
- Order relations. A is an ancestor of B for a given PDAG if the PDAG contains a path from A to B in which all edges are directed. This indicates a causal relation between A and B.
Features are binary indicator variables for Markov or order relations. While Markov relations are local properties of a network, order relations capture a global property.
Experimental results
I briefly summarise some experimental findings reported in Pe'er et al..
Pairwise relations
- Most Markov pair relations correspond to known biochemical or regulatory interactions, a shared common regulator, or a functional link.
- For a third (about 30%) of these "proof-of-principle" genes, the Pearson correlation was small.
- Conclusion: Due to the context specific nature in which they handle the data, Bayesian networks detect interactions that are missed by global correlation analysis.
Separator relations
- Separator triplets explain away dependencies, which can provide an enhanced insight into the underlying molecular architecture of pathways.
- Most separators were identified as known mediators of transcriptional responses. The genes they separate share functional roles and regulation patterns. This is consistent with the separator serving as a common regulator.
- Previously uncharacterised genes participating in some of the interactions could be assigned putative effector functions.
- The analysis can be generalised from Markov triplets to d-separation relations, where the mediating gene is not in a direct Markov relation with the two genes it separates.
- This identification of indirect relations provides support for the regulatory role of putative transcription factors and signalling molecules.
Subnetwork analysis
- Bayesian networks uncover intra-cluster structures: Groups of genes, which by correlation analysis alone are simply clustered together, can be organised in clear functional subnetworks.
- These subnetworks provide a much richer context for regulatory and functional analysis and assist us in understanding the roles of genes and in assigning them putative novel functions.
- This promises to identify regulatory, metabolic, and signalling components of molecular networks.
References
- Friedman, Linial, Nachman, Pe'er (2000):
Using Bayesian networks to analyze expression data
Journal of Computational Biology 7, pp. 601-620
- Pe'er, Regev, Elidan, Friedman (2001):
Inferring subnetworks from perturbed expression profiles
Bioinformatics 17, pp. S215-S224