Bayesian Networks for Analysing Gene Expression Data

Dirk Husmeier
Biomathematics & Statistics Scotland
SCRI, Invergowrie, Dundee, DD2 5DA
United Kingdom
August 2001

  1. Introduction
  2. Shortcomings of clustering
  3. The Bayesian network paradigm
  4. Correlation versus causality
  5. Practical implementation and limitations
  6. Experimental results
  7. References


Molecular pathways consisting of interacting proteins underlie the major functions of living cells. A central goal of molecular biology is therefore to understand the regulatory mechanism that governs protein synthesis and activity. All the cells in an organism carry the same genomic data yet their protein makeup can be drastically different both temporally and spatially, due to regulation. Protein synthesis is regulated by control mechanisms at different stages:

  1. Transcription
  2. RNA splicing
  3. Translation
  4. Post-translational modifications
Oligonucleotide or cDNA microarrays allow us to focus on the first stage. While traditional methods in molecular biology could only report the expression levels of single genes, microarrays measure the abundance of thousands of mRNA targets simultaneously. This provides new rich data for understanding gene expression and regulation.

Shortcomings of clustering

The most commonly used computational method for analysing microarray gene expression data is clustering. Consider a microarray X_(ij), whose rows X_(i.) correspond to genes and whose columns X_(.j) correspond to probes (tissue samples, experiments, etc.). Based on a correlation measure between the row vectors X_(i.), genes are partitioned into clusters, using e.g. a hierarchical clustering algorithm.

Example (gene expression in yeast):

Clustering gene expression data for the yeast cell cycle
Gene expression during the yeast cell cycle. The genes correspond to the rows, and the time points of each experiment are the columns. The ratio of induction/repression is shown for each gene such that the magnitude is indicated by the intensity of the colours displayed. The dendrogram on the left side of the figure shows the structure of the clusters. Figure copyrighted by Stanford University.

While clustering provides a compact summarisation of the data and might point to functional relationships between clustered genes (genes that are coexpressed might be coregulated), it suffers from the following shortcomings:

Typical questions Bayesian networks can answer, but where clustering fails:

The Bayesian network paradigm

Summary of the basic concept:

Summary of the basic method:


Local graphical map of the SVS1 yeast gene.
Subnetwork for the yeast expression data, showing a local graphical map of the SVS1 yeast gene. Nodes represent genes, arcs represent conditional dependencies, and the width of an arc corresponds to the computed confidence level. The local map shows that CLN2 separates SVS1 from several other genes. For instance, the interaction between SRO4 and SVS1 is mediated by CLN2. While SRO4 might correlate with SVS1, conditional on CLN2 the expression levels of SRO4 and SVS1 are independent. Figure partly redrawn from Friedman et al.. For a presentation of the full network, click here.

Here is a more comprehensive introduction to Bayesian networks, written by Kevin P. Murphy.

Correlation versus causality

Ultimately, we are interested in the flow of causality: Does knocking out gene A lead to an increased expression of gene B? However, a Bayesian network is a model of dependencies between random variables, rather than causality. More than one graph can imply exactly the same set of independencies. Example: A-->B and A<--B are alternative ways of describing that A and B are not independent. Different graphs that imply the same set of independencies are called equivalent. The analysis of causality is hampered by the following fact:

There are two ways to proceed in order to establish causal relations.

Practical implementation and limitations

In a practical implementation of this scheme, one faces two problems: computational complexity and limited data.

To reduce the computational complexity, the conditional probabilities are chosen of such a functional form that their parameters can be determined from closed form equations once the (complete) data and the network structure are known. (More precisely: The posterior distribution of these parameters can be calculated in closed form from some sufficient statistics of the data). This makes the parameter estimation process fast and allows the algorithm to focus on finding the optimal network structure (NP-hard problem). Two functional representations are commonly used:

Learning the Bayesian network structure is an optimisation problem in the space of DAGs. The number of such graphs is superexponential in the number of variables and exhaustive search is intractable (NP-hard problem). Therefore, one needs to resort to heuristic, local (greedy) optimisation algorithms. Friedman et al. suggest the following sparse candidate algorithm:
  1. Identify a relatively small number of candidate parents for each gene based on simple local statistics (such as correlation).
  2. Restrict the search to networks in which only the candidate parents of a variable can be its parents.
  3. Disadvantage: Early choices can result in an overly restricted search space. To avoid this problem, a heuristic iterative algorithm is devised that adapts the candidate sets during the search.

Limited data
Typically, the number of genes (=random variables) is much larger than the number of gene expression measurements; e.g. the yeast data set contains 76 expression measurements for 800 genes. Consequence: one has a diffused posterior over a huge model space and cannot possibly list all the networks that are plausible given the data. The solution is as follows:

  1. Identify lower-complex properties of the network ("features").
  2. Estimate the posterior probability of these features given the data. Due to the lower complexity this estimation should be fairly robust even from a small set of sampled networks.
Examples of features:
  1. Markov relations. Markov neighbours are variables that are not separated by any other measured variable. These are parent-child relations (A-->B), where one gene regulates another, and spouse relations (A-->X<--B), where two genes coregulate a third. Markov relations indicate that two genes are related in some joint biological interaction. Note that in the structure A<--X-->B, A and B are not Markov related. Here, the interaction between A and B is mediated by X, that is, conditional on X, A and B are independent.
  2. Order relations. A is an ancestor of B for a given PDAG if the PDAG contains a path from A to B in which all edges are directed. This indicates a causal relation between A and B.
Features are binary indicator variables for Markov or order relations. While Markov relations are local properties of a network, order relations capture a global property.

Experimental results

I briefly summarise some experimental findings reported in Peer et al..

Pairwise relations

Separator relations

Subnetwork analysis


Back to the main page.