Bayesian Networks for Analysing Gene Expression Data
Biomathematics & Statistics Scotland
Dundee, DD2 5DA
Shortcomings of clustering
The Bayesian network paradigm
Correlation versus causality
Practical implementation and limitations
Molecular pathways consisting of interacting proteins
underlie the major functions of living cells.
A central goal of molecular biology is therefore to
understand the regulatory mechanism that governs
protein synthesis and activity. All the cells in an
organism carry the same genomic data yet their
protein makeup can be drastically different both
temporally and spatially, due to regulation.
Protein synthesis is regulated by control
mechanisms at different stages:
Oligonucleotide or cDNA
allow us to focus on the first stage.
While traditional methods in molecular biology
could only report the expression levels of
single genes, microarrays
measure the abundance of thousands of mRNA
This provides new rich data for understanding
gene expression and regulation.
The most commonly used computational method for analysing
microarray gene expression data is clustering.
Consider a microarray
X_(i.) correspond to genes
and whose columns
correspond to probes (tissue samples,
experiments, etc.). Based on a correlation
measure between the row vectors
genes are partitioned into clusters, using e.g. a hierarchical
Example (gene expression in yeast):
Gene expression during the yeast cell cycle.
The genes correspond to the rows, and the time points of
each experiment are the columns. The
ratio of induction/repression is shown for each gene such that the
magnitude is indicated by the intensity of the colours displayed.
on the left side of the figure shows the structure of the clusters.
Figure copyrighted by
While clustering provides a compact summarisation of the data
and might point to functional relationships between clustered
genes (genes that are coexpressed might be coregulated),
it suffers from the following shortcomings:
Typical questions Bayesian networks can answer, but
where clustering fails:
Clustering is based on a global correlation measure.
This obscures relationships that exist over only a subset of
Clustering fails to detect interactions between genes
different from linear correlation.
It is impossible to incorporate additional types of
information, such as clinical data or experimental details.
Is the effect of a mutated gene on a target direct, or
mediated by other genes?
Which genes mediate the interactions within a cluster
of genes or between clusters?
What is the nature of the interaction between genes
(e.g. does gene A inhibit gene B)?
Summary of the basic concept:
Summary of the basic method:
Bayesian network: graphical display of dependence structure
between multiple interacting quantities (here: expression levels
of different genes).
Probabilistic semantics: Fits the stochastic nature of both
the biological processes and noisy experiments. Capable of
handling noise and estimating the confidence in the different
features of the network.
Due to lack of data:
Extract features that are pronounced in the data
rather than a single model that explains the data.
Directed acyclic graph (DAG):
Graphical representation of the decomposition of a
joint distribution into a product form.
Nodes represent random variables. Arcs indicate conditional
measured expression level of gene
i. Arcs = regulatory interactions between genes.
Define the functional form of the conditional
distributions (e.g. multinomial for discrete variables,
linear Gaussian for continuous variables).
Inference or learning consists of two tasks:
"Best" is to be taken in a Bayesian way as the most
probable structure/parameter vector given the data.
Find the best network structure G.
Given a network structure, find the best set of parameters
for the conditional distributions.
Gene expression experiments typically give small
(many genes but only a few probes/experiments)
which are not sufficiently informative to significantly
determine that a single model is the "right" one.
From a Bayesian perspective, the posterior probability
over models is vague and not dominated by a single
There are two ways to estimate uncertainty:
Sample networks from the
posterior distribution with Markov chain Monte Carlo (MCMC).
Apply bootstrapping and repeat the optimisation
algorithm several times on resampled data sets.
Subnetwork for the yeast expression data, showing
a local graphical map of the SVS1 yeast gene.
Nodes represent genes, arcs represent conditional dependencies,
and the width of an arc corresponds to the computed
The local map shows that CLN2 separates SVS1 from several
other genes. For instance, the interaction between
SRO4 and SVS1 is mediated by CLN2.
While SRO4 might correlate with SVS1,
conditional on CLN2 the expression levels of SRO4 and SVS1 are independent.
Figure partly redrawn from
Friedman et al..
For a presentation of the full network, click
Here is a more comprehensive
introduction to Bayesian networks, written by
Kevin P. Murphy.
Ultimately, we are interested in the flow of causality:
Does knocking out gene A lead to an increased expression
of gene B? However, a Bayesian network is a model of
dependencies between random variables, rather than
causality. More than one graph can imply exactly the
same set of independencies. Example: A-->B and A<--B
are alternative ways of describing that A and B are not
independent. Different graphs that imply the same set of
independencies are called equivalent. The analysis
of causality is hampered by the following fact:
There are two ways to proceed in order to establish
On the basis of observations (passive measurements)
alone we cannot distinguish
between equivalent graphs.
Using observations alone:
Analyse equivalence classes in the form of
partially directed graphs (PDAG),
where a directed edge A-->B denotes that all members
of this equivalence class have this edge, whereas an undirected
edge A--B denotes that some members of the class contain
the edge A-->B, while others contain the edge A<--B.
A directed edge A-->B is an indication, rather than evidence,
that A might be a causal ancestor of B.
Problem: For sparse data, like microarray data, the
analysis is complicated by the fact that we do not have
a single PDAG, but rather a (vague) posterior distribution
Using interventions, that is, setting the values of some
variables using forces outside the causal model, e.g.
gene knockout or over-expression.
This modifies the score function (the posterior probability
of the network conditional on the data) such that it is
no longer structure equivalent:
The score of equivalent graphs is no longer guaranteed to be
the same, which helps us to determine the direction of causality.
Effectively this reduces the set of networks that
are structure equivalent to a subset of networks that
are intervention equivalent.
In a practical implementation of this scheme, one faces
two problems: computational complexity and
To reduce the computational complexity, the conditional
probabilities are chosen of such a functional form
that their parameters can be determined from closed
form equations once the (complete) data and the network structure
are known. (More precisely: The posterior distribution of these
parameters can be calculated in closed form from some
sufficient statistics of the data). This makes the parameter estimation process
fast and allows the algorithm to focus on finding the optimal
network structure (NP-hard problem).
Two functional representations are commonly used:
Learning the Bayesian network structure is an optimisation
problem in the space of DAGs. The number of such graphs
is superexponential in the number of variables and exhaustive
search is intractable (NP-hard problem). Therefore,
one needs to resort to heuristic, local
(greedy) optimisation algorithms.
Friedman et al. suggest the
following sparse candidate algorithm:
Multinomial distribution for the possible states
of a child variable given the state of the parents.
This requires a discretisation of the data
(typically using three categories of gene expression:
under-expressed, baseline, over-expressed).
Advantage: Can model nonlinear dependencies.
Disadvantage: Discretisation causes loss of information.
Results are sensitive to the discretisation thresholds.
Pe'er et al. redeem the
disadvantages of a fixed threshold by modelling the
levels in different experiments as samples from a mixture
of Gaussians and then discretising the data by assigning
each expression level to the most likely mixture component.
Learn a linear regression model for the child
variable given its parents.
Advantage: Works with the actual data; no information
loss caused by discretisation.
Disadvantage: Can only model linear dependencies.
Identify a relatively small number of candidate parents
for each gene based on simple local statistics (such as correlation).
Restrict the search to networks in which only the candidate parents of a variable can be its parents.
Disadvantage: Early choices can result in an overly restricted
search space. To avoid this problem, a heuristic iterative
algorithm is devised that adapts the candidate sets during
Typically, the number of genes (=random variables)
is much larger than the number of gene expression
measurements; e.g. the yeast data set
76 expression measurements for
Consequence: one has a diffused posterior over a huge
model space and cannot possibly list all the networks
that are plausible given the data.
The solution is as follows:
Examples of features:
Identify lower-complex properties of the network ("features").
Estimate the posterior probability of these
features given the data.
Due to the lower complexity this estimation
should be fairly robust even from a small set of sampled
Features are binary indicator variables for
Markov or order relations.
While Markov relations are local properties of a network,
order relations capture a global property.
are variables that are not separated by any other
measured variable. These are parent-child relations
one gene regulates another, and spouse relations
(A-->X<--B), where two genes coregulate a third.
Markov relations indicate that two genes are related
in some joint biological interaction.
Note that in the structure A<--X-->B, A and B are not
Markov related. Here, the interaction between
A and B is
mediated by X, that is, conditional on X,
A and B are independent.
A is an ancestor of B for a given PDAG
if the PDAG
contains a path from A to B in which all
edges are directed. This indicates a causal relation
between A and B.
I briefly summarise some experimental findings reported in
Peer et al..
Most Markov pair relations
correspond to known biochemical or regulatory interactions,
a shared common regulator, or a functional link.
For a third (about 30%) of these "proof-of-principle" genes,
the Pearson correlation was small.
Conclusion: Due to the context specific nature
in which they handle the data,
Bayesian networks detect interactions that are
missed by global correlation analysis.
Separator triplets explain away dependencies,
provide an enhanced insight into the underlying
molecular architecture of pathways.
Most separators were identified as known mediators
of transcriptional responses. The genes they separate
share functional roles and regulation patterns.
This is consistent with the separator serving as a common
Previously uncharacterised genes participating
in some of the interactions could be assigned putative effector
The analysis can be generalised from Markov triplets
to d-separation relations, where the mediating gene is
not in a direct Markov relation with the two genes it
This identification of indirect relations provides
support for the regulatory role of putative transcription
factors and signalling molecules.
Bayesian networks uncover intra-cluster structures:
Groups of genes, which by correlation analysis alone
are simply clustered together, can be organised in
clear functional subnetworks.
These subnetworks provide a much richer context for regulatory
and functional analysis
and assist us in understanding the roles of genes
and in assigning them putative novel functions.
This promises to identify regulatory, metabolic, and
signalling components of molecular networks.
Back to the main page.