Bayesian Networks for Analysing Gene Expression Data
Dirk Husmeier
Biomathematics & Statistics Scotland
SCRI, Invergowrie,
Dundee, DD2 5DA
United Kingdom
August 2001

Introduction

Shortcomings of clustering

The Bayesian network paradigm

Correlation versus causality

Practical implementation and limitations

Experimental results

References
Molecular pathways consisting of interacting proteins
underlie the major functions of living cells.
A central goal of molecular biology is therefore to
understand the regulatory mechanism that governs
protein synthesis and activity. All the cells in an
organism carry the same genomic data yet their
protein makeup can be drastically different both
temporally and spatially, due to regulation.
Protein synthesis is regulated by control
mechanisms at different stages:

Transcription

RNA splicing

Translation

Posttranslational modifications
Oligonucleotide or cDNA
microarrays
allow us to focus on the first stage.
While traditional methods in molecular biology
could only report the expression levels of
single genes, microarrays
measure the abundance of thousands of mRNA
targets simultaneously.
This provides new rich data for understanding
gene expression and regulation.
The most commonly used computational method for analysing
microarray gene expression data is clustering.
Consider a microarray X_(ij)
,
whose rows X_(i.)
correspond to genes
and whose columns X_(.j)
correspond to probes (tissue samples,
experiments, etc.). Based on a correlation
measure between the row vectors X_(i.)
,
genes are partitioned into clusters, using e.g. a hierarchical
clustering algorithm.
Example (gene expression in yeast):
Gene expression during the yeast cell cycle.
The genes correspond to the rows, and the time points of
each experiment are the columns. The
ratio of induction/repression is shown for each gene such that the
magnitude is indicated by the intensity of the colours displayed.
The dendrogram
on the left side of the figure shows the structure of the clusters.
Figure copyrighted by
Stanford University.
While clustering provides a compact summarisation of the data
and might point to functional relationships between clustered
genes (genes that are coexpressed might be coregulated),
it suffers from the following shortcomings:

Clustering is based on a global correlation measure.
This obscures relationships that exist over only a subset of
the data.

Clustering fails to detect interactions between genes
different from linear correlation.

It is impossible to incorporate additional types of
information, such as clinical data or experimental details.
Typical questions Bayesian networks can answer, but
where clustering fails:

Is the effect of a mutated gene on a target direct, or
mediated by other genes?

Which genes mediate the interactions within a cluster
of genes or between clusters?

What is the nature of the interaction between genes
(e.g. does gene A inhibit gene B)?
Summary of the basic concept:

Bayesian network: graphical display of dependence structure
between multiple interacting quantities (here: expression levels
of different genes).

Probabilistic semantics: Fits the stochastic nature of both
the biological processes and noisy experiments. Capable of
handling noise and estimating the confidence in the different
features of the network.

Due to lack of data:
Extract features that are pronounced in the data
rather than a single model that explains the data.
Summary of the basic method:

Directed acyclic graph (DAG):
Graphical representation of the decomposition of a
joint distribution into a product form.
Nodes represent random variables. Arcs indicate conditional
dependencies.

Random variable
X_i
=
measured expression level of gene
i
. Arcs = regulatory interactions between genes.

Define the functional form of the conditional
distributions (e.g. multinomial for discrete variables,
linear Gaussian for continuous variables).
Inference or learning consists of two tasks:

Find the best network structure G.

Given a network structure, find the best set of parameters
for the conditional distributions.
"Best" is to be taken in a Bayesian way as the most
probable structure/parameter vector given the data.

Estimate uncertainty:
Gene expression experiments typically give small
data sets
(many genes but only a few probes/experiments)
which are not sufficiently informative to significantly
determine that a single model is the "right" one.
From a Bayesian perspective, the posterior probability
over models is vague and not dominated by a single
model.
There are two ways to estimate uncertainty:

Bayesian approach:
Sample networks from the
posterior distribution with Markov chain Monte Carlo (MCMC).

Frequentist approach:
Apply bootstrapping and repeat the optimisation
algorithm several times on resampled data sets.
Example:
Subnetwork for the yeast expression data, showing
a local graphical map of the SVS1 yeast gene.
Nodes represent genes, arcs represent conditional dependencies,
and the width of an arc corresponds to the computed
confidence level.
The local map shows that CLN2 separates SVS1 from several
other genes. For instance, the interaction between
SRO4 and SVS1 is mediated by CLN2.
While SRO4 might correlate with SVS1,
conditional on CLN2 the expression levels of SRO4 and SVS1 are independent.
Figure partly redrawn from
Friedman et al..
For a presentation of the full network, click
here.
Here is a more comprehensive
introduction to Bayesian networks, written by
Kevin P. Murphy.
Ultimately, we are interested in the flow of causality:
Does knocking out gene A lead to an increased expression
of gene B? However, a Bayesian network is a model of
dependencies between random variables, rather than
causality. More than one graph can imply exactly the
same set of independencies. Example: A>B and A<B
are alternative ways of describing that A and B are not
independent. Different graphs that imply the same set of
independencies are called equivalent. The analysis
of causality is hampered by the following fact:

On the basis of observations (passive measurements)
alone we cannot distinguish
between equivalent graphs.
There are two ways to proceed in order to establish
causal relations.

Using observations alone:
Analyse equivalence classes in the form of
partially directed graphs (PDAG),
where a directed edge A>B denotes that all members
of this equivalence class have this edge, whereas an undirected
edge AB denotes that some members of the class contain
the edge A>B, while others contain the edge A<B.
A directed edge A>B is an indication, rather than evidence,
that A might be a causal ancestor of B.
Problem: For sparse data, like microarray data, the
analysis is complicated by the fact that we do not have
a single PDAG, but rather a (vague) posterior distribution
over PDAGs.

Using interventions, that is, setting the values of some
variables using forces outside the causal model, e.g.
gene knockout or overexpression.
This modifies the score function (the posterior probability
of the network conditional on the data) such that it is
no longer structure equivalent:
The score of equivalent graphs is no longer guaranteed to be
the same, which helps us to determine the direction of causality.
Effectively this reduces the set of networks that
are structure equivalent to a subset of networks that
are intervention equivalent.
In a practical implementation of this scheme, one faces
two problems: computational complexity and
limited data.
Complexity
To reduce the computational complexity, the conditional
probabilities are chosen of such a functional form
that their parameters can be determined from closed
form equations once the (complete) data and the network structure
are known. (More precisely: The posterior distribution of these
parameters can be calculated in closed form from some
sufficient statistics of the data). This makes the parameter estimation process
fast and allows the algorithm to focus on finding the optimal
network structure (NPhard problem).
Two functional representations are commonly used:

Multinomial model.
Multinomial distribution for the possible states
of a child variable given the state of the parents.
This requires a discretisation of the data
(typically using three categories of gene expression:
underexpressed, baseline, overexpressed).
Advantage: Can model nonlinear dependencies.
Disadvantage: Discretisation causes loss of information.
Results are sensitive to the discretisation thresholds.
Pe'er et al. redeem the
disadvantages of a fixed threshold by modelling the
gene expression
levels in different experiments as samples from a mixture
of Gaussians and then discretising the data by assigning
each expression level to the most likely mixture component.

Linear Gaussian.
Learn a linear regression model for the child
variable given its parents.
Advantage: Works with the actual data; no information
loss caused by discretisation.
Disadvantage: Can only model linear dependencies.
Learning the Bayesian network structure is an optimisation
problem in the space of DAGs. The number of such graphs
is superexponential in the number of variables and exhaustive
search is intractable (NPhard problem). Therefore,
one needs to resort to heuristic, local
(greedy) optimisation algorithms.
Friedman et al. suggest the
following sparse candidate algorithm:

Identify a relatively small number of candidate parents
for each gene based on simple local statistics (such as correlation).

Restrict the search to networks in which only the candidate parents of a variable can be its parents.

Disadvantage: Early choices can result in an overly restricted
search space. To avoid this problem, a heuristic iterative
algorithm is devised that adapts the candidate sets during
the search.
Limited data
Typically, the number of genes (=random variables)
is much larger than the number of gene expression
measurements; e.g. the yeast data set
contains
76 expression measurements for
800 genes.
Consequence: one has a diffused posterior over a huge
model space and cannot possibly list all the networks
that are plausible given the data.
The solution is as follows:

Identify lowercomplex properties of the network ("features").

Estimate the posterior probability of these
features given the data.
Due to the lower complexity this estimation
should be fairly robust even from a small set of sampled
networks.
Examples of features:

Markov relations.
Markov neighbours
are variables that are not separated by any other
measured variable. These are parentchild relations
(A>B), where
one gene regulates another, and spouse relations
(A>X<B), where two genes coregulate a third.
Markov relations indicate that two genes are related
in some joint biological interaction.
Note that in the structure A<X>B, A and B are not
Markov related. Here, the interaction between
A and B is
mediated by X, that is, conditional on X,
A and B are independent.

Order relations.
A is an ancestor of B for a given PDAG
if the PDAG
contains a path from A to B in which all
edges are directed. This indicates a causal relation
between A and B.
Features are binary indicator variables for
Markov or order relations.
While Markov relations are local properties of a network,
order relations capture a global property.
I briefly summarise some experimental findings reported in
Peer et al..
Pairwise relations

Most Markov pair relations
correspond to known biochemical or regulatory interactions,
a shared common regulator, or a functional link.

For a third (about 30%) of these "proofofprinciple" genes,
the Pearson correlation was small.

Conclusion: Due to the context specific nature
in which they handle the data,
Bayesian networks detect interactions that are
missed by global correlation analysis.
Separator relations

Separator triplets explain away dependencies,
which can
provide an enhanced insight into the underlying
molecular architecture of pathways.

Most separators were identified as known mediators
of transcriptional responses. The genes they separate
share functional roles and regulation patterns.
This is consistent with the separator serving as a common
regulator.

Previously uncharacterised genes participating
in some of the interactions could be assigned putative effector
functions.

The analysis can be generalised from Markov triplets
to dseparation relations, where the mediating gene is
not in a direct Markov relation with the two genes it
separates.

This identification of indirect relations provides
support for the regulatory role of putative transcription
factors and signalling molecules.
Subnetwork analysis

Bayesian networks uncover intracluster structures:
Groups of genes, which by correlation analysis alone
are simply clustered together, can be organised in
clear functional subnetworks.

These subnetworks provide a much richer context for regulatory
and functional analysis
and assist us in understanding the roles of genes
and in assigning them putative novel functions.

This promises to identify regulatory, metabolic, and
signalling components of molecular networks.
Back to the main page.