Systems Biology
(Biomathematics and Statistics Scotland, since 2002)
Inferring genetic networks from microarray gene expression data.
Bioinformatics
(Biomathematics and Statistics Scotland, since 1999)
I have recently moved
into bioinformatics, where
my main interests are the development of
statistical methods for analysing DNA sequences and
the application of machine learning
techniques in phylogenetics.
The objective of phylogenetics is
the reconstruction of the evolutionary history of species,
expressed in a so-called phylogenetic tree,
from a DNA sequence alignment.
Besides being of fundamental importance in itself -
aiming to estimate, for instance, the ancestry of the human race
or to derive the whole tree of life -
this methodology has recently become of immense practical
relevance in epidemiology (suggesting, e.g., cross-infection
between humans and apes in the emergence of AIDS) and
forensic science (e.g., proving that a dentist in Florida infected
several of his patients with HIV).
Evolution is driven by stochastic forces
that act on genomes, and phylogenetics
essentially tries to discern
significant similarities between diverged sequences amidst
a chaos of random mutation, natural selection, and genetic drift.
Faced with a poor signal-to-noise ratio, the most powerful
methods make use of probability theory.
I am currently working on a project to detect
sporadic recombination
in multiple DNA sequence alignments.
Conventional phylogenetic tree estimation methods assume
that all sites in a DNA multiple alignment have
the same evolutionary history.
This is a reasonable approach when applied to
DNA sequences obtained from most species.
However, this assumption is violated in certain
bacteria and viruses due to sporadic
recombination, which is a process that leads to
the transfer of DNA subsequences between
different strains.
The resulting mixing of the genetic material by the formation
of so-called mosaic sequences establishes
an important source of genetic variation
and constitutes a mechanism through which
many disease-causing bacteria may acquire resistance to
antibiotics.
While the detection of recombination is known to
be important in its own right for many medical applications
(HIV-1, for instance, shows a high recombination frequency, and
the existence of mosaic sequences
needs to be considered
during the design of a potential vaccine.
For further information, click
here ),
it is also a
crucial prerequisite for consistently inferring the
evolutionary history of a set of DNA sequences.
Click
here
to find out more about previous work done on this project.
Bayesian Machine Learning and Medical Applications
(Imperial College London, 1997-1999)
Recently there has been much interest in the use of
Bayesian methods in problems of machine learning and inference,
particularly in combination with powerful nonlinear function
approximators such as neural networks.
Since nonlinear models induce complicated probability densities,
approximations become necessary.
It is not clear how much of the elegance of the
Bayesian framework is lost in the presence of these
approximations.
In a recent empirical study I carried out an extensive evaluation
on a set of various benchmark classification problems,
where the objective was to study the sensitivity of the Bayesian
scheme to changes in the prior distribution of the parameters and
hyperparameters, and to evaluate
the efficiency of the so-called automatic relevance determination
(ARD) method.
On the practical side, I am applying Bayesian neural networks
to predict the development and progression
of Kaposi's Sarcoma (KS).
This is a joint project between
the
Department of Electrical and Electronic Engineering,
Imperial College
and the Department of Genito-urinary Medicine,
St. Mary's Hospital.
Kaposi's sarcoma (KS) is a vascular tumour,
which is more common and often aggressive in patients
with underlying immunosuppression
(post-transplant KS and AIDS-associated KS).
The aim is to determine factors that influence the
variable progression rate of KS in HIV infected individuals
by multi-variable analysis in order to define clinical end-points
and provide guidelines for better patient management.
To this end I apply the automatic relevance determination (ARD)
method for Bayesian neural networks as well as the
determination of vertices on receiver operational characteristic
(ROC) curves.
Neural Computation
(King's College London, 1994-1997)
My research during my PhD studies at
King's College London focused on time series prediction and the
estimation of conditional probability densities
with neural network.
An overview of this work can be found in the
synopsis of my
recently published book:
Conventional applications of neural networks usually
predict a single value as a function of given inputs.
In forecasting, for example,
a standard objective is to predict the future value
of some entity of interest on the basis of a time
series of past measurements or observations.
Typical training schemes aim to minimise the sum of
squared deviations between predicted and actual
values (the `targets'), by which, ideally, the network
learns the conditional mean of the target
given the input.
If the underlying conditional distribution is
Gaussian or at least unimodal,
this may be a satisfactory approach.
However, for a multimodal distribution, the
conditional
mean does not capture the relevant features of the
system, and the
prediction performance will, in general, be very poor.
This calls for a more powerful and sophisticated model,
which can learn the whole conditional probability distribution.
Chapter~1 demonstrates that
even for a deterministic system and
`benign' Gaussian observational noise,
the conditional distribution of a future observation,
conditional on a set of past observations, can
become strongly skewed and multimodal.
In Chapter~2, a general neural network structure
for modelling conditional probability densities
is derived, and it is shown that a universal
approximator for this extended task requires
at least two hidden layers.
A training scheme is developed from a
maximum likelihood
approach in Chapter~3, and the performance
of this method is demonstrated on
three stochastic time series in Chapters~4
and 5.
Several extensions of this basic paradigm are studied
in the following chapters, aiming at both an
increased training speed and a better generalisation
performance.
Chapter~7 shows that
a straightforward application
of the Expectation
Maximisation (EM) algorithm does not lead to
any improvement in
the training scheme, but that in combination with the
random vector functional link (RVFL)
net approach,
reviewed in Chapter~6, the training
process can be accelerated by about two orders of magnitude.
An empirical corroboration for this `speed-up'
can be found in Chapter~8.
Chapter~9 discusses a simple
Bayesian approach to network training,
where a conjugate prior distribution on the network
parameters naturally results in a penalty term
for regularisation.
However, the hyperparameters still
need to be set by intuition or cross-validation,
so a consequent extension is presented in
Chapters~10 and 11,
where the
Bayesian evidence scheme,
introduced to the neural network
community by MacKay for regularisation and model selection
in the simple case of Gaussian homoscedastic noise,
is generalised to arbitrary
conditional probability densities. The Hessian matrix of the
error function is calculated with an extended version of the
EM algorithm.
The resulting update equations for the hyperparameters
and the expression for the model evidence
are found to reduce to
MacKay's results in the above limit of Gaussian noise
and thus provide a consequent generalisation
of these earlier results.
An empirical test of the evidence-based regularisation scheme,
presented in Chapter~12, confirms that the problem of
overfitting can be considerably reduced,
and that the training process is stabilised with
respect to changes in the length of training time.
A further improvement of the generalisation
performance can be achieved by
employing network committees, for which two weighting
schemes -- based on either the evidence or the
cross-validation performance -- are derived
in Chapter~13.
Chapters~14 and 16 report the results
of extensive simulations on a synthetic and a
real-world problem,
where the intriguing observation is made that in
network committees, overfitting of the individual
models can be useful and may lead to better prediction results
than obtained with an ensemble of properly regularised networks.
An explanation for this curiosity can be given
in terms of a modified
bias-variance dilemma, as expounded in
Chapter~13.
The subject of Chapter~15 is the problem of feature
selection and the identification of irrelevant inputs.
To this end, the automatic relevance
determination (ARD) scheme of MacKay and Neal is adapted
to learning in committees of probability-predicting RVFL networks.
This method is applied in Chapter~16 to a
real-world benchmark problem, where the objective is
the prediction of housing prices
in the Boston metropolitan area on the basis of various
socio-economic explanatory variables.
The book concludes in Chapter~17
with a brief summary.
Theoretical Biophysics
(RUB Bochum, 1989-1991)
During my `Diplomarbeit' at the
Department of Biophysics
in the University of Bochum (RUB) I was working
on molecular dynamics in proteins.
The objective was to numerically solve the
Hamiltonian equations of motion of a complex
biological system (hemoglobin and solvent)
and to compute the trajectories in the
high-dimensional phase space of all atomic coordinates
and momenta.
This allowed the simulation of the dynamic and
kinetic processes in hemoglobin
and the analysis
of their structural-functional relationship.
Of particular interest was an intramolecular
reaction, where a covalent bond
is formed between the N-epsilon of HisE7
and the Heme-Fe.
This blocks the active site
of the protein and disables its physiological
function as a transport molecular
for ligands (like oxygen).
By coupling the system to a heat bath and
applying the method of thermodynamic integration
I computed the thermodynamic entities
(enthalpy and entropy) of this reaction,
which were found to be in reasonable agreement with
earlier temperature-jump and EPR experiments.
The results of this study contributed to
the attainment of a deeper understanding
of the role of the entropy in the physiological function
of proteins (entropy-enthalpy compensation).
Last update: July 2002.