Christoforos

FIRST SUPERVISION

The following issues were discussed:

Coding up: We will extend the Bayesian Network Toolbox in Matlab. One routine will generate the ground Bayesian Network from the dataset and another will use the parameter tying option of BNT in accordance with the higher order dependency specifications given by the PRM.

Markov Equivalence: It can be convincingly argued that the notion of Markov Equivalence does not generally extend to the higher order description of PRMs. In specific, it is straightforward to show that two semantically indistinguishable PRMs can correspond to statistically distinct models. Experimental verification of this result can be found in the literature, though no explanation was offered by the authors for the discrepancy. Such discrepancies suggest that PRMs are not as natural a generalisation of Bayesian Networks as suggested by the authors.

Model: We will implement the model that appears in the paper by Segal et al
Genome-wide discovery of transcriptional modules from DNA sequence and gene expression, with the difference that the expression level variables will be binary rather than continuous. The nature of the inference task is to fuse two types of data and use all the information in a coherent way to perform unsupervised clustering. The clustering will be represented by a latent discrete variable.

Parameter Learning Algorithms: Since we have a latent variable, the likelihood term will be analytically intractable. The authors use a hard EM algorithm to perform the training. This algorithm is relatively ad hoc and potentially too sensitive to initialisation. We will compare it with MCMC methods. Alternatives we will try to consider are: simple MCMC, population MCMC and a 'mixed' MCMC algorithm with occasional forced global transitions to avoid getting trapped in an isolated low likelihood area of the search space.

Dependency Structure Learning: We will not attempt to learn the structure of the model, for lack of time but also since there are theoretical complications with structure learning in PRMs. We will however try to theoretically investigate these complications if there is time.

Learning the optimal number of clusters: The results can be safely expected to vary significantly with the number of clusters allowed. It is hence worthwhile to investigate what is the optimal number of clusters. First, we ought to investigate whether our model succeeds to recover the 'true' number of clusters when it can; a simple way to investigate this would be to generate synthetic data from three clusters and train a PRM with six clusters on this dataset. If not, maybe we can come up with a theoretical argument for the reason of this failure. We can then perform model selection to recover the 'true' number of clusters. Three possible alternatives could be examined: Bayesian Information Criterion scoring, Existence Uncertainty PRMs as introduced in 'Learning Probabilistic Models of Link Structure' by the Friedman group and finally a reversible jump MCMC on the existing model.

Tasks for this week:

Start encoding the model bottom-up, starting from the sequences (or top-down, if you refer to Figure 2 of the paper).

Try and learn motifs from synthetic sequences. Generate sequences as follows:

Training is supervised: for the training set, you know the class membership. You test the generalization performance on an independent test set.

Investigate how the performance depends on the following settings:

Now assume that the actual motifs were generated from a more complex dependency model, say a first-order Markov chain. How does this affect the prediction accuracy of the model?
Back to my homepage.