Phone: 0030 210 654 1951
28 May 2004
Problem
Identify co-regulated genes on the basis of their
promoter sequences.
Task
Embed known motifs in background sequences.
Vary the noise level of the motifs
(e.g., include motifs which deviate from the original
ones by a pre-specified Hamming distance).
Try and distinguish these sequences from pure background
sequences.
Seeding
Local greedy search is susceptible to
suboptimal local optima. Therefore, explore the effects
of seeding for clever initialization.
To extract
binding motifs, use
Meme or Friedman's
hyper-geometric approach.
Data
Here is a website with transcription factor
binding sites. It was used in the paper
Modeling Dependencies in Protein-DNA Binding Sites
by Barash, Elidan, Friedman and Kaplan.
21 May 2004
Here is the paper by Segal, Yelensky and Koller:
Genome-wide discovery of transcriptional modules from DNA sequence and gene expression.
Start encoding the model bottom-up, starting from the sequences
(or top-down, if you refer to Figure 2 of the paper).
Try and learn motifs from synthetic sequences.
Generate sequences as follows:
-
Background distribution:
Uniform distribution
over the four nucleotides.
-
Motifs:
PSSMs of different lengths
and with different entropies.
-
Classes:
All sequences in a given class contain the same
motif (more precisely: motifs of the same type,
that is, sampled from the same PSSM).
Training is supervised: for the training set, you know the
class membership. You test the generalization performance
on an independent test set.
Investigate how the performance depends on
the following settings:
-
Number of training exemplars
-
Motif and sequence lengths
-
Entropy of the PSSM
-
Seeding versus non-seeding
Now assume that the actual motifs were generated from
a more complex dependency model, say a first-order
Markov chain. How does this affect the prediction
accuracy of the model?
Minutes
Journals and conferences
Back to my homepage.