FINDING OVER- AND UNDER-REPRESENTED WORDS IN DNA SEQUENCES |
|
A DNA sequence can be represented by a long series of letters from the 4-letter alphabet {A,C,G,T}
A word is a short sequence of letters from the series. For example, TGCTG is a 5-word (a word of length 5).
Finding over- and under-represented words in DNA sequences is an important problem in molecular biology since exceptional words may be involved in the stability of DNA as well as in mechanisms like recombination, replication and repair.
The concept of an exceptional word is based on a statistical comparison between the observed frequency of a word and the one expected under a probability model that reflects the composition of the DNA sequence in its small vocabulary.
Here we describe statistical approaches to the identification of over- and under- represented words in DNA sequences.
|
Authors : S. Schbath and
A. Bouvier INRA
© INRA Sept 1998 |
|