EXCEPTIONALITY OF A WORD AND ITS SUBWORDS |
|
The exceptionality of a word under a given model may be related to the exceptionality of some of its subwords under the same model (see the comparison of exceptional words under two models); in this case, it results on a contamination phenomena. In other cases, it is an additional constraint for the sequence.
The pyramidal display is then very convenient to study simultaneously the exceptionality of a word and of its subwords.
A word W of length h can be represented by a pyramid of h-2 stages; each stage corresponds to a word length. The higher stage is composed with a unique square coloured according to the statistic of W. The stage beneath is made of two squares corresponding to the 2 subwords of length h-1 of W, and so on up to the stage made of h-2 squares associated with the subwords of length 3.
A pyramid can be made either under a single model (M1, for instance) or under maximal models for each stage (M1 for 3-words, M2 for 4-words, ...). Models with phase can be used.
For instance, let us look at
the pyramids of some of the most
under-represented 6-words (the first, the fifth and the sixth) in a
sequence of E. coli
(111,402 bases) under maximal models without phase.
When we analyze this pyramid from the bottom to the top we can say that:
When looking at the pyramid of AGCGCT, we can see that this under-represented 6-word is roughly composed of non-exceptional subwords. The exceptionality really starts with the 6 letters AGCGCT.
TCCGGA is a very interesting word. Its 2 subwords of length 5 (TCCGG and CCGGA) are over-represented in the sequence but it seems that there is a stronger constraint: they "do not" have to occur side by side.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 14 of 21 |
|