COMPARISON OF EXCEPTIONAL WORDS UNDER TWO MODELS
|
|
We want to identify exceptional 3-words in the entire genome of the
phage Lambda that is 48,502 bases long.
We used successively models M0 and M1, and we compared the results;
the Gaussian approximation
has been used here.
Note that M1 takes
into account an eventual bias in 2-letter word composition whereas M0
takes only into account bias in letter composition.
(To view in more detail please click on image.)
Each of the sixty four 3-words is represented by a point whose x-axis
coordinate is the z-score associated to the word,
calculated under M0,
and the y-axis coordinate is the z-score calculated under M1.
Therefore, the x-axis coordinate tells us if the word count is well
predicted by the letter frequencies whereas, the y-axis coordinate
tells us if the word count is well predicted by the 2-word
frequencies.
Here are some examples of how to interpret such results:
-
AAA is the second most over-represented 3-words under M0 (z-score=15)
but is less exceptional under M1 (z-score=7). In other words, there
are much more AAA than we could simply predict according to the count
of A in the sequence. Now, when we take into account the count of
AA
in the sequence, AAA is still over-represented but has lost part of
its exceptionality. It means that the sequence is probably AA rich
(see the exceptional 2-words under M0).
-
The case of TGC is even more obvious: by taking
into account the counts of A, C, G, and T (M0), TGC has a high
frequency (z-score=12).
Now, if we take into account the counts of TG
and GC (M1), the frequency of TGC is completely expected. It means
that TG is very frequent
(see the exceptional 2-words under M0) but
the proportion of TG followed by a C is "normal"
(and/or GC is very
frequent but the proportion of GC preceded by a T is expected).
The over-representation of TGC under M0 just reveals a bias in the
composition of TG and/or GC.
-
On the contrary, CAA, TTG and
TAT are not exceptional under M0 but
are under M1. It means that the frequency of TTG, for instance, is
correctly predicted by the frequencies of T and G but, as soon as we
take into account the fact that TT and TG
occur respectively 3346 and
3793 times in the sequence, then
TTG should appear more often than it
does. In other words, TT is rarely followed by a G or TG is rarely
preceded by a T.
It is then useful to study the count of a word under several models
rather than under a unique one.
Thus, we obtain very precise information on
the sequence vocabulary. The advantage of using the maximal model
is that we detect
exceptional words in their totality
(CAG, CTG, TAT, TAG,
TTG and CAA are such words).
However, the disadvantage is that the maximal model masks words that
are exceptional only because some of its
subwords are exceptional
(TGC and GTA are such words).
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 13 of 21
|
|