COMPARISON OF EXCEPTIONAL WORDS UNDER TWO MODELS

[Link Map]
begin comments
line

We want to identify exceptional 3-words in the entire genome of the phage Lambda that is 48,502 bases long.

We used successively models M0 and M1, and we compared the results; the Gaussian approximation has been used here.

Note that M1 takes into account an eventual bias in 2-letter word composition whereas M0 takes only into account bias in letter composition.

(To view in more detail please click on image.)

Each of the sixty four 3-words is represented by a point whose x-axis coordinate is the z-score associated to the word, calculated under M0, and the y-axis coordinate is the z-score calculated under M1.
Therefore, the x-axis coordinate tells us if the word count is well predicted by the letter frequencies whereas, the y-axis coordinate tells us if the word count is well predicted by the 2-word frequencies.

Here are some examples of how to interpret such results:

It is then useful to study the count of a word under several models rather than under a unique one. Thus, we obtain very precise information on the sequence vocabulary. The advantage of using the maximal model is that we detect exceptional words in their totality (CAG, CTG, TAT, TAG, TTG and CAA are such words). However, the disadvantage is that the maximal model masks words that are exceptional only because some of its subwords are exceptional (TGC and GTA are such words).

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 13 of 21