CHOICE OF THE APPROXIMATION |
|
Theoretically, the Gaussian approximation is valid when the expected count of the word grows to infinity with the length n of the sequence, and the compound Poisson approximation is valid when the expected count is of order O(1) as n tends to infinity. The last condition implies that the length of the word is of order O(log n).
In practice, we will use the compound Poisson approximation only for rare words that are words with a very small expected count, less than 1 approximately. On a 4-letter alphabet, there are 4h words of length h, namely:
4,096 words of length 6
16,384 " 7
65,536 " 8
262,144 " 9
1 048,576 " 10
4 194,304 " 11,
so we can approximately say that, in a sequence of length n,
rare words of
length h are such as:
if n = 10,000 h > 6 if n = 50,000 (~Lambda genome) h > 7 if n = 100,000 h > 8 if n = 1 000,000 h > 10 if n = 4 000,000 (~E. coli genome) h > 11.Of course, the boundary between rare words and non rare words is blurred. The rarer the word is, then the better the compound Poisson approximation is.
A comparison of both approximations is of interest as soon as one hesitates. For instance, in Lambda genome that has 48,502 bases, we have compared for each 7-word the Gaussian statistic (so-called z-score) and the compound Poisson statistic; we used model M1.
We can see from the following figure
that there are no under-represented words and that the most over-represented 7-words are quite the same using both approximations. However, the Gaussian statistic tends to over-estimate the over-representation.
For 8-words, this trend is pronounced indicating that a compound Poisson approximation is more suitable (not shown).
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 11 of 21 |
|