COMPOUND POISSON APPROXIMATION OF THE COUNT |
|
Rather than the Gaussian approximation, a compound Poisson approximation is recommended for rare words. Rare words are generally restricted to long words since they appear with a very small probability.
The difficulty comes from the possible overlaps of the words. Since occurrences of a word W may overlap, the occurrences of W in a sequence tend to occur in clumps of W. A given clump of W has a unique size in the sequence defined as the number of overlapping occurrences of W composing the clump. For instance, in the sequence
there are only two clumps of TGCT. The first one is of size 2 and starts at position 7, the second one is of size 1 and starts at position 18.
Since occurrences of W may overlap in the sequence, the Chen-Stein method cannot be used directly. The key idea is to use the clumps of W that do not overlap by definition in the sequence.
Indeed, the count N(W) can be seen as the number of clumps composed by
the size of the clumps;
more precisely, we write N(W) like
![]()
where
's are the numbers of clumps of size k,
(k>0)
and can be approximated by independent Poisson variables
Zk.
The computation of the Poisson distributions parameters requires
combinatoric techniques. (Schbath)
Finally, N(W) is approximated by
which is by definition a compound Poisson variable.
Therefore, the criterion is the following:
If
is close to 1, W is under-represented,
If
is close to 0, W is over-represented.
The compound Poisson approximation has been established only in the class of models Mm, m >= 0.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 8 of 21 |
|