COMPOUND POISSON APPROXIMATION OF THE COUNT

[Link Map]
begin comments
line

Rather than the Gaussian approximation, a compound Poisson approximation is recommended for rare words. Rare words are generally restricted to long words since they appear with a very small probability.

The difficulty comes from the possible overlaps of the words. Since occurrences of a word W may overlap, the occurrences of W in a sequence tend to occur in clumps of W. A given clump of W has a unique size in the sequence defined as the number of overlapping occurrences of W composing the clump. For instance, in the sequence

S = ATGGACTGCTGCTAGATTGCTTA

there are only two clumps of TGCT. The first one is of size 2 and starts at position 7, the second one is of size 1 and starts at position 18.

Since occurrences of W may overlap in the sequence, the Chen-Stein method cannot be used directly. The key idea is to use the clumps of W that do not overlap by definition in the sequence.

Indeed, the count N(W) can be seen as the number of clumps composed by the size of the clumps; more precisely, we write N(W) like
displaymath61
where tex2html_wrap_inline63 's are the numbers of clumps of size k, (k>0) and can be approximated by independent Poisson variables Zk. The computation of the Poisson distributions parameters requires combinatoric techniques. (Schbath)

Finally, N(W) is approximated by tex2html_wrap_inline69 which is by definition a compound Poisson variable.

Therefore, the criterion is the following:

If eqnarray32 is close to 1, W is under-represented,
If eqnarray32 is close to 0, W is over-represented.

The compound Poisson approximation has been established only in the class of models Mm, m >= 0.

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 8 of 21