STATISTICAL METHODS - INTRODUCTION |
|
To detect if a word W appears with an unexpected frequency in a sequence, one needs to know the probability distribution of the count N(W) .
N(W) is a sum of Bernoulli variables, each one being equal to 1 if W begins at a given position in the sequence, and 0 otherwise.
The difficulty comes from the dependence between these Bernoulli variables(*). If they were independent, the distribution of N(W) would be approximated either by a Gaussian distribution or by a Poisson distribution depending on the asymptotic frame: the expected count tends either to infinity or to a constant when the length of the sequence grows to infinity. In this latter case, we say that W is a rare word.
In fact, the distribution of N(W) can be approximated by a Gaussian distribution if W is not a rare word, whereas the count of rare words can be approximated by a compound Poisson variable.
(*)Note: These variables are not independent as soon as the word has more than one letter. For example, a word W of length h that does not overlap itself cannot appear simultaneously at a position i and at the h-i following positions. On the other hand, a word that overlaps itself can open simultaneously at a position i and at the i+p following ones, with 0 < p < h.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 5 of 21 |
|