GAUSSIAN APPROXIMATION OF THE COUNT |
|
Under the model M1,
for instance, the expected count of a h-letter word
in the {A,C,G,T} alphabet
can be estimated by
One can show that
![]()
and a good estimate of the asymptotic variance is given by

where
is the concatenated word
; n(.) denotes the count of words
inside the word W and the notation "a+"
stands for the letter a followed by any letter.
Note that the variance takes into account the periodic structure of the word.
Therefore, a high positive value of the z-score defined by
detects an over-represented word under M1, whereas a high
negative value of the z-score detects an under-represented
word under M1.
z-scores
Um
and
Um_3
are obtained similarly for general models Mm
and Mm_3.
Note: Details of the derivation of these formulae can be found in Schbath (1995) , Chapter 1.

Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 6 of 21 |
|