GAUSSIAN APPROXIMATION OF THE COUNT

[Link Map]
begin comments
line

Under the model M1, for instance, the expected count of a h-letter word tex2html_wrap_inline80 in the {A,C,G,T} alphabet can be estimated by

displaymath82

One can show that

displaymath84

and a good estimate of the asymptotic variance is given by
eqnarray29

where tex2html_wrap_inline87 is the concatenated word tex2html_wrap_inline89 ; n(.) denotes the count of words inside the word W and the notation "a+" stands for the letter a followed by any letter.

Note that the variance takes into account the periodic structure of the word.

Therefore, a high positive value of the z-score defined by

displaymath92

detects an over-represented word under M1, whereas a high negative value of the z-score detects an under-represented word under M1.

z-scores Um and Um_3 are obtained similarly for general models Mm and Mm_3.

Note: Details of the derivation of these formulae can be found in Schbath (1995) , Chapter 1.

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 6 of 21