EXCEPTIONAL 2-LETTER WORDS |
|
We study the count of the 2-words in the entire genome of the phage Lambda that is 48,502 bases long. We use the Gaussian approximation under the model M0. The results are the following:
-------------------------------------------------------
word | count | exp | var | z-score | rank
-------------------------------------------------------
TA | 2173 | 3049.028 | 1711.599 | -21.1747 | 1
AG | 2732 | 3260.130 | 1788.503 | -12.4881 | 2
GT | 2769 | 3168.162 | 1754.773 | -9.5288 | 3
AC | 2574 | 2889.303 | 1649.831 | -7.7626 | 4
CT | 2536 | 2807.795 | 1618.717 | -6.7555 | 5
GG | 3179 | 3387.512 | 1833.617 | -4.8694 | 6
CC | 2497 | 2660.707 | 1560.301 | -4.1444 | 7
TC | 2676 | 2807.795 | 1618.717 | -3.2758 | 8
GA | 3256 | 3260.130 | 1788.503 | -0.0977 | 9
CG | 3113 | 3002.195 | 1691.447 | 2.6942 | 10
AT | 3337 | 3049.028 | 1711.599 | 6.9606 | 11
CA | 3214 | 2889.303 | 1649.831 | 7.9939 | 12
TT | 3346 | 2963.015 | 1679.320 | 9.3458 | 13
AA | 3693 | 3137.539 | 1744.499 | 13.2990 | 14
GC | 3613 | 3002.195 | 1691.447 | 14.8516 | 15
TG | 3793 | 3168.162 | 1754.773 | 14.9162 | 16
The second column (count) corresponds to the observed count of the
word in the sequence, the third column (exp) corresponds to the
expected count under M0, the fourth column (var) indicates the
variance of the difference (observed count - expected count).
The fifth column is the associated
z-score. The sixteen 2-words are listed
with respect to their increasing z-score.
TG, GC and AA are then the most over-represented 2-words in Lambda genome, in respect of the frequencies of A, C, G, and T, whereas TA, AG and GT are the most avoided.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 12 of 21 |
|