EXCEPTIONAL 2-LETTER WORDS

[Link Map]
begin comments
line

We study the count of the 2-words in the entire genome of the phage Lambda that is 48,502 bases long. We use the Gaussian approximation under the model M0. The results are the following:

   -------------------------------------------------------
   word | count |   exp    |   var    | z-score  |  rank  
   -------------------------------------------------------
     TA |  2173 | 3049.028 | 1711.599 | -21.1747 |    1
     AG |  2732 | 3260.130 | 1788.503 | -12.4881 |    2
     GT |  2769 | 3168.162 | 1754.773 |  -9.5288 |    3
     AC |  2574 | 2889.303 | 1649.831 |  -7.7626 |    4
     CT |  2536 | 2807.795 | 1618.717 |  -6.7555 |    5
     GG |  3179 | 3387.512 | 1833.617 |  -4.8694 |    6
     CC |  2497 | 2660.707 | 1560.301 |  -4.1444 |    7
     TC |  2676 | 2807.795 | 1618.717 |  -3.2758 |    8
     GA |  3256 | 3260.130 | 1788.503 |  -0.0977 |    9
     CG |  3113 | 3002.195 | 1691.447 |   2.6942 |   10
     AT |  3337 | 3049.028 | 1711.599 |   6.9606 |   11
     CA |  3214 | 2889.303 | 1649.831 |   7.9939 |   12
     TT |  3346 | 2963.015 | 1679.320 |   9.3458 |   13
     AA |  3693 | 3137.539 | 1744.499 |  13.2990 |   14
     GC |  3613 | 3002.195 | 1691.447 |  14.8516 |   15
     TG |  3793 | 3168.162 | 1754.773 |  14.9162 |   16
The second column (count) corresponds to the observed count of the word in the sequence, the third column (exp) corresponds to the expected count under M0, the fourth column (var) indicates the variance of the difference (observed count - expected count). The fifth column is the associated z-score. The sixteen 2-words are listed with respect to their increasing z-score.

TG, GC and AA are then the most over-represented 2-words in Lambda genome, in respect of the frequencies of A, C, G, and T, whereas TA, AG and GT are the most avoided.

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 12 of 21