DNA SEQUENCES AND WORDS |
|
Consider a DNA sequence like a long sequence of letters in the
4-letter alphabet {A, C, G, T},
for instance,
The count of W in the sequence, denoted by N(W), is the number of all (overlapping or not) occurrences of W in the sequence. For instance, N(TGCT) = 3 in the sequence S: there are 3 occurrences of TGCT starting at positions 7,10 and 18.
However, a coding DNA sequence is naturally read as a sequence of
consecutive 3-words, called codons. Each letter of the sequence is then
associated with an integer 1, 2 or 3, called phase ; For instance,
S = ATG|GAC|TGC|TGC|TAG|ATT|GCT|TA ... 123|123|123|123|123|123|123|12 ----- ----- -----
We then may want to count only occurrences of W on phase 1, that is when its last letter is on phase 1. For example, there are 2 occurrences of TGCT on phase 1 and 1 occurrence of TGCT on phase 3 in the above sequence.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 2 of 21 |
|