DNA SEQUENCES AND WORDS

[Link Map]
begin comments
line

Consider a DNA sequence like a long sequence of letters in the 4-letter alphabet {A, C, G, T}, for instance,

S = ATGGACTGCTGCTAGATTGCTTA ...

A word is by definition a short sequence of letters in the {A, C, G, T} alphabet. For instance,
W = TGCT
is a 4-word (word of length 4). Note that two occurrences of this word may overlap in the sequence.

The count of W in the sequence, denoted by N(W), is the number of all (overlapping or not) occurrences of W in the sequence. For instance, N(TGCT) = 3 in the sequence S: there are 3 occurrences of TGCT starting at positions 7,10 and 18.

However, a coding DNA sequence is naturally read as a sequence of consecutive 3-words, called codons. Each letter of the sequence is then associated with an integer 1, 2 or 3, called phase ; For instance,

S = ATG|GAC|TGC|TGC|TAG|ATT|GCT|TA ... 123|123|123|123|123|123|123|12 ----- ----- -----

We then may want to count only occurrences of W on phase 1, that is when its last letter is on phase 1. For example, there are 2 occurrences of TGCT on phase 1 and 1 occurrence of TGCT on phase 3 in the above sequence.

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 2 of 21