COMPARISON OF EXCEPTIONAL WORDS ON TWO PHASES

[Link Map]
begin comments
line

Each of the 64 codons is associated to one of the 20 amino acids (+ stop signal) via the genetic code. The sequence of amino acids obtained from a coding DNA sequence (a gene) forms a protein. Therefore, the reading frame is important and an occurrence of a given word can lead to different meanings depending on its phase in the sequence. For instance, CAGG on phase 1 corresponds to the codon CAG (coding for glutamine) followed by a G, whereas CAGG on phase 3 corresponds to a C followed by the codon AGG (coding for arginine). It can be of interest to study the exceptionality of a word with respect to a phase, 1, 2 or 3.

It is well known from biologists that there is a codon bias meaning that the different codons coding for a given amino acid are not used in the genome with the same frequency, and each organism has its own codon bias. We then recommend to use model M2_3 (or a periodic model of higher order) to take into account the codon bias.

By looking at the following figure, we can compare the Gaussian statistics (z-scores) of all 4-words on phase 1 (x-axis coordinate) and on phase 3 (y-axis coordinate) in a long coding DNA sequence of E. coli under M2_3.

(To view in more detail please click on image.)

CAGG is exceptionally frequent on phase 1 (CAG|G) but occurs on phase 3 (C|AGG) as it is expected regarding to the counts of phased 3-words.

GATC is under-represented on phase 1 and on phase 3; it may simply reveal an under-representation of GATC whatever its phase in the sequence.

line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: 17 of 21