GLOSSARY

[Link Map]
begin comments
line

Base : a DNA sequence is a sequence of nucleotides, each characterized by one of the four bases Adenine, Cytosine, Guanine and Thymine, often coded by A, C, G, T.


Clump : a clump of a word W in a sequence is a maximal set of overlapping occurrences of W. By definition, clumps of W cannot overlap in a sequence. The size of a clump is the maximal number of occurrences of W contained in the clump.


Coding DNA sequence : it is a DNA sequence that is read, in a particular reading frame, codons per codons, and coded into a sequence of amino acid. A gene is a coding DNA sequence such that its corresponding amino acid sequence is exactly a protein.
A coding DNA sequence can be a part of a gene, or concatenated genes in the same reading frame.


Codon : it is a 3-letter word that will code for an amino acid via the genetic code. There are in fact 61 codons coding for only 20 amino acids and 3 codons coding for the stop signal. Therefore, an amino acid can be coded by several codons.


Conjugate word : the conjugate of a word W is obtained by reversing W and replacing A by T, T by A, C by G and G by C. For instance, the conjugate of AGGCAC is GTGCCT.


DNA sequence : it is a long molecule consisting of a succession of nucleotides; each nucleotide is characterized by one of the four bases Adenine, Cytosine, Guanine and Thymine. A DNA sequence is then generally represented by the sequence of bases A, C, G, T.


Model Mm : it is the m-order Markov chain model on the state space {A, C, G, T} with an homogeneous transition probability matrix. The probability of a letter at a given position in the sequence depends on the m previous letters. This model is exactly fitted to the counts of all the (m+1)-letter words.


Model Mm_3 : it is an m-order Markov chain model on the state space {A, C, G, T} with 3-periodic transition probabilities. The probability of a letter at a given position and at a given phase in the sequence depends on the m previous letters. This model is exactly fitted to the counts of all the (m+1)-letter words on each of the three phases.


Maximal model : To study the count of a h-letter word, the maximal order of the Markov chain model is h-2. Indeed, an (h-1)-order Markov chain model fits the count of all the h-letter words. The maximal model will then be the model of maximal order, namely M(h-2) or M(h-2)_3.


Palindrome : it is a word that is identical to its conjugate. For instance, ACTAGT is a palindrome.


Period : the lag between two overlapping occurrences of a word is a period of this word. A word may have several periods. Periods that are not multiples of the smallest period are said to be principal. A period of a word is less than the word length. (Click here for more explanations)


Periodic word : if two occurrences of a given word can overlap in a sequence, the word is said periodic. That is, the first letters of a periodic word are identical to its last letters (see the page "periodic structure of the word").


Phase : it is an integer, 1, 2 or 3, associated to each base of a coding DNA sequence. A base on phase 1 is followed by a base on phase 2, that is followed by a base on phase 3, that is followed by a base on phase 1, and so on. The phase of the first base of a coding DNA sequence is defined by the reading frame.


Phased word : in a coding DNA sequence, an occurrence of a given word is in one of the three possible positions according to the reading frame. A word is said to be on phase k if and only if its last letter is on phase k in the sequence. The count of a word is then the sum of the three counts of the word on phases 1, 2 and 3.


Rare word : it is a word considered like a rare event in the sequence in probabilistic term. The mean number of this event is bounded when the length of the sequence tends to infinity. In pratice, a rare word is generally assimilated to a word long enough to have a very small expected count in the sequence, at least less than 1 (see the page "choice of the approximation" for an approximate rule).


Reading frame : it determines how a coding DNA sequence has to be read, codons per codons, to form the sequence of amino acid.


Statistic of a word : there are two kinds of statistic depending on the statistical approximation of the word count. When it is approximated by a Gaussian distribution, the (Gaussian) statistic is simply the z-score. When the count is approximated by a compound Poisson distribution, the (compound Poisson) statistic is the Gaussian quantile corresponding to the probability that an ad-hoc compound Poisson variable is greater than the observed count. In both cases, the statistic depends on the order m of the model (Mm or Mm_3).


Total variation distance : the total variation distance between the distributions of two random variables X and Y defined on the same space E is the supremum over all E-measurable subsets B of tex2html_wrap_inline15.


Word : a word is a short sequence of letters in the {A, C, G, T} alphabet.


z-score : it is the normalized difference (observed count - expected count) calculated under a given model. The normalization proposed is such that the z-score is asymptotically Gaussian.


line
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: Glossary.