GLOSSARY
|
|
Base :
a DNA sequence is a sequence of nucleotides, each characterized
by one of the four bases Adenine, Cytosine, Guanine
and Thymine, often coded by A, C,
G, T.
Clump :
a clump of a word W in a sequence is a maximal set of
overlapping occurrences of W. By definition, clumps of
W cannot overlap in a sequence. The size of a clump is the
maximal number of occurrences of W contained in the clump.
Coding DNA sequence :
it is a DNA sequence that is read, in a particular
reading frame, codons per codons,
and coded into a sequence of
amino acid. A gene is a coding DNA sequence such that its
corresponding amino acid sequence is exactly a protein.
A coding DNA sequence can be a part of a gene, or concatenated
genes in the same reading frame.
Codon :
it is a 3-letter word that will code for an amino acid via the
genetic code. There are in fact 61 codons coding for only 20
amino acids and 3 codons coding for the stop signal. Therefore,
an amino acid can be coded by several codons.
Conjugate word :
the conjugate of a word W is obtained by reversing W
and replacing A by T, T by A, C by G and G by C. For instance,
the conjugate of AGGCAC is GTGCCT.
DNA sequence :
it is a long molecule consisting of a succession of
nucleotides; each nucleotide is characterized by one of the
four bases Adenine, Cytosine, Guanine and Thymine. A DNA sequence
is then generally represented by the sequence of bases A,
C, G, T.
Model Mm :
it is the m-order Markov chain model on the state space
{A, C, G, T}
with an homogeneous transition probability matrix.
The probability of a letter at a given position in the sequence
depends on the m previous letters. This model is exactly fitted
to the counts of all the (m+1)-letter words.
Model Mm_3 :
it is an m-order Markov chain model on the state space
{A, C, G, T}
with 3-periodic transition probabilities.
The probability of a letter at a given position and at a given phase
in the sequence depends on the m previous letters.
This model is exactly fitted to the counts of all the (m+1)-letter
words on each of the three phases.
Maximal model :
To study the count of a h-letter word, the maximal order
of the Markov chain model is h-2.
Indeed, an (h-1)-order Markov
chain model fits the count of all the h-letter words.
The maximal model will then be the model of maximal order, namely
M(h-2) or M(h-2)_3.
Palindrome :
it is a word that is identical to its conjugate. For instance,
ACTAGT is a palindrome.
Period :
the lag between two overlapping occurrences of a word is a period
of this word. A word may have several periods. Periods that are
not multiples of the smallest period are said to be
principal. A period of a word is less than the word length.
(Click here for more explanations)
Periodic word :
if two occurrences of a given word can overlap in a
sequence, the word is said periodic. That is, the first
letters of a periodic word are identical to its last letters
(see the page "periodic structure of
the word").
Phase :
it is an integer, 1, 2 or 3, associated to each base of a
coding DNA sequence.
A base on phase 1 is followed by a base on phase 2,
that is followed by a base on phase 3, that is followed by a base
on phase 1, and so on. The phase of the first base of a coding DNA
sequence is defined by the reading frame.
Phased word :
in a coding DNA sequence, an occurrence of a given
word is in one
of the three possible positions according to the reading frame.
A word is said to be on phase k if and only if its last letter
is on phase k in the sequence. The count of a word is then
the sum of the three counts of the word on phases 1, 2 and 3.
Rare word :
it is a word considered like a rare event in the sequence in
probabilistic term. The mean number of this event is bounded
when the length of the sequence tends to infinity.
In pratice, a rare word
is generally assimilated to a word long enough to have a very small
expected count in the sequence, at least less than 1
(see the page "choice of the approximation"
for an approximate rule).
Reading frame :
it determines how a coding DNA sequence has to be read,
codons per codons,
to form the sequence of amino acid.
Statistic of a word :
there are two kinds of statistic depending on the
statistical approximation of the word count. When it is approximated
by a Gaussian distribution, the (Gaussian) statistic is simply
the z-score.
When the count is approximated by a compound Poisson
distribution, the (compound Poisson) statistic is the Gaussian
quantile corresponding to the probability that an ad-hoc compound
Poisson variable is greater than the observed count.
In both cases, the statistic depends on the order
m of the model
(Mm or Mm_3).
Total variation distance :
the total variation distance between the distributions of two
random variables X and Y defined on the same space E is the supremum over all
E-measurable subsets B of
.
Word :
a word is a short sequence of letters in the {A, C, G, T} alphabet.
z-score :
it is the normalized difference (observed count - expected count)
calculated under a given model. The normalization proposed is
such that the z-score is asymptotically Gaussian.
Finding words with unexpected frequencies in DNA sequences. 11.9.98 Page: Glossary.
|
|