Abstract
Background: Single Nucleotide Polymorphisms (SNPs) are widely used
molecular markers, and their use has increased massively since the inception of
Next Generation Sequencing (NGS) technologies, which allow detection of large
numbers of SNPs at low cost. However, both NGS data and their analysis are
error-prone, which can lead to the generation of false positive (FP) SNPs. We
explored the relationship between FP SNPs and seven factors involved in
mapping-based variant calling | quality of the reference sequence, read length,
choice of mapper and variant caller, mapping stringency and
ltering of SNPs by
read mapping quality and read depth. This resulted in 576 possible factor level
combinations. We used error- and variant-free simulated reads to ensure that
every SNP found was indeed a false positive.
Results: The variation in the number of FP SNPs generated ranged from 0 to
36,621. All of the experimental factors tested had statistically signi
cant e[symbol]ects
on the number of FP SNPs generated and there was a considerable amount of
interaction between the di[symbol]erent factors. Using a fragmented reference sequence
led to a dramatic increase in the number of FP SNPs generated, as did relaxed
read mapping and a lack of SNP
ltering. The choice of reference assembler,
mapper and variant caller also signi
cantly a[symbol]ected the outcome. The e[symbol]ect of
read length was more complex and suggests a possible interaction between
mapping speci
city and the potential for contributing more false positives as read
length increases.
Conclusions: The choice of tools and parameters involved in variant calling can
have a dramatic e[symbol]ect on the number of FP SNPs produced, with particularly
poor combinations of software and/or parameter settings yielding tens of
thousands in this experiment. Between-factor interactions make simple
recommendations di[symbol]cult for a SNP discovery pipeline but the quality of the
reference sequence is clearly of paramount importance. Our
ndings are also a
stark reminder that it can be unwise to use the relaxed mismatch settings
provided as defaults by some read mappers when reads are being mapped to a
relatively un
nished reference sequence from e.g. a non-model organism in its
early stages of genomic exploration.
Year
2015
Category
Refereed journal