For microbial forensics, we need to compare two genomes and discover all differences between them. Once an initial set of differences is found, we need to determine whether they are all correct. Obviously, if the sequences are correct, then every difference represents a distinguishing genetic marker. But genome sequences are not 100% correct, and the accuracy varies from one nucleotide to the next. The most important factor in calculating the accuracy of a base is coverage.
When a genome is sequenced using the whole-genome shotgun (WGS) method, which today is standard, each base is sequenced independently seven or eight times, on average.
The level of sequencing redundancy is called coverage, and it can be explained simply as follows. To prepare the DNA for sequencing, total genomic DNA is sheared randomly (using sonication or other methods) and then size-selected to produce a genomic library of a certain size, for example 3 kbp. The library consists of a large set of 3 kbp inserts in a clone vector such as pUC18 (many standard and customized vectors have been designed, with preferences for different insert sizes). Sequences are prepared from these clones, and both ends of each insert are sequenced. A sequencing "read" in 2005 contains ~700-800 nucleotides (longer reads can be obtained by using different reaction mixtures), a number that is steadily getting larger as technology improves. When we say a genome was sequenced to 8x coverage, we mean that each nucleotide in the genome is, on average, contained in eight separate reads. If the original clone library represents a perfectly uniform random sample of the genome (which it never does, but the approximation is useful), then 8x coverage implies that over 99% of the nucleotides in the genome are contained in at least one read.4
The length of a read is determined not only by the sequencing chemistry and technology, but also by the software that calls bases. The most commonly used basecalling programs are phred,5 TraceTuner, and KB (commercial products designed for Applied Biosystems 3700 capillary sequencers). Each of these programs converts the four-color signal (the chromatogram) generated by an automated sequencer into a series of bases, each of which also has a quality value attached. These quality values are simply error probabilities converted to a more intuitive range, using the formula Q = —10 log P, where P is the probability that the basecall is in error. Thus if the probability of error is 1/1,000, then Q = 30. Every major sequencing center, and every genome project, uses these programs or similar ones, and thus has a quality value attached to every base.
To understand the effect this has on the accuracy of assembled DNA sequences (see the next section for more on assembly), consider the possible scenarios where coverage ranges from 1x to 8x. Assuming that each read con taining a particular genome position is independent (again, not a perfect assumption, but a reasonable approximation), the probability of making an error in a basecall can be calculated by multiplying the error probabilities in each of the reads. Thus at 1x coverage, a base reported with a quality value of 20 has a probability of 1% of being wrong. In contrast, if that same position has 8x coverage, and if each of the eight reads have a quality value of 30, and if all the reads agree on the identity of the base, then the probability of error is only 10-24. (This calculation assumes that the probabilities of error for each base in each read are independent, an admittedly over-simplistic assumption.) At this level of accuracy, we would only expect to see no errors at all in any genome, even a large mammalian genome.
Thus using quality values and some straightforward assumptions, we can compute the accuracy of any base in a genome sequence. Using similar principals, we can compute the likelihood that a difference between two genomes is real. The statistics behind this computation were first explained in the study comparing the B. anthracis used in the October 2001 attacks to the reference Ames strain.3 That study also showed how to extend the statistical model to compute a confidence in a VNTR. The idea is that if any of the "extra" bases in the longer of two VNTR regions are correct, then there is a genuine difference between the strains being compared.
Note that for forensic purposes, if an SNP or VNTR appears to have an unacceptably high probability of error (using the calculations described above and in Read et al.3), then one can reduce the probability by retesting. As long as the original sample material is still available, a basecall can be reconfirmed by resequencing the region in question. Each additional sequence produces a basecall along with a probability of error, and this probability can be multiplied by the previous value to produce a new, usually lower likelihood of error. (If the resequencing gives a different answer, though, then the likelihood of error will increase rather than decrease.)
The problem of basecalling accuracy is still an area of active research, and more precise estimates of accuracy may emerge as programs are refined further. For example, some recent SNP analyses have shown that if the quality values around a given nucleotide are uniformly high, then one can confidently revise a quality value upward.6 Perhaps the most direct way to improve accuracy is to consider every SNP to be a potential miscalled base, and to reanalyze the raw chromatogram data from each underlying read. A new system, AutoEditor, that does exactly this was released in mid-2003, and was shown to correct 85% of the mis-called bases in a large collection of over 25 complete genomes.7 This dramatically reduces the potential set of SNPs in a genome sequence, and allows scientists to focus more effectively on genuine polymorphisms.
Was this article helpful?