Software For Data Analysis

A discussion of the full range of bioinformatic tools available for sequence analysis, data mining, and genomics is beyond the scope of this chapter. Comments here will be limited to software for base calling in automated sequencers. In some respects, the discssion is necessarily brief, as the algo-rithims used by manufacturers of automated sequence analyzers are proprietary. However, Green and colleagues have developed base calling software, phred, that is nonproprietary, is well described in the literature, and is the software of choice for many genomics centers and clinical laboratories (80). This software is freely available to academic users.

Computer analysis is required to convert the image of the gel to a base sequence. This analysis has four components. First, the lanes are tracked; that is, a line (straight or curved) is applied to the actual path of the DNA fragments such that peaks can be assigned to a given sample. Using slab gel automated instrumentation, this can be far from a trivial task, because as a result of inhomogeneities in either the gel or electric field, the path of the DNA fragments can be rather far from straight. The introduction of 96-well slab gels made this problem particularly difficult, as the lanes were so narrow and tightly spaced that it was often difficult to determine which trace went with which sample. However, the introduction of CE-based instrumentation has completely eliminated the lane-tracking problem. After the boundaries of the lane have been identified, the thousands of signals from the detector are summed to produce a lane profile, or trace. Trace processing involves deconvolution and smoothing of the signals, noise reduction, and correction for differential dye effects on fragment mobility. The last step is base calling, in which the processed trace is converted into a sequence of bases.

The phred software utilizes a four-step procedure for calling bases. The first step is to localize, in each lane, the ideal position of the peaks. Although the positions of the peaks should be able to be known with a high degree of certainty, there are many factors that will cause one or more peaks to exhibit aberrant mobility. One of these factors is the compression artifact, where the sequencing fragments are imperfectly denatured and the 3' end of the single stranded fragment folds back onto itself, giving a hairpin loop that has higher electrophoretic mobility than perfectly denatured fragments. Fragments with high G+C content tend to give more compressions than A+T-rich fragments. The use of dye-terminator chemistry seems to decrease (but not eliminate) compression artifacts, presumably as a result of steric hindrance to intramolecular base-pairing by the bulky dye moieties (81). Other factors that can cause deviation from ideal peak spacing include polymerase stops, typically caused by secondary structure of the template during the sequencing reaction, and noise introduced by variation in elongation or labeling resulting from degraded or impure reagents.

The second task that the phred software accomplishes is to identify all of the peaks in the electropherogram. Third, the existing peaks are mapped onto the idealized pattern. During this step, some peaks will not be assigned. Finally, any unas-signed peaks that appear to represent a base (using a rules-based algorithm) are inserted into the sequence.

The accuracy of the phred software was compared to the ABI base calling algorithm by comparing base calls from both pieces of software and deriving the error rates for each vs the known reference sequences. phred was found to be superior at all places in the electropherogram (short fragments and long) (80).

Clearly, the question of methods of evaluating base calling accuracy is a crucial one. One method, such as the one used by Ewing et al. (80) to validate the accuracy of the phred software, is to compare the results from any given sequencing run to a reference sequence or a "finished sequence." The disadvantage of this method is that the accuracy measurements might not be available in real time. For example, in de novo genome sequencing, it might be some time between the availability of one or two sequencing runs covering a given area and the availability of a finished sequence, defined as a consensus DNA sequence with several-fold depth of coverage. In the clinical laboratory, the areas typically sequenced are well known and the question is whether an observed sequence variation is the result of a potential disease-causing mutation or a sequencing error. Comparison of a newly generated sequences from a patient to a reference sequence is not particularly helpful in making this distinction. Another potential method to estimate errors is to include one or more control sequences in the analysis. Certainly, this approach is virtually mandatory in the clinical laboratory, but it must be kept in mind that error profiles can differ between sample and controls. Green and colleagues have addressed this critical issue by incorporating direct estimates of the probability of base-calling errors into the phred software (82). A review of the statistical methods for accomplishing this task is beyond the scope of this chapter and the interested reader is referred to the original literature (80,82). Briefly, four parameters from the trace files were found to be most effective for discriminating errors from correct base calls: peak spacing, the ratio of the amplitude of the largest uncalled peak to the smallest called peak in a window of seven bases, the ratio of the amplitude of the largest uncalled peak to the smallest called peak in a window of three bases, and the peak resolution. Each base is given a quality score, q, which is defined as q = -10 logjo(p)

where p is the estimated error probability for that particular basecall. Thus, a base call with a 1 in 100 probability of being incorrect would have a quality score of 20, and a quality score of 40 would indicate an probability of error of 1 in 10,000. Although no consensus has appeared, it would probably be advisable to limit interpretation of clinical sequencing data to regions with quality scores of greater than 30.

The error rate prediction of phred was validated by Richterich (83) using data from six different large-scale sequencing projects. Not only did he find that the error prediction to agree extraordinarily well with what was found by comparing individual runs to the finished sequence, but, remarkably, the algorithm was insensitive to a number of procedural variations, including dye-terminator vs dye-primer chemistry, DNA preparation methods, types of fluorescent tag, and different sequencing enzymes.

Was this article helpful?

0 0
Get Pregnant - Cure Infertility Naturally

Get Pregnant - Cure Infertility Naturally

Far too many people struggle to fall pregnant and conceive a child naturally. This book looks at the reasons for infertility and how using a natural, holistic approach can greatly improve your chances of conceiving a child of your own without surgery and without drugs!

Get My Free Ebook

Post a comment