Data quality

Sequence data can also contain errors. Current sequencing technology does not produce sequence reads that are 100% correct all of the time, nor is the data of consistent quality throughput all parts of the sequence read. The data that comes from the beginning and ends of the sequence reads are most prone to error. A number of factors cause this including the fact that for longer reads, the signal is of weaker intensity due to the geometric distribution of the incorporation of ddNTPs during the sequencing reaction and as the longer fragments take longer to move through the gel, they are more diffuse than shorter fragments and more difficult to call. Variable sequence quality is not however restricted to the ends of the reads, as secondary structure effects or compressions of GC-rich region peaks can on occasion be poorly resolved in the middle of the sequence. Therefore, the base-caller may not identify the correct base or the correct number of bases and this makes sequencing data 'noisy'. Assembly programs are affected by noisy data when trying to find overlaps between fragments, as single base differences could prevent otherwise identical sequences from overlapping (true overlaps). Assembly programs therefore need to tolerate a low level of error from the raw input data. This is a delicate balance between being able to identify true overlaps in the presence of sequencing errors while not increasing the amount of repeat-induced overlaps. This strategy works where the level of sequence divergence between the repeats is greater than the sequencing error rate. Having error-free sequence data would greatly reduce the complexity of the algorithms required to assemble a genome sequence and as it would require less reads, would make the cost of sequencing much lower. The level of error in sequence data is estimated to be approximately 3% (Hill et al., 2000).

To reflect the fact that the data quality in all parts of a sequence read is not the same, Green and colleagues (Ewing et al., 1998; Ewing and Green, 1998) developed a system where a quantitative measure of the quality of each base call is assigned. This method estimates the error probability for each base call and is implemented in the program Phred. Since its introduction in 1998, Phred quality values have become the industry standard and it remains the most widely used method of base-call error estimation. Phred's quality values are logarithmic and are defined as Q = —10 log10 P, where Q is the quality value and P the estimated error probability of a base call. Quality values range from 0 to 99, increasing values indicating increasing quality. A Phred quality value of 10 corresponds to a 90% chance of accuracy, or a 1 in 10 chance that the base call is incorrect. The standard that is generally required is a Phred quality of 20, Q20, which corresponds to a predicted error rate of 1% or a 99% chance of the base call being correct. For the public human genome project, the aim was that the overall sequence quality conform to the Bermuda Standard of being accurate to at least 1 bp in 10,000, or Q40 (Bentley, 1996).

Many assembler programs can use the base quality information in the assembly process. Those that do not use quality values directly can still use the quality values as a method of automatically trimming low-quality data from the sequence reads, for example, a clip being introduced in the sequence read when the average quality value falls below a predetermined value over a sliding window of a fixed width, e.g. 50 bases. Applied Biosystems have also introduced a base-caller that assigns Phred-like quality values to bases (KB base-caller), although details of the algorithm have not been released (http://docs.appliedbio-systems.com/pebiodocs/04362968.pdf).

Was this article helpful?

0 0

Post a comment