Can SAM sequencing aid SBS array shortread sequencing

While the (pyro)sequencer GS20 typically generates more than 25 Mb with a Phred quality score of 20 (or more) for bases called during the sequencing of the 0.58 Mb genome of M. genitalium (Margulies et al, 2005), the substantially lower accuracy of the short individual (feature) reads demands higher coverage for assembly of the entire genome. Homopolymer regions define one of the limits of the (semi-) quantitative pyrosequencing process (Ehn et al., 2004; Ronaghi, 2001) with runs up to at least seven nucleotides able to be assessed accurately. However, alignment may require insertion of additional "padding" (bases) into different copies of individual element reads during de novo sequence assembly.

For simulation of SAM sequencing we have assumed perfect sequencing accuracy for each read (including all coverage) of our mutant copies. While this does not account for the reported 99.4% raw base-read accuracy observed for actual pyrosequencing output on PicoTiterPlates (Margulies et al., 2005), our simulation is intended to explore the advantages of SAM sequencing in overcoming regions of low sequence complexity and homopolymer tracts that occur in eukaryote genomes. Considering our level of introduced mutation of 10%, if additional random errors such as insertions, deletions and homopolymer tract errors were introduced into our raw base reads at the same level of 0.6%, it would have little effect on the accuracy of SAM sequence reconstruction.

Church and colleagues have also reported extensive sequencing of prokaryote genomes using a PCR-colony sequencing method (Mitra et al., 2003) and a related sequencing by ligation approach (Shendure et al, 2005) with reads of 26 bp per amplicon. They note that during their ground-breaking resequencing of the entire ~3.3Mb genome of an E. coli strain that, "despite 10 times coverage in terms of raw basepairs, only 91.4% of the genome had at least one time coverage'', and further noted "substantial fluctuations in coverage were observed due to the stochasticity of the RCA step of library construction." While, their data indicates that the vast majority of the problem is due to insufficient formation of closed circles during the library construction prior to RCA, we would suggest that some residual problems could be due to sequence biases as well as some "very difficult'' sequence that larger library sizes and oversampling may not fully address.

0 0

Post a comment