Modeling with 25 bp long reads

Table 2 shows data from simulations of the same BRCA1 genomic region, but using 25 bp reads only. Ten mutant sequences were simulated. Each base had a probability 0.1 of mutating and being replaced by one of the other three possible bases. All 25-mers were determined for all sequences - for the original sequence and for each of the 10 mutants. The 25-mers from different mutants were initially assembled separately from each other and from the original sequence.

Mini-contigs, assembled in stage I, were then assembled together in a second stage using phrap. Each mini-contig gets an equal "vote" towards this consensus, regardless of whether it came from a mutant sequence, or from the original. This process successfully reconstructed the 100,000, 200,000 and elements of 400,000 length fragments with an error of 1-3 bases per 10,000. For example, 25-mers from the 400,000 length fragment were assembled into a single contig of length 399,959, which differed from the original in only 86 bases, an error proportion of approximately 0.0002. It is likely that even longer fragments could be reconstructed using more than 10 mutants. The ability to assemble with very short reads of 25 bp, is remarkable, considering that the target is known to contain numerous mono- and di-nucleotide repeats of various lengths and is difficult to assemble using conventional assembly tools (without SAM), even from full-length reads (—500 bp).

0 0

Post a comment