Different human and other genomic DNA sequences, each of length up to -600 kb were used. The chromosomal fragments that were analysed are indicated in Table 1.
The study sequences were chosen to represent different genomic composition characters, including human genomic regions (IHGSC, 2004) with numerous unique gene sequences interspersed with discrete regions of low complexity repeats (HUM-BRCA1, -HLA), a gene poor human genomic region with numerous mixed low complexity repeats (HUM-subcentro), and a mosquito genomic region of strong base bias (AT-rich) containing numerous short and mixed homopolymer tracts (MOS1). For comparison, the entire 0.58 Mb bacterial genome of M. genitalium was also analysed. Some of the confounding sequence motifs identified within these test elements are shown in the third column of Table 1, while the highest number of contigs obtained for reassembly of the full 0.6 Mb fragments using 10 mutants and 10 times coverage are shown in the last column. Importantly, these analyses suggest that SAM sequencing methods and SAM assembly allows these complex genomic elements to be assembled into either single contigs or a small number of contigs, with low sequence error from just a few independent mutants.
These motifs are illustrative of problematic sequences known or expected to prevent sequence assembly from short-read data, and include homopolymers, regions of simple repeats and strongly base-biased elements, with multiple short homopolymer regions and other regions of sequence similarity. The motifs are not exhaustive and are meant to represent some of the diverse sequences that would pose a significant challenge to conventional short-read sequencing technologies (Margulies et al, 2005; Shendure et al., 2005).
Was this article helpful?