Fast Scalable Sequence Comparison Programs To Locate Unique Sequence

Next, we must quickly determine which portions of a bacterial or viral genome consensus gestalt (Fig. 15.5) are potentially unique compared to all other micro-bial genomes already sequenced. We were aware that suffix trees are the most efficient data structure for comparing two strings to determine matches. Stefan Kurtz's Vmatch programs had just been developed and provided the most scalable implementation of suffix trees to date. He generously provided new functions that allowed us to compare a target viral or bacterial genome against a 900+ Mb library of all publicly available microbial genomes in just a few minutes. This compares to the 2-4 days required for naïve approaches (e.g., BLAST, which could involve parsing enormous output files). Note that this approach masks out portions of the target genome that are definitely not unique (Fig. 15.6). Further effort is needed (described below) to determine the agtaatcgt.ATCATTGTACCCACTTGAGAAGTTAGTAAC.TTTTTTCTATTATAATCTT GTATCCGTAAGATACATTACTACACATAGGAATTCCCTGAT.GAGCAATGTTTAAATACA TCTACATTTGGAT..TGATGTAGTTGCGTATTTCTCTACAATATTAATACCATTTTTGCA ACTATTTATTTCTAGACCTTTTG.GATTAGTAATCTCAATAATTCTACGTCAATATTATC AGATTCTATATATTCGAATATATCAAAGTCATTGATATTTTTATAATTGGTAGAAGACAA TAATGACACCACAACATCAGTTTTGATATTCTTATTTTT.TTGGTAACGTATACATTTAA TGAATTTTCATTACGTTCTACCAATGATTGTGCACTGCAGGCATCAAAAGTTTTACAACT ATCATAAAGCATACTATCCTATCC

FIGURE 15.5 An analysis of a multiple sequence alignment of several pathogen target genomes yields a "consensus gestalt" view. Positions that do not agree in all of the input genomes are represented by a dot. Runs of conserved positions above a threshold size are shown in capital letters; runs below that size are in lower case.


0 0

Post a comment