New Generation Sequence Assembly Tools

The tools that are used to reconstruct the genomes of more complex organisms have made use of developments in the field of graph theory. A graph is a collection of edges and vertices. As we have seen, the greedy algorithm used an essentially overlap-layout-consensus method to assemble the genome, although in some programs these three steps are not always distinct or explicit. The overlap-layout-consensus method has largely been used for almost all of the assemblers written to cope with the assembly of complex mammalian genomes, such as Arachne (Batzoglou et al., 2002) and the Celera Assembler (Huson et al., 2001). Improvements to the overlap-layout-consensus method that allow them to be used for complex genomes are due to the use of graph theoretic methods. The formulation of the sequence assembly problem as a graph theoretic problem was initially proposed by Peltola et al. (1984), and restated more recently by Kececioglu and Myers (1995). The three steps were defined as overlap, where the reads which fit the overlap criteria are found; layout, where the layout between the reads that induces a layout consistent with the orientation of the reads; and consensus, where a multiple alignment of the consistent reads in the layout is performed.

These methods construct a graph at the layout stage from the overlaps that have been found. They regard the sequencing reads as nodes in the graph, which are connected by edges. In the context of sequence assembly, an edge represents an overlap between two reads. The layout algorithm therefore has to find the optimal sub-graph of true overlaps through the graph. During construction of the graph, redundant information is removed, by discarding reads that are fully contained in other reads and by removing transitive edges. For example if the paths x —y —z and x —z exist, then the information in the x —y—z path can be removed as it is redundant. The layout algorithm determines the relative location and orientation of reads with respect to each other. A Hamiltonian path through the graph visits each vertex exactly once although it does not necessarily pass through all of the edges. A contig is therefore a path in the graph. This is referred to as a Hamiltonian cycle, if the same start and end node are used, and computing this is well-known to be NP-complete.

There is a need therefore for the use of heuristics when determining the layout. Following the determination of the layout, a multiple sequence alignment of all of the overlapping reads in the contig is constructed as directed by the layout graph. The consensus sequence is constructed from the multiple alignment by majority voting, and this consensus infers the genome sequence. Some assemblers use quality values and therefore can take a more sophisticated approach to the consensus. Assembly tools constructed using this approach include the Celera Assembler and Arachne.

0 0

Post a comment