An alternative to the overlap-layout-consensus approach was undertaken by Pevzner et al. (2001) in the construction of their assembly tool, Euler. They make use of another graph theoretic approach to determine the layout that was first proposed for use in sequencing by hybridization (SBH). SBH is a technique that involves creating an array of all possible k-tuples, as it is not known in advance which subset will be useful, and then hybridizing labeled DNA from the organism of interest to the array. The reverse complement of the sequence represented by the k-tuples will be bound. Hybridization signals are detected and the sequence can be reconstructed by finding the DNA sequence whose base sequence corresponds to the k-tuples that were detected by hybridization. The problem of reconstructing the DNA sequence is represented in graph theory terms by constructing a graph on (k—1)-tuples, so that each of the edges has a length of 1. The probe sequences are the nodes in the graph. Two nodes are connected if the (k— 1) suffix of one node matches the (k— 1) prefix of another node and implicitly represent all of the k-tuples.
Similar to the overlap-layout-consensus method the problem of finding a path through the graph remains. However, instead of using a Hamiltonian path, Pevzner (1989) proposed applying an Eulerian path to the SBH. An Eulerian path passes along each edge in the graph once, but may pass through each node more than once. This problem is not NP-complete and the algorithms that exist to compute this are in linear time. The Eulerian path method, as applied to SBH, was limited largely due to the fact that there was a physical limit to size of DNA that could be sequenced by this technique as the array must contain all k-tuples, and to the limits of hybridization as there may be errors in the hybridization data due to non-specific binding. In addition, there may be more than one solution to an Eulerian path and repetitive DNA can lead to confusing tangles.
Idury and Waterman (1995) extended the use of Eulerian paths to a hybrid SBH/shotgun sequencing approach. They showed that shotgun data could be broken down into overlapping k-tuples, simulating a SBH experiment. As an array of probes no longer needs to be constructed, there is no size limitation to the DNA sequence which can be reconstructed by this method. The set of all k-tuples in the shotgun data corresponds to all of the k-tuples in the original DNA, and therefore solving the path through the graph is equivalent to assembling the DNA. As noted, shotgun sequencing data does contain errors and therefore these can lead to spurious edges which greatly complicate the graph. As the initial Idury/Waterman algorithm did not scale-up well, this work was extended by Pevzner et al. (2001) and was implemented as the assembly program Euler.
The implementation of Euler introduced a number of measures to address some of the practical issues with DNA sequencing with the potential to confuse the program. In a novel approach, they introduced an error correction stage as the first stage to correct for errors in the raw sequence data, to prevent errors from excessively tangling the graph. This step is usually the last step in the consensus generation of the more conventional assembly programs. They also combined the information in the original reads to help guide the assembly. This leads to the formulation of the Eulerian SuperPath problem; finding the Eulerian path consistent with all read-paths, a read-path being a path in the graph that corresponds to a sequence read.
Was this article helpful?