Genome sequence assembly is a complex topic about which many articles have been written; for the purposes of this chapter a brief summary will suffice. Assembly is the process of collecting together all the individual sequence reads from a WGS project and reconstructing the correct sequence of the original DNA molecule(s). For a bacterial project, this typically involves assembling tens of thousands of reads into a circular chromosome and possibly additional plasmids. Prior to the sequencing of H. influenzae by WGS sequencing, many scientists were skeptical that an entire genome could be correctly assembled from individual reads. Once that project, which used the TIGR Assembler,10 demonstrated feasibility, it quickly became standard practice to assemble microbial genomes from WGS data. A few years later, WGS assembly was demonstrated to be feasible for animal genomes with the assembly of the 130 Mbp Drosophila melanogaster genome,11 and soon thereafter the human genome.12
Although genome assemblers routinely put together bacterial-sized projects, they do not always accomplish this feat without errors. These errors are of particular concern for microbial forensics, as they may lead to mistakes in generating genetic markers. The leading genome assemblers today are designed to handle both small and large genomes, and they employ a variety of sophisticated techniques to avoid getting confused by repetitive sequences (the main cause of trouble when assembling a genome). Although the technology has
FIGURE 15.2 A newly assembled contig maps to two distinct, noncontiguous locations on a reference genome. The correctness of the new contig can be determined by PCR across the juncture. (See color insert.)
been quite successful at scaling up to larger genomes, it still has not overcome the problem of assembly errors, in part because the raw data always contain errors (but also because nearly identical repeats can be very hard to assemble correctly).
By exercising care in the interpretation of assemblies, however, these errors can nearly always be avoided. The errors generally fall into two broad categories: (1) mis-assemblies and (2) consensus errors. Let us assume for the sake of discussion that we are comparing a complete reference genome to an assembly of a related strain that has been sequenced to 8x coverage, and that the 8x assembly has produced 100 large pieces of DNA, called contigs. Suppose that one of these contigs appears to consist of two disjoint pieces of the reference genome that are joined together, as shown in Figure 15.2. This mis-assembly will appear to be a significant difference between the genome in question and a reference genome. There is no good method for evaluating the probability that the assembler has mis-assembled a contig; instead, the best means for testing this difference is to use PCR to test the newly-assembled contig. If PCR confirms the result, then the assembly is good and the genetic difference is real.
The second type of assembly error is a consensus error: for each position in a contig, the assembler's output includes a consensus base that represents what the majority of the underlying reads show at that position. If this does not match the genome at a given position, then it is a consensus error. This can happen for several reasons, most commonly because of basecalling errors (see the section above). Assemblies are based on a multiple alignment of the individual reads; as pointed out above, when the coverage is deep, the probability of a basecalling error is vanishingly small. In an incomplete assembly, the ends of each contig typically contain only one or two sequencing reads, and it is here that basecalling errors are most likely. Thus when comparing an assembly to a reference genome, differences that occur near the ends of contigs are likely to represent simple basecalling errors. These can be checked by additional sequencing.
The other common type of consensus error is a polymorphism, which is much more interesting in forensic or genotyping studies. If the source DNA was not extracted from a clonal organism, but instead came from two or more individuals, then the reads may contain differences that represent polymorphisms in the population. In this case, the coverage at a given position might contain a mixture of two different bases, e.g., C and T. Such polymorphic differences can be identified even without a reference genome by looking through the genome assemblies for correlated mismatches; i.e., positions where more than two reads differ from the consensus and where those reads agree with one another.
Was this article helpful?