Clone Library

4a directed shotgun sequencing approach described below. Automated sequencers sequenced vast numbers of small fragments. Finally, a computer examined all the random segments of sequence for overlaps and assembled them into contigs and, ultimately, into complete chromosomal sequences. Even with the help of an STS map, this method relies on colossal amounts of computing power to cope with the millions of separate short sequences. Such computers were not available until recently and the success of the Celera method is largely due to the rapid increase in computer power over the last few years.

Now that the human genome has been sequenced, the next major task is to identify all the genes and elucidate their functions. The claim that once we have sequenced ourselves we will understand all human disease is doubtful. We have known for years the complete gene sequences of several viruses, including HIV, yet no complete cure has emerged. Deducing the function of a protein given only the DNA sequence that encodes it is hazardous at best. Although DNA sequences are very useful, a great deal of experimental work must also be done to understand inherited defects.

Very large genomes may be broken up into large fragments which are cloned using YACs or BACs and then sequenced by the shotgun approach.

Survey of the Human Genome 683

Assembling a Genome from Large Cloned Contigs

Cloning fragments that are as large as possible, but still are able to be manipulated in vitro is the first step to assembling a large genome from cloned contigs. The genome is first broken up into large fragments that are cloned to give a library of overlapping pieces. These fragments are inserted into high capacity vectors such as yeast artificial chromosomes (YACs) or bacterial artificial chromosomes (BACs), which may carry several hundred kb of DNA (see Ch. 22). Each fragment is then analyzed separately by shotgun sequencing. Hopefully this results in a complete contiguous sequence. Overall this approach yields a set of what are effectively large "cloned contigs". This approach may be used for eukaryotic genomes that contain vastly more DNA than bacteria. This is the approach that was taken by the official government sponsored human genome project.

After sequencing the individual cloned fragments, the next problem is to identify the overlapping regions among the clones. As described above for closing the gaps left after shotgun sequencing, hybridization and PCR methods may be used to identify the overlapping fragments. Other methods include screening the cloned contigs for similarities in restriction profiles and repetitive elements.

However, comparing large numbers of clones by such methods is slow and tedious when tackling very large genomes. The human genome of 3 x 109 bp would give 10,000 cloned fragments of 300,000 bp (the maximum size for BAC/YAC inserts)—even without the 6 to 8-fold redundancy necessary to ensure complete coverage. In order to sequence each of these large clones, approximately 600 reactions would have to be performed, assuming about 500 base pairs per reaction. If 80,000 clones were constructed, about 48 million different sequences would have to be assembled into the complete genome.

Using a map of sequence tagged sites greatly helps in the computations needed to assemble a genome from shotgun sequencing.

Assembling a Genome by Directed Shotgun Sequencing

For random shotgun sequencing of the human genome it would be necessary to sequence about 70 million small stretches of about 500 bp. This would give the necessary redundancy for 99.8% coverage of 3 x 109 bp.With 100 automatic sequencers generating 1000 sequences per day, this could be done in 700 days—i.e. roughly two years.

The critical issue is the assembly of these sequences into contigs and ultimately into complete chromosomes. The vast amount of computer time plus the uncertainties due to repetitive sequences make this approach prohibitive as it stands. However, if an STS map is used as a framework, then assembly becomes possible. In fact, this is the successful approach taken by Craig Venter of Celera genomics to complete the human genome ahead of schedule.

Sixty million sequences were generated from a library of fragments averaging 2 kb inserted into a multicopy plasmid vector. Another ten million sequences were from another library of larger pieces (10 kb) in a different vector. The 10 kb library is especially important in dealing with repeated sequences, since most of these are around 5 kb in size (or smaller) and can be entirely contained within a 10 kb fragment. The 2 kb library would not contain the entire repetitive region, and correctly aligning the numerous clones for this region would be impossible. Using the end-sequences of the 10 kb fragments allows the assembly process to avoid making incorrect overlaps between two identical repetitive sequences that are actually in different locations (Fig. 24.22).

Survey of the Human Genome

The sequence of the human genome still has a few gaps. These are mostly in the highly repetitive and highly condensed heterochromatin, which contains few coding

Repeated sequences ab ^ c d \ ef

Was this article helpful?

0 0

Post a comment