Bioinformatics Tools

3.1. SEQUENCE DATABASES Sequence analysis is perhaps the most widely recognized component of the HGP and bioinformatics. There are three major international databases of publicly available information on nucleotide sequences. GenBank ( is maintained by the National Center for Biotechnology Information in the United States, EMBL ( is maintained by the European Bioinformatics Institute in the United Kingdom, and the DNA Data Bank of Japan ( is retained in Japan. The Human Genome Browser Gateway (www.genome. is another useful genome sequence information site. The databases contain identical sequence information, however, each site formats the information differently. There are tools available for sequence and structure prediction for a submitted nucleotide sequence on each website.

3.2. MAP VIEWER Map Viewer ( mapview/), available through the National Center for Biotechnology Information, is a tool that allows for visualization of an organism's complete genome, integrated maps for each chromosome, and sequence data for a genomic region of interest. The data are presented graphically, displaying available sequence data, cytogenetic, genetic, physical, and radiation hybrid maps.

3.3. BASIC LOCAL ALIGNMENT SEARCH TOOL The BLAST (basic local alignment search tool; www.ncbi.nlm.nih. gov/BLAST/) search program is a well-known tool that is designed to identify all similar nucleotide and protein sequences. These alignment algorithms compare experimentally derived sequences to one or more databases of either nucleotides or amino acids. Homology is determined by a greater than 30% match and implies that sequences might be related by divergence from a common ancestor or share common functional aspects. BLAST 2.0 is the currently available version of the program. Because of its ability to accommodate for introns in a DNA sequence, it is also referred to as gapped BLAST.

3.4. EEPRESSED SEQUENCE TAGS DATABASE AND UNIGENE One aspect of the HGP is the generation of an expressed sequence tag (EST) database composed of large numbers of low-quality cDNA sequences. ESTs are small pieces of DNA sequence that are generated by sequencing the 3' and/or 5' ends of expressed genes. Although their quality is low, the volume of ESTs produced makes them a good source for identifying new gene sequences. EST sequences are deposited into dbEST ( or UniGene ( dbEST was created to organize, store, and provide access to the large volume of EST data that have accumulated. Redundancy in dbEST is common, because of many ESTs that match the same gene. UniGene automatically partitions GenBank mRNA and EST sequences into a nonredundant set of gene-oriented clusters composed of identical 3' untranslated regions (7). A variety of species are represented in the database: human, rat, mouse, cow, zebrafish, clawed frog, fruitfly, mosquito, wheat, rice, barley, maize, and cress.

