The unique gene identifier

Given the aforementioned desiderata that motivate the need for a standard nomenclature for genes, the quest for such a standardized nomenclature has generated widespread international efforts. There are several candidate nomenclatures, the most popular and useful of which we summarize here. Optimally, these nomenclatures should capture polymorphisms, splice variants, and mutations of the genes as they appear in one species and in their orthologs in all other sequenced genomes of other organisms. We list some of the more popular nomenclatures used for microarray experiments. In practice, microarray repositories will use several nomenclatures to reference the measurements made in gene expression experiments. For a more comprehensive and current compendium of nomenclatures such as these, the reader is referred to issue no. 1 of Nucleic Acids Research, of any recent year.

LocusLink ( Perhaps the most extensively curated and possibly the broadest nomenclature is maintained through the LocusLink database [145]. LocusLink identifiers are stable, in that they are designed not to change over time, and are genome independent. There are other databases which have a much larger set of expressed sequence tags (see glossary, p. 277) or putative genes, but for which either the complete coding region is not known, or for which the function of the gene product is unknown. The curated nature of LocusLink is important because, unlike GenBank records, which are under the full editorial control of only those who submitted the sequence, the sequences and annotations in LocusLink are under an extended and distributed editorial process.

Broadly speaking, the LocusLink record is comprised of three components. The first is a stable unique identifier called the locus ID. The second component is the reference sequence which is obtained from the REFSEQ[15] database, which in turn is seeded by a GenBank source coding sequence. This reference sequence is left as provisional until the full length is obtained according to the editorial process. The third component includes all the annotations such as the completeness indicator for the gene (indicating that the entire coding component of the gene is complete), accession numbers providing links to a protein record, links to the OMIM,[16] links to other databases such as UniGene and DBSNP, and an extremely useful English language summary that describes the gene. Of note, LocusLink is involved in an active collaboration with the Human Gene Nomenclature Committee (HGNC)[17] to provide an overall unifying international standard. Most recently, LocusLink also maintains links to functional annotation databases such as those supplied by Proteome[18] and publicly maintained ontologies such as GO.

It appears likely that LocusLink will continue to be the locus for much annotation of organismal genomes and the provider of stable identifiers for their genes.

UniGene ( Although LocusLink provides the most stable identifier for known genes, or at least those for those which at least one function is at least presumed, at this early stage of the discipline of functional genomics, the majority of genes do not have any known function.[191 For this reason, other related databases are useful. Chief among these may be the UniGene system at the NCBI [28]. UniGene attempts to be comprehensive; to catalog all the characterized genes as well as hundreds of thousands of uncharacterized and novel expressed sequence tags, while attempting to reduce the redundancy inherent in those individual catalogs. The content of each UniGene record includes a set of sequences with GenBank identifiers that have been clustered such that they are presumed to correspond to a single specific gene. As this clustering is done mostly automatically, and the clustering procedures have evolved over time and been augmented with more data, the UniGene clusters are not guaranteed to be stable, unlike LocusLink identifiers. The strength of UniGene is that it attempts to be comprehensive and includes not only the GenBank accession numbers for sequences in each cluster but also known alternate splicing variants. Currently, UniGene describes human, rat, cow, mouse, and zebra fish organisms. Each UniGene record has a unique ID called a cluster ID. Because these cluster identifiers are unstable, a cluster ID can be retired for a variety of reasons.

• The sequences that congregate to a cluster might be found to be contaminated.

• Two or more clusters may be joined to form a single cluster in which case the original cluster IDs are retired.

• A cluster may be found to be composed of more than one gene and therefore must be split into two or more clusters. Thus the original cluster ID has to be retired and new ones generated for the subclusters.

Because of this terminological instability most reports using the UniGene nomenclature will also report the "build" (i.e., version) number from which the cluster IDs were drawn.

GenBank ( An overview such as this would be incomplete without mentioning GenBank, the public database of all known nucleotide and protein sequences from which many of the other nomenclatures are derived [22]. At the time of this writing, GenBank contains over 12 million annotated sequences and is growing at the rate of over 5 million sequences per year. GenBank sequences are divided into divisions, roughly corresponding to taxonomy (e.g., bacteria, viruses, etc.) as well as for specific projects, such as ESTs, genome sequencing, sequence-tagged sites, and others.

Each GenBank record contains taxonomic information regarding the organism from which the sequence was obtained, bibliographical references, and biological features found within the sequence. Each record has an accession number that is stable and unique; each entry is also assigned a unique unstable GI number so that revisions may be tracked. Beyond this, very little additional annotation is provided, given the low-commitment nature of this database.

GenBank is a member of the International Nucleotide Sequence Database Collaboration, including the DNA Data Bank of Japan and the European Molecular Biology Laboratory, which defines common ontologies for taxonomy and features allowing daily movement and translation of data between these databases.

IMAGE ( The Integrated Molecular Analysis of Genomes and their Expression (IMAGE) consortium was formed to create, collect, and characterize cDNA libraries from a variety of different tissues and was initially spearheaded by the efforts by the Washington University Genome Sequencing Center and Merck & Co. [118]. Several microarray databases refer directly to the IMAGE clone identifiers rather than to any one of the more precisely characterized sequences described above. Those IMAGE cDNA clones that have been sequenced are available in the dbEST database maintained at the NCBI.[20] Perhaps the best feature of the IMAGE clone identifiers is that they refer to physical instances of these clones on the master plates from which the clones are generated. In that sense, the clone identifiers are stable. However, considering that some of the clones obtained from these master plates have been known to suffer contamination or have not been sequence-validated, they are not stable.

The Institute for Genomic Research (TIGR, TIGR also maintains a fairly extensive list of human and other organism nomenclatures. TIGR has its own consensus curation and assembly process that it uses to maintain a number of databases for organisms, including human, mouse, rat, zebra fish, rice, Arabidopsis, soy, tomato, maize, potato, cattle, and many others. The nomenclature specific to TIGR is the tentative consensus sequence, or TC. TC clusters are created by a process similar to UniGene by assembling ESTs into clusters based on sequence. However, alternate splice forms are built into separate TC clusters, as opposed to UniGene, where these are kept in the same cluster. Many manufacturers of microarrays refer to TIGR accession numbers or TC identifiers in their descriptions of the probe sets or cDNA probes on their microarrays. TIGR itself uses TC identifiers to create a useful set of mappings of orthologs across all the species indexed in the TIGR database.

Enzyme nomenclature database The Nomenclature Committee of the International Union of Biochemistry and Molecular Biology maintains a catalog of known enzymes and their functions. Enzyme nomenclature is based on the reactions that are catalyzed, and not the genes that make up the enzymes, or the protein structures of those enzymes. Thus, simple one-to-one translations between Enzyme Commission (EC) numbers and GenBank or LocusLink accession numbers are neither possible nor accurate. For example, the enzyme described by EC (lactate dehydrogenase, LDH) assists in the conversion of lactic acid to pyruvate. However, that enzyme is made of four independently chosen subunits of two types, either LDHA (LocusLink 3939) or LDHB (LocusLink 3945). Note that these genes have different LocusLink entries in other species, whereas the enzyme function keeps the same EC accession number.

This nomenclature can be searched using the Expert Protein Analysis System maintained by the Swiss Institute of Bioinformatics[21].

Other identifiers Other identifiers used on microarrays include the Expressed Gene Anatomy Database identifiers[22], and the Munich Information Center for Protein Sequences Yeast Genome Database[23]. In addition, other identifiers of which a functional genomicist should be aware include Ensemble[24], the Genome Database[25], the Mouse Genome Database[26], the Saccharomyces Genome Database[27], and the Database of Transcribed Sequences[28].


[13]That is, they are kept as confidential proprietary information.






[19]And a larger majority only have a few of their functions known.










Was this article helpful?

0 0

Post a comment