The issuing of consistent names to genes is an apparently mundane chore. Unfortunately, at this time, its importance may exceed that of shared data models and ontologies for the purposes of the immediate progress in our functional genomics efforts. Why might this be?
First, we want to make sure that we compare the behavior of the same genes and their properties across all experiments. When a microarray manufacturer reports that the nth element of its array interrogates the expression of a particular gene, the following questions should be asked: Which particular polymorphism (s) of that gene, which alternative splicing product(s), and which other gene(s) are being reported on?
Consequently, the nomenclature (i.e., naming scheme) used to identify gene expression measurements on microarrays has to be able to capture the increasingly polymorphic definition of genes. As we learn more about single nucleotide polymorphisms (SNPs), alternate splicing products, pseudogenes, frame-shift mutations, and other variations in just the primary structure of the genes and the proteins they produce, a nomenclature has to be able to distinguish between these various forms of the same gene and yet group them as belonging to the same coding gene. This goes beyond the UniGene effort which attempts to cluster known experimentally derived sequences into a cluster with a single consensus sequence, ideally corresponding to a single particular gene.
The task of finding a standardized and universal nomenclature for each gener is just the tip of the iceberg of a recurrent challenge in microarray technology: the task of linking a spot on a microarray to a specific gene or expressed sequence tag (EST). This is challenging in both mundane technical ways and in deep fundamental ways. The mundane ways in which it is challenging is that if one does not have the exact sequence for each microarray spot, at the very least, one has to obtain the correct sequence that each spot on the microarray was engineered to represent. For instance, on a spotted cDNA microarray, one has to determine the gene sequence of the clone used for each spot. With an oligonucleotide microarray, one has to determine from the manufacturer what sequence each probe set was designed to correspond to. In cDNA microarrays, these probes are typically linked to an IMAGE (see the description of the IMAGE consortium on p. 245) sequence ID. These are typically issued with the cDNA microarrays or with the probes that are spotted onto the microarray. Unfortunately, there has been a high error rate in many of the cDNA clones obtained from various manufacturers and so although the name mapping may be relatively straightforward, it may not actually represent what is being hybridized. Consequently, researchers may have to resequence a clone if they want to verify that the probe is assaying for the presence of the gene they think it is. The challenge is somewhat different with proprietary oligonucleotide microarray manufacturers, where the manufacturer's catalog states what gene each probe set has been engineered to hybridize to but the actual oligonucleotide sequences themselves are unknown.'131 This is problematic for several reasons.
• The gene sequence that the probe set may be engineered for may be a fusion gene, i.e., an abnormal composite of two different genes. Consequently, since we do not know which part of the fusion gene is actually being represented by the probe set, it is unclear whether that probe set measures just the presence of the fusion gene transcript or whether it can also measure the presence of either one of the constituent genes separately.
• Furthermore, at the time of this writing, some oligonucleotides probe sets are designed against large sections of the genome such as a cosmid (see glossary, p. 277). Knowing the genomic sequence does not provide very much information about which gene in particular is being expressed from that region of the genome. If the specific oligonucleotides were made public, then one could perform a BLAST search to find which specific gene in that region the oligonucleotides corresponded to.
• As the genome efforts in the human, mouse, and other species increase in accuracy and completeness, the mapping between the microarray oligonucleotides and the genes they correspond to is likely to improve. Specifically, as we gain more knowledge of polymorphic variants of genes and their alternate splicing products, the investigator will be able to verify which particular variant a probe set is measuring. However, in the absence of such information, researchers cannot take advantage of increased knowledge of the genomic sequences to provide a more accurate labeling of the intensities that they are measuring on the microarrays.
• A related problem occurs with oligonucleotide microarrays where sequences that were thought to be part of one gene at one time are found to be components of other genes as the UniGene (see p. 244) clustering become more refined. Practically, this means that in one generation of Affymetrix microarrays, a probe set is said to correspond to one gene, whereas in another generation it may correspond to two or more genes based on the new assignments of the particular GenBank subsequences to different genes. In the absence of knowledge of the exact oligonucleotide sequences, or in the absence of a version history of Affymetrix accession numbers, the analysis of oligonucleotide microarray data, particularly across generations of microarrays, becomes problematic.
• Occasionally, manufacturers of oligonucletide sequences just make mistakes in their design process and engineer oligonucleotides that do not interrogate for the expression levels of the specified genes. The most recent and notorious example of such an error was in the design of Affymetrix oligonucleotide microarrays for murine systems that was unreported for months. If these sequences had been made public, then the error would likely have been discovered much earlier.
Pragmatically then, in the year 2002, some relatively laborious cross-indexing of terms across nomenclatures needs to be done in order to first determine which gene(s) a particular spot on a microarray was engineered to assay, and secondarily, to compare results across different microarrays. In practice, this means looking up in a table provided by each microarray manufacturer each accession number corresponding to each probe or probe set on a microarray, whether it be a tentative consensus (TC) number from the Institute for Genomic Research (TIGR) organization, a GenBank accession number, a UniGene accession number, an IMAGE clone ID, or some other accessioning scheme, finding the representative genetic sequence corresponding to that probe or probe set, and then running a Basic Local Alignment SearchTool (BLAST) search against one of the standardized nomenclatures such as LocusLink (see p. 243), if it is a gene with a known function, or UniGene, if a more comprehensive search is desired. Because this lookup function has to be done hundreds of times per analysis due to the large number of genes present, it requires one to write a script that submits the appropriate sequences to a BLAST program. The submission can be over the Internet using the batch version of BLAST at the NCBI. For better performance, if one has sufficient storage space locally, the entire GenBank repository and BLAST programs can be downloaded and run locally, using instructions at the NCBI. It is prudent to re-run these scripts on a regular basis as the UniGene and LocusLink databases are frequently updated.
An alternative approach is to develop the infrastructure to track each microarray probe or probe set and automatically determine the latest meanings for each. We have developed such a system that automatically parses the latest tables converting GenBank to UniGene, and UniGene to LocusLink, as well as the content files for these databases. Using these tables, one can then connect each of these references to the original publicly available accession numbers provided by the microarray manufacturers. The advantages of this approach are that:
• One can immediately find official names and symbols for probes and probe sets, as well as known synonyms.
• One can take the opposite route and search for probes and probe sets by gene meaning or other characteristics. For instance, if one has prior information that a gene on human chromosome 4 is associated with a particular phenotype, one can optimize functional genomics analysis by restricting to only those genes known to be in the appropriate location. One could also restrict analysis by protein domain, such as restricting to only those genes containing a zinc finger or DNA binding domain.
• One can find and search by functional categorization, such as a term from GO or a disease from Online Mendelian Inheritance in Man (OMIM).
• Because UniGene is essentially formed by applying BLAST to GenBank entries, it spares one from performing those BLAST operations. • The tables can be updated regularly automatically.
We maintain a website at http://www.unchip.org/ which uses these tables and allows the translation of a microarray-specific accession number (currently mostly Affymetrix) to one of the more globally used nomenclatures described below. A similar service is available for Affymetrix customers'141. A commercial product offering similar, but more broadly applicable functionality is called GeneSpider and may be available soon from Silicon Genetics.
Was this article helpful?