Interpretation of microarray experiments is not only the task of the investigator performing the experiment but also the scientific journal peer reviewers, the readers of those journals, and other interested parties. The scale of the genomic data in addition to its complexity has challenged all involved to seek new methods of communication in order to take full advantage of its promise. An early realizations was that print media is insufficient to allow full interpretation of these data-rich experiments. Most journals require that the primary data as well as associated experimental conditions and sample information be made available electronically via the Web as part of the publication process. The provision of data alone has often proven insufficient to allow meaningful scientific evaluation, including reproducing the authors results. Without standards for reporting results, elements necessary to interpret the arrays might not be included in the online databases (10). For example, an author might publish only the raw image files without the final expression values or, conversely, publish expression without the image files.

Recognizing that reporting standards were needed the professional society, the Microarray and Gene Expression Data (MGED) Group initiated such a project (102). Their initial work has produced a set of standards called the minimum information about a microarray experiment (MIAME) checklist (Table 3). The MIAME standards have been formulated to include the minimum information to interpret an array experiment and, as such, represent the information authors must provide upon publication in the scientific literature. There is evidence that these criteria are currently being implemented, although it remains to be seen if they will be widely adopted. The original MIAME criteria request a large amount of data per experiment without explicitly stating in what form it should be presented. In answer to formatting concerns, the MGED group has followed up with additional recommendations called the Microarray and Gene Expression Markup Language (MAGE-ML) and Object Model (MAGE-OM).

It is the microarray alone that requires standardization. The very feature measured by an array, the gene, is itself not a static concept. We discussed in Section 3.1 the need to link probe DNA sequence information to gene name, a process requiring standardized genomic databases. As gene sequencing began in earnest in the mid-1980s, it became clear that such central clearinghouses of information were necessary. One such database was created when the United States Congress funded the National Center for Biotechnology Information (NCBI) as a division of the National Library of Medicine at the National Institutes of Health in the mid-1980s. A central task was to catalog and publish genetic information to allow researchers to speak a common language, referring to specific DNA sequences in a standardized way. Consider the challenge previously facing scientists when discussing alternate splices of the same gene, where

Table 3

Summarized MIAME Checklist

Experiment design

Type of experiment (i.e., normal vs diseased tissue, time course, others).

Experimental factors: the parameters or conditions tested, (i.e., time, dose, or genetic variation).

Number of hybridizations performed in the experiment.

Reference used for the hybridizations.

Hybridization design: description of the comparisons made in each hybridization, whether to a standard reference sample or between experimental samples; an accompanying diagram or table might be useful.

Quality control steps (i.e., replicates or dye swaps).

URL of any supplemental websites or database accession numbers

Samples used, extract preparation and labeling

Origin of the biological sample and its characteristics (i.e., gender, age, developmental stage, strain, or disease state).

Manipulation of biological samples and protocols used (e.g., growth conditions, treatments, separation techniques).

Protocol for preparing the hybridization extract (e.g., the RNA or DNA extraction and purification protocol).

Labeling protocol(s).

External controls (spikes).

Hybridization procedures and parameters

The protocol and conditions used during hybridization, blocking and washing.

Measurement data and specifications

The quantitations based on the images.

The set of quantitations from several arrays upon which the authors base their conclusions. Access to images of raw data is not required although preferred, authors should make every effort to provide the following: scanning hardware and software used, image analysis software, measurements produced by the image analysis software and a description of which measurements were used in the analysis, image analysis before data selection and transformation (spot quantitation matrices), final gene expression data table(s).

Array Design

General design, including the platform type (spotted glass array, an in situ synthesized array), surface and coating specifications, and the availability of the array (the name or make of commercially available arrays).

For each feature (spot) on the array, its location on the array and the ID of its respective reporter (molecule present on each spot).

For each reporter, its type (e.g., cDNA or oligonucleotide) should be given with unambiguous characterization such as database refer ence and sequence

For commercial arrays: manufacturer reference including catalogue number and references to the manufacturer's website

For noncommercial arrays, the following details should be pro vided: source of the reporter molecules: for example, the cDNA or oligo collection used, with references, method of reporter preparation, spotting protocols, additional treatment performed prior to hybridization.

gene name does not describe a unique DNA sequence, but two or more sequences. The converse also occurred where multiple gene names described very similar DNA sequences when a gene was sequence and named by more that one researcher. To clarify these types of confusion, the NCBI initiated the GenBank database to which investigators submit sequence and other information and, in turn, have that information associated with a unique GenBank accession number.

GenBank does not further evaluate the quality or veracity of genetic data, but rather serves as a repository. As GenBank sequence information grew, it became possible to perform automated analyses of the sequences. By evaluating GenBank sequences for overlap, it was possible to identify clusters of many entries that likely describe a single gene. In this way, a DNA sequence derived from several GenBank entries could be ascribed to a single UniGene cluster ID, again improving the ability of researchers to speak a common language.

With a common vocabulary in place for relating DNA sequences to naming and numbering systems, it becomes far simpler to discuss the function of individual genes and networks of genes. One of the benefits of standardizing databases such as GenBank is that automated analyses can be performed on vast quantities of data. In this manner, DNA sequences can be surveyed across species and potentially useful information from one species related to another. In the case of cellular metabolism in particular, it has been frequently noted that many genes are highly conserved across eukaryotic species. This realization has been used to link information from well-studied systems such as yeast to those where a given gene might not have been studied at all. In addition to automated information transfer such as this, there was a need to standardize species-specific information as it accrues, particularly with an eye for automating analyses of genomic experiments. For example, a microarray analysis might describe hundreds of genes that differ between sam-ples—a quantity of information that is difficult to assimilate. The problem might be reduced in complexity if genes could be grouped by function, where a few groups of genes not hundreds of single genes, might be considered for their biologic significance.

Standardization of genetic information at functional levels is the interest of groups such as the Gene Ontology Consortium (GOC) (8). The goal of the GOC is to create a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells accumulates and changes. To this end, they have systematically annotated thousands of named genes from multiple species under three broad headings: biologic process, molecular function, and cellular component. Genes are assigned to a GO terms that describes its function in a hierarchical manner. In this way, a given gene can be described in terms of a very specific function as well as increasingly broader functions. For example, a gene might be specifically described as being involved in DNA strand elongation while also being recognized as a part of the broader headings of DNA-dependent DNA replication, DNA replication, and, ultimately, DNA metabolism. Additionally, a given gene can be annotated under all three headings; for example, a receptor can be located in the cell membrane (cellular component), participate in cell growth (biologic process), and have signal transduction properties (molecular function). In the case of a specific metabolic pathway, the gene ontology terms are a poor substitution for expert knowledge.

We have only mentioned a few agencies and databases that have contributed meaningfully to the analysis of genomic data.

Was this article helpful?

0 0
My First Baby

My First Baby

Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!

Get My Free Ebook

Post a comment