Microarray Study Design And Sources Of Error

We briefly mentioned measurement error in Section 3.2 with regard to fluorescence ratios; however, measurement error enters into the array experiment at multiple steps (29). Consider the following experiment for potential sources of error. Liver biopsies are performed on two different mouse populations— one with normal livers and the other with tumors. Messenger RNA is prepared from each of the biopsies. The RNA is labeled and then hybridized against a microarray, and the array is scanned. The use of replicate samples has provided considerable insight into the contributions of measurement error at each of the steps and techniques to minimize that error (50).

There are two general categories of replicate samples: biologic and technical. A technical replicate is the replication of any part of sample preparation or analysis occurring after the biopsy. Starting at a distal aspect of the experiment, we could replicate each probe on the array to measure the reproducibility of fluorescence between probes on the same array. These experiments have been done and show that the agreement of probes on at array is very acceptable at 95-99% (29). Therefore, from the measurement error perspective, there is little value in having replicate probes on an array. Moving proximally in the experiment, investigators have examined the effect of splitting an aliquot of labeled cDNA or, even more proximally, preparing two separate mRNA samples from the same tissue sample before hybridizing to two different cDNA arrays. In these examples, the correlations between genes drops to 80% and 60%, respectively.

The statistics on reproducibility are important in that they describe both the limitations of the technology and suggest methods for improving overall reproducibility. Replicate experiments reveal that technical factors accounts for significant variation in the results. They also suggest the means for reducing this variation considerably. By performing either replicate arrays for the same sample and averaging the results or (more economically) pooling separately labeled specimens, the results will tend toward a more reproducible mean (50). Although technical replicates are not required to obtaining meaningful results, they can be used to reduce error if needed.

Biologic replicates are either repeat biopsies of the same mice or biopsies of different mice with similar tumors. Gene correlation can fall to 30% or lower when biologic replicates are considered. The 30% value includes all of the variations described by the technical components of sample processing and additional variation in the underlying biologic system. Although a reproducibility of 30% might seem low, we should recall that the majority of the genes being sampled likely have little to do with the biologic pathways of interest. A 30% correlation does not mean that each gene is 30% correlated, but, rather, that some are highly correlated and some not at all. High correlation would be expected for the small number of genes with direct relevance to the disease process of interest, with weaker correlations for genes indirectly related to that metabolic state. More exhaustive reviews of DNA microarray reliabilities are available (28,50,51).


Having developed expression arrays as a model genomics tool, it is worth focused discussion of the implications of using RNA as opposed to DNA or protein. The measurement of RNA is associated with a number of challenges, the most considerable of which is that it is rapidly degraded. The degradation is so rapid that high-quality RNA can only be obtained when tissues is processed immediately or fresh frozen within minutes of collection. Not only is RNA difficult to obtain, as Brown and Botstein have suggested, protein—specifically protein activity— is the biologically active species, not RNA (52). Unlike RNA, many proteins are relatively stable and can be found in paraffin-embedded tissues. As RNA is measured in genomic terms, proteins and protein activities can be measured en masse via proteomic techniques that will be described in Section 8. The problem for current investigators is that the field of proteomics is far less advanced than expression genomics. Proteomics measurements are more complicated, less well standardized, and the researchers less experienced.

The preference for using RNA is not merely convenience. The quantitative relationship between transcription and translation has been confirmed in many settings, including microar-rays, arguing for the use of RNA expression profiles (53,54). For the measurement of any single mRNA species, however, other methods such as Northern blot are superior to microar-rays (55). Similarly, the usual limitations of RNA measurement remain true in that correlations between RNA and protein levels are low in cases where proteins are secreted, rapidly degraded, or unusually stable. Cellular processes that are decoupled from translation, such as responses to micronutri-ents, drugs, or physiologic conditions might be poorly represented by changes in RNA levels. Finally, RNA, like protein, is not uniformly degraded and relative changes in RNA species might be artifacts of degredation.


Expression microarrays are the most widely used genomic assays; however, there are a wide variety of genomic and pro-teomic tools using RNA, DNA, and protein. Many of these techniques address specific shortcomings of the expression arrays or offer complementary information (56).

7.1. SAGE A distinct weakness of expression microarrays is the use of fluorescence, an indirect measure of mRNA

concentration. We have already discussed the challenges of using the fluorescence ratio as a measure of RNA in the cDNA microarray. In a following section, we will encounter similar standardization difficulties in the oligonucleotide platform. Serial analysis of gene expression (SAGE) is an RNA genomic technique that predates the expression array, in which mRNA copy number is quantified directly, overcoming a distinct shortcoming of the DNA microarray (57). The SAGE technique works on the principle that short oligonucleotides of 10 bases in length, called tags, are sufficient to uniquely identify any specific cDNA transcript. The principle is a statistical assertion based on the approximation that the total number of human genes is around 30,000-40,000 with 80,000-120,000 transcripts based on alternate splicing. A 10-bp tag, comprised of the 4 bases ATCG, has a maximum of 410 possible random combinations, a number far greater than the estimated number of human transcripts. By this logic, it is statistically unlikely that any single 10-bp tag would present more than once. Although we know that DNA sequences are not random and that related genes might often share sequence homology, such events are rare enough to allow SAGE to be useful.

Messenger RNA is isolated from the cells of interest and double-stranded (ds) DNA is produced by reverse transcription, as seen in Fig. 7. The ds-cDNA is immobilized on streptavidin-coated magnetic beads and the cDNA cleaved with a restriction enzyme that cuts most transcripts at least once. Only the 3' end of the cDNA is retained by collecting the magnetic beads. By isolating the DNA fragment closest to the 3' end which is cut by a specific restriction enzyme, the 10-bp tag is standardized. In addition, the availability of genomic sequencing allows for the identification of the specific gene from the 10-bp tag. In the example shown in Fig. 7, the restriction enzyme recognizes the site CATG. For this reason, the sequence CATG will be included in all of the subsequent tags in the analysis, although it adds nothing to the specificity of the tag. It is therefore technically correct to say that the tags are 10-14 bp long, of which

10 bp are unique.

A linker is added to the cut end, and a second restriction enzyme, called a type II restriction enzyme, is added. The type

11 restriction enzyme has the desired property of cutting a DNA sequence at a specific number of basepairs downstream from the recognition site. In this case, the cut produces a 10-bp sequence, the above-described tag, from original cDNA after accounting for the linker. The individual tags are randomly blunt-end ligated to form dimers called di-tags. Di-tags are PCR amplified in a manner to maintain the proportional representation of each tag's frequency in the original mRNA sample. After amplification, the linkers are removed and the di-tags are allowed to concatimerize, a process in which di-tags form long chains. The concatimerized di-tags are subcloned into a vector and their DNA is sequenced. The resultant sequence information is called a SAGE library and can be analyzed to identify and quantify each of the 10-bp tags.

Unlike expression arrays, there is a direct quantification of each of the mRNA species that was present in the original specimen, expressed as the number of copies per 10,000 tags. Whereas expression array data require considerable effort to compare one study to another (a point we will return to later),

RNA Isolated from Sample

J Reverse Transcribe

DS DNA J Link to Magnetic Beads

Sage Method

J Restriction Enzyme Digest

J Linker Ligation


J Type II Restriction Enzyme Digest


My First Baby

My First Baby

Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!

Get My Free Ebook

Post a comment