Sources and examples of noise in the generic microarray experiment

Before delving into details of how one can distinguish chip-reported expression data that arise from a genuine measurement of mRNA levels in the presence of noise from reported data that come from noise alone, it is informative to review the ways in which noise could enter a typical microarray experiment. Noise can roughly be classified, by point of origin, into two categories:

• Intra-chip noise encompasses the previously discussed background effect on the chip surface, where localized physical and chemical changes in one probe feature diffuse into neighboring probe features. Improper techniques in scanning are also included, such as having one area of the chip brighter than another. Chip manufacturing defects such as quantitative non uniformity in laying probe sequences on the array surface would also be covered (see figure 3.18).

• Inter-chip noise includes biological variation in the samples (e.g., note that even pure cell populations are unsynchronized in their cell cycle phase). Also included are protocol variations in the hybridization procedure, the environment, time effects, and again, possibly, inconsistencies in chip manufacture.

In contrast to inter-chip noise, intra-chip noise is often more subtle and not as easily discernible, due to the fact that there is usually no obvious reference situation for a comparison of a single microarray experiment outcome.

Next, we explore examples of noise in microarray data. Figure 3.13 below, courtesy of Todd Golub at the MIT/Whitehead Institute's Genome Center, displays the gene expression intensity-intensity plot from an experiment that has been repeated with the same sample hybridized onto two separate arrays of the same type. First, it should be pointed out that had the two arrays produced exactly the same data, a plot of the gene-by-gene expression level of chip 1 versus chip 2 would have shown the data points aligning themselves exactly along the diagonal line y = x. Instead, here we see a typical, and, in fact a rather qualitatively good, result for intermicroarray variation. This variation is a very typical manifestation of inter-chip noise. Note especially that the typical data point spread away from this y = x line increases with decreasing expression intensity. As mentioned earlier, this is due to fact that at lower gene expression levels, the corresponding RNA amounts that are being measured are smaller and therefore the effect of noise in these measurements is relatively more significant. Thus, there is greater variation here between these two nominally identical probe assays in a repeated assay.

Duplicate 1 Duplicate 1

Figure 3.13: Apparent differences in source of variance due to log scaling. On the left is the apparently increased variation at lower expression levels using a plot of the logarithm of all the gene expression levels measured in one microarray versus the logarithm of all the of the genes measured on another microarray. When the expression values are plotted on a linear scale as on the right,

Duplicate 1 Duplicate 1

Figure 3.13: Apparent differences in source of variance due to log scaling. On the left is the apparently increased variation at lower expression levels using a plot of the logarithm of all the gene expression levels measured in one microarray versus the logarithm of all the of the genes measured on another microarray. When the expression values are plotted on a linear scale as on the right, then it appears that the higher expression levels have higher variance. The former scale emphasizes the decreased reproducibility at lower expression levels and the latter emphasizes that there are fewer measurements or genes at higher expression levels. Note that for well-definedness, the logarithmic scale plot excludes measurements that had a negative value. Golub et al. [78] also plotted two lines, y = 2x and y = / x, which mark the locations of a 2.0-fold increase and a 2.0-fold decrease in gene expression, respectively, relative to the y = x line of an idealized 1.0-fold increase (or decrease), i.e., zero variation. Observe that at lower expression intensities, there is a sizeable number of genes whose expression varies outside of this 2.0-fold envelope despite the fact that these intensities came from essentially identical samples. A similar-looking pattern is seen in the following two figures. The second figure shows data from repeat hybridization of a sample that was hybridized to one chip, washed off from the first chip, and then rehybridized onto a second chip. In the third figure, RNA was extracted twice from a common sample and hybridized onto two different chips from the same lot simultaneously.

In all three scenarios, there is a similar and more prominent spread of data points about the y = x line at lower intensity levels about the idealized line of zero variation y = x. At this point, we bring to attention a common trompe l'oeil that derives from the logarithmic scaling commonly employed to illustrate the concordance (or lack thereof) between two microarray hybridization data sets. As shown in figure 3.13, by scaling the data sets to a logarithmic scale, the apparent variance at the higher expression levels is reduced. However, as we will explore further in our analysis of noise in microarrays, it remains that a larger proportion of the poorly reproduced results with microarrays are obtained at the lower expression levels. The reason for the apparent wider scatter at higher expression levels in the non-logarithm-plotted graph in figure 3.13 is that there are relatively fewer genes with high measured expression levels.

In figure 3.8, we illustrate another example of the significance of noise in oligonucleotide microarrays, the Affymetrix GeneChip. Here, we show measured expression data or Avg Diffs plotted from an experiment where the same RNA sample was hybridized onto separate oligochips for human skeletal muscle. All gene expressions were plotted on one of the three graphs based upon their Affymetrix Abs Calls (Absent, Marginal, Present). As we noted in section 3.1.2, Affymetrix Abs Calls are determined from a decision matrix which is a function of the specificity of probe set expression for each gene. Note that the decision rules for this matrix are not currently available to the public. The lack of robustness or reproducible stability in both the Abs Calls and the Avg Diffs is evident from these plots. That is, for the same RNA sample both the expression level and the Abs Call will vary. Apropos the Abs Call instability, note that the total number of genes across the three experiments is well under the total number probes for each GeneChip. While such observed variances are not unexpected, it should warn the investigator against depending too heavily upon a particular microarray-reported value, statistic, or call in his or her research program.

The noise example above is not singularly unexpected or unusual within the gamut of microarray noise examples, nor is it particular to the Affymetrix technology. Similar manifestations of noise, though from possibly different sources, are just as consequential in other microarray technologies. For instance, consider data from robotically spotted Incyte GEM microarrays in figure 3.14, which are three rows of expression data from an Incyte file. Of relevance to our discussion are the location index of a probe on the physical microarray and its corresponding difference in expression. Incyte GEM microarrays are used in a technique similar to the two-dye fluorescence technique for cDNA chips pioneered at Stanford University, so that each microarray is hybridized to two differently labeled sample probes.

GEMID

Location

DiffExpr

BalancedDiffExpr

PI Signal

P1S/B

PI Area%

022PAOVW

2196

1.2

1.2

4330

60.3

90

022PAQWV

783

-1.1

-1.2

4325

43.8

94

P2BalancedSignal

P2Signal

P2S/B

P2Area%

Probel

PI Description

Probe2

3724

3616

70.5

90

123Z3996

cocsxplusminus.si

12363997

5001

4856

64.1

94

123Z3996

cocsxplusminus.si

12363997

P2Description

GenelD

PlateRow

PlateCol

PlatelD

ClonelD

CSX

-137302

C

11

021OAOOE

524442

CSX

-131764

E

5

021VA0MR

463651

CloneSource

AccessionNum

Locus

IncyteClonelD

PCRStatus

GeneName

IMAGEConsortium

AI325648.1

02:15.5

mm45a10

Passed

RibosomalproteinL7|IMAGE:524442|

IMAGEConsortium

AA027730.1

mil 5h10

Passed

PublicdomainESTj IMAGE:463651)

Figure 3.14: Incyte data file snippet. Heads and the two rows of values from an Incyte data file. Incyte microarrays are constructed using a robotic spotting process. The spotting process involves two proximal spots for the two dyes—Cy3 and Cy5—one of which is the targeted clone and the other a control spot.

Since cDNA microarrays, unlike oligochips, allow for two sample probes corresponding to different experiments to be hybridized simultaneously on the chip, the inter-experiment noise here may also intra-chip rather than exclusively inter-chip as we had seen with oligonucleotide microarrays. The two different colors or labels correspond to signals P1S and P2S respectively. These signals are corrected for background intensity and its variation, B, by the image analysis software used with the Incyte microarrays. Background corrected values are reported as P1S/B and P2S/B which correspond to the background-corrected intensities for the two probes. The P1S and the P2S area percentages correspond to the areas of potential hybridization that are recorded by the image analysis software to have undergone actual hybridization. In theory, if the appropriate control probe is chosen, say P2S, then this should correct for changes in hybridization conditions across different areas of the microarray that is caused by noise, such as localized temperature differences on the microarray surface.

The first thing to note from figure 3.15 is that we see an almost similar distribution of intensities for an identical RNA sample (the "hybridization cocktail") that is hybridized onto two separate Incyte arrays, P and P'. One way to view the noise profile of a robotically spotted microarray is as a plot of the ratio of signal P2S versus P2S/P2S' as in figure 3.16. In the ideal situation, the latter ratio oughtto be uniformly equal to 1.0. That is, data points should only be distributed along the horizontal line y = 1 for any signal intensity P2S. In fact, at lower signal strengths of P2S, there are a number of genes whose P2S/P2S' ratios are significantly above 2.0-fold, even though on average the number of these outliers decreases with increasing P2S, and most genes tend to have a fold of 1.0. This demonstrates once again that the expression variance at lower expression values is more pronounced.

Figure 3.15: Duplicate cDNA assays of a common extract. The same RNA extract or hybridization "cocktail" measured on two Incyte spotted arrays. The P2S signals (the second of two dyes) from each microarray is plotted for each gene.

Figure 3.16: Ratio of probe intensities as a function of expression level. The ratio of the P2S/P2S' signals (y-axis) from two robotically spotted microarrays plotted against the P2S signal. The same RNA extract was used in both microarrays. Although the ratio is close to 1.0 for many genes, it does deviate sporadically and widely in a few instances. Also, as the expression level decreases, the distribution of the ratio spreads increasingly away from 1.0.

Microarray-reported data may also vary by the spatial location of the probe on the physical array. In order to see this, we map every probe and its associated reported expression amount from the two dimensional physical chip onto a one-dimensional array or line, and plot the expression level of each cell on the y-axis versus its location index on the x-axis. Figure 3.17 displays such a plot of experiments involving Incyte microarrays. A visually prominent feature in these graphs are four spikes marking dramatic increases in the expression level of the four probe cell locations on every chip. Furthermore, there appears to be a periodic pattern in reported expression level with respect to probe location on these microarrays. A careful review of chip manufacturer's specifications reveals that there have indeed been four control probe sectors (spiked controls of reference RNA transcripts) on each chip which would account for these spikes. The first preanalysis task then is to remove these spiked controls' data from their respective data sets and to renormalize these data sets for background effects, thus producing the next plots. In figure 3.17 the fold differences appear to be much less variable than previously and less obviously influenced by the probe location on the microarray.

Was this article helpful?

0 0

Post a comment