Managing noise

Now that we have established that the current microarray technologies are bedeviled with noise, we will begin to address the data that are generated using such systems. The most direct route would be to have a direct measurement of the variation per gene of all microarray measurements. This is done by replicating the microarray assay(s). In practice, only duplicate or triplicate hybridizations are performed because of cost. The goal here is to obtain a subset of genes from these experiments whose reported expression level is robust across replicates, despite the noise. As we have alluded to in section 2.9 and as illustrated in figure 2.9, any of these techniques will inevitably have to balance between their significant false-negative or false-positive rate. That is, in order to avoid having to follow thousands of red herrings, as would be the case with a high false-positive rate, one has to be willing to discard many genes that might be genuinely involved in the biological process of interest on the grounds that their reported expression is inconsistent or irreproducible across replicates in the current assaying technology. We describe in this section, as an example, one technique that was developed to identify sets of genes that have been up- or downregulated, that had a low false-positive rate for the specific data set in consideration. When we consider the broader question of what constitutes a significant change in expression level for a gene, then the question of appropriate (dis)similarity measures is raised. This question is addressed in sections 3.3 and 3.6.

Modeling the expression-level-dependent noise envelope If, as has been mentioned in preceding chapters on reproducibility and replicates, the variance in reported gene expression level is a function of the expression level itself, then clearly single-value thresholds (e.g., a 2.0- or 3.0-fold increase) applied to an entire expression data set, as is commonplace in the literature, are not sufficient for deciding which genes are significantly changed between different samples or conditions. At lower expression levels, these uniform thresholds will be associated with an increased false-positive rate, and at high expression levels with an increased false-negative rate. One way to address this problem would be to develop fold thresholds that are themselves a function of the reported expression level, rather than a fixed number. This is simply done by obtaining at least a duplicate hybridization from the same hybridization cocktail hybridized onto two microarrays. As much as is possible, these are identical experiments in which the operating conditions, cell lines, culture media, incubation time, and so forth are controlled to be the same.

As the expression values should be identical, all variance is theoretically attributable to nonbiological sources. Then, for each set of duplicates, an identity mask (ID mask) is calculated in Tsien et al. [181] wherein are fold changes which are insignificant or attributable to noise alone. Two parameters are used for creating each ID mask: expression value range (or a sliding window of expression intensities), and either the scale value or the number of standard deviations. These are used to calculate the ID mask borders and can be adjusted for different trade-offs in sensitivity and specificity, depending on one's utility model as reflected in figure 2.9.

We illustrate the application of ID masks to six experiments (A, B, C, D, E, F) that were run in duplicate. Total RNA was isolated from the cell lines (MCF-7 human breast cancer cells and MG-63 human osteosarcoma) and hybridized onto Atlas Human cDNA Expression Arrays from Clontech(Clontech Laboratories, Inc., Palo Alto, CA). Each of these Atlas Arrays (Human 1.2 I, Human Cancer) is a nylon membrane on which approximately 1200 human cDNAs have been immobilized. Although this example uses a relatively low-density array technology, the methodology applies similarly to higher-density microarrays.

Two methods are then explored for creating ID masks. Method 1 relies on segmental calculation of standard deviations. A "data point" refers to an (x, y) pairing, in which x is an expression value of a gene g from the first hybridization, and y is the corresponding fold difference value (i.e., the ratio of the expression of the same gene g in the second hybridization to x). Using all data points in a given sliding window of expression values (e.g., from intensities 1001 to 2000), the standard deviation of the fold values is calculated. The average of expression values within that intensity window is then paired with the average fold value within the same window plus the number of standard deviations specified by the experimenter. This new pair becomes a candidate "upper mask border" point. Similarly, a candidate "lower mask border" point is created by pairing the average expression value of that window with the average fold value minus the number of standard deviations as specified by the user. Each successive group of data points in each sliding window of expression values (e.g., all points from intensities 2001 to 3000, then all points from intensities 3001 to 4000, etc.) likewise give rise to candidate mask border points. A line is then fitted via least squares linear regression on the set of (expression value, fold value) pairs comprising the candidate upper mask border points. This line defines the upper mask border. Similarly, one computes the lower mask border from the set of calculated candidate lower mask border points. If one of the derived mask borders fits poorly—based upon its relationship to the original data points— the "reciprocal reflection" of the other (good-fitting) mask border can serve in its place. This simply means that each (x, y) point on the good-fitting (linear) border gives rise to a point (x, 1/y) to create the reciprocal reflction border. Figures 3.22 and 3.23 show ID masks delimited by one linear regression border; the other border was derived by taking the reciprocal values of that linear regression border. The region between these borders represents the "identity" region of insignificant fold differences (i.e., fold changes resulting from noise alone).

0 20000 40000 60000 80000

0 20000 40000 60000 80000


Figure 3.22: Identity mask for experiment A. Method 2 with parameters 9000 for expression value sliding window size and scaling factor 0.975 resulted in the lowest percentage of original data points


Figure 3.23: Identity mask for experiment E. Method 1 with parameters 5000 for intensity window size and 3 SD resulted in the lowest percentage of original data points lying outside of the mask region (0.9%).

The second method, method 2, for creating an ID mask uses candidate mask border points derived from maximal points in each sliding expression intensity window rather than from standard deviation calculations as in method 1. Specifically, among all data points in a given intensity window, the point with the greatest fold value is chosen. This is repeated for each successive window of expression values. These fold values can also be scaled before use in a linear regression to find the upper mask border. The lower mask border is analogously derived from the minimal fold values. Once the ID mask has been derived, all original data points are checked for inclusion or exclusion in the ID mask region. The percentage of data points lying outside of the ID mask region is recorded. We then automatically searched a large number of masks for those which provided the best performance. For both methods 1 and 2 of ID mask creation, sliding windows of ranges 1000, 5000, and 9000 on the expression value axis were chosen for experimentation. Only when calculations were not possible with one of these window sizes (e.g., due to division by zero) was an alternative window size chosen. For method 1, the number of standard deviations (for calculation of candidate mask border points) was chosen to be 2.5 and 3.0. For method 2, the scaling factor was chosen to be 0.975 and 1.0. Twelve candidate ID masks were created for each pair of experiments (2 methods, times 3 intensity window sizes times 2 scale or SD factors). For each pair of experiments the mask with the lowest percentage of original data points lying outside of the mask region was selected. The results are shown in table 3.2.8.

The particular values that were obtained for these experiments are only shown for illustrative purposes. For any microarray technology adopted, even a simple analysis of this sort on duplicate experiments will result in much greater accuracy. That is, these masks will provide much greater sensitivity and specificity control than a single arbitrary fold-ratio threshold. For even greater accuracy, this kind of noise modeling must be done on a per-gene basis as described in section 3.3.

[4]Twenty-one amino acids if one includes selenocysteine.

Was this article helpful?

0 0

Post a comment