Preprocessing Filters and Normalization

We will be referring to the data from study S2 as tabulated in table 3.3.1. Since the number of replicate experiments in a typical microarray study does not normally exceed three, it is not easy to derive meaningful statistics for any one gene "latitudinally" across two or three repeat measurements, e.g., the mean average reported intensity for Gene 9785 in state B is (250.5 + 193.3)/2 = 221.9, Descriptive statistics such as means or variances are generally informative of the underlying stochastic process in the system being investigated only when one has a large relevant data set to work on. Even though each microarray assay yields one noisy reading for each of the ~ 104 individual genes, the reported measurement for gene k will not inform us about the robustness of the measurement for gene j (j" k), in the absence of at least one other whole replicate assay.

On the other hand, even prior to any transformation(s) that will render the distinct replicate data sets comparable to each other, we can obtain informative statistics for each chip experiment data set "longitudinally" or intra-array across all genes 1 through Nsuch as the following:

• P1. The mean, standard deviation, minimum, and maximum of Avg Diff in each microarray experiment. For example, for chip A2 of study S2 we could find that the mean is 232.4, standard deviation 888.1, minimum -2000.7, and maximum 19792.8.

• P2. The mean and standard deviation of Avg Diff differences for all duplicate pairs.

For example, the average and standard deviation of (B2 - B1) for all 13,179 genes in study S2 may be -35.4 and 200.3, respectively. This tells us that, on average, a gene registers a lower expression level in B1 in contrast to experiment B2. The standard deviation may be informally regarded as an inverse indicator of reproducibility of the replicate readings. This is not strictly an intra-array calculation.

• P3. The distribution range of Avg Diff values on each chip.

• P4. The distribution of genes which have J (J d 4) many Present (P) Abs Call. For instance in study S2, 2547 genes have a P call in all experiments, 4112 genes have Absent (A) calls throughout all experiments, and 3792 genes have exactly three A calls in all four experiments.

Gross overall data statistics like the above can quickly be used to detect systematic inter-chip differences. For instance, if the genomicist in study S2 finds that for P2, the average replicate difference (B1 - B2) was 11714.4, then he or she might have good reason to suspect that chip B1 was scanned at a much higher ambient or background setting than was B2. Such statistics have been used as preliminary and very crude filters to reduce the number of potentially interesting candidate genes in some microarray studies. For instance, the genomicist in study S2 could decide that she would only consider (following P4) the 2547 genes which have P Abs Calls throughout all 4 experiments for more refined data analysis. Alternatively, the genomicist could accept only genes whose Avg Diffs in some or all of the four experiments are above a predetermined cut-off value. These kinds of filters have their weaknesses. In table 3.3.1, we clearly see that Gene 9784 would be missing from the list of potentially interesting candidates for study S2 following the all-P-Abs-Call filter, and yet from a cursory inspection of the raw data, and in the absence of further analysis, it seems that drug X triggers a significant reported expression decrease in Gene 9784 from its control levels.

In general, any gross preliminary filtering of the sort above leads to a loss of information. Loss of information is inevitable in typical microarray analyses and is not always a negative feature. After all, microarray data analyses are essentially reductionistic by nature due to the problems of irreproducibility as well as cost limitations in confirming candidate genes resulting from these analyses. The relevance of the loss is context-specific. It depends upon the sort of information that is lost and the scientific question being investigated. Gross statistics such as P1 and P3 have also been used to "scale" or normalize one chip data set against another so that transformed numerical indicators of expression intensity (Avg Diffs) are comparable in some probabilistic sense. This is the gist of question Q1.

Was this article helpful?

0 0

Post a comment