Normalization

We shall use the term normalization to, informally, refer to the transformations that are applied onto data sets from chip experiments to render them comparable to one another in a probabilistic or statistical sense, or to take into quantitative account assumptions about how these data were generated. Normalization of a collection of different data sets is typically carried out with respect to a reference data set. The reference set could either be data from one particular experiment, or a postulated distribution of expression intensities. The following are three normalization methods in common practice:

• N1. Linear regression

• N2. Probabilistic distributions: first, second or k-th moments

• N3. Housekeeping genes, and spiked targets

The choice of transformations will depend upon assumptions that are made about the properties of individual data sets such as intensity distribution and the behavior of noise. This will subsequently influence the characterization of expression or fold change significance at later stages in the analysis.

Linear regression Our discussion will first focus upon the assumptions correctable by linear transformations, N1. Referring to study S2,let Xj and y denote the reported expression intensity (Avg Diff) of Gene i in the duplicate experiments B1 and B2 respectively with i = 1, ... , N, where N = 13,179 is the total number of unique probes on the Mu11K chip set. Let B1 be the reference data set. For notational clarity, let ' denote the vector of gene-ordered intensities in a chip experiment, (xi, ... , xN) in N, and let x = £ be the mean of the gene expression intensities Xj in the chip set B1.

Now consider for the moment a hypothetical situation where chip B2 has a systematic error in relation to its (reference) duplicate B1 such that the Avg Diff for every Gene i, Xj in B1 is remeasured or reproduced in the B2 experiment as y = ±1xi + ±0 for some real constants ±0 and ±1. A global linear shift of this kind could arise, for instance, if chip B2 were scanned at a different uniform ambient brightness from chip B1. Physically, ±1 is the magnification or dilation factor and ±0 is the translation factor. A scatter plot of x; versus y would then look like a discretized line of slope ±1 with its vertical or y-intercept at ±0, provided that not all the Xj are equal to a constant. Clearly,if chips B1 and B2 were ideal duplicates, then the points (x, y) would be aligned on a line of slope one through the origin.

This graphical representation of our supposedly replicated raw data suggests an intuitive way to resolve the hypothetical problem. Knowing the values ±1 and ±0, the transformation ¡m ^ v, - {ui - itGi)/"; will correct this systematic difference in B2 with respect to B1 so that a plot of Xj versus is a line of slope one through the origin (0, 0).

In the general situation, one usually has no a priori knowledge of the slope ±1 and y-intercept ±0 of the xj versus y scatterplot. Instead, the main task in this case is to determine the values ±0 and ±1. Normally, one supposes that the data set B2 is most likely to be systematically differentfrom B1 by a linear transformation just described, and assume that with B1 and B2 being duplicates that their scatterplot should ideally approximate, in the sense of least squares, a line of slope one through the (0, 0). Using standard linear regression techniques, one computes ±0 and ±1 for the equation of the line which passes through the data (x;, y) and which minimizes the sum of squared vertical or y-errors between this line and the observed data points (x,, y). In other words this line f(x) = ±1x -±0, say, should minimize Ni=1 (y - f(x))2. Solving for ±1 and ±0 using simple calculus, linear algebra gives us the dilation and translation factors,