## Ev

where \M\ indicates the determinant of the square matrix M. The normalization transformation then is . As an additional bonus, note that the averages of the normalized set

B2 and the reference set B1 are equalized, i.e., -y I f , m = -'v. Verify this. Question: Are their respective standard deviations the same?

Variations of N1 that have appeared in the literature include using a subset of genes for the regression calculation as, for instance, using only genes which have present Abs Calls in both the B1 and B2 sets, or excluding data points (x;, y) when either x; or yi lie below a threshold. Furthermore, if one supposes that the microarrays assay a large enough number of different genes and that the subset of these genes whose expression levels change significantly between conditions A to B is small compared to the total number of genes being measured, then one may also also normalize pairs of experiments A, Bj which are not true biological replicates. A useful exercise: Let us say that we have normalized data A1, A2 via N1 with respect to B1 to produce [ separately. Is the linear regression of data versus a line of slope one through (0;

0)? What happens when .4, and {,4, — ^2), regarded as ordered vectors in N are orthogonal to one another?

Probabilistic distributions The second category of normalization schemes N2 operates under the premise that the distribution of expression levels or Avg Diffs in each chip experiment, or more stringently for duplicate sets of experiments, should be identical. Consider again the data sets B1 = [x}Ni = 1 from table 3.3.1. Several studies in the bioinformatics literature have applied the standard central limit theorem-type transformation, , on each experimental data set. Here, Ax denotes the standard deviation for the B1 data set. so that the transformed data B1 will now have mean 0 and variance 1. Exercise: Consider another data set B1 = {yj}N/ = 1 that has been normalized similarly a- = la, '/l/^,. Show that the linear regression of . versus >

is a line of slope .i-,;/, (= .Y-('<>vi.'r. // t ) through the (0, 0). Here Cov(. , .) is the symmetrical covariance function.

Note that this approach of setting the first two moments of each data set to constants does not guarantee that all the transformed sets of intensities have the same distribution, unless the original data sets were gaussian-distributed to begin with, in which case they would be completely characterized by their first (mean) and second (variance) moments. As an aside, it may be useful to check whether an arbitrary data set is indeed gaussian-distributed. We do this qualitatively with a quartile-quartile plot between the test data set and another data set of the same cardinality that is known beforehand to be gaussian-distributed. A gaussian-distributed data set can be generated using the pseudo-random number utility on most computing machines . These two data sets are first individually ordered by magnitude and then plotted pairwise; for example, if these ranked data sets of cardinality n looked like {s}" = 1, {//}" = 1, then we plot (s, t). If B1 is gaussian-distributed, then its quartile-quartile plot against a gaussian-distributed reference set will be linear.

On equating the statistical moments of any two data sets, the reader should also note that even if two distributions share the same kth moments for all integral k, it does not imply that the two distributions are the same (or in the probabilistic parlance, almost surely equal). As a counterexample from Casella and Berger , the distributions with probability density functions i(j } = —i !!- * r,' and f2(x) = f1(x)[1 + sin(2A log x)] have k moments that agree; we clearly see from their graphs that these are different distributions. Furthermore, Press et al.  have observed that it is not uncommon to encounter real-life data sets that have finite means but arbitrarily large second moments. The reader ought to be aware of the existence of such pathological cases. These cases are not central to our discussion and the interested reader is referred to  for details. Less stringent variations of this normalization scheme are also in common use; e.g., methods which equalize only the first moments (means) of intensity data sets as, for instance, >—. or Wi H> - if - where the mean of the data set {y/}n/ = 1 is transformed to coincide with the mean of a reference set {x/}ni = 1, by a dilation or a translation.

Housekeeping genes and spiked targets The N3 category of normalization devices works on the assumption that certain probes or targets have a known constant hybridization behavior throughout all experimental conditions. For example, exact amounts of particular mRNAs are incorporated or spiked into the target that have a previously known and deterministic effect on specific probes on the microarray under any condition. These special genes could function as housekeeping devices in any chip experiment. In this context, the normalization technique typically amounts to transformations on whole data sets, which will narrow or eliminate the statistical divergence in expression intensity values or Avg Diffs of these housekeeping genes across experiments.

General assumptions and principles of normalization The normalization methods that are outlined above postulate a linear and systematic nature in microarray measurement errors, and assume that every target-probe complex behaves in a similar manner, i.e., hybridization rates are equal and independent of the transcript sequence. More general normalization techniques which are nonlinear or nonparametric have been described [156, 182]. Alternative normalization techniques for microarray data sets in the literature have included the use of eigenvectors , a scatterplot smoother , normalizing by both sample and gene , and mapping the expression data into a real interval between 0 and 1 .

Again, we emphasize that the choice of normalization procedure depends entirely upon the postulates, particularly in the noise-related assumptions that one makes for a set of microarray experiments. Recall that we had originally visited the topic of normalization with the aim of resolving Q1, i.e., whether order relations (i.e., d, >, =) between inter-chip Avg Diffs for a gene as is imply a comparison of the true inter-chip expression level of a gene across different chip experimental conditions. While normalization techniques like N1 attempt to correct global and deterministic measurement errors which have the form of a linear transformation, it does not resolve the stochastic component of the error, in other words, error or variations due purely to chance. Chance-type errors are often subsumed under the term noise and are almost always modeled as a stochastic process. Graphically, stochastic effects are reflected by the scatter pattern of data points about some line of regression, as one sees in figure 3.15. In accounting for noise, the genomicist would often makes assumptions regarding the stochastic model. For instance, she or he could postulate that the repeat measurements are random variables of intensity-dependent gaussian distributions whose means lie along the regression line and possess a uniform standard deviation. Methods like N2 implicitly assume that each microarray data set is a set of samples drawn from a probabilistic distribution which is strongly characterized by its first k-moments. This and other modeling assumptions of the noise effect can naturally be used to determine the significance of a change in the expression level of a gene which we will discuss in an upcoming section on fold significance. The reader is advised that even after applying any one of the normalization techniques above, it might not make sense to extrapolate order relations of the true expression levels across experiments or conditions from order relations on the numerical microarray data.

In view of this, one might begin to ask whether there is a weaker notion of inter-chip comparison. That is, while we cannot definitely claim that the true expression level for Gene 9785 under condition B2 is greater than in B1 because the Avg Diff value for Gene 9785 in B2 (141.2) is greater than in B1 (110.7), can we derive meaningful biological comparisons of expression levels from the data pair (110.7, 141.2) at all, or at least within some statistical or probabilistic framework?

Consider the scatterplot of the replicate data sets B1 and B2 with points (x, y) where x B1 and y B2. Assume that the replicate measurement for any gene is a random variable which is characterized by gaussian distributions whose means are expression intensity dependent and have a uniform standard deviation along the line of regression. In this situation, it is reasonable to expect the data points to be mainly clustered within a 1-standard deviation envelope of the regression line. The outliers or data points which lie outside or away from this 1-SD envelope may be interpreted as genes whose expression intensity report or Avg Diff reproduces poorly across replicate experimental conditions. So, even though an order relation in the Avg Diff values does not imply a same-order relation between the corresponding true expression levels, the coordinates of the data (xi, yi) can be an indicator of reproducibility of a microarray assay depending upon the postulated underlying noise distribution in the system. Next, we consider the scatterplot for the data set pair (B1, A1), assume as previously that measurement errors are gaussian distributed about a regression line, and similar to the above we define the outliers as data points that lie outside the 1-SD envelope away from the regression line. Then a naive interpretation of the outliers in the (B1, A1) plot which are not outliers in the (B1, B2) plot is that these are genes that have a expression change in going between states A and B that is less likely to be due to measurement error or noise. We can also naturally assign a magnitude to this expression change by considering the number of standard deviations the outlier in the (B1, A1) plot lies away from the regression line. This subject rightly belongs under the topic of gene expression change or fold significance which is explored in the next section.