## Fold calculation and significance

The discussions in this section refer to study S2. A prototypal question concerns whether the expression level, as represented by Avg Diffs, of Gene i is significantly different between states A and B. By significance one almost always means statistical significance—specifically, a difference that is unlikely to be due to measurement errors or noise, modeled as a random process that is characterized by a postulated probability distribution which is in turn derived either from the replicate data or some null hypothesis. The delicate point here lies in deciding on a consistentand well-defined criterion for statistical significance, and more fundamentally on the choice of a null hypothesis.

Similarly to the normalization procedure, diverse approaches exist to account for different assumptions and postulates about the initial condition of the biological system, the behavior of the target-probe complex, and the constitution and properties of noise. As detailed in [144] for a more general applicational framework, these approaches typically start off by computing a set of statistical parameters or statistics from the empirical data such as means, variances, and kth moments which define a test distribution. This is followed by a choice of a suitable null distribution and some calculations to decide where the parameters fall within this null distribution, i.e., comparing between the test and null distributions. If the test statistic falls in a probabilistically unlikely region on the null distribution, then one may conclude that the null hypothesis is false for the data set, which in this context means that the data or test distribution is statistically different from the postulated noise or null distribution. As emphasized in [144], the reader should bear in mind that one can only disprove a null hypothesis, never prove it. That is, the fact that a data statistic falls within a probabilistically likely region of the null distribution does not imply that the fold is equivalent to the postulated noise effect. When the number of samples used for calculating a statistic is reasonably large, a difference of means that is less than the standard deviation may be significant. Whereas when samples are sparse, a difference of means that is much larger that the standard deviation may not be significant. This fact is relevant considering that the generic microarray study typically uses a small number of expression data per gene, corresponding to different experimental or replicate conditions, for calculating the statistic of the gene.

From study S2 suppose that the data sets {Aj}MAj = 1 and {Bj}MBj = 1 have been suitably normalized following any of the methods outlined in the last section. In order to avoid the clutter of notation, we will use the same symbols Aj, Bj to denote the un-normalized and normalized data sets; the assignments should be clear from the context. Let A'j denote the Avg Diff for Gene i as assayed by the microarray Aj. Define B'j likewise. First, there are several ways to compute the fold statistic, e.g., the mean, of the Gene i(i = 1, 2, ... , N):

• F3. More generally in F2: ! where A is any permutation on the set {1, 2, ... ,M}

Some studies have alternatively chosen to compute the fold of Gene i from states A to B as instead of if! as above. In order to resolve the problem of non-positive-valued Avg Diffs, it is commonplace in the literature, before taking arithmetic ratios, to threshold the Avg Diff values in every chip data set to an arbitrarily chosen minimum positive number, as e.g., in setting all Avg Diffs of less than 10.0 to 10.0. Alternatively, some studies translate all the intensity data in a chip data set so that the minimal translated Avg Diff in each set is positive. For example, if the minimum element in the A2 data set is -2000.7, then when 2001.7 is added to every Avg Diff reading in A2, all the translated Avg Diffs will be positive-valued. Note that such solutions will skew the intensity statistic of the microarray data which is not always a desirable thing. For symmetry reasons, the logarithm of folds rather than just the folds alone are averaged. Consider the data for Gene 1 un-normalized: From A1 to B1, the intensity of Gene i changed 2.0 f U§f)-fold, whereas from A2 to B2, it changed 0.5 J^Mj-fold so that on average the intensity of Gene i should intuitively be unchanged, i.e., have a fold change of 1.0. However, it is obvious that the arithmetic average of 2.0 and 0.5 does not equal 1.0. A logarithmic transformation of the individual folds solves this problem, as does the use of the geometric rather than the arithmetic mean of the folds.

As an exercise, the reader should verify that the order of taking logarithmic ratios in F2 and F3 does not change the resulting fold, /Gene'. After computing the fold of all genes in the data set in any one of the preceding ways, one typically wants to know whether a fold average or a fold distribution is different from, or statistically significant in relation to, a postulated null distribution which represents the effects of measurement errors or noise. As we have already noted, it is essentially the choice of the null hypothesis which distinguishes the different methods for determining fold significance. Some studies associate the null hypothesis with an interval of nonsignificance informally called a noise envelope. The distribution for the null hypothesis has also been calculated from the microarray data set permuted, especially when the number of replicate data points is small. In general, replication improves the estimation of the null distribution. Below, we outline several common null distributions:

• G1. A null distribution for each Gene i whose mean is obtained by averaging latitudinally across duplicates,

• G2. A null distribution for all genes whose mean is obtained by averaging longitudinally across aN N genes

• G3. For each gene row, permutation of the intensity data, for instance, exchange data between conditions A1 " B2, etc., and then recalculate Gl.

A coarse, qualitative method may combine F1 and G2 to decide that a Gene i with /Gene 1 > max(/A; 1/4B) or /Gene 1 < - max(/A; 1/4B)) is significantly different, foldwise, from the average fold statistic resulting from noise as calculated from duplicate conditions. Another approach could be to rank the N genes by their /Gene 1 value from F2 and to decide that the top and bottom 5% of these ranked genes are significantly changed.

There exist equally diverse non fold-centered methods for determining the significance of a change in expression. The intuitive idea behind all these approaches is to call a change statistically significant if the expression change inter-states A and B is maximal, and the expression change intra-state within the replicate conditions {Aj}MAj = 1 and {BJ}MBj = 1, respectively, is minimal. Again, the choice of a null distribution representing the postulated noise distinguishes these methods. Being non fold-based, traditional parametric and nonparametric statistical tools for analyzing means or variances between data groups such as the Q2, Student t, Mann-Whitney, and F tests and ANOVA

have been applied in the literature. These methods are reviewed comprehensively in [144] and the reader should be aware of the implicit assumptions and underlying null hypotheses in these standard tests prior to application. A drawback to these traditional approaches is that their conclusions are "asymptotic", i.e., they are statistically valid only when one has a large number of replicate data.

At the end of the normalization section 3.4, we had briefly described a way to graphically visualize the determination of significant difference in expression data between A1 and B1. To reiterate, on the scatterplot of the un-normalized data sets A1 versus B1, we compute and draw the linear regression line. If chip assay reproducibility is reasonably robust and if we assume that the majority of genes do not undergo a dramatic change in expression level in going from state A1 to B1, then the data points should cluster close to this regression line which should have slope one and should pass through (0, 0). The envelope of a standard error or deviation spread away from the regression line is our object of interest. This 1-SD envelope may be defined by a pair of lines that lie along a horizontal distance to the left and right of the regression, f(x) = ±1x + ±0.

The intuitive reasoning here is that a data point which is outside this envelope is a statistically insignificant event relative to a postulated stochastic distribution, usually gaussian, of the data around the regression line due purely to noise (chance). Such an outlying data point represents, in the reverse context, a gene which has undergone a statistically significant expression change from A1 to B1.