A typical clinical studyA typical genomic study

Variables (10's-100's)

O iimiiiiiimiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii in

Figure 1.4: A major difference between classic clinical studies and microarray analyses. The high dimensionality of genomic data in contrast to the relatively small number of samples typically obtained results in a highly underdetermined system.

Initially, the low number of cases in functional genomic investigations may have been due to the high cost of the microarrays (on the order of several thousand dollars per microarray) but increasingly the scarcity of cases in a typical functional genomics study will relate to the scarcity of appropriate biological samples. As these experiments involve the measurement of gene expression,

a particular tissue has to be obtained under the right conditions. This is in distinction to genomic DNA samples where most blood samples will suffice. Especially in human populations, suitable tissue samples are relatively rare.[5] Yet even though there are only tens of cases, each case involves the measurements of tens of thousands of variables corresponding to the expression of tens of thousands of genes measurable with microarray technology. The result of the large number of variables compared to the number of cases is that we have highly underdetermined systems. That is, we are making measurements of very high dimensionality (on the order of tens of thousands) but we are only providing a small number of cases to explore this high-dimensional space. Another way to say this is that there are many, many ways in which the variables being measured could be interrelated mechanistically, based on the relatively small number of observations. Due to this high dimensionality and the underdetermined nature of these systems, standard biostatistical techniques do not hold up well because many of the assumptions that underlie these conventional biostatistical techniques do not hold.[6] We often provide the following analogy. To solve a linear equation of one variable (e.g., 4x = 5) we only need one equation to find the value of the variable. To solve a linear equation of two variables (e.g., y = 4x + b), two equations are required. If we have tens of thousands of variables, but only hundreds of equations, then there will be thousands of potentially valid solutions. This is the essence of what constitutes an underdetermined system. In this context, we must use techniques that can maximally inform us of the relationships between variables of interest (and find out which ones are of interest) despite the underdetermined nature of the data sets. High-dimensionality data sets have been well-known to the "machine-learning" community of computer scientists in applications such as automated recognition of human faces, so it is not surprising that many of the techniques developed by that community have found their way into the functional genomics enterprise.

[1]For those of you who are unfamiliar with what these are, we touch on defining these terms in section 1.5 and in the glossary.

[2]Using the common shorthand title for the international public and private effort to determine the sequence of bases in the human genome.

[3]The characteristics of the various microarray technologies are addressed in chapter 3.

[4]The label applied often seems to be determined more by the training background of the labeler rather than any fundamental characteristic of the analytic technique.

[5]See chapter 2 for a discussion of which tissues are appropriate for particular experiments.

[6]Although this holds true for most of the biostatistical techniques biologists will have learned in graduate school, in fairness there has been quite a lot of research by statisticians on the analysis of underdetermined systems of high dimensionality. Their work has just not found its way into mainstream biomedical study until recently.