Why do we need new techniques

A first look at a typical genomic study might cause a quantitatively trained scientist or even a biologically trained scientist to ask the following, quite legitimate question: Why is this field not amenable to standard biostatistical techniques? After all, we are trying to understand the relationship between multiple variables and the mechanisms that the relationships reveal. And there has been a long history of the development of biostatistical techniques to analyze large studies with large numbers of cases with many variables to elucidate precisely this kind of question. Specifically, these studies ask questions such as: What risk factors are associated with heart disease? Does smoking cause disease? What is the difference in survival between a group treated with one chemotherapeutic drug versus another? On the surface these questions seem similar to many of those posed regarding genetic risk factors for acute and chronic disease. Yet a review of the bioinformatics and functional genomics literature over the past 3 years reveals that most of the analyses have been performed using techniques borrowed from the computational sciences and machine-learning communities in particular. Why is this? There are several reasons, including academic parochialism, but perhaps the most substantive one is the essentially underdetermined nature of genomic data sets as described below. If we examine figure 1.4, we see sketched out the fundamental difference between a typical clinical study and a typical genomic study. A high-quality clinical study will involve thousands to tens of thousands of cases, such as in the Nurses' Health Study [18] or the Framingham Heart Study [55] over which tens or even hundreds of variables are measured. In contrast, in a typical genomic study, there are only tens or, exceptionally, hundreds of cases, but thousands of measured variables.

