There is little doubt that one of the tremendous accomplishments of the Human Genome Project is that it has enabled a rigorous computational approach to identifying many questions of interest to the biological and clinical community at large. However, the danger of a computational triumphalism is that its makes several dubious assumptions. The first is genetic reductionism: At an abstract level most bioinformaticians understand that a particular physiology or pathophysiology is the product of the genetic program created by the genome of an organism and its interaction with the environment throughout its development and senescence. In practice, however, a computationally oriented investigator often assumes that all regulation can be inferred from DNA sequence, based solely on the syntax of its sequence elements. That is, it is assumed that it is predictable whether a nucleotide change in a sequence will result in a different physiology. We refer to this as "sequence-level reductionism."
The second dubious assumption is the computability of complex biochemical phenomena. One of the most venerable branches of bioinformatics involves modeling molecular interactions such as the thermodynamics of protein folding, and protein-protein and protein-nucleic acid interactions. As yet, all the combined efforts and expertise of bioinformaticians have been unable to provide a thermodynamically sound folding pattern of a protein in the heterogeneous solvent environment of a cell for even as long as one microsecond. Furthermore, studies by computer scientists over the last 10 years [23,114] suggest that the protein-folding problem is "NP hard." That is, the computational challenge belongs to a class of problems that are believed to be computationally intractable. Therefore it seems overly ambitious to imagine that within the next decade we will be able to generate robust predictive models that are able to accurately predict the interactions of thousands or millions of heterogeneous molecules and the ways in which they modulate the transcription of RNA and the translation of messenger RNA (mRNA) into protein and the subsequent functions of these proteins. We refer to this ambition as "interactional reductionism."
This is not to say that models have no useful role in molecular biology or bioinformatics. On the contrary, they are extremely useful to embody what we currently believe we know about biological systems. Where the predictive capabilities of these systems break down points to where we should guide further research. Also, the educational value of such models cannot be underestimated.
The final questionable assumption is the closed-world hypothesis. Both sequence level reductionism and interactional reductionism are predicated upon the availability of a reliable and predictive and complete mechanistic model. That is, if a fertilized ovum can follow the genetic program to create a full human being after 9 months, then surely a computer program should be able to follow the same genetic code to deterministically infer all the physiological events that are determined by the genetic code. Indeed, there have been several efforts, such as the E-cell effort of Tomohita et al. , which aimed to provide robust models of cellular function based on the known regulatory behavior of cellular systems. Although such models have important utility, our knowledge of all the pertinent parameters for these models appears grossly incomplete today. These parameters are required to describe intracellular processes, intercellular processes, and the unimaginably large repertory of possible environmental interactions with both sets of processes. This incompleteness, and the lack of knowledge of where the boundaries are between the complete and the incomplete, imply that these models will have behaviors that may diverge substantially and unpredictably from those that actually occur.
These caricatured positions of the traditional molecular biologists and the computational biologists are, of course, overdrawn. When prompted, most of these investigators will articulate the fullness of the complexities of the analytic tasks of functional genomics. In the conduct of their research or even in the discussions within their publications, these same investigators will nonetheless often retreat to the simplifications and assumptions described above. This may be because they take for granted that their colleagues and readers understand these simplifications, but such unstated assumptions can often misdirect novices in this discipline.
What we argue for, and hope that this book communicates, is the necessity for a rapid generate-and-test paradigm that cross cuts repeatedly across the disciplines of genetics, computational biology, and molecular biology. Operationally, this means that bioinformatics tools can be used to guide the investigations of an experimental biologist investigating a particular biological system or disease process. But for even the smallest assumption, rather than relying on the statistical association or predicted behavior of a system, empirical evidence has to be developed to support these. It is only in this incremental accretion of evidence that the discipline of functional genomics can become a science.
Why microarrays? Why focus on microarrays? After all, there are many other ways to impute function to genes. Using only genomic DNA easily obtainable from a peripheral blood sample or a buccal smear, genetic epidemiologists can conduct association studies using microsatellite markers or polymorphisms  to associate prognoses, diagnoses, and even biological function with a particular gene [67,126]. And then there are the more conventional genetic techniques of transgenic and misexpression whole-organism models of the function of various genes. Even more recently, the feasibility studies for proteomic assays suggest that in the future we will directly be able to assess changes in protein concentration at the cellular level.
In contrast to linkage and association studies, microarray studies are designed in principle to measure directly the activity of the genes involved in a particular mechanism or system rather than their association with a particular biological or clinical feature. An association-linkage-genetic epidemiology study relies upon a long indirect probabilistic causal chain: that a change in DNA sequence results in a change in gene regulation or protein structure, resulting in a change in cellular physiology measurable as a change in a whole-organism profile (e.g., the human phenotype). For some changes in genomic sequence, particularly in the instances of multigenic regulation, the effects may be so small that any conceivable population study may not be able to detect them. Also, the cost of screening the genome of sufficiently large populations to achieve adequate statistical power has prohibited all but the most focused association studies (although this is likely to change). Unlike the current state of art and engineering of large-scale proteomic assay systems, gene microarrays are currently affordable and within many applications have acceptable reproducibility and accuracy.
Another aspect of an integrative genomics Notwithstanding these apparent advantages of expression microarray studies, as we discuss in sections 1.5.3 and 1.5.2, there are several kinds of information that we are missing by not including such measurements. Our decision to restrict the scope of this book to the exploration of functional genomics and genomic medicine from the perspective of microarray technology is then largely a pragmatic one. Expression microarrays are sufficiently well engineered and cost-effective to allow thousands of researchers to productively employ them to drive their investigations. If, in the future, as we expect, massively parallel measurements of individual proteins becomes cost-effective, large-scale, and highly reproducible, then we will certainly expand the analysis to address these methodologies. The same will be true when high-resolution (i.e., every kilobase) genome-wide scans of hundreds of individuals will become economically feasible for most clinical research studies. Current estimates have these technologies available on the genomic and population scale within 5 years. A well-prepared genomic investigator will have prepared the pipeline to take advantage of all these measurement technologies.
From the computational perspective, the measurement of any analyte, whether it be an inorganic constituent of serum, an RNA transcript, or a protein, are all simply point measurements of variables corresponding to the total state of the cell. Likewise, all clinical measurements (e.g., height, blood pressure) and history (e.g., age of menarche, time of cancer diagnosis) are point measurements corresponding to the total state of the organism (e.g., human). It is only in the important details of the quality and meaning of these measurements that they differ. This is said both with tongue planted firmly in cheek and in all seriousness.
And indeed, it is this other aspect of an integrative genomics to include as many modes of data measurement as are available. Each mode reflects another aspect of cellular and organismal physiology each with its own set of specificities and sensitivities with respect to a phenomenon or process of interest. The role of bioinformatics in an integrative genomics is to both provide the glue bringing together all kinds of genomic and phenotypic data and the means to extract knowledge (or at least high-yield hypotheses for subsequent testing) from them in an efficient, large-scale, and timely fashion. It is also the inspiration for the title of this book.
Although even the most basic of the original conclusions, stated in the publications heralding the completion of the draft of the human genome, the order of magnitude of the number of genes in the human genome, now seems to be again in contention.
That is, there would be insufficient numbers of individuals with the necessary constellation of phenotypes across the entire human population. At the same time we recognize that there are several diseases, such as sickle cell anemia, where the change of one base in the hemoglobin gene results in a severe and unsubtle disease phenotype.
Was this article helpful?