Informatics and advances in enabling technology

Gene expression detection microarrays are notable not because they can uniquely measure gene expression. There certainly have been many technologies that have allowed for the quantitative or semiquantitative measurement of expression for well over two decades. What distinguishes gene expression detection microarrays is that they are able to measure tens of thousands of genes at a time and it is this quantitative change in the scale of gene measurement that has led to a qualitative change in our ability to understand regulatory processes occurring at the cellular level. Figure 1.3 provides perhaps the best motivation for the application of information sciences to the functional genomics enterprise. Since DNA sequencing was invented 25 years ago, the number of gene sequences deposited in international repositories, such as GenBank, has grown exponentially, culminating with the entire human genome being sequenced in 2001. Distinguished from this, the amount of knowledge about these genes (as measured by the proxy of the number of papers published in biomedicine) has also been growing exponentially, but at a much slower rate. As shown, the number of GenBank entries has fast outstripped the growth of MEDLINE. As such, it serves as a proxy for the large gap that has just opened up between our knowledge of the functioning of the genome and raw genomic data. And GenBank entries are just a fraction of the various kinds of data, listed above, generated as part of our investigation of the genome. This volume of data must somehow be sifted and linked to the biological phenomena of interest. Doing so exhaustively, reliably, and reproducibly is a plausible strategy only with the application of algorithmic implementations on computers. This has led to an unprecedented demand for investigators with the knowledge of successful manipulation and analysis of large data sets. These skills may come from education and training in computational physics, chemical engineering, operations research, or financial modeling, but once they are applied to the domain of functional genomics, they can be collectively described as belonging to the domain of bioinformatics.

1975 1980 1985 1990 1995

Figure 1.3: Relative growth of MEDLINE and GenBank. The industrialization of genomic data acquisition has created a growing gap between knowledge—of which MEDLINE publications are a proxy—and the information we have gathered about the genome. Information science applied to this information (i.e., bioinformatics) is one of the pillars of an international strategy to overcome this knowledge gap. Cumulative growth of molecular biology and genetics literature (light gray) is compared here with DNA sequences (dark gray). Articles in the G5 (molecular biology and genetics) subset of MEDLINE are plotted alongside DNA sequence records in GenBank over the same time period. (Derived from Ermolaeva et al. [65].)

One reason why the number of known sequences is growing so much faster is the discovery and use of many automated techniques, such as automated sequencers and shotgun sequencing methods. Until the recent advent of gene expression microarrays, we did not have a similar technique to automate the acquisition of knowledge about these genes' behavior in cellular physiology. The past 5 years have seen an incredible confluence of disparate technologies, such as robotics, florescence detection, photolithography, and the Human Genome Project,'21, so that today, biologists can use RNA expression microarray detection technologies to obtain near-comprehensive expression data for individual cells, tissues, or organs in various states. With currently available commercial tools, a single experiment using RNA expression detection microarrays can now provide systematic quantitative information on the expression of 60,000 unique RNAs within cells in any given state. Complementary DNA (cDNA) and oligonucleotide microarray technology'31 cannot only be used to determine the abundance of RNA transcripts. By virtue of their broad reach, these measurement platforms permit a large number of exhaustive comparisons: of transcriptional activity across different tissues in the same organism, across neighboring cells of different types in the same tissue, across groups of patients with and without a particular disease or with two different diseases. These platforms can also be used to analyze complex systems, such as traits with multigenic origins or those linked to the environment. They can be used in time series to measure how a particular intervention may start a transcriptional program, i.e., change the expression of large numbers of genes in a reproducible pattern determined by inherent genetic regulatory networks. With sufficient data, they can be even used to provide insight into the underlying mechanisms of these genetic regulatory networks.

Nonetheless, the tools to extract knowledge from data collected from all of these types of experiments are still in their infancy, and novel tools are still needed to sift through the enormous databases of simultaneous RNA expression to find the true nuggets of related function. The application of techniques of information science, computer science, and biostatistics'41 to the challenge of knowledge acquisition from genomic data is commonly known as bioinformatics. This appellation applies to the quantitative and computational analysis of all forms of genomic data, including gene sequence, protein interactions, protein folding, and any observable or measurable phenomenon of interest to the biomedical researcher. The breadth of this commonly used definition of bioinformatics risks relegating it to the dustbin of labels too general to be useful of which artificial intelligence, knowledge management, and systems analysis are only among the more recent. In this book our intent is to be sufficiently specific about the bioinformatics techniques employed that the matter of a sufficiently broad and yet specific definition of bioinformatics is moot.

Over the past 6 years, several approaches have been developed to analyze microarray-generated RNA expression data sets. The central hypothesis (or hope) of these methods is that, with improved techniques in bioinformatics, one can analyze larger data sets of measurements from RNA expression detection microarrays to discover the "true" biological functional pathways in gene regulation, and develop more definitive, sensitive, and specific diagnostic and prognostic characteristics of disease. However, this is only one of many important areas of bioinformatics addressed in this book. Particularly because we are still in the immediate aftermath of the Human Genome Project, many of the basic naming and data management practices of functional genomics remain in flux and are active areas of bioinformatics development. Although this activity is quite distinct from the analytic efforts touched on above, it currently consumes perhaps the largest proportion of the bioinformatics community because its resolution is urgent and a sine qua non for the success of any of the analytic efforts. After all, if we cannot reliably name the same gene in identical fashion across experiments, if we cannot reliably retrieve expression data from all the microarray experiments of interest, if we cannot readily access the meaning and function of genes determined by thousands of researchers, then the whole enterprise of functional genomics will be crippled if not intractable.

There is a related discipline to bioinformatics—clinical informatics—which refers to the application of information science to various aspects of clinical care. Although clinical informatics is not addressed in this book in detail, in chapter 6 we describe many of the problems that have dogged clinical informaticians and that will confront bioinformaticians as they attempt to bring their basic science findings to clinical relevance.

Was this article helpful?

0 0

Post a comment