Missing the Forest for the Dendrograms or One Aspect of Integrative Genomics

In the first 2 years of significant publications regarding the large scale application of microarray technologies, numerous special purpose or adapted machine-learning algorithms were described in the literature. Self-organizing maps [175], dendrograms [27,63,76], K-means clusters [101], support vector machines [33,72], neural networks [107], and several other methodologies (borrowed largely from the machine-learning community of computer science) have been employed. Most of these have worked reasonably well for the purposes described in the papers. It is one of the central contentions of this book, however, that the choice of a particular clustering or classification methodology is secondary to proper experimental design and full knowledge of the properties and limitations of massively parallel expression analysis in general and those of the specific microarray technologies employed. This contention does not constitute an overly fastidious approach to functional genomics but the insight from (often expensive) experience gleaned through our own mistakes and those of our colleagues and the reported investigations. The bioinformatics technique centric approach has been evident in our own collaborations. Often, at the outset of a new investigation, our collaborators from both the computational community and the biological community immediately wish to address the questions of which are the appropriate clustering techniques and which one is "the best." While we certainly recognize that the choice of clustering or classification technique is important (and we devote chapter 4 to this matter), we are firmly convinced that it is only one part of a well-designed pipeline that defines a successful exploration of biology and medicine using microarray technology. This pipeline is diagrammed in figure 1.5 on page 17. We refer to this functional genomics pipeline throughout the book. We discuss at length the practicalities of assembling this pipeline in subsequent chapters, but a few characteristics of this pipeline bear mentioning here:

Interesting Patients

Interesting Animals

Interesting Cell Lines r

Appropriate Tissue

Functional Clustering

Assess Significance

Appropriate

Extract RNA

Conditions

1

Scan Biochip

-

Hybridize Biochip

Validate Biologically

Figure 1.5: An archetypal functional genomics pipeline. Shown is a simplified view of a functional genomics pipeline solely involved in expression microarray experiments. Note the interdigitation of "wet" and "dry" components requiring close multidisciplinary collaboration and some creative consideration of the value of the individual contributions in this pipeline for a particular experiment and publication.

• Selection of the right tissue. Experiments in functional genomics require selection of the functionally relevant tissue or cell type. In certain experiments, like those using blood and solid cancers, the functionally relevant tissue is clear. In other analyses, the functionally relevant tissue is not so easily ascertainable or acquirable. For example, the clinical phenotype seen in type 2 diabetes mellitus, or insulin resistance, involves the coordinated physiological dysfunction of several organs and cell types, including liver, muscle, and fat cells. Schizophrenia involves a higher-order brain dysfunction, but brain cells are not easily accessible in humans. For some common diseases like hypertension, it is not clear what the functionally relevant tissue is. A successful pipeline involves collaboration with a source of tissue, such as a surgical team, a laboratory with biologically interesting animals, or a laboratory with cell lines of interest.

• Right conditions. Even if the appropriate tissue is selected from the organism of interest, the conditions under which the tissue is obtained (e.g., number of hours post mortem) can determine whether or not the investigation is successful. An insulin-sensitive tissue such as skeletal muscle will have a different characteristic metabolic and expression profile depending on the glucose and insulin concentrations prior to the extraction of RNA. The time of day will influence the expression of genes in all tissues which have endogenous circadian rhythms or that have processes that can be entrained by physiological clocks. Awareness of these issues and cooperation from a surgeon, pathologist, or technician responsible for obtaining the tissue is therefore an essential component to the success of the functional genomic pipeline.

• Extracting RNA, hybridizing to microarray, and scanning. Each of these steps in this "wet" component of a functional genomics pipeline is susceptible to operator error and is a potential source of poor or noisy measurements. The RNA extracted may be of poor quality, the hybridization conditions may vary (e.g., the room temperature), and the settings of the scanner that produces the digital image of the microarray may vary from one scan to another. Industrialization and standardization of this component has been the focus of the more successful and high-quality functional genomics efforts using expression microarrays.

• Functional clustering. This "dry" component of the pipeline is often thought to be what bioinformatics is about. And in fact, it may be at this stage that the algorithmic analysis of an expression profiling study to detect biologically or clinically meaningful patterns or associations is the only time a bioinformatician is involved. We will argue throughout this book that a successful functional genomics pipeline involves the bioinformatician at every step.

• Computational validation. As will be elaborated in this book, there are many reasons to perform bioinformatics analyses on functional genomics data sets, and many methods can be used. One unique problem with these types of data sets is that they are "short and wide," meaning that many characteristics are measured on relatively few samples. For example, current microarrays offer the quantitation of up to 60,000 expressed sequence tags (ESTs) in any given sample, but current costs may limit a single experiment to 10 to 100 samples. Because of this problem, these data sets are essentially underdetermined, as described on page 10, meaning that there are many correct ways to mathematically describe the clusters and genetic regulatory networks contained within them. Thus, some computational validation is required immediately after the bioinformatics analysis so that computationally sound but biologically spurious or improbable hypotheses are screened out.

The principal motivation for the screening out of spurious or improbable hypotheses is the efforts that follow. Each hypothesis generated that passes this step may need to be validated in a biological laboratory. Some biological laboratories may wish (and may have the resources) to pursue many hypotheses and can tolerate the eventual refutation of large numbers of false-positive hypotheses. Other biological laboratories may only be able to validate a few. Thus, a proper bioinformatics analysis includes a computational validation. An ideal computational validation does not merely provide a yes or no answer as to the potential validity of a hypothesis, but instead provides a continuum of validation, or a receiver-operating characteristic curve. With such a curve, the biologist can select the desired point of sensitivity or specificity and true and false negatives and positives (see sections 2.1.4 and 4.12.3).

• Biological validation. Most biological questions will not be answered using microarrays. Instead, the most likely outcome from a functional genomics analysis is the next biological question to ask. As hypotheses are generated from bioinformatics analyses, biological validation is crucial to verify these hypotheses. This verification may include, for instance, making sure a particular set of genes is truly expressed at the proper time and place as hypothesized, using conventional biological techniques such as Northern blotting and in situ hybridization.

• The multidisciplinary team. In most settings, all of these steps, from acquisition of source material, to microarray construction, to bioinformatics analysis, to biological verification, cannot be performed by a single group or laboratory. A successful functional genomics pipeline brings together resources from many disciplines and of varied backgrounds. Two anecdotes serve to illustrate the value of this multidisciplinary approach.

We were in the process of analyzing a large number of microarray expression data obtained from skeletal muscle for some colleagues interested in muscular dystrophy—a class of genetic diseases of muscle. They were gratified when our clustering analyses found interactions between transcriptional factors and contractile proteins that they had discovered just months before using conventional molecular-biological techniques as well as several new but plausible interactions. However, because the clustering analyses were exhaustive, they also identified several hormonal interactions that were not of primary interest to these neuromuscular specialists. Using annotation tools linking the microarray data to several national databases, it became quickly apparent that these hormonal interactions were thought to be exclusive to adipocytes (cells constituting the principal component of fatty tissue) but we had just found suggestive evidence to the contrary. The multidisciplinary nature of our effort allowed the formulation of well-posed questions directly related to the interests of the biological investigators and yet kept us open to important hypotheses generated from the data.

We are participating in a study of the functional genomics of the developing brain using mouse models. We had computed, using approximately two dozen microarray data sets produced by our collaborators—developmental biology researchers—a list of approximately 100 genes that appeared to be involved in the development of a specific region of the brain. Our collaborators were in the process of selecting a subset of these for biological validation but we were worried about the outcome of the validation because the data had been derived from entire portions of the brain whereas the process the developmental biologists were interested in only occurred in a minute component of the brain. It seemed probable that many of the 100 genes were not specific to the processes we were studying. Given only the expression data from the microarrays, none of the bioinformatics techniques were able to further refine or hone the list of 100. Fortunately, the developmental biologists provided us with the following insight. They knew of one gene g that was expressed in the tiny area of the brain that they were studying and they had determined empirically that it was expressed in no other part of the brain. They suggested that we find all those genes in the list of 100 that behaved very closely to g. We found 10 such genes and our collaborators went on to successfully validate 8 of them using the techniques of conventional hypothesis-driven molecular biology. If we had not drawn on the multidisciplinary capabilities of our team for that small but crucial biological insight, then we would have been stuck with a large list of nonspecific genes of little relevance to the questions originally posed by the developmental biologists.

When the initial design of the multidisciplinary functional genomics pipeline is given short shrift and the fundamental limitations of expression microarray technologies misunderstood, the enterprise of functional genomics appears to approximate the "fishing expedition" that has been the oft-stated concern of traditional biologists regarding this nascent discipline. Consequently, even if a particular investigator participates in only a fraction of the pipeline, understanding the safe design of an entire functional genomics pipeline can maximize the yield of these experiments, or at the very least produce convincing and reproducible negative results. It is the intent of this book to point the way to investigations that provide such an understanding.

Because of the dramatically different backgrounds (at least at present) of the different contributors to the functional genomics pipeline, its social dynamics may be challenging, as described below. We pay attention to these dynamics because one aspect of an integrative genomics is the integration of disciplines and experts (the other side will be described before the end of this chapter).

Was this article helpful?

0 0

Post a comment