Discarding data and lowhanging fruit

After an investigator obtains his or her first microarray data sets, often the first question he or she will ask is: "Which of the observed up or downregulation of genes represents biologically significant changes in expression?" This and similar questions are addressed in detail in section 3.3. Here we provide a context and motivation for an answer that biologists often find discomfiting: that very few genes are in fact significantly changed in expression in a way that is distinguishable from biological and measurement variation and noise.[4] We should emphasize at this point that there is an important distinction between mathematical or statistical significance, and biological significance. The former has to do with analytically quantifying the difference between two or more different sample sets of numbers, i.e., microarray expression data, for which there exists standard statistical tests like the Student's t or Q2. These tests typically ignore the rich biological meaning and structure in these numerical representations of expression. The latter is a more complex matter which we shall discuss here.

It is likely to be the case that there are thousands of genes with important biological outcomes effected by small relative changes in the expression levels. In the present state of the microarray technology and with the typically small number of replicate experiments for any given condition, these small changes cannot be reliably and reproducibly distinguished from noise. This may mean that even though we measure tens of thousands of genes in a microarray experiment, we would only obtain hundreds of genes that we are reasonably convinced are involved in a particular biological system. On the other hand, not all statistically significant gene expression changes lead to a significant change in the biological or physiological state of a system. This answer is most disconcerting to those biologist-researchers who have had substantial experience in investigating the expression of only a few genes at a time. Their investigations typically involve one or two highly likely candidates based on many prior investigations, and where multiple measurements (all of which are carefully checked) have been made of the expression levels. Consequently, the researchers have well-grounded ideas of what constitutes a biologically relevant or significant change in expression for these few genes. They do not have the benefit of this sort of knowledge for each of the thousands of genes that are represented on a microarray but nonetheless are uncomfortable with discarding from further analysis thousands of genes that appear to have large numerical changes in expression.

In contrast, due to the large number of genes that can be measured in a single experiment, computer scientists and computational biologists are comfortable— perhaps too comfortable—with generating exhaustive lists of genes that are possibly involved or interacting in a particular biological process. This kind of exhaustive list generation has been quite common in publications of microarray experiments from 1997 until today. Unfortunately, such lists are not that helpful to the biologists who wish to determine which elements in these lists are worth pursuing, i.e., are biologically significant for their investigations. This is often the reason for questions of what constitutes a significant fold change, where significance can mean anything from "present in the tissue," "associated with the system being studied," "causative of the changes in the system being studied," or even simply meaning "worthy of further study." Thus, an analyst for a typical functional genomics study will simply draw two boundaries: one covering increases in expression and the other covering decreases in expression, as shown in figure 2.10. Such an analyst will then declare that any gene found beyond these lines (the red points) is significantor relevant to the biological process being studied. Unfortunately, there is no single threshold number or even function that we can provide that would generate a particular number of candidate genes that is worth pursuing for any arbitrary experiment and experimental design. Fortunately, there are well-founded decision analytic procedures that can be followed in order to come up with the appropriate number of candidate genes for any given investigation. These are described in detail in section 3.2. The underlying motivation for picking a threshold is straightforward.

Figure 2.10: Typical uses of fixed fold thresholds. Reproducibility scatter plots. In each of the experiments, samples were hybridized to identical oligonucleotide arrays containing probes for 6800 human genes. The message abundance in arbitrary units is plotted. Left panel: A single biotinylated RNA target was divided in two and each half hybridized to two arrays. Fifty-nine genes (0.8%) judged to be expressed "present" by the Affymetrix GeneChip software differed by more than twofold, and 0.3% differed by more than threefold. Middle panel: A single biotinylated target was hybridized to one array, the sample removed and then rehybridized to a second array. Of "present" genes 2.6% and 0.4% differed by more than twofold and threefold, respectively. Right panel: A single total RNA sample was converted to biotinylated cRNA in two independent labeling reactions, and the cRNA then hybridized to two arrays. Of the "present" genes 2.2% and 0.4% differed by twofold and threefold, respectively.

The first question one has to consider is the cost of a false positive. A false-positive gene is a gene which, in reality, is not biologically significant to the process under investigation, but which the analytic technique deems to be significant, statistically or biologically. That is, what is the cost involved in the follow-up procedure to confirm the function of a gene in a particular biology process? Whether this confirmation process, or biological validation, involves quantitative PCR, in situ hybridization, the generation of a transgenic mouse, a transient-transfection assay, or transfecting a cell line, its cost in time and money will be substantial. This cost thus limits the number of genes that can be investigated. The threshold that will then be picked will be determined in part by the disutility or cost of the biological validation step for false positives. The tens of thousands of genes present on any microarray now ensure that the number of false positives could potentially overwhelm the typical time and financial budgets of most (academic) research laboratories.

Of course, there is the second and converse question of the cost of missing a false negative from the same system. A false-negative gene is a gene which, in reality, is biologically significant to the process under investigation, but which the analytic technique deems not to be significant, statistically or biologically. There are likely to be several genes involved in any signaling pathway of interest. Not all genes in a pathway are equally amenable to biological validation. Furthermore, some of the chosen genes may be more suitable targets than others for diagnostic assays or therapeutic interventions. If the threshold picked for considering a gene to be significantly'51 changed in expression excludes one of these targets, then the cost of that false negative will be quite high, e.g., a missed scientific opportunity or lost opportunity for commercial development.

With this in mind, it becomes clearer how a threshold should be picked. First, the investigators should ask themselves how many false-positive and false-negative leads they can tolerate. Then they can conduct a series of replicate experiments as discussed in section 3.2. and determine where they will have to draw the thresholds to attain the required sensitivity and specificity.'61 So that if one is about to embark on a functional genomics investigation, an integral part of the experimental design must be the decision analysis diagrammed in figure 2.9. This requires that the cost to one or one's enterprise of missing the one or more genes that are likely to be involved in the genetic regulatory pathway of interest be made explicit. Similarly, the cost of having to follow up on false leads must be estimated. These two costs define the principal disutilities that one is trying to minimize. In order to complete this simple decision analysis, the probabilities corresponding to the sensitivities and specificities that are diagrammed will also have to be obtained. After this is complete, one will then know whether it is possible to engage in a productive high-throughput functional genomics strategy, or whether one has to increase the sensitivity or specificity of the measurement techniques, or whether one needs to increase one's biological validation budget, or all three. Given the current level of reproducibility of expression measurements in microarrays, most investigators will choose a high level of specificity to reduce the false-positive rate. This will inevitably lead to a high false-negative rate. At this point the biologist will find that among the false negatives are genes that are known to have changed expression (but did not meet the significance threshold computed by the bioinformatician) which will lead naturally to the following worry: Many other genes of relevance to the biological system under investigation are being discarded from the analysis and follow-up. This worry is likely to be well-founded but given the state of the measurement technology, and the typical sample sizes employed, it cannot be easily remedied. It remains, that the low-hanging fruit, while only numbering in the hundreds, are likely to shed new light on the processes studied. Already this represents several orders of magnitude more hypotheses to be tested as compared to the investigations of the pregenomic era. For those fruit higher up the tree, the above decision analysis suggests the need to wait for more accurate and cheaper microarray technologies.

Figure 2.9: A decision analytic procedure for picking a threshold for selecting genes from a functional genomics experiment. Because of the large number of probes on current microarrays, it is all too easy to underestimate the cost and practical intractability caused by even a moderate false positive rate. Investigators are advised to perform this simple decision analysis in order to determine what false positive and false negative rate they can afford before they proceed with any experiments.

It goes without saying that the preceding discussion is an oversimplified decision analytic procedure. Attempting to assess the utilities of false-negative findings will be particularly difficult for most investigators, but the constraints of realistic budgets will be a driving factor in considering this analysis. However, the procedure we have described does have the merit of providing a rational basis for deciding upon the number of genes that one should seek to obtain from one's pipeline. Even a coarse decision model can be useful in avoiding unpleasant surprises after the initial excitement of obtaining a list of putatively interesting genes wears off. The reader who is interested in more sophisticated decision analyses is referred to the textbook by Weinstein et al. [187]. It should be noted that the raw specificities and sensitivities of the expression microarrays used in a functional genomics pipeline need not be the ones that are used in this decision analysis. As we demonstrate in sections 3.2.2 and 3.2.8, the application of appropriate noise models over repeated experiments can lead to improved sensitivity and specificity. The cost of such models is typically in the increased number of microarray experiments required to develop those models.

Dynamics Perhaps the best glimpse of the interactions between the various genetic components, and in particular their causal dependencies, is obtained through analysis of the dynamics of the system. This is elaborated considerably in the section on dynamics (section 3.6.3, page 146). Suffice it to say here that understanding the trajectory over time of various RNA component species rather than their absolute values at any given point in time provides far more information about the operation of the underlying system. To return to our watch metaphor, observing how the gears turn inside the watch while we smoothly move the minute hand or the hour hand will give us much more information about how the gears move together than a few (unordered) snapshots in time of the different positions of these hands on the face of the watch. In most analyses to date, even when expression data are obtained as a time series, the information buried in the specific ordering and timing of these measurements is rarely exploited. For example, many of the clustering and classification methods employed (and introduced below) will generate the same answers even if the order of the measurements were shuffed. In fairness, the number of time points and the sampling intervals of many of the existing time-series expression data sets is minimal. They provide insufficient information for most of the standard armamentarium used to analyze time-oriented data. As the price of microarrays falls, the quantity and quality of time-series experiments will undoubtedly increase, at which point many of the techniques for analyzing trends and periodicity that have been developed in statistics and the signal processing literature will become applicable. At that time, the temporal relationships between every single expressed gene with every other gene measured at all time points will provide a qualitatively improved set of insights into causal relationships in the genetic regulation of cellular physiology.

[1]For example, cellular proliferation and differentiation during development.

[2]All of these genes have been implicated in somatic growth and bone repair.

[3]This involves giving the subject a continuous intravenous infusion of insulin and glucose, while maintaining a relatively constant blood glucose concentration by varying the rate of the glucose infusion.

[4]This is particularly true given the noisy nature of most microarray-based expression measurement systems.

[5]For the discussion here, we define "significant" as "worthy of further study."

[6]Sensitivity can be described as (1 false negatives)/all positives; specificity as (1 false positives)/all positives.

Was this article helpful?

0 0

Post a comment