Supervised versus unsupervised learning

In the introductory chapter, we discussed why the computational techniques applied in the analysis of gene expression are qualitatively different from those of traditional biostatistics: the data sets are of high dimensionality and yet the number of cases are relatively small. Consequently, the number of solutions that could explain the observed behavior is quite large. For this reason, the machine-learning community has recognized the potential role for their techniques specifically designed to explore high-dimensional spaces (such as those of voice or face recognition) and have also recognized the enormous need to apply these techniques to genomic data sets. To this effect, the first reaction of a computer scientist with a background in machine learning when he or she becomes aware of the new challenges created by genomic data sets is to pull out the tools of the standard armamentarium of machine learning. He or she then begins to explore informally, and subsequently evaluates formally the results obtained when these tools are applied to genomics data sets. We provide a framework for the armamentarium in the genomic data mining chapter (chapter 4, page 149) but here we only provide enough of an overview to discuss these tools with respect to experimental design.

Two useful broad categorizations of the techniques used bythe machine-learning community are supervised learning techniques and unsupervised learning techniques. These are also commonly known as classification techniques and clustering techniques, respectively. The two techniques are easily distinguished by the presence of external labels of cases. For example, labeling a tissue as obtained from a case of acute myelogenous leukemia (AML) or acute lymphocytic leukemia (ALL) is needed first before applying a supervised learning technique to create a method to learn which combinations of variables predict or determine those labels. In an unsupervised learning task, such as finding those genes that are co-regulated across all the samples, the organization or clustering of the variables operates independently of any external labels. The kinds of variables (also known as features in the jargon of the machine-learning community) that characterize each case in a data set can be quite varied. Each case can include measures of clinical outcome, gene expression, gene sequence, drug exposure, proteomic measurements, or any other discrete or continuous variable believed to be of relevance to the case.

What kinds of questions are answered by the two types of machine learning? In supervised learning, the goal is typically to obtain a set of variables (e.g., expressed genes as measured on a microarray) on the basis of which one can reliably make the diagnosis of the patient, predict future outcome, predict future response to pharmacological intervention, or categorize that patient or tissue or animal as part of a class of interest. In unsupervised learning, the typical application is to find either a completely novel cluster of genes with a putative common (but previously unknown) function, or more commonly to obtain a cluster or group of genes that appear to have similar patterns of expression to a gene (i.e., they fall into the same cluster) already known to have an important well-defined function. The goal there is to find more details about the mechanism by which the known gene works and to find other genes involved in that same mechanism either to obtain a more complete view of a particular cellular physiology or, in the case of pharmacologically oriented research, other possible therapeutic targets. Although the distinct goals of supervised versus unsupervised machine-learning techniques may appear rather obvious, it is important to be aware of the implications for study design. For example, an analyst may be asked to find classifiers between two types of malignancy, as was done in the Golub et al. investigation of AML and ALL [78]. However, the lists of genes that reliably divide the two malignancies may have little to do with the actual pathophysiological causes of the two diseases and may not represent any particular close relationship of those genes and function. Why might this be? It is quite possible that small amounts of change of some gene products such as transcriptional activators and genes such as p53 may cause large downstream changes in gene expression. That is, with only a subtle change, an important upstream gene may cause dramatic changes in the expression in several pathways that are functionally only distantly related but are highly influenced by the same upstream gene. When applying a classification algorithm directly on the gene expression levels, the algorithm will naturally identify those genes which change the most between the two or more states that are being classified. That is, a study design geared toward the application of a supervised learning technique may generate a useful artifact for classification, diagnosis, or even prognosis, but it will not necessarily lead to valuable insights into the biology underlying the classes obtained.

Let us consider the more general cases where gene expression values are not the only data type. For example, as illustrated in figure 2.11 below, a given case can include several thousand gene expression measurements but also several hundred phenotypic measurements such as blood pressure, a laboratory value, or the response to a chemotherapeutic agent. Here again a clustering algorithm can be used to find those features that are most tightly coupled in the observed data. When designing an experiment that includes the various data types, it is worthwhile thinking ahead of time whether some kinds of features are more likely to cluster together, separately from the genomic data. That is, after application of a clustering algorithm the data set may reveal relationships between the nongenomic variables that are much more significant and stronger than any of those that involve gene expression or sequence. While that is not necessarily a bad outcome, it will not help the investigator who is trying to understand the particular contribution of genetic regulation to the observed phenomenon. As an example, if one looks at the effect of thousands of drugs on several cancer cell lines, then it should not be surprising if these drug effects were most tightly clustered around groups of pharmaceutical agents that were derived from one another through combinatorial chemistry. Similarly, phenotypic features which are highly interdependent, such as height and weight, will cluster together. The strength of these obvious clusters will often dominate those of heterogeneous clusters that contain phenotypic measurements as well as gene expression measurements. This suggests that careful use of feature reduction to only include those features that are nonredundant and only truly independent phenotypic measures for each case should be used. We refer the reader interested in systematic approaches to feature reduction to the excellent text by Sholom Weiss and Nitin Indurkhya [188].

Figure 2.11: Clustering more than only expression values. A full-edged functional genomic study with clinical relevance will involve multiple and quite heterogeneous data types. The realities of clinical research and clinical care will ensure that there is a requirement for the handling of missing data.

2.2.2 Figure of merit: The elusive gold standard in functional genomics

Figure 2.11: Clustering more than only expression values. A full-edged functional genomic study with clinical relevance will involve multiple and quite heterogeneous data types. The realities of clinical research and clinical care will ensure that there is a requirement for the handling of missing data.

2.2.2 Figure of merit: The elusive gold standard in functional genomics

Whereas the discussion above was motivated by the questions posed by our collaborators who are primarily biologists or clinicians, it parallels similar questions coming from those colleagues with a computer science background. One question is, how can we determine whether a particular methodology that we are applying is successful or not? Is one machine-learning algorithm better than another, or does one clustering method provide more robust clusters than another? In other words, how do we know how successful a particular functional genomics investigation is? What is the figure of merit that we are trying to obtain?

The figure of merit is probably the most ascertainable for the case of a known classification. Take, for instance, the case of the task of the classifying whether a particular tissue belongs to one of two tissue types using a supervised learning technique. Suppose also that a test sample is available. Then, various classification algorithms can be run against one another and the sensitivity and specificity of these algorithms can be compared. More generally, the receiver operating characteristic (ROC) curve of each of these classification algorithms can be defined. A good example of such a comparison is provided by Michael Brown et al. [33]. Several classification algorithms were employed and compared and ranked based on a weighted measure of true and false positives and true and false negatives. The best performance turned out to be in the application of support vector machines. The test that was used for this "bake-off" was the Saccharomyces cerevisiae data set from Stanford involving expression data of 2467 genes across 79 different hybridization experiments.[9] Because the classification is known in advance, the performance of these algorithms can be measured. However, there are several limitations which should be recognized. The classification algorithms' performance only pertains to the particular population or set of experiments on which it was originally tested. That is, a classification algorithm working on a yeast data set may not necessarily work as well relative to other algorithms on distinguishing two different types of leukemia or two different types of diabetes mellitus in humans. Even less dramatically, the same algorithm trained on one set of patients with leukemia may have a different relative performance compared to other classification algorithms on another set of leukemic patients drawn from a different population or obtained with a different ascertainment bias. For example, one set of patients may have been selected because of particularly refractory disease that caused them to be referred to a tertiary care hospital and another set of patients may have been treated in community hospitals. It is possible that the underlying diseases of these populations may be different and therefore the possibility of underfitting or overfitting the underlying classes of disease in these populations will differ. The figure of merit, therefore, in a classification experiment is the correct classification performance of an algorithm or measure for that particular population. Its applicability or merit for use in even related populations is uncertain or problematic.

Figure of merit in a clustering experiment What is the figure of merit for a clustering analysis? That is, how can we evaluate the degree to which the cluster of genes obtained are in fact correct or relevant? For anybody who has performed a clustering procedure on microarray expression data, it will seem obvious that it is altogether too tempting when one obtains a cluster of genes through any machine-learning algorithm to come up with a post hoc justification for why those genes mayhave fallen or risen together. This temptation is all too evident in the publications of the last 3 years in which clusters of genes reported by investigators are described typically as falling into one of three different categories. The first is well-known published associations, either mechanistically verified or empirically known, that have been obtained through other more conventional techniques. Second is associations that appear plausible to the authors, and presumably the reviewers, but have not been proven in the literature. The third is associations for which the authors are hard-pressed to find support. For the biological investigator who wishes to further investigate the fundamental biology of genes and their function, the last two categories constitute a fairly unsatisfactory basis on which to invest several months, if not years, understanding the basis for the imputed associations or potential functional dependencies.

The challenge then for the functional genomist using microarray data is to come up with means to validate the clusters obtained. There are two levels of validation. The first is a statistical or methodological validation with techniques such as permutation or cross-validation as described elsewhere in this book (section 4.12.1 and section 4.12.2). These techniques can be used to ensure, within a specified degree of certainty, that for the particular data set being studied, the clusters are neither the result of serendipitous coordination of gene behavior nor that the samples are apparently inadequate'101 to estimate the reliable coordinated behavior of gene expression. An alternative method is to estimate the probability of particular clusters. This methodology is still only in its infancy and the relevant literature is only now beginning to be generated for functional genomics.[15, 178]. As with other aspects of functional genomics, much larger data sets will allow more effective use of these methods. Ultimately however, we should remember that the clusters or classifications that we obtain through machine-learning techniques are only reflections on the measurements made in a particular system. If we are to make broad claims about the empirical or scientific truth of the relationships or classes that we infer from expression data, wewill have to submit analyses to the same kind of tests that havebeen developed for other experimental sciences with longer histories. Specifically, we need to at least come up with the microarray equivalent of Koch's postulates. In 1890, Robert Koch set out to develop criteria for judging whether a given bacterium was the cause of the disease. The criteria are summarized in table 2.2.

Table 2.2: Koch's postulates

1. It (the suspected pathogen) should be present in every instance of the disease.

2. It should be isolated from the diseased host and grown in pure culture.

3. The specific disease must be reproduced when a pure culture of the agent is innoculated into a healthy susceptible host.

4. The same agent must be recoverable from the experimentally infected host.

It has been recognized there are several problems with Koch's postulates and there is a great deal of discussion and disagreement about this in the literature of the history of science [143]. However, the postulates are a good first approximation of the requirement for validation of functional genomics experiments. Therefore the analogs to Koch's postulate in the domain of microarrays may be useful. These are illustrated in table 2.3. The criteria describe the test for an inferred regulatory dependency obtained through microarray analysis.

Table 2.3: Functional genomics analogs to Koch's postulates

1. If gene A is found to be correlated with gene B then this relationship should be reproducible through northern blots and other quantitative expression measurement techniques [151].

2. Furthermore, in all pathological conditions in which the gene is thought to play a critical role, the predicted level of that gene should be found at the right time and in the right part of the tissue during the disease process.

3. If a gene thought to be involved in a pathway is underexpressed or overexpressed in a model system, then the process controlled by that pathway will be affected.

It is only with these biological tests that we can "rate" the performance of clustering algorithms—it is the only durable figure of merit. It is not surprising that very little of this kind of hypothesis testing has been reported in the literature. After all, for each gene in a cluster, there is implied a laborious sleuthing effort in which the bioinformatician and biologist must closely collaborate to verify the results using the analogs to Koch's postulates in table 2.3. If we are to avoid a functional genomics "meltdown," hinted at in the introductory chapter, then the interdisciplinary nature of the functional genomics pipeline (diagrammed in figure 1.3.1) will have to be substantively adopted. All gene associations or regulatory relationships suggested by bioinformatics techniques should undergo at least a minimal validation by the biologist. Conversely, experimental design without the involvement of the bioinformatician is likely to lead to wasteful, expensive, and unrewarding analyses.

[7]In other words, there are several distinct and statistically distinguishable processes.

[8]Well-known examples of these are the transcriptional factors such as sonic hedgehog (shh) which in some tissues and at some times are involved in cell proliferation and in others in cell differentiation processes.

[9]Although this paper is one of the better examples of comparison and classification techniques, it does not compare the entire ROC curve but only one single point on the curve. Therefore, it remains unclear what would be the overall performance of these algorithms under circumstances in which specificity and sensitivity were valued differently. See page 201 for a brief discussion of using the ROC curve to evaluate a classification test.

[10]Insufficient numbers of biological samples, insufficiently strong effects, or sufficiently distinct biological processes, or any combination of these.

Was this article helpful?

0 0
Delicious Diabetic Recipes

Delicious Diabetic Recipes

This brilliant guide will teach you how to cook all those delicious recipes for people who have diabetes.

Get My Free Ebook

Post a comment