Fig. 12. ^-Means clustering: (A) iteration 1; (B) iteration 2.
into which the sample should be divided. Because the classes themselves are unknown in the setting of unsupervised learning, it is reasonable to expect that their number would also be unknown. For example, in attempting to describe new tumor subtypes, it might be unclear how many subtypes to expect.
9.6.4. Which Method to Choose? There is great interest in improving the performance of unsupervised learning algorithms and refinement of specific aspects, such as error description. Work continues on existing hierarchical and partitioning method, as well as the development of novel techniques. In the absence of consensus, an extended discussion of the relative merits of individual applications is unlikely to be fruitful for this audience. Nonetheless, researchers and clinicians alike require an interim objective measure to asses the quality of the analysis. We suggest the following framework in the context of ongoing development. The study should clearly delineate the hypothesis being tested in terms of supervised learning, unsu-pervised learning, or other analysis. One quality of an unsuper-vised learning experiment is that, by definition, it is hypothesis generating and the classes described by the analysis are unknown a priori. Because the classes are unknown, the criterion which to judge the quality of the algorithm should not be its ability to determine known classes. Better criteria include such features as reproducibility of clusters or measures of cluster strength. The hypotheses generated in an unsupervised analysis can be verified by supervised analyses or other method.
Within the context of a stated hypothesis, any specific study goals that might clarify the choice of one analytic approach should be noted. For example, if there is a strong a priori reason to only look at two subgroups in an unsupervised analysis, an author would have a reasonable argument to perform a portioning analysis with k = 2. If there are no such specific requirements, then the choice of one method over another should be made explicit or consideration should be given to the use of multiple methods. In the absence of other justification, hierarchical clustering should probably be viewed as the standard for unsupervised learning.
9.7. SUPERVISED LEARNING Supervised learning is the process of segregating samples into known classes based on collections of data called features. As in the unsupervised learning example, the features are genes. The machine learning techniques along with user inputs will decide which genes should be used and how. Classes can represent a wide variety of biology, from phenotype to clinical behavior. Examples of supervised learning include using genomic data to differentiate known tumor types, aggressive tumor behavior, response to drug therapy, or tendency to metastasize. Supervised learning algorithms have two components. The first component is a training stage in which a training dataset is used to determine the relationship between genes and outcome. The second component is a validation stage, in which the hypothesized relationship is evaluated in independent samples, also called the test dataset, not used to develop the model. In the validation step for which the classes are known, the model is evaluated for its error rate in predicting known classes using the selected genes.
In the parlance of machine learning, the process of finding a relationship among features, genes, and class assignments (e.g., clinical outcomes) is termed "building a classifier." The classifier itself is any set of rules to establish the relationship. Classifiers, their rules, and their subtypes take many forms (93-98). In the simplest machine learning cases, a classifier predicts the class by relating responses to a series of binary yes/no features. For example, a training set of tumor specimens using a decision-tree classifier could establish a rule in which any female patient with an estrogen receptor-positive tumor would be assigned the to the class "breast cancer." In more complicated examples, every gene on a microarray might contribute a vote toward tumor classification, such that the importance of any single gene might be difficult to interpret.
9.7.1. k Nearest Neighbors The partial list shown in Table 2 shows that there are far more supervised learning techniques than unsupervised learning techniques. We will discuss two examples that have been widely used and then consider the wider choice of optimal technique. The k-nearest-neighbors technique is simple but useful, relying on the distance concepts previously discussed. A set of useful genes is selected from those provided by a training set of samples. The definition of what constitutes a useful gene is not explicit but, in general, implies genes that are differentially expressed in the different classes. Once these genes are selected, they
Selected Supervised and Unsupervised Learning Algorithms
Linear discriminant analysis Classification tree Density based Regression
Neural networks Nearest neighbors Support vector machine Other Boosting
Learning vector quantization Unsupervised Hierarchical
Classification and regression trees Naïve Bayesian Predictive modeling Linear
Agglomerative clustering (AGNES—
agglomerative nesting) Divisive clustering ¿-Means
Linear perceptron k Nearest neighbors Linear and nonlinear Bagging
Density based Model based define the multidimensional gene space that will be used for the analysis and each sample is mapped into their corresponding location. For our purposes, we select two genes defining a two-dimensional gene space that separates well three known tumor classes A-C, as seen in Fig. 13. Samples from the training set are represented by diamonds, with samples from similar classes segregating closely in gene space. To the extent that the expression of genes defining gene space describe the identities of classes A, B, and C, a new sample of those classes should also map in the same vicinity. The k-nearest-neighbor algorithm assigns the class of a new sample by selecting its k nearest neighbors, k representing a predetermined experimental parameter often 3 or higher. Setting k = 3 in the example, the hollow star would be assigned to class C because all three nearest neighbors are of class C. The solid star is of true class A, but it has two of three nearest neighbors of class B, suggesting that it would incorrectly be assigned to class B. In some formulations of the k-nearest-neighbors algorithm, the voting is not simple majority rule. For example, the nearest neighbor can be given the most weight or weighting can be proportional to overall distance.
9.7.2. Support Vector Machine An alternate classification approach used by the support vector machine classifier is to divide gene space into sectors which represent the classes. Again, the training set is used to select a set of genes that are useful for making class distinctions and samples are mapped into gene space. Two dimensions are shown in Fig. 14 for simplicity's sake, although, in practice, two dimensions are rarely sufficient to separate complex classification problems. Higherdimension models often have 50-100 genes or more and are more likely to find gene spaces that separate classes well. Boarders around each class can be defined in two dimensions by a line, three dimensions by a plane, and higher dimensions by a hyperplane. Computationally, boarders are defined by classifiers such as the support vector machine algorithm, which maximizes the distance between samples and minimizes the distance that outliers fall into space defined by another class. Once defined, test samples can be mapped into gene space and assigned a class according to the sector in which they fall.
9.7.3. Which Method to Choose? Faced with the previous classification examples in conjunction with those enumerated in Table 2, the question arises as it did with unsupervised learning: Which method is preferred? In part, the answer is suggested by the work of Duda et al. in the form of the oddly titled "No Free Lunch" theorem (89). In essence, the theorem states that there is no single best classifier, but rather the performance of a classifier is a function of the question being asked and the data available to answer it. For example, if method A is optimal at classifying a tumor into histologic subtypes based on a given set of genes, there is no guarantee that method A will work to classify those same tumors by clinical behavior. Similarly, given a different set of gene on which to classify, there is no guarantee that method A will still outperform other classifiers. The performance of a classifier in a specific situation can only be learned through validation. That is not to say that different types of classifiers might not be preferred in a given situation. For example, decision trees require function through a series of binary splits. Genomic expression data are continuous and might not be conveniently framed as a decision-tree problem. There are a number of considerations of this sort that enter into decisions when selecting an analytic approach and for which the reader should consider further study in classification (99). For the reader of this text however, no matter what algorithm is selected, the performance should be evaluated in validation.
9.7.4. Validation The validation process takes a variety of forms. Genes that have been suggested as interesting are often validated by use of conventional techniques. The author can return to the original clinical sample and verify increased RNA expression by Northern blot, real-time polymerase chain reaction (PCR), or other method. In addition, investigators look for genes and proteins related to those suggested by the microarray analysis in order to prove overall consistency of the hypothesis-generating experiments.
In addition to validation of the biologic findings, the authors will usually present at least one form of analytical validation related to the classification component of the study. The method of validation can take a number of forms depending on the study limitations. The most common study limitation currently in the filed of genomics is that of sample size. An investigator's ability to validate results is significantly constrained by small sample size; however, a number of techniques have evolved for use even in the setting of very small studies.
Recalling that most microarray experiments contain relatively few samples, it is understandable and, in fact, efficient to use all of them in the process of building the classifiers. This produces a dilemma when trying to validate the performance of the model, as the classifier will always overestimate its true performance in the data that were used to develop it compared to independent data. The reason for increased performance in the training set is partially the result of to random chance in the
Was this article helpful?
A Beginner's Guide to Healthy Pregnancy. If you suspect, or know, that you are pregnant, we ho pe you have already visited your doctor. Presuming that you have confirmed your suspicions and that this is your first child, or that you wish to take better care of yourself d uring pregnancy than you did during your other pregnancies; you have come to the right place.