measurement of gene expression. Just as the measurement of any given gene will vary somewhat around its true value because of measurement error and random chance, the measured group mean of a gene will vary around the true mean. To the extent that random error artificially increases the strength of a gene associated with a given class assignment, a classifier will select that gene preferentially. To the extent that random error artificially decreases the strength of association with class assignment, a classifier will omit that gene from the classifier. Because of random error, the classifier includes genes that looked better than they actually are for predicting a class, and it omits some genes that might have been included. To evaluate the extent of this and other shortcomings, classifiers need to be validated.
Of the techniques developed to overcome the limitations of small sample size, the most popular of which is probably the bootstrap. Bootstrapping, or sampling with replacement, is a method in which one or more samples is removed from the total sample set and, in their place, duplicate samples are substituted so that the overall sample size remains the same. For example, in a dataset containing samples A, B, and C, the bootstrap dataset would contain samples A, B, and B. Sample C is withheld for validation purposes.
Using the bootstrap sample set, the classifier is built and then validated on the samples that have been withheld. In this way, an error rate for the classifier can be calculated. The classifier that is made in this way is not exactly the same as the one that would have been made if the dataset were not modified, but it is very similar, and the error rate would be similar as well. The process is repeated multiple times, as many bootstrap datasets are constructed and multiple error rates are calculated. When the error rates are combined, we have an estimate of the error rate for the final classifier, which is itself built on the total sample dataset, not a bootstrap. Note that the bootstrap validation is an estimate of the error rate for classifiers of a given type in a given dataset; it is not the true error rate of the final classifier in an independent dataset. An alternative to the bootstrap would be to withhold a proportion of samples from the training set to use as the test set, although this is generally regarded as an inefficient use of data. In another common alternative approach, a classifier that has been built previously can be tested on a prospectively collected data source.
9.8. LIMITATIONS AND SPECIAL CONSIDERATIONS OF SUPERVISED AND UNSUPERVISED LEARNING We have already introduced the concern that distance measured in gene space has weaknesses as a foundation for both clustering and classification algorithms. Looking at the specific methods we present, we should again consider the implications of using distance. In all of the examples, we graphed genes 1 and 2 on the same linear 1-12 scale. In practice, however, genes can vary over vastly different gene expressions, and the distance algorithms will overweight those genes with higher expression. A gene that varies by a factor of 10, from 100 to 1000 for example, will be weighted the same as a gene that varies from 10,000 to 11,000 if a correction is not included in the analysis. Distance measures that correct for scale, error, or other considerations are available for use in all of the previously cited examples. In most cases, however, the selection of optimal distance measure will not be obvious to the investigator a priori and, again, might best be determined in validation experiments.
Although the concept of distance remains at the center of much of supervised and unsupervised learning, there are a many techniques listed in Table 2 that do not directly rely on multidimensional gene space. For the reader interested in further study, there are a variety of information sources to expound on specific methods, including the computational applications for those wishing to perform genomics assays themselves (100,101). For those interested in evaluating the current literature however, we see little interest in full development of classification techniques at the current time. This is particularly true as the coming years we will likely see developments in the technology and, eventually, standards that are not readily apparent at this time. Accepting a lack of standard methods obliges us in some cases to view the machine learning component of some studies as a black box for the present time.
Ultimately, the black box of machine learning is no different that that of commonly used research methodologies and should not interfere with the ability to evaluate the overall quality of genomics research. For example, relatively few clinicians have training in linear regression, logistic regression, or Cox proportional hazards modeling, yet most have developed comfort with the results of studies reliant on the methods. By viewing genomics studies in the same modular fashion as those using regression techniques, we believe that the reader has sufficient information to evaluate the overall quality of the research regardless of the supervised learning methods used. The modules of a study or clinical application are exactly the section headings of this chapter, including study design, sample preparation, normalization, gene filtering, supervised/unsuper-vised learning, and validation.
Was this article helpful?
Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!