Perhaps the best way to test a model, whether constructed in a supervised or unsupervised manner, is to evaluate that model on a new set of inputs. This is typically done by determining a priori those inputs that will make up the training set (i.e., the input samples used to create the model), and those that make up the testing set (i.e., those that will be used to test the model). It is important to determine whether population-measurable characteristics are similar between these two sets. For instance, one may want to train a system using samples from one type of tissue, and test it using another type of tissue.
This type of testing is crucial in that it prevents the generation of models that overfit the original input data. It is easy to describe overfitting with an example. Imagine a thought experiment where one takes a list of tyrosine kinases (e.g., insulin receptor, IGF-1 receptor) and phosphatases (e.g., PP1, PTP1B). Each one is written on a flashcard, along with one of the two category names. Now, one shows the flashcards to a child repeatedly. When later quizzed with cards without the category names, the child may be able to properly assign the function just by noticing that within the input set, whenever the name contained the word "receptor," the card was referring to a tyrosine kinase. Needless to say,the "rule" this child generated is quite specific to the particular set of input samples and their representation on flashcards.
Because so many features are typically present in microarray data sets, and because there are so few categories into which the samples have to be classified, it becomes very easy for a machine-learning mechanism to generate large numbers of "rules" specifying how each feature can accurately select the proper category. However, the characteristic of these genes may be specific only to the input sample set at hand, and those rules may not be generalizable to other input samples. Thus, it becomes important to test these rules using independent inputs.
At the risk of redundancy,we emphasize that finding test and training sets that are highly comparable is of paramount importance. In the discussion of the sources of noise in microarray experiments (section 3.2), we demonstrated that we could cluster a set of leukemia experiments into their original test and training sets because of systematic differences in the measurements in the test and training sets (figure 4.16). In the presence of such systematic biases, the inaccuracy or accuracy of a clustering or classification technique on a test set may not be a meaningful measure of its performance.
Figure 4.16: Identifying differences between the test and training sets through clustering. The matrix incision tree algorithm from  was applied to the leukemia classification problem published by Golub et al. . The algorithm correctly clustered 64 of the overall 72 cell lines (94%) in the data set, placing the acute myelogenous leukemia (AML) samples in branches (c) and (d), the acute lymphocytic leukemia (ALL) samples in branches (a) and (b), and misclassifying four ALL samples into branches (c) and (d). However, the matrix incision tree also successfully revealed the distinction between the published training set (cases 1-38, italic, placed in branches (b) and (c)) and the test set (cases 39-72, placed in branches branches (a) and (d)) with 100% accuracy. It is unlikely that the biology of the test set of the AML and ALL cell lines were that different from the cell lines in the training set. It is much more likely that systematic changes in the hybridization conditions, or different sample preparation, or the use of a different batch of microarrays were responsible for the clustering of test versus training sets.
Cross-validation testing Cross-validation is a classic approach to testing for robustness in models generated by machine-learning algorithms, built on the idea of having both a testing and training set, but for use when the total number of inputs is small. This technique works by repeatedly partitioning the available input into subsets for repeated trials. For each of these subsets, the algorithm or machine-learning technique is run and the output model is saved. After all the trials, the set of output models can be analyzed to determine those characteristics or features that are consistently represented across the trials. As an example, if in 10 trials, gene A is always found to accurately classify a set of samples between two diseases, but gene B accurately classifies in only 6 of the 10 trials, then gene A may be viewed as a more robust classifier, in that it may be less influenced by particularities of a single input set.
The traditional way this is performed is using n-way leave-one-out cross-validation, where n is the number of input samples: n input subsets are created, each missing the nth sample, then the machine-learning technique is applied to each of these input subsets. Cross-validation testing is easier to apply to supervised learning methods than to unsupervised, as it is difficult to determine accurate metrics for deciding on the strength of the generated unsupervised models. This approach has been used in several published works, including [19, 33, 46, 60, 72, 78, 86, 90].
Was this article helpful?