Having described the microarray experiment from chip design to analysis, it is a useful exercise to practice the lessons of this chapter by examining two key microarray publications. The first of these, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring" by Golub et al. published in 1999, was one of the earliest in the field, yet remains current in terms of its methodology (6). The investigators selected acute leukemia as their model biological system, using only mononuclear cells obtained from bone marrow biopsies, thus avoiding many sample preparation concerns discussed in Section 4. The authors were provided with Affymetrix oligonucleotide arrays, a platform with which the authors at the Whitehead Institute's Center for Genomic Research have worked extensively. Because the array was commercially available, there is little discussion on its specifics characteristics in this publication.
In the initial gene-filtering step, the authors select a subset of informative genes, 1100 out of a possible 6817 genes that correlated with the study parameter of interest, acute leukemia subtype. The authors asked the supervised learning question, How can the subset of genes be used to differentiate the known acute leukemia subtypes of acute myeloid leukemia and acute lym-phoid leukemia? They used a classifier called weighted voting, implemented by the software Genecluster, and selected a 50-gene model to make class predictions. Although we have not discussed the weighted voting classifier, we can still evaluate its performance in the validation process. The authors performed the two validation steps we discussed in Section 9.7: internal and external validation. The internal validation was the leave-one-out method, similar to the bootstrap, in which weighted voting classifiers were constructed for datasets that serially excluded a single sample. Each classifier was applied to the single sample that had been excluded from the data used to develop the classifier, and a cumulative error rate was calculated. In this application, the weighted voting method correctly assigned 36/38 leukemias, making no errors but making no call on 2 samples. Subsequently, all samples were used to develop a final weighted voting classifier, which was applied to an independent dataset, again with good performance. In an independent sample, the classifier made no errors in 29/34 samples but made no call in 5.
The authors then ask the unsupervised learning question: If subtypes of leukemia had not been known, could they have discovered them based on strong gene patterns in the data? In this analysis the authors selected the full set of 6817 genes and used the self-organizing maps (SOMs) partitioning method. As we discussed, portioning methods require that the user specify a number of clusters a priori—in this case, two. Also recall that unsupervised learning looks for patterns in the data without regard to a named class. Looking only at the data structure with the specification that it be divided into two groups, the SOM method did in fact "discover" the tumor subtypes called A1 and A2, corresponding almost perfectly to the known actue lym-phoblastic leukemia (ALL) and acute myoblastic leukemia (AML) subtypes. Once an unsupervised learning method has identified a data cluster, the group and the samples belonging to it can be named—A1 and A2 for example. Once a sample belongs to a named class, supervised learning can be performed so that future samples of that type can be identified. The investigators performed that experiment, building a new classifier (supervised learning experiment) to identify A1 and A2. The new classifier was applied to the independent data, assigning samples to either class A1 or A2 class or uncertain. In almost all cases, A1 status was assigned to ALL and A2 to AML. The example here is contrived in that the "correct" tumor class was previously known. However, the AML-ALL distinction had not been known, the authors demonstrate that it could have been discovered using this method.
The second publication examines a more concrete clinical problem, where a single histologic tumor type seems to have subtypes based on heterogeneity of clinical behavior (16). The tumor of interest is the diffuse large B-cell lymphoma (DLBCL), where histology alone fails to identify subclasses, yet patients demonstrate markedly different survival. In 2000, Alizadeh et al. reported the cDNA microarray analysis of 96 lymphoma and normal tissue samples. As with the previous example, lymphoid cells avoid many of the difficulties associated with solid tumor samples. The investigators in that study have developed a cDNA array, describing its properties in their introduction. As described in Section 3.2, the cDNA assay requires a reference RNA for which pooled lymphoma cell lines were used.
Lymphomas are divided into several types, of which DLBCL is one subtype, and several lymphoma types were included in the analysis. Although the ability to describe known lymphoma subtypes might best be approached via supervised learning techniques, this was not the stated primary goal of the analysis. The author wished to find previously unknown DLBCL subtypes, a hypothesis that lends itself to more unsupervised learning techniques. Alizadeh et al. applied hierarchical clustering to the entire sample set, which includes a range of samples of known histology. Hierarchical clustering identified unique clusters for all known histologies and suggested two major clusters of DLCBL. The clusters within the histology of DLBCL suggested that one histologic class might be divisible into two distinct gene expression classes. The authors discuss a number of genes that are differentially expressed between the classes, theorizing that one cluster represents a germinal-center-like B-cell tumor and the other an activated-B-cell-like tumor. Finally, the authors examined clinical outcomes according to the cluster assignments for the two DLCBL subtypes, demonstrating a clearly worse prognosis for the activated-B-cell-like tumor. We showed in the leukemia example that once a tumor subtype has been suggested by unsupervised learning, it can be named and supervised learning applied. The investigators chose not to pursue the supervised learning analysis in this case.
The previous two examples are typical of how microarrays analyses are being applied in addressing clinical problems. The methods gathered in this chapter are focused to allow the reader a basic understanding of the relevant techniques at each stage of the analysis. This overview is hopefully specific enough to be concrete, yet acknowledges that an exhaustive review in many cases is not practical. By developing a framework for the generic genomics experiment, including an understanding of the assay, the analytic techniques, and the accompanying databases, the reader can view individual components with flexibility. The individual assays, analytic tools, and databases will evolve; however, the basic principles discussed here will likely prove more durable.
Was this article helpful?
Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!