Unsupervised Learning

9.6.1. Introduction Commonly used unsupervised learning techniques fall broadly into two categories (hierarchical and partitioning), of which hierarchical clustering is more widely used (90). The preference for hierarchical clustering is at least twofold after considering the following challenges of machine learning. Computational resources to solve many machine learning problems are not currently available. The chasm is broached by abandoning the search for an exact solution, accepting, instead, an approximate solution. One approximation is to run a complex procedure iteratively until it converges near a single value. Alternately, instead of evaluating all possibilities, the algorithm selects a random solution, which is iteratively optimized. In these approaches, running additional iterations or choosing a different random seed might result in a different solution, although, in general, the approximations are close. Hierarchical clustering, or at least one version of it, avoids the uncertainty by providing a method that can be solved explicitly that does not rely on a random seed. The general preference for hierarchical clustering provides additional practical incentives, for its use in that software is widely available.

9.6.2. Clustering and Dendrograms Distance measures suggested in Section 9.4 are the basis for hierarchical clustering. Recalling that a sample is uniquely identified in multidimensional gene space by a single point, it is computationally trivial to calculate a distance between any two samples. Samples for which the between distance is smaller have expression profiles that are more similar than those with larger distances. In this way, some samples might be in close proximity, clustered in other words, compared to other samples. The two broad types of hierarchical clustering are termed "agglomera-tive" and "divisive" (91). Divisive clustering is a top-down approach in which all samples start in a single group and are serially divided into smaller groups until each sample is its own group. Computationally, divisive clustering is difficult to solve explicitly for large numbers of samples. For this reason, agglomerative clustering is preferred. Agglomerative clustering is a bottom-up approach in which individual samples are serially grouped to build progressively larger clusters. In this method, each sample is paired with its nearest neighbor to build

Fig. 11. Dendrogram. (Figure appears in color in insert following p. 172.)

the first layer of clusters. The newly formed cluster is assigned its own point in gene space that is in the neighborhood of the two samples which comprise it. For example, a line in gene space can be drawn between the two samples and the cluster location assigned to the line's midpoint. Once new locations have been calculated for all pairwise clusters, a second round of distance measures between clusters can be calculated. Clusters that are closest are merged and the process repeats until all samples have been agglomerated into one group.

Hierarchical clustering results are displayed in the form of a dendrogram. Although there is some variation in dendrograms format, Fig. 11 displays many of the typical features. Columns in this example represent samples and rows represent genes. At the intersection of a row and column is a colored rectangle whose intensity is proportional to the gene expression in that sample. Depending on the software, a threshold gene expression level, such as average expression, is often calculated in order to further use color for display purposes. Samples in which expression is above and below the threshold are displayed in different colors, with color intensity proportional to the distance from that threshold. In this example, red represents greater than average expression and blue represents less than average. The dendrogram represents the clustering of both samples and genes in separate tree structures seen above and to the left of the grid, respectively. Samples (or genes) closely related in gene space are placed adjacent to each other in the tree and joined with shorter branches. In Fig. 11, samples 1-4 and genes 3, 4, 11, and 12 are very closely related, forming clusters with short branches. Less closely related samples (or genes) have longer branches and are located farther apart in the tree.

Dendrograms are a convenient method for the display of microarray data from multiple samples and are widely used tool for viewing experimental analysis. Although useful, they should be viewed with the following shortcomings in mind.

Once a sample has been assigned to a branch, it cannot be reassigned, meaning an error propagates through subsequent branching decisions. Branch lengths, although proportional to distances between samples, are an ambiguous measurement on which to judge the strength of a cluster assignment. For these reasons, it can be difficult to judge where cluster boundaries occur or if any significant clusters are present. Fig. 11 suggests two strong clusters, samples 1-4 and samples 5-11, as defined by the major branching of the tree. It is less clear whether these strong clusters should be further subdivided and where those divisions might be.

9.6.3. Partitioning Methods ^-means clustering is one of the most commonly described partitioning method (as described in Fig. 12) and again relies on gene space and distance measures (92). The user first selects a number of classes into which the samples should be divided, two in the example. The algorithm selects a point in gene space, called a centroid, for each of the classes. The choice of starting point for the cen-troid might be random or based on some feature of the data depending on the exact method. In the first iteration of partitioning, each of the samples is then assigned to the class of the centroid to which it is closest, as shown in Fig. 12A. Using all samples assigned to a given centroid, the geographic center of their positions is calculated, and the centroid is moved to that location. Samples are reassigned to a centroid based on its new location, repeating the iterative process until either the cen-troid's position remains stable or other analytic parameters have been satisfied.

Partitioning methods suffer from the need to use iterative approaches and random seeds (random starting locations for the centroid). Placing the centroids at a different starting location or running a different number of iterations could result in variations in the final cluster assignments. In addition, most procedures require that the user specify the number of classes

Was this article helpful?

0 0
My First Baby

My First Baby

Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!

Get My Free Ebook

Post a comment