## Self Organizing Maps

The use of self-organizing maps (SOM) in functional genomics was first described by Tamayo et al. . The underlying hypothesis in this technique is the standard functional genomics dogma articulated in section 2.2: genes with similar expression patterns are functionally similar. Thus, if one can find groups of genes that are all near each other, such a group is likely to be related functionally.

As typically used, this methodology starts by rescaling each gene into a standard distribution, e.g., with mean = 0 and variance = 1, though this is not required. Each measured gene is mapped into a k-dimensional point, such that each of the k dimensions corresponds to a different tissue sample or microarray. The ith coordinate axis represents the ith sample. Each gene is evaluated by its vector of measurements and is represented by a point in the map. Coordinates for these points therefore represent the expression levels in various experiments. This is easy to visualize for a gene measured in three experiments (see figure 4.3). The multidimensional space would be represented as a hypercube. Each side of the cube represents a different tissue sample or microarray. Each gene would be a multidimensional point within the cube space, with each coordinate equal to its measurement. Figure 4.3: Genes can be represented as points in a multidimensional space. Each sphere represents a gene measured in three tissue samples. Each of the expression levels is considered as a measure along a three-dimensional coordinate. One can imagine that genes with similar expression levels in all three samples may be grouped as clusters in the three-dimensional space. These clusters can be found using an unsupervised technique, such as self-organizing maps. The analysis starts by choosing a map or topology of nodes N (called centroids), typically arranged in a grid. Each node is in the same dimensional space as the genes, or objects being clustered. Each centroid starts at a random position in the k-dimensional space as follows:

Figure 4.3: Genes can be represented as points in a multidimensional space. Each sphere represents a gene measured in three tissue samples. Each of the expression levels is considered as a measure along a three-dimensional coordinate. One can imagine that genes with similar expression levels in all three samples may be grouped as clusters in the three-dimensional space. These clusters can be found using an unsupervised technique, such as self-organizing maps. The analysis starts by choosing a map or topology of nodes N (called centroids), typically arranged in a grid. Each node is in the same dimensional space as the genes, or objects being clustered. Each centroid starts at a random position in the k-dimensional space as follows:

define number_centroids as the number of centroids we will start with define number_axes as the number of samples or experiments for which we have data define vector as an array of floating point numbers, with size = number_axes make a new empty array of vectors called centroids, with size = number_centroids loop through centroids using the index i loop from 1 to number_axes, using index j find the minimum expression level across all the genes measured in sample j find the maximum expression level across all the genes measured in sample j set coordinate j for centroid i to be a random number between the minimum and maximum end loop end loop

The centroids are subsequently mapped to a position in this ^-dimensional space and correspond to the center of a cluster. The mapping process is iterative, meaning that the map is formed over several cycles. During each cycle, a random data point is chosen and all of the centroids are moved toward that point, as illustrated in figure 4.4. The distance each centroid is moved is typically a function of both the iteration number and the distance between the point and centroid, such that the distance moved decreases as more iterations have passed. The distance moved is also less for those centroids farther away from the randomly chosen point.

define number_centroids as the number of centroids we will start with define number_genes as the number of genes we are working with set continue_loop to true set iteration to zero loop while continue_loop set cumulative_moving to a zero vector calculate learning_rate, as a function of iteration pick a random gene G from the list of all genes we have loop through centroids using the index c calculate dissimilarity measure (i.e. distance) from centroid to G

calculate vector v by taking centroid c and subtracting G move centroid c along vector v as a function of learning_rate add the amount moved to cumulative_moving end loop add one to iteration set average_moved to cumulative_moving divided by number_centroids if average_moved is under a predetermined threshold set continue_loop to false end if end loop Figure 4.4: Principle of self-organized maps. Centroids start in an arbitrary topology; then, as the method progresses, each moves toward a randomly chosen gene during each iteration. After proceeding for enough time, each centroid will be in the middle of a cluster. (From Tamayo et al. .)

The algorithm continues until either fixed endpoints are reached, or after there is no movement of nodes above a threshold.

Figure 4.4: Principle of self-organized maps. Centroids start in an arbitrary topology; then, as the method progresses, each moves toward a randomly chosen gene during each iteration. After proceeding for enough time, each centroid will be in the middle of a cluster. (From Tamayo et al. .)

The algorithm continues until either fixed endpoints are reached, or after there is no movement of nodes above a threshold.

As most commonly used, self-organizing maps employ Euclidean distance as the dissimilarity measure. However, there is nothing that would prevent the use of the correlation coefficient or mutual information as the dissimilarity measure instead. Nonetheless, we describe this algorithm using the Euclidean distance because that is the way it is typically presented in the most well-known applications in functional genomics .

Figure 4.5 explains the conventional graphical representation of self-organizing maps. Each horizontal axis represents an ordering of the samples. In this case, samples were measured in time series.

Figure 4.5: Graphical convention for self-organizing maps. (Derived from Tamayo et al. .) 4.6.1 K-means clustering

The k-means clustering algorithm is very similar to the self-organizing map methodology. Centroids are initialized to randomly chosen genes, then each gene is mapped to its closest centroid. In an iterative manner, the centroids are then updated with the average expression pattern of the genes assigned to that cluster; then the genes are again mapped to their closest centroid. This continues until all genes remain mapped to the same centroids.

### Published examples

• Tamayo et al.  used self-organizing maps to functionally cluster genes into various patterned time courses in HL-60 cell macrophage differentiation.

• Toronen et al.  used a hierarchical self-organizing map to cluster yeast genes responsible for diauxic shift.

• Soukas et al.  used a k-means approach to cluster genes involved in leptin signaling.

• Tavazoie et al.  used k-means to determine clusters of genes based on function, for which 52 upstream regions in the genome were searched in an effort to find common regulatory sequences.

• Easy visualization in two dimensions of the range of variation seen in the data: Since the map will be spread across the multidimensional space of experiments, by displaying the expression patterns represented at each node, one can quickly see the variety of gene expression patterns. For example, if there was not much variety in the gene expression patterns, one would quickly find the representative gene expression patterns from the many nodes of the self-organizing map appearing the same.

• Computational requirements: Since only the centroids are moving in each iteration, the number of distances to be computed is linear with respect to the number of nodes. After the map is made, gene assignment to a cluster may require the computation of distances from each gene to each centroid. On the face of things, this is still much less computationally expensive than the initial precomputation of comprehensive pairwise dissimilarity measures required by the other techniques, suchas dendrograms and relevance networks. However, this gain in performance is tempered by the large (but finite) number of iterations that may be required to reach stability.

• Arbitrary map topology: There is nothing to indicate exactly how to create the initial map geometry. The k-means algorithm requires the number of nodes to be specified as a priori information. The self-organizing map technique also requires the initial geometry of the nodes to be specified.

• Misses negative correlations: Negative correlations include gene interactions such as those demonstrated by tumor suppressor genes. As an example, p53 suppresses the expression of several other genes. This means that the higher the expression of p53 seen, the less expression of other genes is expected. Negative correlations will be missed in this type of approach.

• Inuenced by spiking in gene expression: Without adequate filtering, spiking (isolated high or low expression in single samples) can prevent proper normalization, which impairs self-organizing map formation.

• Each experiment or coordinate is treated independently and with the same weight: each axis is treated as orthogonal to the others. This may not be the actual case, however. Consider the extreme example of a situation where two replicated measurements are made on three patients, making six microarray measurements. In eachof the three patients, the pair of replicated measurements should not differ. However, if each of the six measurements is treated as a separate, orthogonal axis, too much weight may be given to the replicates, which may have measurements close together. In other words, genes may appear closer simply because they have many similar measurements across the replicated experiments.

• Further analysis is needed to determine the actual boundaries of each cluster: After the map is made, all the genes in the multidimensional space may be assigned to a cluster by choosing the centroid closest to each gene. However, it is not immediately clear where the boundaries of each cluster lie. First, this is important in determining which genes lie within the cluster. Second, this is crucial because it is important to distinguish those genes that are far away from all centroids. If too many genes are too distant from all the clusters, it can be an indication that additional centroids need to be added at the at the start of analysis to gain coverage in that area of the multidimensional space. Thus, it is important to determine what the shape and size of the clusters are, but this requires prior assumptions (the space of the centroids can be round or oval, the diameter can be large or small, or equal among all the clusters, etc.) for which there may be little domain or expert knowledge.

• Membership of a gene may be limited to a single cluster: If a gene resembles the centroid pattern seen in two clusters, it may be assigned to one of the two clusters but not both, unless precautions are taken to scan for this situation.

• Missing data: If a single gene expression measurement is missing, it becomes difficult to determine where to place that gene in the multidimensional space. This positioning may be crucial to determining which centroid is closest to that gene.

• Stochastic technique: Since the initial conditions are determined randomly, the results may not be repeatable. Validation techniques can be performed by repeating the algorithm using several random starting positions, but it is not clear how to merge the resulting, differing maps.

Each measurement of that gene in each experiment corresponds to one element of that vector.

Note that this topology is selected by the investigator sometimes on the basis of prior knowledge (the number of expected natural clusters of the data) and sometimes empirically by iterating across several possible topologies and inspecting the results.