## Which Technique Should I

Each of the techniques described above has its own set of advantages, disadvantages, and proper uses. It is useful at this point to revisit our initial list of potential hypotheses and determine which bioinformatics technique would be most helpful to apply in each case.

• What uncategorized genes have an expression pattern similar to these genes that are well characterized? One could iterate through the group of well-characterized genes and apply the supervised nearest neighbor technique to find genes with similar patterns. Similarity in this context could be defined using any of the established dissimilarity measures (starting with the correlation coefficient and Euclidean distance).

• How different is the pattern of expression of gene X from other genes? Starting with all possible gene-gene pairs, one can first comprehensively compute a dissimilarity measure across all the pairs (e.g., measuring the Euclidean distance between all pairs of genes). This set of measures will form a distribution, and very possibly a bell-shaped distribution. One can then find where the dissimilarity measures for gene X are in this distribution. If, for instance, the dissimilarity measures for gene X are in the top 5th percentile of the overall distribution, it would suggest that gene X is more dissimilar than other genes are.

• What genes closely share a pattern of expression with gene X? One can use a supervised nearest neighbor approach to find those genes with expression patterns most similar to gene X.

• What category of function might gene X belong to? Here, one can create a dendrogram, then find gene X in the dendrogram. As one progresses up the tree starting from gene X, one finds those genes with expression patterns similar to gene X. Using a "guilt by association" approach, if most of the neighbors of gene X belong to a particular functional category, then one should hypothesize that gene X belongs to that category . Obviously, this approach is only useful when the functional categories are known for the majority of the genes. An alternative, simpler approach than constructing a dendrogram is to compute the nearest neighbors to gene X and ascertain what their functions are.

• What are all the pairs of genes that closely share patterns of expression? Although constructing a dendrogram starts with the comprehensive pairwise computation of a dissimilarity measure, these pairs are not clearly displayed in the results. Instead, one can use relevance networks to find those pairs of genes that score highest with a dissimilarity measure.

• Are there subtypes of disease X discernible by tissue gene expression? To answer this question, one can easily construct a dendrogram across the samples, not across the genes. For instance, depending on the mixture of samples collected, one might find that half of the samples fall in one half of the dendrogram, and the other half of the samples fall in the other half of the dendrogram. One can then proceed to determine if there are phenotypic differences between these branches. This was most dramatically demonstrated by Alizadeh et al. , where the two major branches in the gene expression dendrogram obtained were found to correspond to patients with significantly different mortality (see figure 4.13). Of course, if the sample mixture is not diverse enough, one might miss the true biological subsets and might instead overfit normal variance or noise in the measurements. This is a matter of experimental design addressed in chapter 2. Figure 4.13: Subcategories of B-cell lymphoma determined by microarrays correspond clinically to duration of survival. On the left is a dendrogram constructed across the samples of B-cell lymphoma, using an unsupervised technique. The top branchess entially defines an even split between the categories GC B-like DLBCL and Activated B-like DLBCL, but this distinction was never before made clinically. On the right are Kaplan-Meier survival curves of the patients from whom the samples were obtained. Patients whose cancer matched the Activated B-like DLBCL gene expression profile had a significantly worse prognosis. (From Alizadeh et al. .)

What tissue is this sample tissue closest to? One can create a dendrogram of the tissues and determine where the unknown tissue lies. Alternatively, one can use a supervised nearest neighbor approach to determine the "closest" pattern of gene expression. What are all the different patterns of gene expression seen? Although dendrograms can provide a categorization and ordering of genes based on expression pattern, self-organizing maps can easily categorize and summarize the expression patterns present. Further analysis can be run on these clusters, including determining whether the genes fall in the same functional category, or searching for common sequences in the 52 upstream region from each gene.

Which genes have a pattern that may have been a result of the influence of gene X? Answering this question requires the proper biological experiment to be performed. Thus, this question is often asked when gene X is being over- or under-expressed in a particular sample. One can treat the controlled expression level of gene X as an environmental factor, then, using a supervised nearest neighbor approach, one can find whether any of the genes correlate with the controlled expression level. However, the controlled expression level may not be quantifiable, but instead may be discrete (e.g., gene X is "absent" in a sample from a knockout animal, but is "present" in a sample from a wild-type or normal animal). In this case, one can create a hypothetical gene expression matching this information (e.g., high expression levels in the one set of samples, low expression levels in the other set), then use a supervised nearest neighbor approach to find genes with a similar pattern. What are all the gene-gene interactions present among these tissue samples? Relevance networks provide a network of gene-gene interactions based on dissimilarity measures. Which genes best differentiate two known groups of tissues? Because the two groups of tissues are known, this calls for a supervised approach. Classic methods can be used to address this question, such as the Student's t-test. Alternatively, one can construct a hypothetical gene expression pattern across the samples with the desired behavior (e.g., high expression levels in one tissue, low expression levels in the other tissue), then use a supervised nearest neighbor approach to find genes with a similar pattern.  Newer approaches to answer this question involve the use of support vector machines [33, 49, 72]. It is important to note that simplifying or pruning the list of genes using a technique such as principal components analysis does not give one the genes that best split two sets. A classic example of this is in shown in figure 4.14. Figure 4.14: Feature reduction with principal components analysis. Two genes are measured in eight samples, four from one disease (marked by X) and four from another disease (marked by O). The expression measurements of gene A are represented on the x-axis and the expression measurement of gene B are represented on the y-axis. Note that the first principal component (i.e., the vector that captures the most variance) is parallel to the x-axis, which corresponds to gene A. However, the line that best splits the two diseases is described by its orthogonal vector along the y-axis, which corresponds to gene B. This means that although the variance seen in gene A best explains the overall variance seen in gene expression measurements in each disease and both together, the variance seen in gene B is best used to split the two diseases. Although this is easy to distinguish in this simplified two-dimensional case, it is not so easily visualized in a real-world, multidimensional data set.

Which gene-gene associations best differentiate these two groups of tissue samples? Answering this question requires a data set with sufficient measurements from the two types of tissues, and the analysis involves a combination of approaches. First, for each of the two tissue samples, one can first comprehensively compute a dissimilarity measure for all the gene-gene pairs. Then one can use a supervised approach to compare the dissimilarity measure for the two tissue types. For example, if the expression pattern correlation coefficient between gene A and gene B was .95 in breast cancer, but was .25 in lung cancer, one might make a biological hypothesis that there is a difference in this interaction between the two diseases.

• Which genes have a pattern similar to this organismal, phenotypic, or environmental factor? If one can treat the external factor as a hypothetical gene expression pattern, then one can apply the supervised nearest neighbor technique to determine genes most similar to that pattern.

That is, we compare the measure-triangles for each tissue. The elements of the measure-triangle that are most different will suggest different relationships between the gene pairs corresponding to each of these dissimilar elements of the measure-triangle. Because the whole measure-triangle is used, the result is comprehensive.