Finding Genes That Split Sets

As suggested by the machine-learning taxonomy above (section 4.4), nearest neighbor analysis has been used in both supervised and unsupervised learning methods. In a supervised application, the technique is traditionally used to find genes that match an externally specified pattern. These patterns may be "ground truths" in independently validated biological knowledge (e.g., a gene known to be involved in the process being studied) or empirical (e.g., a gene that is observed to be highly expressed in certain samples and expressed at a lower level in other samples). In an unsupervised manner, the technique is used to find clusters of genes that share similar expression patterns.

Let us start with the supervised method. We may have a specified gene of interest in our data sets. Alternatively, we may specify a desired or hypothetical gene expression pattern which may not exist in the data set. Such a pattern could be made based on the desired properties of a gene in particular samples or patients. For example, one may want to find a gene that was upregulated in patients with disease and downregulated in nondiseased patients.

To get a list of genes related to either query pattern, we must again define what it means to have similar patterns. Although any dissimilarity measure can be used as the judge of whether expression patterns are similar, Euclidean distance and the correlation coefficient have been used traditionally. Iteration through the data set will quickly find and rank genes by the degree of similarity to the query pattern.

define number_genes as the number of genes we are working with define query_pattern Q as a vector across all the samples, representing the gene with the expression pattern we are most interested in make a new empty array called distances, with size = number_genes loop through all the genes, with index G

calculate dissimilarity measure (i.e. distance) between Q to G store the measure in the appropriate position in distances end loop

Published examples

• Using a hypothetical pattern, Golub et al. [78] used this technique to find those genes that best split samples of acute lymphocytic leukemia from acute myelogenous leukemia.

• Ben-Dor et al. [19] used a nearest neighbor classifier to split normal colon samples from cancer samples from [4], as well as other data sets.

Advantages

• Computational and memory requirements: When used in a supervised manner, nearest neighbor analysis is considered a "lazy classifier," in that the algorithm requires only that the data set be kept in memory. The genes "nearest" to the search pattern are found when required, minimizing unnecessary comparison operations. This is in contraste to other techniques, which may require the initial expensive computation of pairwise dissimilarity measures.

• Quickly finds genes or features that most significantly split the labeled sets. These can be validated biologically,then developed into diagnostic tools, for example.

Disadvantages

• Only gross differences are found. In other words, genes with differences in absolute expression level are typically chosen. Differences in gene-gene interactions may be missed where the expression level of the genes involved in the interaction may not be different (see figure 4.6).

100200 400 600 800 1000 1200 1400 1600 1800

Expression measurement of Gene A Figure 4.6: Genes can have a difference in interaction, but not in expression level. Scatterplot of gene A and gene B, measured in samples from disease 1 (open circles) and disease 2 (closed circles). Note that expression measurements from neither gene A nor gene B can be used to separate disease 1 from disease 2. However, the linear regression model of gene expression levels from disease 2 is different from disease 1.

• The genes or features that best split two sets may not necessarily be the most significant or biologically causative. For instance, the genes selected may be the most obvious but distant "downstream" effects of other genes which are primarily responsible for the difference in states.

• Because many more features (i.e., genes) are measured compared to cases (i.e., samples or experiments), it is almost always possible to find genes that split samples into labeled sets. If one considers combinations of genes as features (i.e., with support vector machines), then one can find even more features that successfully split the sets. One has to be careful to look for simple models that split sets, even if they are not as accurate [30]. Otherwise, these models may lose biological relevance.

Was this article helpful?

0 0

Post a comment