Hierarchy of Bioinformatics Algorithms Available in Functional Genomics

Although there are only a few commonly used analytic techniques in functional genomics, it is important to remember that these techniques exist in a hierarchy of methods. One can model this taxonomy of techniques many different ways, as outlined in section 2.2; in this section, we chose to distinguish these methods as supervised or unsupervised [73]. An example of supervised learning is simply calculating the fold increase or decrease in expression of each gene in two microarray measurements. In this type of analysis, the particular context of each measurement is known, and the end result is a list of individual genes behaving differently in each of the different contexts. Another example of supervised learning is when it is algorithmically determined which genes are best able to determine an external labeling,[1] such as distinguishing one type of leukemia from another [78]. This type of approach is used in neural network construction [155], and decision tree construction [146], which can then be easily translated into diagnostic algorithms [33,98].

In unsupervised learning, the patterns inherent in the data itself are determined, without a priori labels or contexts. Relevance networks, dendrograms, and self-organizing maps analyze every possible pair of genes to determine whether a functional relationship exists between them. This is determined using a metric, such as Euclidean distance between the genes (when genes are positioned in a multidimensional space with each experiment as a separate axis), Pearson's correlation coefficient (assumes a linear model of control between the two genes), or mutual information (with no assumptions as to the model of control). The end result of this analysis is a ranked list of hypotheses of pairs of genes that work together. A taxonomy of techniques divided in this manner would look like this:

1. Unsupervised: Analysis looking for characterization of the components of the data set, without a priori input on cases or genes.

a. Feature determination: Determine genes with interesting properties, without specifically looking for a particular pattern determined a priori.

i. Principal components analysis: Determine genes explaining the majority of the variance in the data set.

b. Cluster determination: Determine groups of genes or samples with similar patterns of gene expression.

i. Nearest neighbor clustering: The number of clusters is decided first, the clusters are calculated, then each gene is assigned to a single cluster.

A. Self-organizing maps: Groups are found by iteratively moving each cluster centroid toward a randomly chosen data point; the center of each cluster is a hypothetical gene expression pattern.

B. K-means clustering: Similar to self-organizing maps, except there is no map structure for the centroids.

ii. Agglomerative clustering: Bottom up method, where clusters start as empty, then genes are successively added to the existing clusters.

A. Dendrograms: Groups are defined as sub-trees in a phylogenetic-type tree created by a comprehensive pairwise dissimilarity measure, such as the correlation coefficient.

B. Two-dimensional dendrograms: Both genes and samples are clustered separately.

iii. Divisive or partitional clustering: Top-down method, where large clusters are successively broken into smaller ones, until each subcluster contains only one gene.

A. Dendrograms.

B. Two-dimensional dendrograms.

c. Network determination: Determine graphs representing gene-gene or gene phenotype interactions.

i. Boolean networks: Gene expression measurements are binarized; nodes in the graph represent an entire expression state, and links represent transitions from one expression state to another.

ii. Bayesian networks: Nodes in the graph represent features (genes or phenotypic measures), and links between features are determined probabilistically.

iii. Relevance networks: Nodes in the graph represent features (genes or phenotypic measures); links between features are calculated with pairwise application of a dissimilarity measurement, and networks represent agglomerations of these pairwise associations. 2. Supervised: Analysis to determine ways to accurately split into or predict groups of samples or diseases.

a. Single feature or sample determination: Find genes or samples that match a particular a priori pattern i. Nearest neighbor: Find genes or samples closest to another pattern, using a dissimilarity measure, such as Euclidean distance or the correlation coefficient.

A. Comparison to hypothetical pattern: Find genes or samples closest in similarity to a hypothetical pattern, determined with a priori information.

B. Comparison to actual pattern: Find genes or samples closest in similarity to one of the genes or samples in the data set.

C. Comparison to clusters: First, remember the training set of data and labels; then, when presented with a test gene or sample, find the element of the training set that is closest and use its label.

ii. Student's t-test: Find genes where the expression measurements are statistically different between groups of samples b. Multiple-feature determination: Find combinations of genes (i.e., more than one) that match a particular a priori pattern.

i. Decision trees: Use the training set of genes or samples to construct a decision tree to help classify test samples or test genes.

ii. Support vector machines: First take the set of measured genes, then create a richer feature set with combinations of genes, then find a hyperplane that linearly separates groups of samples in this larger multidimensional space.

The hierarchy above is meant to serve as a way to categorize bioinformatics techniques, and the examples under each branch of this hierarchy are incomplete; newer methodologies in each branch are shown in section 4.10.

This hierarchy just touches upon one of the important goals of the bioinformatics arm of functional genomics. That is, to ascertain gene regulatory networks from gene expression data sets. Few of the algorithms listed above find their way into this endeavor for reasons that we address here.

One crude definition of a regulatory network is a set of genes G and a set of relations R that relates any two elements from G. The relation may specify either a positive or a negative interaction. For example (A, B, +) may indicate gene A interacts with gene B in a positive manner, such that the level of expression of gene B increases as the level of expression of gene A increases. Completely missing are alternative models other than these simple linear models of gene-gene interactions. Any element of translation into protein, the actual actions of proteins in DNA binding, any sense of temporality, and gene and protein degradation are missing. However, despite all that is missing, there is enough captured in this simple model to provide material to construct a biological hypothesis linking the two genes. [177]

In the above hierarchy, we do not consider a dendrogram to be a member of the network determination category of algorithms. Why is this? One typically constructs dendrograms from a gene expression data set to survey the types of expression patterns, and to determine the neighbors of genes to be classified. Although dendrograms are computed from a comprehensive pairwise calculation of gene-gene expression pattern similarity, as detailed below, these gene-gene links are not shown in the output. Besides the first gene-gene pair, each subsequent gene is added to the dendrogram when its expression pattern matches an expression pattern of a branch. These branch expression patterns represent the average expression pattern of the descendant genes, and may not match any one of these genes. Furthermore, gene-gene interactions with negative correlations are not represented in dendrograms, thereby eliminating a large class of biological regulatory behavior.

In the same vein, why do we not consider a self-organizing map as part of the network determination category of algorithms? Self-organizing maps are typically made to survey the types of expression patterns seen in a gene expression data set. Like dendrograms, self-organizing maps do not display gene-gene interactions; instead, these typically output a centroid expression pattern (essentially, an average expression pattern not matching anyone of the genes in the data set) and output the expression patterns of the genes surrounding the centroid. There is no representation of a gene-gene interaction, nor are negative interactions considered.

[1]For example, human expert-derived labels.

Was this article helpful?

0 0

Post a comment