## Phylogenetic Type Trees

A third methodology is phylogenetic-type[5] tree construction, otherwise known as dendrograms. Although there are several ways to construct this data structure, this technique typically first involves comprehensively comparing all genes against each other using a dissimilarity measure, such as the correlation coefficient. Eisen et al. [63] took expression levels at various time points and created a vector for each gene. They then compared all genes against each other and recorded the correlation coefficient for each pair of vectors, then constructed a phylogenetic-type tree with branch lengths proportional to the correlation coefficients.

Let us take this process one step at a time. First, enough memory must be allocated to store the comprehensive pairwise comparisons. If n genes are being compared with each other, there are ((n*n)/2) n/2 possible pairwise comparisons. Typically, when maximum efficiency is used, dissimilarity measures are stored as real values taking 4 bytes each. Clearly, as n increases, the necessary memory increases on the order of n2. Next, an iterative method must be used to perform the pairwise comparisons. Since associativity holds with the correlation coefficient and Euclidean distance, once gene A is compared with gene B, the reverse comparison does not need to be performed. There are four logical ways to order these comparisons, as shown in figure 4.7, and one needs to be chosen arbitrarily. Regardless of how we are calculating or storing the measures in this chapter, we refer to the comprehensive set of measures for a collection of variables or features as a measure-triangle.

Figure 4.7: Four possible ways to order comprehensive pairwise comparisons. We define the term measure-triangle as the comprehensive pairwise comparison between each of a set of features. The comparisons can be performed and stored in any of four different ways. Although there are many efficient ways to store the results of each comparison, e.g., using matrices, we purposely use a less efficient, but easier-to-understand method here. We store the result of each comparison in a hash table, using the two genes (sorted) as the key. That is, the dissimilarity measure can be retrieved from the hash table using the unique pair of genes.

define number_genes as the number of genes we are working with make a new empty two-dimensional array called measure-triangle, with size = number_genes by number_genes loop from 2 to number_genes, with index i loop from 1 to i - 1, with index j calculate dissimilarity measure (i.e., correlation coefficient) between gene[i] and gene[j] store measure in measure-triangle[i][j] end loop end loop

For each possible pair, the dissimilarity measure must be applied. Eisen et al. [63] essentially used the correlation coefficient in pairwise comparisons. Wen et al. [189] used Euclidean distance as a dissimilarity measure.

In addition to the measure-triangle used to store the dissimilarity measures, we need a data structure to hold the developing tree. A tree can be thought of as a set of internal nodes and external nodes (or leaves). Internal nodes serve as branchpoints linking together genes or subbranches. External nodes or leaves are exclusively genes. Because both genes and branches can be connected by branches, we will make that fact explicit by stating that they both can be branchable.

define number_genes as the number of genes define number_samples as the number of samples define an object called branchable, which contains

(1) an array of expression measurements of size number_samples

(2) a link called parent, to another object of type branchable

(3) a list called children, to other objects of type branchable define an object called branch-branch, which is derived from branchable define an object called branch-gene, which is derived from branchable make a new empty list called remaining-branchable-objects make a new empty list called used-branchable-objects loop through all the genes, with index g make an object of type branch-gene for each gene set the list of children for this object to null set the parent for this object to null add this object to the list remaining-branchable-objects end loop

We are ready to begin building the tree. Iterations continue until every node has been attached to the tree. The tree begins with a single branch with two leaves, connecting the two genes with the strongest connection (e.g., either the highest correlation coefficient or the shortest Euclidean distance). We will store the branches loop until the size of remaining-branchable-objects reaches zero find the strongest measure in measure-triangle determine the two branchable objects, called a and b, that were compared to make this measure make a new object of type branch-branch, called c add the branchable objects a and b as children of c set the parent of branchable objects a and b to c set the expression measurements of c to the average expression measurements of a and b remove objects a and b from remaining-branchable-objects, if they were in that list add objects a and b to used-branchable-objects remove all comparisons involving a and b in measure-triangle compare object c to all the objects in remaining-branchable

-objects and add those measures to measure-triangle if size of remaining-branchable-objects is zero set main-tree-trunk to c end if end loop

The two genes are removed from the measure-triangle, or specifically, all the measures involving any one of these two genes are removed from the measure-triangle.[6] In place of the two genes, a new entity is created representing the branch with the two genes. The gene expression pattern for this new branch is the average pattern of the two gene expression patterns being linked. Notably, the new gene expression pattern for the branch does not have to be the average pattern. In computing the distance between a gene and a branch, or a branch and a branch, one could use a minimum, maximum, or average dissimilarity measure. New measures need to be calculated comparing the branch against all the remaining genes, and these are stored back in the measure-triangle. At every iteration, the next strongest measure is found and the process repeats, removing two entities (genes or branches) and replacing them with a new one, until only one entity remains.

At this point, the construction of the dendrogram is complete. However, the results still need to be displayed. The constructed dendrogram is typically used to order and display genes with similar clusters. At every branch, the display ordering can be one of two possibilities: Order one branch before the other, or vice versa. This means for any tree with n branch points, there are 2n possible display orderings of the data. Other information needs to be used to determine a consistent way to order the data. For example, position along a chromosome or average expression level can be used to order genes. Since there are no loops present in dendrograms, we can use a depth-first search to output the results. The following pseudocode will traverse the dendrogram and output the leaves of the tree (i.e., the genes) in order by average expression level.

define a new procedure called print-tree, that takes a branchable object b as a parameter if this branchable object b is a branch-gene (i.e., and not a branch-branch)

print the name/label of b end if if the list of children of b is not empty move the left margin left 1 cm sort the list of children by expression level call procedure print-tree recursively on each child of b separately move the left margin back 1 cm end if end procedure start the printing by calling print-tree on main-tree-trunk

The result is a dendrogram, typically drawn with the graphical conventions shown in figure 4.8.

Each row represents a separate gene.

Figure 4.8: Graphical convention for dendrograms. 4.8.1 Two-dimensional dendrograms

Each row represents a separate gene.

Figure 4.8: Graphical convention for dendrograms. 4.8.1 Two-dimensional dendrograms

Using the technique above, one creates a dendrogram and the position of the branches suggests the presence of clusters. As mentioned above, the results still have to be displayed, and one can use a secondary key to order the data, suchas average expression level, chromosomal position, etc.

We have not yet answered how to order the data horizontally; /.e.,how should the experiments be ordered, rather than the genes? For ordered experiments, such as time series data collection, gene expression is typically shown from earliest to latest acquisition. However, when the samples have no ordering, a second dendrogram is typically constructed to cluster the samples. This was the approach used by Ross et al. [154] in clustering both various samples of cancer cell lines and the genes measured by microarray, as shown in figure 4.9. The algorithm to construct such a two-dimensional dendrogram is identical to constructing each of the dendrograms individually.

Figure 4.9: Two-dimensional dendrogram. Dendrograms are constructed for both the genes and samples individually, and are combined in the display. (Derived from Ross et al. [154].) Published examples There are many examples of dendrograms applied to gene expression measurements—too many to accurately list all of them. We will cite the earliest publications and the ones most cited.

Figure 4.9: Two-dimensional dendrogram. Dendrograms are constructed for both the genes and samples individually, and are combined in the display. (Derived from Ross et al. [154].) Published examples There are many examples of dendrograms applied to gene expression measurements—too many to accurately list all of them. We will cite the earliest publications and the ones most cited.

• Spellman et al. [167] constructed dendrograms from a yeast expression data set constructed under multiple conditions.

• Eisen et al. [63] described the application of dendrograms to multiple yeast expression data sets, 1998.

• Alizadeh et al. [3] applied this technique to discover two subgroups within samples from the single disease B-cell lymphoma, and showed a statistically significant difference in survival curves of the patients from whom the samples were obtained.

• Ewing et al. [66] applied this technique after linking EST measurements with gene expression measurements.

• Bittner et al. [26] classified genes measured in malignant melanoma.

• Ross et al. [154] classified gene expression measurements from 60 cancer cell lines.

• Category number and size: One can estimate the number or size of different expression patterns in the data set. For example, in viewing a self-organizing map, the user may not be sure there was enough coverage of the multidimensional space.

• Quickly visualize overall pattern similarity: One can quickly see whether most of the genes or samples in the data set are similar to or different from each other.

• Parallelizability: The most time-consuming aspect in the construction of dendrograms is the initial comprehensive precomputation of the dissimilarity measure across the set of features. This computation can be easily parallelized. The construction of the actual dendrogram requires the serial creation of branches, and is not trivial to parallelize.

• Display requirements: Optimally displaying dendrograms requires only a narrow column, in which gene names, symbols, and expression patterns (typically displayed in a color representation) can fit.

• Misses negative correlations: Negative correlations include gene interactions such as those demonstrated by tumor suppressor genes. As an example, p53 suppresses the expression of several other genes. This means that the higher the expression of p53 seen, the less expression of other genes is expected. Negative correlations need to be explicitly searched for.

• Each gene is connected to the overall structure by only one link: the genes in a dendrogram are stored as leaves of the tree. Clearly, each leaf can only be connected to the tree by one stem. However, multiple links from a gene may be of interest. For example, one may wish to distinguish between a transcription factor responsible for the regulation of 10 genes and a structural protein whose gene is controlled by a single transcription factor. Dendrograms do not allow for a variable number of links from each gene.

• All genes are clustered, regardless of connection strength: the algorithm continues until every gene is connected. By definition, certain connections may be weaker than others. It may be a disadvantage to see all genes connected, when particular links are much weaker (i.e., perhaps less likely to be validated) than others.

• Single structure: Although a dendrogram is a single structure, intuitively one can use a fixed distance from the root or a set of long branches to determine where to cut the branches in order to delineate the subgroups. This may or may not have a statistical basis.

• Single metric type: Dendrograms can only be constructed using a single dissimilarity measure. For example, it is not practical to construct a tree with genes connected by both Euclidean distance and correlation coefficient. Only one dissimilarity measure may be used at a time.

• Single data type: As stated at the start of this chapter, although we have many objects in functional genomics that may be clustered, it is only practical to cluster one data type at a time using dendrograms. For instance, genes may be clustered, and samples may be clustered. If both need to be clustered, this is performed as a two-dimensional construct using coupled two-way clustering [76]. Nonetheless, on occasion it does make sense to cluster genes and phenotypic measures along the same axis. A "systolic blood pressure" measurement might appear in one part of the dendrogram, while a "weight" may appear in a different part. It would be difficult to determine which of the genes in the immediate surrounding branches might be closest in profile to the phenotypic measure.

• It becomes difficult to follow the branching in a dendrogram, when the number of genes or features is high [93].

• As each new branch is added to the forming tree, the average gene profile representing a higher branch may not match any of the specific gene profiles within the subtree [162].

• These are not globally optimal trees: The algorithm described here falls under the category of "greedy algorithms" in computer science, as the best match is always used at every step, regardless of the later consequences it may have for forming the tree. Early "bad" decisions in the formation of branches cannot be corrected later.

• Entire vector is used: This means that for each gene, the entire set of samples are used in calculating the dissimilarity measurements. However, there may be very little variation in most of the samples, and high variation in only a few of the samples. Two genes might be scored as being similar, when they may actually differ greatly in one of the many samples.

• Computational requirements: This technique requires the comprehensive precomputation of the dissimilarity measure for all pairs of genes, which grows on the order of n2 where n is the number of genes being studied. This can be a time-consuming computation.

[5]They do not describe the distance between the genomes of organisms as in standard phylogenetic trees and therefore are referred to as "phylogenetic-type" trees.

[6]That is, all the elements of the hash table for which one of the two genes is a key.