## Dissimilarity and Similarity Measures

In time course studies such as study S1, it is natural to talk about how two genes Gk and G¡ are (dis)similar to one another with respect to their expression levels sampled in time. In the bioinformatics literature, a gene and its corresponding expression levels are often represented as a data point or a vector in k, where k is a positive integer (denoted by N). Each vector component or coordinate slot denotes the expression level of that gene under a different experimental condition with k conditions altogether. A measure of similarity or dissimilarity such as a metric is then defined on k which maps pairs of vectors, representing genes and their corresponding expression information, to real numbers. When the number of genes and the expression data set are very large, as in most micro array experiments, one often asks which genes or groups of genes behave similarly under these k different or replicate conditions. In an attempt to answer this question, one resorts to classification and clustering methods, and of particular importance prior to the analysis, one has to define what one means by the term similar. Similarity is characterized by the choice of a (dis)similarity measure, such as a metric.

As noted by Jain and Dubes [101], a cluster method takes as its input a finite data set of at least p (p > 2) elements together with a list of finitely many attributes of these data, or, alternatively, it may consist of a measure of similarity or dissimilarity between pairs of objects in . The output is an internal structure of , which is determined entirely by the input and not dependent on any external criterion. Following [103], even though there is no universal agreement in the field with regard to terminology, when the rules for determining the output are externally defined, the algorithm will be called a classification rather than a cluster method. For example, suppose that we have for our set X microarray data which are represented as points in 3 , i.e., equipped with the standard Euclidean distance, a metric. Then, a cluster method would ideally yield groups of genes which are physically close (in Euclidean distance) to each other, and hence behave similarly over three conditions, forming very tangible visual clusters when plotted. On the other hand, questions such as which genes appear in the first octant of 3 would fall under the classification category. The distinction between the terms clustering and classification is not purely of academic interest. Clustering algorithms generally have a higher order of complexity, and are therefore more computationally intensive, than classification methods. We begin the discussion of (dis)similarity measures with several formal and standard definitions. Let be any nonempty set.

Definition 1 A function ?1K - ^ i®-6^ is a metric if for all

• M1. A(x, y) = 0 if and only if x = y (definite).

• M3. d(x, y) d A (x, z) + A (z, y), for all (triangle inequality).

Some examples of common metrics are:

• Let ^ = € M. The classical Euclidean distance, r p—*» | „ p > i for any

. When p = 2, we have the preceding Euclidean distance.

• Suppose that is the set of real-valued continuous functions on the closed real interval [0, 1]. Let f u € A', and A(f,g) = maxx p^p*) g(x). Exercise: Confirm this.

• The distance between two random variables X and Y , A(X, Y) = H(X, Y) - H(X; Y) where H(X, Y) is the joint entropy of X, Y , and H(X; Y) P H(X) - H(X| Y) is the mutual information between X and Y , a symmetrical function. An overview of entropy and information-theoretic concepts is presented in subsection 3.6.2. We refer the interested reader to Cover and Thomas [54], for a detailed review of information theory.

The reader should be warned that there is a widespread misuse of the terminology metric in the current bioinformatics literature to include functions such as the linear (Pearson's) correlation coefficient and mutual information. It is a simple exercise to show that these functions do not satisfy the axioms of a metric. Next we define a weaker class of functions than the metric, a dissimilarity coefficient.

Definition 2 A dissimilarity coefficient is any function which satisfies two axioms:

A dissimilarity coefficient d is definite if it satisfies a third condition:

Any metric is a definite dissimilarity coefficient, and note that 1 r, where r denotes the linear (Pearson's) correlation coefficient is a definite dissimilarity coefficient. Exercise: Confirm these.