## Entropy and mutual information

Suppose that we have the expression information for a pair of genes Gx, Gy across n experimental conditions. In the most extreme cases, the expression of Gx and Gy may be completely dependent (if this dependence is linear, this implies that r = ±1), or independent (which implies that r = 0) across these n conditions. Question: Can we measure or quantify the dependence of Gx and Gy levels in this case? Equivalently, how much would the expression data for Gx inform us about the corresponding expression of Gy, and vice versa? In this regard, we may appeal to relevant information-theoretic and conceptual devices that were formulated to solve the fundamental problem in communications of reproducing at the receiver endpoint, either exactly or approximately, a message from the transmitter end point. The messages transmitted and received will correspond to the expression data of Gx and Gy in the biological scenario. Suppose that X is a random variable with the set of events or outcomes Tx = 1 ^ } ^ ¡, and corresponding probabilities p (X = x = pxi e 0. By definition, _ 1. Following Shannon , we want to find a measure

H(px1,—, pxn) of how much "choice" is involved in the selection of events, or of how uncertain we are of the outcome. Intuitively and reasonably, we will require that H have the following three properties:

• E2. If all px/'s are equal, i.e., . _ , then H should be a monotonically increasing

Pxi - n function of n. With equally probable events, there is more choice or uncertainty in the outcome when there are more possible events xn.

• E3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of the later H's. For instance, let n = 3 and

, and we consider three possible decompositions of a choice of the three outcomes: Then

It can be shown that the only explicit form of H satisfying E1 to E3, H(Pz„--- ,Px.) = logPxt , where K is positive constant. We let K = 1. We shall call H(X) the entropy of random variable X. Consider the case of a Bernoulli random variable X, i.e., n = 2, with px1 = p and px2 = 1 - p where p > 0. Then, when X is one-dimensional,

Using a base-2 logarithm, H plotted as a function of p is concave, non-negative, symmetrical and has a maximum of 1 at p =1/2 (see figure 3.24). Figure 3.24: Entropy plot

As noted in , H has several interesting properties that substantiate it as a reasonable measure of choice or uncertainty.

Figure 3.24: Entropy plot

As noted in , H has several interesting properties that substantiate it as a reasonable measure of choice or uncertainty.

H1. H e 0 with equality if and only if all but one pXj = 0 and the non-zero event has probability 1. Therefore, when we are entirely certain about the outcome H the measure of uncertainty vanishes.

H2. For any number of possible outcomes n, H is a maximum (log n) when all the px/'s are equal, i.e., px, = 1/n. Intuitively, this is the most uncertain scenario.

H3. Suppose that Y is a random variable with the set of outcomes and corresponding probabilities P(K = y.) = pV{ > 0 and P(Y = y) = 1. Let pxj,yj be the probability for the joint occurrence of X = Xj and Y = y. Define the joint entropy of X and Y to be and by basic properties of marginals of joint distributions, and by basic properties of marginals of joint distributions, Exercise: Showthat H(X, Y) d H(X) + H(Y) with equality if and only if Xand Yare independent, i.e., pxi, yj = pxipyj. The uncertainty of a joint event is less than or equal to the sum of individual uncertainties. Note that H(X, X) = H(X).

H4. Any change toward equalizing the px's increases H. Computationally, any averaging operation, i>„: ' > p'; = ^i-^i i/':. wIki«- v.(■(;;. = v^ «= i with aik e 0 increases H—except in the special case of a permutation where for every i all but one ak = 0, the non-zero constant being 1, in which case H is invariant.

H5. Recall the definition of conditional probability, p<.x - ^r - >n - . if iP'T = # !>_, .t = r'v' An easy consequence of this definition is Bayes' theorem,

The conditional entropy of Xgiven Y = y is defined as H(X\Y = y) = - Yi P*lirlo8P*l*

where as the conditional entropy of Xgiven Y is defined by averaging the H(X\ Y = y) over A

This quantity measures how uncertain we are of Y on average when we know X. The chain rule for entropy can then be shown. Exercise:

Now following the exercise in H3,we can conclude that, H(X) + H(Y) > H(X,Y) = H(X\Y) + H(Y)

which says that on average our knowledge of Ydoes not increase our uncertainty of X. The reader is encouraged to come up with a pair of X, Ysuch that H(XIY = y) > H(X) but H(XIY) d H(X). When X and Y are independent, our uncertainty of X is unchanged; otherwise it decreases.

H6. The mutual information between X and Y is defined as M{X,Y) = H(X) - H(X\Y)

Exercise: Show that M(X, Y) e 0 and M(X, Y) = M(Y, X). Mutual information measures the average reduction in uncertainty about Xgiven that we have knowledge of Y, and vice versa by the exercise. When X and Y are independent, M(X, Y) = 0, i.e., there is no reduction in uncertainty. At the other extreme, when X = Y , M(X, Y) = H(X) so that we have reduced all uncertainty—H(X)—about the system. Here we also see that M is not a metric since M(X, X)" 0. The distance between random variables X and Y is defined as

Exercise: Show that D satisfies the three axioms of a metric in an earlier section.

H7. The cross or Kullback-Liebler entropy between the probability distributions px and qx over a common space of possible outcomes is defined as,

In order to apply the preceding tools in the context of microarray data analysis, suppose that we have the gene expression data for genes Gx, Gy across m different experimental conditions. Divide the ranges of the Gx, Gy expression data into uniformly spaced n intervals or "bins." Let pxi be the probability that the expression for Gx falls into the ith interval, and similarly for Gy. Suppose that three out of the m different experiment intensities for Gx fell within the kth bin, then we let pxk = 3/n. A graphical example of this is shown in figure 3.25. Figure 3.25: Graphical example of mutual information calculation. First, a scatterplot of the expression measurements of the two genes is created, and a grid is imposed. In this example, each expression measurement is quantized into four bins (one can think of these as "low," "low-medium," "high-medium," and "high," though anynumber and positioning of bins can be considered). The entropy for each gene is then calculated using the row and column sums, and the joint entropy is calculated from the grid. 3.6.3 Dynamics

Figure 3.25: Graphical example of mutual information calculation. First, a scatterplot of the expression measurements of the two genes is created, and a grid is imposed. In this example, each expression measurement is quantized into four bins (one can think of these as "low," "low-medium," "high-medium," and "high," though anynumber and positioning of bins can be considered). The entropy for each gene is then calculated using the row and column sums, and the joint entropy is calculated from the grid. 3.6.3 Dynamics

Dynamics provides a powerful example of how the choice of the appropriate similarity measure can affect the analysis of a genomic system. We use the term dynamics to refer to the rate of change of genetic expression over time, calculated as the firstorder difference of the genetic expression levels (Eg Ed,E(3 E(2). This is different from the simple temporal pattern of genetic expression (E^E^Eq) that we refer to as statics. The primary motivation for studying gene expression dynamics is that existing static techniques may not identify all the important relationships. Some genes may have associated dynamic behaviors but may not have associated static expression behaviors. A hypothetical example is shown in figure 3.26: Gene A codes for an enhancer protein that regulates the expression of Gene B—a high level of Gene A causes an upregulation of expression in Gene B. Since Gene B can be at many possible expression levels before being affected by Gene A, the enhancer-type relationship between the two genes cannot be noticed by simply examining the correlation of static expression patterns. Instead, one needs to examine the dynamics of gene expression—the way in which the expression level of Gene A leads to a change in Gene B—to detect the underlying dynamic relationship. We therefore hypothesized that using dynamics measures as our fundamental similarity measure has the potential to discover relationships between genes that are not detectable using static similarity measures. Figure 3.26: Dynamic relationships between genes. (a) The expressed product of Gene A binds an enhancer region that increases transcription of Gene B. (b) Gene B's initial expression level before being affected by Gene A can vary throughout the experiment. As a result, measuring the correlation between the absolute levels of Genes A and B will not reveal the underlying enhancement relationship between the two. Instead, this can only be done by analyzing the expression dynamics—the change in expression level of Gene B in relation to the expression level of Gene A. (Derived from Reis et al. .)

Reis et al.  investigated the Saccharomyces cerevisiae mRNA expression data aggregated from several experiments reported by Eisen et al.  in which the response of the yeast cells to several different stimuli is recorded. The data contain 79 data points in 10 time series measured under different experimental conditions. Of over 6000 genes in the yeast genome, Eisen et al. included only 2467 genes that had functional annotations. We analyzed the same subset of genes as described. Dynamics similarity measures were calculated as slopes of the change of expression over time. Slopes were calculated between each adjacent pair of expression data points, Etn and

Figure 3.26: Dynamic relationships between genes. (a) The expressed product of Gene A binds an enhancer region that increases transcription of Gene B. (b) Gene B's initial expression level before being affected by Gene A can vary throughout the experiment. As a result, measuring the correlation between the absolute levels of Genes A and B will not reveal the underlying enhancement relationship between the two. Instead, this can only be done by analyzing the expression dynamics—the change in expression level of Gene B in relation to the expression level of Gene A. (Derived from Reis et al. .)

Reis et al.  investigated the Saccharomyces cerevisiae mRNA expression data aggregated from several experiments reported by Eisen et al.  in which the response of the yeast cells to several different stimuli is recorded. The data contain 79 data points in 10 time series measured under different experimental conditions. Of over 6000 genes in the yeast genome, Eisen et al. included only 2467 genes that had functional annotations. We analyzed the same subset of genes as described. Dynamics similarity measures were calculated as slopes of the change of expression over time. Slopes were calculated between each adjacent pair of expression data points, Etn and

Ki|iri<Mion.l.«vrln4 | - K.i|irr«»iin».|,r\r|n

### As slopes are only calculated between data

Timfn 4 | -Tlmr»i points within the same time series, the 79 data points in 10 time series are reduced to only 69 slope measurements. The units of the slope measurements are in normalized expression level units per minute. The authors found those pairs of genes that exceed a correlation coefficient of .78 based on a permutation analysis (see section 4.12.1). We compared the pairs of genes found to be correlated at that threshold or higher using both the standard static expression levels and the slope similarity measures. We found that 133 genes appeared in both the static and dynamic analyses, leaving 215 genes that were exclusive to the dynamics analysis. However, only about half of the 133 shared genes appear linked to the same genes in both analyses—most appear linked to other genes.

Of many examples, figure 3.27 shows the distribution of slopes between RAD6 and MET18 which are highly correlated by the slope similarity measure but not by the static gene expression levels. RAD6 is a ubiquitin-conjugating enzyme concentrated in the nucleus that is essential to mediating the degradation of amino end rule-dependent protein substrates(21). MET18, also known as MMS19, is a protein concentrated in the nucleus that affects RNA polymerase II transcription). These are inversely related in their dynamics, with an R2 of .791. It is not surprising that a gene responsible for protein degradation might have an inverse relationship to a gene responsible for RNA transcription leading to protein synthesis. Again, the significance here is that the discovery of this relationship depended upon picking the right similarity measure. Besides this one, there is undoubtedly a large unexplored universe of potentially useful measures. -0.1 -0.05 0 0.05 0.1 Slope of Expression Level of MAD3

Figure 3.27: Negative dynamic correlation. The distribution of slopes of MAD3 and EXM2 plotted one against another.