## Low entropy filter

The opposite problem can also occur with gene expression patterns. Genes can demonstrate spiking behavior, where low expression levels are seen in all samples except one. The single high expression can dominate a pairwise analysis using correlation coefficients, for example (see figure 4.1 for an example) .

An entropy filter can be used to remove genes that demonstrate spiking behavior, or, in other words, that are not well distributed over its range of values. Entropy is a measure of the amount of disorder in a variable. If measurements of a variable are equally distributed across its dynamic range, then very little information is known about the expected value for the variable. Once the measurement is known, the amount of information about the variable increases greatly, meaning such a variable has high entropy. Compare this to a variable that can be measured at only two values. A great deal of a priori information is already known about the expected value for this variable; thus, once the measurement is actually known, the amount of information increase is less. This indicates the variable has lower entropy(see figure 4.2).

 9 11 8 ■ i 1 2 H 8 \ 2 30^ + 30 (i)} • (4} 4 5 4 4 , ^ 6 4 h = 2.7777 rK •

Figure 4.2: Detecting spikes by calculating entropy. The top graph shows a hypothetical gene with "spiking" behavior; i.e., the gene expression is markedly higher in two samples, compared to the other samples. The bottom graph shows a second hypothetical gene with gene expression measurements that are more distributed across the dynamic range. Distribution of the gene expression measurements are not just a characteristic of the gene; it is also connected to the samples in which expression was measured.

where x is the variable whose entropy H is being calculated, log2 is base 2 logarithm, and p(x) is the probability a value of x was within quantile i of that feature. For example, if one were using 10 quantiles, and a gene with the expression amounts 20, 22, 60, 80 and 90 would have deciles 7 units wide, with twovalues in the first decile, one in the sixth decile, and one in the ninth and tenth decile, making H = '.92. This pseudocode accomplishes the same.

define number_bins as the number of bins to use in entropy calculations make a new empty list called post_filter loop through each gene G to be filtered find the minimum expression level for G find the maximum expression level for G set expression_range to maximum - minimum set interval to expression_range divided by the number_bins make a new empty array called deciles, with size = number_bins set deciles to interval loop through the rest of deciles[2..number_bins] using the index i set deciles[i] to deciles[i-1] plus interval end loop make a new empty array called decile_count, with size = number_bins loop through all the expression measurements for G set the flag bin_found to false loop through deciles[1..number_bins] using the index i if the expression measurement is <=

deciles[i] and not bin_found add one to decile_count[i] set the flag bin_found to true end loop end loop set entropy to zero loop through decile_count[1..number_bins] using the index i if decile_count[i] is over zero set probability_in_bin to decile_count[i] divided by the number of expression measurements in G add to entropy (probability_in_bin times log-base-2( probability_in_bin ) )

end if end loop set entropy to -1 times entropy end loop

After entropy is calculated for each gene or sample, a threshold low entropy (e.g., at the lower 5th percentile) can be chosen and used to filter out genes that do not display a suffcient range of values. In using this technique, however, one needs to note that the calculation of entropy does directly depend on the number of quantiles used, and the use of 10 deciles is arbitrary.

As an alternative method, Heyer et al.  proposed the use of the jackknife correlation coefficient to counter the spiking problem. The jackknife correlation coefficient is an alternative dissimilarity measure to the standard (Pearson's) correlation coefficient. To compute this measure for two genes measured in n samples, the technique involves computing n different correlation coefficients, each time with one of the samples removed. The jackknife correlation coefficient is then the minimum of the separate correlation coefficients. Use of this dissimilarity measure will effectively remove single outliers. However, it is not clear whether removing a single outlier is enough, especially when samples from many disparate tissues are used, and several spikes are seen. It is also questionable whether removing spikes is even desirable, because when two genes demonstrate spiking behavior in the same samples, it may reflect something biologically interesting.

It is important to note that both variation filters and low entropy filters remove entire genes from downstream analysis. If the subsequent analysis is being used for hypothesis generation, this means that hypotheses containing the filtered genes will not be created or considered. Yet again, the cost-benefit analysis of section 2.1.4 should be considered. 4.5.3 Minimum expression level filter

As described elsewhere in this book, it is commonly believed that low gene expression measurements are less reliable than high gene expression measurements. Empirically, this means one's confidence interval around a measurement differs as a function of the expression level. If the low expression levels are particularly noisy, this can cause artifacts in the downstream clustering process, or worse, cause the creation of incorrect clusters. One solution to this is to ignore those gene expression measurements with a measurement under a threshold value. Threshold minimum expression values can be calculated arbitrarily at a percentile of the distribution of expression measurements (e.g., at the lower 5th percentile), determined empirically at an ideal point for clustering sensitivity and specificity . Here, we show pseudocode that removes genes using the first of these three methods.

loop through each gene G to be filtered find the minimum expression level for G find the maximum expression level for G set expression_range to maximum - minimum set lowest_5th_percentile to expression_range times 0.05 loop through all the expression measurements for G

if the expression measurement is under lowest_5th_percentile remove that measurement from G end loop end loop

Note that with this type of filter, we are removing gene expression measurements, not entire genes, from the downstream analysis. For example, when genes A and B are compared using a dissimilarity measure, each sample normally has two measurements associated with it: the expression level of gene A in that sample, and the expression level of gene B in that sample. By removing gene expression measurements using a minimum expression level filter, we are removing one or both of those gene expression measurements from the sample. Thus, when genes A and B are compared using the dissimilarity measure, there will be fewer samples used in calculating the measure.