Data Reduction and Filtering

Clustering algorithms can easily be biased if certain assumptions they hold are not met. Each dissimilarity measure has its own set of assumptions and requirements. For example, the reliability of the correlation coefficient not only assumes that two data sets are normally distributed but that they are easily biased by outlying points (see figure 4.1). Another example is that genes with expression measurements that are constant across samples may still show variation due to measurement noise, and if these genes are not filtered out, such measurement noise can be amplified by normalization and can appear as a true signal. Four common methods for filtering genes meeting these degenerate cases are given here.

Figure 4.1: Example of how a single point can distort overall correlation. On the left: the scatterplot shows a negative correlation. The scatterplot on the right is identical, save for the additional point added. If the values of even a single point are high enough, the correlation coefficient can be altered (though the variance of the correlation coefficient, if calculated, would be higher). This is primarily because using the correlation coefficient assumes values distributed normally.

Figure 4.1: Example of how a single point can distort overall correlation. On the left: the scatterplot shows a negative correlation. The scatterplot on the right is identical, save for the additional point added. If the values of even a single point are high enough, the correlation coefficient can be altered (though the variance of the correlation coefficient, if calculated, would be higher). This is primarily because using the correlation coefficient assumes values distributed normally.

Was this article helpful?

0 0

Post a comment