Before launching into a discussion of the various clustering algorithms available to bioinformaticians, one needs to carefully consider the motivations behind data mining. Data mining can be loosely defined as determining the relationships of the elements in a data set. Avariety of objects in biology can be captured in functional genomics data sets, a few of which should be mentioned here.
• Genes: Certainly, the majority of functional genomics bioinformatics analysis involves determining the relationships among genes. Genes can be clustered not only by gene expression but also by gene sequence, nucleotide composition, linkage, and chromosomal position. Genes can obviously be clustered using measurements from normal physiological states, as well as abnormal or pathological states.
• Alternative splicing products: An increasing number of genes are now predicted to have alternative splicing products, or a varying combination of exons. Various alternatively spliced products could be clustered to assess their pattern of expression compared to each other.
• Tissues: Normal tissues can be clustered to assess their relative degrees of similarity or dissimilarity. Unknown tissues, such as pathological specimens, could then be compared with clusters of known tissues.
• Diseases: Tissues from single or multiple diseases can be clustered. What is currently thought to be one disease may in fact be distinguishable into two or more subtypes.
• Phenotype: Cell lines, organisms, or patients can be clustered based on a series of measurements on them, of which gene expression may be only one component.
• Patients: In medical informatics, patient clusters are distinguishable in many ways: geographical location or home, presence or absence of a disease, the quantitation of specific laboratory measurements, or details of a longitudinal course of medical care.
• Promoters: Transcription factors are known to bind these specific regions of DNA to cause increased or decreased expression of the controlled gene. These could be clustered based on sequence similarity or association with functional genomics measurements.
• Environment: Environmental toxins and factors can be clustered by the pattern of gene expression, or the change in expression, seen after exposure.
As diagrammed in figure 2.11, all of these data types may end up in a large heterogeneous data table that can be used to advantage in the automated datamining techniques described below. It is precisely the lack of a priori limitations or assumptions of which data sources and data types will be of clinical or biological interest that is one of the bulwarks of the genomic approach to discovery.
Was this article helpful?