Suppose that you have mastered normalization of microarray data sets, performed experiments to ensure reproducibility, and then gone on to acquire several large data sets and have now completed your clustering or classification experiments. Remarkably, after that preparation, you have what looks like a very nice set of results where the clusters of genes you found on one set of experiments have reproduced beautifully in another set of experiments that were designed to be identical. Furthermore, every time you slightly permute the conditions from those in the control or baseline condition, the sets of genes in the clusters seem to have changed in reproducible ways. And furthermore, the clusters appear to be compact and distinct so that you suspect that you have identified truly significant, functionally related genes.

If one steps back a little at this point, it becomes less obvious what the meaning is of all the effort that one has invested to date, because what one actually has generated is a list of microarray accession numbers. Even an experienced functional genomicist looking at this list of numbers will not have the vaguest idea of what it means. Some of these accession numbers cluster two or three genes, while other clusters include hundreds. At this point, the functional genomicist has very little alternative other than to look up the genes corresponding to each microarray accession number one by one and then peruse the literature regarding each one of the of the genes to determine

• whether that gene has a known function, and if so, in what class (e.g., transcriptional factor, metabolic enzyme, structural protein, combinations of these, etc.);

• whether the genes found clustered together have been described in the literature as being functionally similar or related, or perhaps share promoter motifs, or a subset of the cluster are transcription factors for the rest of the cluster;

• whether homologs or orthologs have been found to be functionally related in any known physiological or pathological state;

• whether the resultant genes are known to be associated with the experimental conditions tested.

To the degree that the functional genomicist understands basic biology, this task is somewhat more tractable. A competent biologist will acquire in the course of his or her training a large framework of functional dependencies of different biological processes and the genetic machinery that underlies these dependencies. This kind of knowledge is unlikely to be present in any single journal article. Even a review article presumes a large shared body of biological knowledge that will be opaque to a researcher unfamiliar with a particular corner of the large space of biological knowledge. Without this rich context, many insights that the analyses might suggest will be missed. Consequently, a biologist familiar with the specific biological phenomenon being studied will be an invaluable collaborator in interpreting the meanings of the results from analysis of these massively parallel data sets. However, even a competent biologist armed with a very efficient search engine of the biomedical literature will find that determining whether the functional dependencies found in these clusters make any particular sense will take a very long time, often more than any other part of the analytic methodology of functional genomics.

Consequently, after having undertaken this laborious, rote lookup process several times, the following fantasy occurs to most genomicists: "Wouldn't it be nice if I could look at the cluster and automatically see that one of the clusters contained both genes coding for known transcriptional factors and genes coding for structural proteins, and under two pathological conditions, those structural proteins were only translated if the transcriptional factors in that same cluster were expressed at their highest levels. Also, six researchers have developed mouse models in which four of these genes in that cluster have been misexpressed, and those mouse models can be obtained by filling out a form on a specified website..."

Enticing as this fantasy is, unfortunately, it does not represent the current state of the art. However, there are many early efforts to achieve this goal. Some skeptics will note that this is an "AI complete problem" [7] in the sense that reaching the goal outlined in the above scenario would really be equivalent to significant progress in the development of machine intelligence that can reason in a commonsensical fashion about a wide range of heterogeneous knowledge. However, as we shall see from some of the early efforts below, even a much more modest set of goals might be quite useful for the purposes of the functional genomicist. The technological solutions provided to achieve these goals fall into three increasingly circumscribed tasks: development of bio-ontologies, development of common data models, and development of common nomenclatures.

Was this article helpful?

0 0

Post a comment