Bioontology projects

We will now cover a few examples of successful bio-ontologies. For a more comprehensive and current compendium of ontologies such as these, the reader is referred to issue no. 1 of Nucleic Acids Research, of any recent year.

The Gene Ontology Consortium The Gene Ontology (GO) consortium is one of the more high-profile efforts that has developed in response to the pressing need for a unifying framework. As summarized by Ashburner et al. [13], the goal of the GO consortium is to provide a controlled vocabulary for the description of three independent ontologies:

1. A description of molecular function. This describes the biochemical activity of the entity such as whether it is a transcriptional factor, a transporter, or an enzyme, without making any further commitments as to where the functions occur or whether the function occurs as part of a larger biochemical process.

2. The cellular component within which gene products are located. This provides a localization of a gene product's molecular function, such as the ribosome, the nucleosome, or mitochondrion. Localization can help determine whether a purported function could occur through direct physical interaction between gene products or as a result of an indirect mechanism.

3. The biological process implemented by the gene products. Biological process refers to a higher-order process, such as pyrimidime metabolism, protein translation, or signal transduction.

The goal is to allow queries across databases using GO terms providing linkage of biological information within and across species. The three ontology domains are presented in figures 5.1, 5.2, and 5.3.

Figure 5.1: The Molecular Function Ontology of GO. (From Ashburner et al. [13].)

cell

/RNH3s\

cytoplasm

DNA-ligl

Mcmd mcm2 MCM3 CDC54/MCM4 CDC46/MCM5 MCM6 CDC47/MCM7

nucleus rev3 DNA-ligU mus309 hay nucleolus pre-replicative complex mcm2 Mcm2 Mcmdl MCM3 Menú Mcmd CDC54/MCM4 Mem* Mcmd4 CDC46/MCM5 Mcm5

MCM6 Mcmd Mcmd6 CDC47/MCM7 Mcm7 Mcmd7

I SACCHAROMYCES

I DROSOPHILA mus nucleolus

Mcmd

nuclear membrane alpha DNA polymerase: primase complex delta DNA polymerase

CDC2 I'oldl DNApol-S Pold2 mus209

DNA replication factor A complex nuclear membrane replication fork

MCM2 MCM3

CDC54/MCM4 CDC46/MCM5 MCM6

CDC47/MCM7

replication fork alpha DNA polymerase: primase complex

DNA replication factor C complex

origin recognition complex

ORC2 Orc2 Ore2

Figure 5.2: The Cellular Component Ontology of GO. (From Ashburner et al. [13].)

Figure 5.3: The Biological Process Ontology of GO. (From Ashburner et al. [13].) These three ontologies are a result of a practical consensus of the kinds of distinctions that are important at present in functional genomics. If the GO consortium efforts were to populate this form of annotation across all genes for all sequenced organisms, this would be an invaluable resource even though the ontology is of so little commitment that it may not enable several categories of useful inferences. Examples of the latter include causal transitivity, temporal ordering, assertion of disjoint subconcepts, and many more. Representations which might support these more elaborate inferences are discussed in section 5.1.2. Again, there are some significant trade-offs in using ontologies of much more ontological commitment and expressivity (see section 5.2). Despite the limitations of the simple ontologies of GO, early efforts to annotate the organismal genomes using this infrastructure appear very promising, notably the GO annotations in the LocusLink site of the National Center for Biotechnology Information (NCBI).[1]

Figure 5.3: The Biological Process Ontology of GO. (From Ashburner et al. [13].) These three ontologies are a result of a practical consensus of the kinds of distinctions that are important at present in functional genomics. If the GO consortium efforts were to populate this form of annotation across all genes for all sequenced organisms, this would be an invaluable resource even though the ontology is of so little commitment that it may not enable several categories of useful inferences. Examples of the latter include causal transitivity, temporal ordering, assertion of disjoint subconcepts, and many more. Representations which might support these more elaborate inferences are discussed in section 5.1.2. Again, there are some significant trade-offs in using ontologies of much more ontological commitment and expressivity (see section 5.2). Despite the limitations of the simple ontologies of GO, early efforts to annotate the organismal genomes using this infrastructure appear very promising, notably the GO annotations in the LocusLink site of the National Center for Biotechnology Information (NCBI).[1]

Kyoto Encyclopedia of Genes and Genomes One of the more comprehensive attempts to describe the molecular and functional biological knowledge (all the way down to the detail of interacting molecules) for all known genes is the Kyoto Encyclopedia of Genes and Genomes (KEGG).[2] Rather than generating annotations per gene, KEGG attempts to curate and elaborate on metabolic pathways, large molecular assemblies, and regulatory pathways. These larger abstractions are invaluable to biological researchers and allow the rapid identification of how a gene fits into cellular physiology. The KEGG ontology has five components:

1. Pathways with a corresponding graphic pathway map

2. Ortholog groups, which represent highly conserved functional groups annotated by the genes belonging to these groups across species

3. Molecular catalogs providing functional annotation of proteins, RNA, and small molecules

4. Genome maps, which mirror much of the positional information at the NCBI

5. Gene catalogs which have similar content to the Molecular Function Ontology of GO

The richness and utility of the KEGG ontology can be seen in figure 5.4 which provides a snapshot of one pathway maintained in its Pathways ontology. It documents our increasingly complex state of knowledge about the genetic regulation of programmed cell death—apoptosis.

Figure 5.4: The KEGG representation for apoptosis.

5.1.2 Advanced knowledge representation systems for bio-ontology

Figure 5.4: The KEGG representation for apoptosis.

5.1.2 Advanced knowledge representation systems for bio-ontology

The AI community has developed a set of tool kits for large and highly portable ontologies over the past two decades. It has not escaped that community's notice that the bio-ontologies developed to date do not exceed the representational capabilities of extant knowledge representation languages. In particular, languages such as Ontolingua [142] and the Ontology Mark-up Language/Conceptual Knowledge Mark-up Language (OML/CKML) [131] have been evaluated for their ability to represent biological knowledge and specifically functional genomic relationships. Ontolingua is a spin-off of the Knowledge Interchange Format (KIF) and is a result of research funded by the Advanced Defense Research Project Agency Knowledge Sharing Effort [137] targeted to support knowledge re-use in ontology building efforts, a target shared by the bioinformatics community. The OML/CKML effort has been led by Washington State University. These two representation languages have similar capabilities as described in a report by McEntire et al. [131], although at this time, the latter has a more compatible syntax with other meta-data languages through its use of eXtensible Mark-up Language (XML). The reader is referred to [131] for a description of the issues that had to be addressed when representing existing genomic knowledge bases.

Another knowledge representation system that has been useful in the bio-medical domain is PROTÉGÉ, designed by Mark Musen at Stanford. In practice, none of these more advanced knowledge representations has had nearly the impact and usage of the simple ontologies such as that of GO or Proteome.[3] Part of the reason is that the need for any kind of functional annotation is so dire that the capabilities of the advanced knowledge representations (e.g., automated inferences) are not a priority and therefore unnecessary overhead when applying them over tens of thousands of genes. It is also the case that even though these advanced representations have significant expressive power, they remain fundamentally limited as described below. We wish to emphasize this point because it is only through the implementation of bio-ontologies with greater expressive power that many of the properties desired for such ontologies can be achieved. The alternative is the creation of a multiplicity of special-purpose, limited expressivity representations that can only be made interoperable with significant effort and without guarantees of semantic or computational soundness.

[1]http://www.ncbi.nlm.nih.gov/LocusLink/.

[2] http://www.genome.ad.jp/kegg/kegg2.html.

[3]http://www.proteome.com/.

Was this article helpful?

0 0

Post a comment