Data Model Introduction

Data models are the logical structures used to represent a collection of entities and their underlying one-to-one, one-to-many, and many-to-many relationships. The main motivations for creating data models is usually to be able to implement them within a database management system, usually as a relational database management system (e.g., Oracle, SQL Server, Sybase, MySQL, etc.). As the relational calculus [48] is based on the first-order predicate calculus, it is possible, although unrewarding, to express a wide range of formal relationships within the data model. The reasons for the unwieldy nature of such an effort are articulated in the prior section. Presently, in the domain of bioinformatics and functional genomics, the most relevant heuristic distinction between bio-ontologies and data models is that bio-ontologies are usually developed in order to provide a shareable framework to describe attributes of genes, whereas data models at this time are mostly developed to store the measurements of genomic systems. For DNA sequences, this corresponds to the GenBank data model and for the domain of microarrays, it corresponds to a way of organizing the pertinent values obtained on measurement of a microarray hybridization experiment.

Among the goals often stated for the development of standardized microarray data models is the capability to allow researchers across multiple laboratories to be able to take advantage of the precious RNA samples obtained from various sources and hybridized to microarrays in several, initially unrelated, experiments. Ideally, these researchers should be able to perform aggregate analyses across all these various experiments independent of the provenance or manufacture of each of the microarray systems. In this way, we may create a shared set of repositories of expression data that might reveal subtle gene interactions across a larger sample than would be observable within a single institution's data set.

Unfortunately, expression measurements are human artifacts rather than ground "truths" in biology.'71 A consequence of this fact is that that there are many such artifacts. There are several microarray measurement systems in use, and many more under development, such that convergence upon a particular data model involves the arduous task of obtaining a sufficiently general representation of all the measurement types across different microarray systems, while at the same time providing sufficient specificity for each measurement system so as to avoid loss of detail of a particular study. Unfortunately, a large part of this specificity reflects differences in the measurement techniques of a particular microarray platform that result in nonlinear relationships with the results from other platforms. The implication of this is that even if the data models are shared across microarray platforms, the results encoded in these shared data models may not be directly comparable.

Let us be a little bit more explicit regarding the challenge of a standardized data model for microarray experiments results. Unlike DNA sequences, which are by nature digital and whose interpretation should be independent of the modality by which the sequence was acquired, expression data are analog signals and their value and interpretation are very much dependent on the technology used to obtain that value. Consider just two of the technologies used to measure expression: photolithographically constructed oligonucleotide microarrays with several probes per gene hybridized against a single color-stained sample, and robotically spotted microarrays with one probe per gene hybridized against two samples each colored with a different dye. In the first case, the expression level is reported as a function of a trimmed mean of intensities, each probing for a particular part of a gene, and the "average difference" between intensities of oligonucleotides that perfectly match against that gene part, and oligonucleotides that have a central base mismatch with that same part. In the second case, the gene expression level is measured as a ratio of hybridization of one specific cDNA from one sample against the cDNA from another sample. Although there is some relationship between the two measurements, it is not clear or empirically determined what that relationship is. For example, two-dye microarray measurements are directly related to the second or control sample used, whereas single-dye oligonucleotide microarray measurements are in theory an absolute expression level.

As a result, there are at least as many microarray expression data models available as there are microarray expression technologies and, in fact, considerably more. At the time of this writing, there has yet to be any consolidation of these different competing standards into a single one, although this is likely to happen in the near future. However, in the absence of such a consensus, we point out the most promising and successful of these efforts. Not all these efforts in data modeling are sufficiently general at the time of this writing to encompass all the current microarray gene profiling technologies. We will note this as we describe the current major candidates for standardized data models.

Gene Expression Omnibus (GEO) The NCBI has been developing the GEO data model to support the construction of a gene expression repository with the same inspiration and goals that were originally targeted for the GenBank repository for DNA sequence. The goal of GEO is to support spotted microarray technology, oligonucleotide microarray technology, hybridization filters, and Serial Analysis of Gene Expression (SAGE) data. The entities used in GEO are shown in figure 5.7 and can be accessed on-line.[8] Examination of this figure reveals that, currently, Affymetrix style files are poorly supported, whereas the two-dye spotted microarray data type is better supported.

Sample fields

• Experiment type: Single channel (absolute value) such as radioactive labeled filters,or dual channel types such as a red/green labeled microarray.

• Sample title: Try to choose a short title which will be specific to your sample - preferably greater than 8 and less than 20 characters. The title must be unique over all your previously accessioned GEO samples. You will be notified if such a name clash occurs when you try to submit, and then will be given a chance to change your sample title. Text is required in this field.

• Lot/batch: Any manufacturer defined lot and/or batch numbers for the platform used to derive this sample's data.

• mRNA source: It is best to use a short list of words to describe the source of the mRNA used to derive this sample's data. Do not list organisms in this field. Text is required in this field.

• Ch1 mRNA source: It is best to use a short list of words to describe the source of the mRNA used to derive this sample's data. Do not list organisms in this field. Text is required in this field.

• Ch2 mRNA source: It is best to use a short list of words to describe the source of the mRNA used to derive this sample's data. Do not list organisms in this field. Text is required in this field.

• PubMed id: PubMed id (PMID) references a publication which, perhaps, further describes your sample. A PubMed id is a numeric value which you may obtain from the PubMed record.

• Web link: World Wide Web link to reference a webpage which, perhaps, further describes your sample.

Figure 5.7: A subset of the definition of sample in the GEO. At the time of this writing, the sample definition in the GEO does not appear to support the full complexity of oligonucleotide high-density microarrays.

• Keywords: One or more terms, as a comma-delimited list, which you think might be useful terms in queries.

• Image file: Use the Browse button to the right to select a local file to be uploaded. The purpose of this image is to serve as a qualitative reference of your experiment. It is not meant to be image-analyzable. In order to be accepted into GEO, the image size must be less than 100 kilobytes, and must be in JPEG or GIF format.

• Total # tags: A whole, non-zero number is required. The recipricol of this number is used for SAGE library normalization.

• Norm, factor: This number is meant to be multiplied to all of the background-corrected scalar values extracted from each platform feature referred to in the data table.

A number is required.

• Protocol: Select the anchoring enzyme used in your SAGE protocol, e.g., Nlalll or Sau3A. Protocol selection in SAGE libraries replaces platform selection.

A selection is required.

• Platform used: Submission of non-SAGE sample data requires explicit reference to a valid GEO platform accession number. Therefore, platform information must be submitted before sample data derived from that platform.

A selection is required.

• Data file: Use the Browse button to the right to select a local file to be uploaded. This file should meet the validation criteria set for the platform type relevant to your sample.

A content bearing, valid data file is required.

The Gene Expression Mark-Up Language (GEML) GEML syntax is based on XML. GEML was originally developed by Rosetta Inpharmatics, and subsequently, several other public and private institutions have become involved, such as Agilent Technologies and Nature America, Inc. Details on GEML can be found at http://www.geml.org/.

GEML starts with an XML format, as this is the most widely used meta-data syntax developed. XML is maintained by the World Wide Web consortium to define general-purpose mark-up languages, and is now used widely as the interchange format for a variety of electronic commerce functions and scientific interchange efforts. There is a multiplicity of tools for generating, parsing, and displaying XML content and consequently GEML leverages this existing effort. The goal of GEML is illustrated in figure 5.8 (taken from their website). At the top level are the various types of microarray measurement systems and in the middle is the GEML lingua franca into which all the values measurement systems are translated. Below the GEML layer are shown several applications that work with microarray data as abstracted into a standardized format in GEML. This architecture is similar to many other three-tiered architectures where there is a common data model abstracting away the heterogeneity and complexity of the various databases that it accesses. In theory, a programmer for microarray analysis software would only have to understand the GEML Document Type Definition (DTD) to write applications that worked with the results of measurements on all microarray platforms.

Figure 5.8: The abstraction layers of the Gene Expression Mark-up Language (GEML). The GEML specification takes advantage of the DTD file. A DTD is the actual specification of the formal data model that any XML message that hews to that DTD must follow. DTD specifications are the World Wide Web consortium's prescribed mechanism for defining application-specific data models, and many industries (e.g., finance and imaging) are in the process of defining a consensus DTD for their own applications.[9] GEML specifies two DTDs: a pattern DTD, which describes the genes reported on a chip layout, and the profile DTD, which contains expression data, treatment, and hybridization information. That is, the latter DTD describes the experimental conditions and the resultant values. figure 5.9 shows the pattern DTD that has been quoted verbatim from the GEML site. Examination of this DTD reveals that the encoding of the experimental results of single or two-dye stained samples hybridized against cDNA microarrays would be fairly straightforward,

Figure 5.8: The abstraction layers of the Gene Expression Mark-up Language (GEML). The GEML specification takes advantage of the DTD file. A DTD is the actual specification of the formal data model that any XML message that hews to that DTD must follow. DTD specifications are the World Wide Web consortium's prescribed mechanism for defining application-specific data models, and many industries (e.g., finance and imaging) are in the process of defining a consensus DTD for their own applications.[9] GEML specifies two DTDs: a pattern DTD, which describes the genes reported on a chip layout, and the profile DTD, which contains expression data, treatment, and hybridization information. That is, the latter DTD describes the experimental conditions and the resultant values. figure 5.9 shows the pattern DTD that has been quoted verbatim from the GEML site. Examination of this DTD reveals that the encoding of the experimental results of single or two-dye stained samples hybridized against cDNA microarrays would be fairly straightforward, whereas encoding the 20 probe values that constitute a single probe set on an oligonucleotide microarray might be challenging and require some contortions in the way the data is squeezed into those existing attribute lists.

<!ELEMENT project

<!ATTLIST project

<!ELEMENT pattern

<!ELEMENT gene (accession*, alias*, other*)» <!—Gene = what the reporter is reporting on—> <!ATTLIST gene primary_name CDATA «REQUIRED systematic„name CDATA «REQUIRED species CDATA «IMPLIED chromosome CDATA «IMPLIED map_position CDATA «IMPLIED description CDATA «IMPLIED» <! ELEMENT accession (other*)> <!ATTLIST accession database CDATA «REQUIRED

id CDATA «REQUIRED»

<!ELEMENT reporter

<!ATTLIST reporter

(pattern*, printing*, other*)> <!—Project = group of patterns and/or printings—> name CDATA «IMPLIED id CDATA «IMPLIED date CDATA «IMPLIED by CDATA «IMPLIED company CDATA «IMPLIED > (reporter-»-, other*)> <!—Pattern = collection of one or more features—> <!ATTLIST pattern name CDATA «IMPLIED

type.id CDATA «IMPLIED species.database CDATA «IMPLIED description CDATA «IMPLIED access CDATA «IMPLIED owner CDATA #IMPLIED> (feature+, gene?, other*)>

name CDATA «REQUIRED systematic.name CDATA «REQUIRED accession CDATA «IMPLIED deletion CDATA "false" control_type CDATA "false" fail_type CDATA «IMPLIED active.sequence CDATA «IMPLIED linker.sequence CDATA «IMPLIED primerl.sequence CDATA «IMPLIED primer2_sequence CDATA «IMPLIED start.coord CDATA «IMPLIED mismatch.count CDATA «IMPLIED> <!—Reporter = measures expression of a gene—>

<! ELEMENT feature (position, pen?, other*)> <!—Feature = location of a reporter for a gene—> <!ATTLIST feature number CDATA «IMPLIED

ctrl.for_feat.num CDATA «IMPLIED>

Figure 5.9: The GEML pattern document type description.

Microarray Gene Expression Database Group (MGED) MGED was developed in recognition of the need for standardization of microarray data models. It is an open discussion group that originally held its meetings in the United Kingdom, and subsequently has held two international meetings.[10] The most notable output of the MGED group, which is constituted of multiple private and public institutions, is the Microarray Mark-up Language (MAML). The MAML DTD is shown in figure 5.10. The MAML DTD is sufficiently comprehensive that even at the time of this writing, it appears to cover all the currently available microarray technologies without requiring each site implementing the MAML database to put their own data through too many contortions in order to adhere to the data model. In particular, the abstract way in which a composite element has been defined in this DTD allows for the representation of a wide range of physicochemical means for assessing gene expression.

<!ATTLIST alias <!ELEMENT position <!ATTLIST position

<!ELEMENT pen <!ATTLIST pen

<!ELEMENT printing CIATTLIST printing

<!ELEMENT chip CIATTLIST chip <!ELEMENT other CIATTLIST other name CDATA #REQUIRED> (other*)>

x CDATA «REQUIRED y CDATA «REQUIRED units CDATA «REQUIRED) (other*)>

x CDATA «REQUIRED y CDATA «REQUIRED units CDATA «REQUIRED) (chip+, other*)»

date CDATA «IMPLIED printer CDATA «IMPLIED type CDATA «IMPLIED pattern.name CDATA «IMPLIED run.description CDATA #IMPLIED> (other*)>

name CDATA «REQUIRED value CDATA «REQUIRED»

<!ELEMENT creation.info (contact, software*, hardware*)

<!ATTLIST creation.info date DATE »REQUIRED)

<!ELEMENT array_platform (array_def I reference)

<!ATTLIST array.platform id ID

»REQUIRED)

(parameter*)

<!ELEMENT array.def act last.name first.name middle.name lab department institution street city province.state country postal.zip.code phone fax email href

CDATA »REQUIRED

CDATA »IMPLIED

CDATA »IMPLIED

CDATA »IMPLIED

CDATA »IMPLIED

CDATA »REQUIRED <!ATTLIST

CDATA »IMPLIED

CDATA »IMPLIED

CDATA »IMPLIED

CDATA »REQUIRED

CDATA »IMPLIED

CDATA »REQUIRED

CDATA »IMPLIED

CDATA »REQUIRED

CDATA »IMPLIED >

(creation, description?, comment*, reference*, treatment*, parameter*, seq.feature*)

array.def name surface.type other.surface.type reporter.type other.reporter.type model version href

CDATA «REQUIRED (non-absorptive absorptive I other)»REQUIRED CDATA »IMPLIED (single-multimer I multiple-oligomer I other)»REQUIRED CDATA »IMPLIED CDATA »IMPLIED CDATA »IMPLIED CDATA »IMPLIED >

<!ELEMENT hardware

<!ATTLIST hardware

<!ELEMENT software

(contact?, description?, parameter*)

> <! ELEMENT seq.feature make CDATA »REQUIRED model CDATA »REQUIRED year CDATA »IMPLIED href CDATA »IMPLIED >

(contact?, hardware*, description?, parameter*)

(bio.seq?, ref.bio.seq?, ref.clone?, gene?, reference?, treatment*, parameter*, coordinate?, description?, comment*)

XIATTLIST seq.f eature id name

ID «REQUIRED CDATA «IMPLIED >

<!ATTLIST software id IDREF «REQUIRED name CDATA «REQUIRED <•ELEMENT version CDATA «IMPLIED

year CDATA «IMPLIED <!ELEMENT operating.system CDATA «REQUIRED

href CDATA «IMPLIED > <!ELEMENT

<!ELEMENT

<!ELEMENT

<!ATTLIST

bio.seq ref.bio.seq gene ref.clone coordinate

(reference I CDATA) (reference I CDATA) (reference I CDATA) (reference I CDATA) EMPTY

coordinate horizontal CDATA «IMPLIED vertical CDATA «IMPLIED >

Figure 5.10: The MAML pattern document type description. Shown are three fragments of the MAML DTD. The first column demonstrates the generic description of a data set creator. The second column demonstrates the capability of maml to support composite expression measurements and analyses of the sort found on the Affymetrix platform.

Attempting a consensus model Obviously, if there is such a wide variety of standard microarray models, there is no standard. Without a single standard, combining expression data across microarray experiments is arduous, if not impossible. Fortunately, the Object Management Group (OMG) appears to be creating a common ground where proponents of these various data models will be able to begin to reconcile their various proposals and create a consensus data model. Although this consensus data model has not yet been finalized, the OMG has already incorporated some of the proposals that we listed previously. As the OMG has successfully created agreed-upon standardized data models in many different application domains outside biology, there is some reason to hope that they will be similarly successful in this domain. For the latest information on the OMG efforts regarding a standardized microarray data model, we will refer the reader to the following website: http://www.geml.org/omg.htm.

In addition to the OMG efforts, there is the Interoperable Informatics Infrastructure (I3C), a group that is tasked to produce technical recommendations for the interchange of data with the goal of creating practicable solutions.[11] The membership of I3C is constituted of 60 companies and organizations committed to addressing the balkanization of data modeling efforts illustrated by table 5.2.

Table 5.2: List of freely available data models and databases for microarrays.

Data model or database

URL

Affymetrix Analysis Data Model

http://www.affymetrix.com/support/aadm/aadm.html

(previously Genetic Analysis

Technology Consortium)

Another Microarray Database

http://www.microarrays.org/software.html

ArrayDB

http://www.genome.nhgri.nih.gov/arraydb

ArrayExpress

http://www.ebi.ac.uk/arrayexpress/Design/design.html

ExpressDB

http://www.twod.med.harvard.edu/ExpressDB

GeneX

http://www.ncgr.org/research/genex

MicroArray Database

http://www.pompous.swmed.edu/exptbio/microarrays/mad

Standford Microarray Database

http://www.genome-www4.standford.edu/MicroArray/SMD

Again, it is interesting to note how much more difficult it has been to arrive at a consensus model for microarray expression measurements than it took to create a standard data model for genetic sequences in GenBank. The reason, as stated before, is that gene expression measurement is an analog measurement rather than the digital measurement of DNA sequence and therefore the measurements can only be understood with respect to a particular measurement system, a human artifact. The goal of the standardized microarray data model can only be realized once a common set of abstractions can be reliably found across all these different measurement technologies and artifacts.

[7]That is, this compares with the case of DNA sequence, where the way to represent the ordering of a particular sequence of nucleotides is, at least in theory, independent of the measurement system.

[8]http://www.ncbi.nlm.nih.gov/geo/info/fields.pgi.

[9]In the latest iteration of specifications for XML document specifications, the XML schema has superseded the DTD, although there is sufficient overlap to allow DTDs to be easily converted into XML schemas.

[10]http://www.mged.org/.

Was this article helpful?

0 0

Post a comment