Chapter 1: Introduction
Figure 1.1: Time-series data from anaerobic metabolism. The time courses of measured concentration of the small molecule inputs, adenine monophosphate (AMP), a source of chemical energy to catalyze reactions) and citrate (a substrate), in the experiments, with the responses of the concentrations of phosphate (P, an inorganic ion) and of the substrates fructose-1,6-biphosphate (F16BP), dihydroxy acetone phosphate (DHAP), fructose-6-phosphate (F6P), glucose-6-phosphate (G6P), and fructose-2,6-biphosphate (F26BP). (Derived from Arkin et al. .)
Figure 1.2: Glycolytic pathway reconstructed ab initio from time-series data. A, The two-dimensional projection of the correlation metric construction (CMC), defined by Arkin et al., for the time series shown in figure 1.1. Each point represents the time series of a given species. The closer two points are, the higher the correlation between the respective time series. Black (gray) lines indicate negative (positive) correlation between the respective species. Arrows indicate temporal ordering among species based on the lagged correlations between the their time series. B, The predicted reaction pathway-derived CMC diagram. Its correspondence to the known mechanism of glycolysis is high. (Derived from Arkin et al. .)
Figure 1.3: Relative growth of MEDLINE and GenBank. The industrialization of genomic data acquisition has created a growing gap between knowledge—of which MEDLINE publications are a proxy—and the information we have gathered about the genome. Information science applied to this information (i.e., bioinformatics) is one of the pillars of an international strategy to overcome this knowledge gap. Cumulative growth of molecular biology and genetics literature (light gray) is compared here with DNA sequences (dark gray). Articles in the G5 (molecular biology and genetics) subset of MEDLINE are plotted alongside DNA sequence records in GenBank over the same time period. (Derived from Ermolaeva et al. .)
Figure 1.4: A major difference between classic clinical studies and microarray analyses. The high dimensionality of genomic data in contrast to the relatively small number of samples typically obtained results in a highly underdetermined system. Figure 1.5: An archetypal functional genomics pipeline. Shown is a simplified view of a functional genomics pipeline solely involved in expression microarray experiments. Note the interdigitation of "wet" and "dry" components requiring close multidisciplinary collaboration and some creative consideration of the value of the individual contributions in this pipeline for a particular experiment and publication.
Figure 1.6: The functional genomics investigation as a funnel for traditional biological investigations. Broad questions and comprehensive data are the mix in which bioinformatics techniques are filtered to separate high-yield hypotheses or candidate genes from spurious findings and poor-quality hypotheses.
Figure 1.7: Flow of genetic information, from DNA to RNA to protein. This simplified diagram shows how the production of specific proteins is governed by the DNA sequence through the production of RNA. Many stimuli can activate the specific transcription of genes, and proteins can play a wide variety of roles within or outside cells. Note that even in this simplified model, it is obvious that since we are currently able to (nearly) comprehensively measure only gene expression levels, we are missing comprehensive measurements of protein modification, activity, transcriptional stimuli, and many other components of the state of a cell.
Figure 1.8: Genetic machinery of the circadian rhythm. Current molecular model of rhythm generation in Drosophila, from . The succession of events (A-F) occur over the course of approximately 24 hours. A, CLOCKBMAL heterodimers bind the per and tim promoters and activate mRNA expression from each locus; CLOCKBMAL may also activate transcription of other circadian-regulated genes (not shown). B, per and tim mRNA are transported to the cytoplasm and translated into PER and TIM protein, respectively. C, Regulation of protein levels occurs by two mechanisms: DBT protein phosphorylates and destabilizes PER, and light destroys TIM. Light during the early subjective night can phase-delay the clock. Small "blobs" indicate degraded proteins. D, PER and TIM levels slowly accumulate during the early subjective night; TIM stabilizes PER and promotes nuclear transport. E, PER and TIM dimers enter the nucleus and inhibit CLOCKBMAL-activated transcription. F, Protein turnover (combined with the lack of new PER and TIM synthesis) leads to derepression of per and tim mRNA expression; the cycle begins again (A). Light during the late subjective night can phase-advance the clock.
Chapter 2: Experimental Design
Figure 2.1: Three experiments within the multi-dimensional representation of experiment design space. Experiment design space defines all the possible stimuli or conditions to which a particular biological system could be subjected. Shown here is an experiment design space that is concerned with insulin signaling in different tissues, in different mouse "knockout" models, with different levels of insulin.
Figure 2.2: Expression space in which gene expression is loosely versus tightly coupled. If genes are tightly coupled in the expression space, they will tend to occupy a small subspace of expression space over any set of experiments. If, however, these genes are loosely coupled or even causally unrelated, then their expression levels will have little relation to one another and therefore tend to be scattered over a greater volume of the expression space. Figure 2.3: Apparently random relationship in expression space between two genes. Figure 2.4: Apparently linear relationship in expression space between two genes. Figure 2.5: Apparently curvilinear relationship in expression space between two genes. This curvilinear relationship would be obscured if the expression space had been insufficiently exercised.
Figure 2.6: Noninvasive monitoring of mechanistic behavior. By observing the concerted action of the watch hands, several competing weak hypotheses about the underlying mechanism can be generated. The more observations made at different times, the smaller the number of possible hypotheses generated.
Figure 2.7: Decomposition of the watch. An invasive exploration of mechanism will reveal most of the components of that mechanism. In the process of the invasive investigation some of the components will be damaged. Also, the relationship of these components inside a working watch must be inferred and is not directly observed because of the invasive nature of this investigation.
Figure 2.8: An example of why discovery of pathways solely using gene expression measurements is difficult. At least three "pathways" are involved in the conversion of lactic acid to pyruvate. The highest enzyme level pathway involves the action of an enzyme called lactic acid dehydrogenase, designated EC 184.108.40.206. However, when one traverses to a lower-level pathway, one learns that the role of this enzyme is performed by five separate protein products, which appear in a binomial distribution. When one traverses lower, through the assembly pathway, one learns this distribution is present because of the various possible combinations of LDHA and LDHB, two individual protein subunits, of which four must be put together to make an active enzyme. Then, when one traverses even lower, through the genetic regulatory pathway, one learns that each of these subunits is regulated differently, and appears on different chromosomes. Not shown are the pathways involved in transcription, translation, movement of proteins to the appropriate locations, and movement of substrates into and out of the cells and organelles, and many other pathways. Figure 2.10: Typical uses of fixed fold thresholds. Reproducibility scatter plots. In each of the experiments, samples were hybridized to identical oligonucleotide arrays containing probes for 6800 human genes. The message abundance in arbitrary units is plotted. Left panel: A single biotinylated RNA target was divided in two and each half hybridized to two arrays. Fifty-nine genes (0.8%) judged to be expressed "present" by the Affymetrix GeneChip software differed by more than twofold, and 0.3% differed by more than threefold. Middle panel: A single biotinylated target was hybridized to one array, the sample removed and then rehybridized to a second array. Of "present" genes 2.6% and 0.4% differed by more than twofold and threefold, respectively. Right panel: A single total RNA sample was converted to biotinylated cRNA in two independent labeling reactions, and the cRNA then hybridized to two arrays. Of the "present" genes 2.2% and 0.4% differed by twofold and threefold, respectively.
Figure 2.9: A decision analytic procedure for picking a threshold for selecting genes from a functional genomics experiment. Because of the large number of probes on current microarrays, it is all too easy to underestimate the cost and practical intractability caused by even a moderate false positive rate. Investigators are advised to perform this simple decision analysis in order to determine what false positive and false negative rate they can afford before they proceed with any experiments.
Figure 2.11: Clustering more than only expression values. A full-edged functional genomic study with clinical relevance will involve multiple and quite heterogeneous data types. The realities of clinical research and clinical care will ensure that there is a requirement for the handling of missing data.
Chapter 3: Microarray Measurements to Analyses
Figure 3.1: The nongenomic time scale response in aldosterone exposure. The response (in seconds) shown here is much faster than any known receptor-to-transcription response that steroid hormones are usually thought to act through. (Derived from Gamarra et al. .) Figure 3.2: Robotically spotted microarray hybridized to two samples, each stained with two colored dyes. An overview of procedures for preparing and analyzing cDNA microarrays and tumor tissue. Reference RNA and tumor RNA are labeled by reverse transcription with different fluorescent dyes (green for the reference cells and red for the tumor cells) and hybridized to a cDNA microarray containing robotically printed cDNA clones. The slides are scanned with a confocal laser scanning microscope, and color images are generated for each hybridization with RNA from the tumor and reference cells. Genes upregulated in the tumors appear red, whereas those with decreased expression appear green. Genes with similar levels of expression in the two samples appear yellow. Genes of interest are selected on the basis of the differences in the level of expression by known tumor classes (e.g., BRCA1 mutation-positive and BRCA2 mutation-positive). Bioinformatics analysis determines whether these differences in gene expression profiles are greater than would be expected by chance. (Derived from Hedenfalk et al. .)
Figure 3.3: The photolithographic construction of microarrays. Synthesized high-density oligonucleotide microarray manufacturing with photolithography. Using selective masks, photolabile protecting groups are light-activated for DNA synthesis (1, 2); photoprotected DNA bases are added and coupled to the intended coordinates (3). This cycle is repeated (4) with the appropriate masks to allow for controlled parallel synthesis of oligonucleotide chains in all coordinates on the array (5). (Derived from Lipshutz et al. .) Figure 3.4: Pixel, probe cell, and Affymetflrix scanned image. Schematic showing how images are composed of probe cells, which contain probes that appear as pixels. (Derived from Jain .)
Figure 3.5: Background noise on Affymetrix arrays. A, A diagram showing how background noise variance is obtained from the background cells. B, The calculation of the background noise. C, The definition of SDT and SRT and how they define "positive" and "negative" probe pairs. (Derived from .)
Figure 3.6: Nonspecific hybridization on an oligonucleotide microarray. The two rows of probe cells represent the probe set for a gene transcript. The PM probes are on the top row and the MM probes are on the bottom row. Even if there is a lot of specific hybridization to the PM oligonucleotides due to the presence of the targeted gene transcript, if there is significant nonspecific hybridization, then the amount of transcript cannot be estimated accurately. In this illustration, the number and intensity of dark probe cells on the bottom row is higher than that in the top row and therefore most software packages would report that the reported intensity for the gene is unreliable—an "Absent" Absolute Call in Affymetrix parlance.
Figure 3.7: The Affymetrix Absolute Call. The relationship between Absolute Call and expression level for a microarray. Far more Absent (A) calls than Present (P) calls are at lower expression levels, but significant numbers of Absent calls are found even at expression levels in the low thousands.
Figure 3.8: Graphical display of how Affymetrix Absolute Calls predict reproducibility. Correlation of expression levels from the same hybridization "cocktail" measured on two microarrays. Top: the correlation using only those genes for which an Absent call was reported by both microarrays. Bottom: only the Present called genes. Figure 3.9: Expression measurements made in duplicate from the same RNA samples do not correlate well all the time. RNA samples from four human samples were placed on duplicate oligochips and the expression of 35,714 ESTs was measured. Each point represents an EST. The duplicate expression measurements are plotted here on a log-log scale (base 10). r = .69, .73, .73, .69. (From Butte et al. .)
Figure 3.10: When expression measurements do not correlate well, fold differences correlate even poorer. Fold differences of 35,714 ESTs were calculated between the six possible pairings of the four patients. Fold differences are expressed in base 10 logarithm, so that ESTs that did not change between models are plotted in the center of each graph. Fold differences from the duplicated measures are shown on the x- and y-axes. Even though the correlation coefficients were high between original and repeated expression values, the correlation coefficients were very low between original and repeated calculated fold differences. (From Butte et al., .)
Figure 3.11: Overall correlation coefficients for each matching probe and probe set across two microarray technologies. Great variance is seen in reproducibility of measurements across probes and probe sets, with some probes showing correlation coefficients near 1.0, and some even showing a negative correlation (i.e., a gene which is reported as highly expressed in one technology has an opposite expression report in the other technology). Figure 3.12: Pooling total RNA extracts for replicate experiments. Two strategies for making replicated measurements from samples, pooled and not pooled. N may not necessarily equal n.
Figure 3.13: Apparent differences in source of variance due to log scaling. On the left is the apparently increased variation at lower expression levels using a plot of the logarithm of all the gene expression levels measured in one microarray versus the logarithm of all the of the genes measured on another microarray. When the expression values are plotted on a linear scale as on the right, then it appears that the higher expression levels have higher variance. The former scale emphasizes the decreased reproducibility at lower expression levels and the latter emphasizes that there are fewer measurements or genes at higher expression levels. Note that for well-definedness, the logarithmic scale plot excludes measurements that had a negative value.
Figure 3.14: Incyte data file snippet. Heads and the two rows of values from an Incyte data file. Incyte microarrays are constructed using a robotic spotting process. The spotting process involves two proximal spots for the two dyes—Cy3 and Cy5—one of which is the targeted clone and the other a control spot.
Figure 3.15: Duplicate cDNA assays of a common extract. The same RNA extract or hybridization "cocktail" measured on two Incyte spotted arrays. The P2S signals (the second of two dyes) from each microarray is plotted for each gene.
Figure 3.16: Ratio of probe intensities as a function of expression level. The ratio of the P2S/P2S' signals (y-axis) from two robotically spotted microarrays plotted against the P2S signal. The same RNA extract was used in both microarrays. Although the ratio is close to 1.0 for many genes, it does deviate sporadically and widely in a few instances. Also, as the expression level decreases, the distribution of the ratio spreads increasingly away from 1.0. Figure 3.17: Ratio of expression level as a function of position. The ratio of the two different dye intensities per gene for a single microarray plotted against the position in the array (defined as a single number that is computed from the x and y coordinates of the probe on the array). The "control islands" that contain "spiked" controls on the array were not included in this plot. Also, all expression values were corrected for background noise. No particular pattern is obvious on inspection.
Figure 3.18: Fourier analysis of spatial series on spotted microarray. The same data illustrated in figure 3.17 is subjected to a Fourier analysis to extract systematic periodic signals. As shown, the frequency with the largest power is 4. That is, there appears to be a periodicity of 4 per microarray based on the position of the probe on the microarray. The magnitude of this systematic variation dominates fold changes in gene expression of the magnitude reported elsewhere as noteworthy.
Figure 3.19: Incyte GEM Microarray circa 1999. In this photograph of a GEM microarray, note that there are four quadrants which are each spotted with separate pins. Figure 3.20: Problems in images acquired from microarrays. On the left: a contaminated D array from the Murine 6500 Affymetrix GeneChip set. Several particles are highlighted by arrows and are thought to be torn pieces of the chip cartridge septum, potentially resulting from repeatedly pipetting the target into the array . On the right top: local changes in intensity due to contaminants and scratches. (Derived from
http://www.mediacy.com/arraypro.htm.) Right bottom: high magnification of a scanned image of a spotted microarray. Note the different sizes and shapes of the spots . Figure 3.21: Estimate of the incidence of alternative splicing in genes. While most genes have only one spliced product, many have more than one, and this can affect the the gene product detection efficacy and rate by microarrays. (Derived from Mironov et al. .) Figure 3.22: Identity mask for experiment A. Method 2 with parameters 9000 for expression value sliding window size and scaling factor 0.975 resulted in the lowest percentage of original data points lying outside of the mask region (0.7%).
Figure 3.23: Identity mask for experiment E. Method 1 with parameters 5000 for intensity window size and 3 SD resulted in the lowest percentage of original data points lying outside of the mask region (0.9%). Figure 3.24: Entropy plot
Figure 3.25: Graphical example of mutual information calculation. First, a scatterplot of the expression measurements of the two genes is created, and a grid is imposed. In this example, each expression measurement is quantized into four bins (one can think of these as "low," "low-medium," "high-medium," and "high," though anynumber and positioning of bins can be considered). The entropy for each gene is then calculated using the row and column sums, and the joint entropy is calculated from the grid.
Figure 3.26: Dynamic relationships between genes. (a) The expressed product of Gene A binds an enhancer region that increases transcription of Gene B. (b) Gene B's initial expression level before being affected by Gene A can vary throughout the experiment. As a result, measuring the correlation between the absolute levels of Genes A and B will not reveal the underlying enhancement relationship between the two. Instead, this can only be done by analyzing the expression dynamics—the change in expression level of Gene B in relation to the expression level of Gene A. (Derived from Reis et al. .)
Figure 3.27: Negative dynamic correlation. The distribution of slopes of MAD3 and EXM2 plotted one against another.
Chapter 4: Genomic Data-Mining Techniques
Figure 4.1: Example of how a single point can distort overall correlation. On the left: the scatterplot shows a negative correlation. The scatterplot on the right is identical, save for the additional point added. If the values of even a single point are high enough, the correlation coefficient can be altered (though the variance of the correlation coefficient, if calculated, would be higher). This is primarily because using the correlation coefficient assumes values distributed normally.
Figure 4.2: Detecting spikes by calculating entropy. The top graph shows a hypothetical gene with "spiking" behavior; i.e., the gene expression is markedly higher in two samples, compared to the other samples. The bottom graph shows a second hypothetical gene with gene expression measurements that are more distributed across the dynamic range. Distribution of the gene expression measurements are not just a characteristic of the gene; it is also connected to the samples in which expression was measured. Figure 4.3: Genes can be represented as points in a multidimensional space. Each sphere represents a gene measured in three tissue samples. Each of the expression levels is considered as a measure along a three-dimensional coordinate. One can imagine that genes with similar expression levels in all three samples may be grouped as clusters in the three-dimensional space. These clusters can be found using an unsupervised technique, such as self-organizing maps.
Figure 4.4: Principle of self-organized maps. Centroids start in an arbitrary topology; then, as the method progresses, each moves toward a randomly chosen gene during each iteration. After proceeding for enough time, each centroid will be in the middle of a cluster. (From Tamayo et al. .)
Figure 4.5: Graphical convention for self-organizing maps. (Derived from Tamayo et al. .)
Figure 4.6: Genes can have a difference in interaction, but not in expression level. Scatterplot of gene A and gene B, measured in samples from disease 1 (open circles) and disease 2 (closed circles). Note that expression measurements from neither gene A nor gene B can be used to separate disease 1 from disease 2. However, the linear regression model of gene expression levels from disease 2 is different from disease 1. Figure 4.7: Four possible ways to order comprehensive pairwise comparisons. We define the term measure-triangle as the comprehensive pairwise comparison between each of a set of features. The comparisons can be performed and stored in any of four different ways. Figure 4.8: Graphical convention for dendrograms.
Figure 4.9: Two-dimensional dendrogram. Dendrograms are constructed for both the genes and samples individually, and are combined in the display. (Derived from Ross et al. .) Figure 4.10: Sample relevance network. One of 78 networks formed by aggregating mouse samples from six different experiments. Threshold was set to .99, meaning that the "thinnest" line seen represented an of .99. The legend for this type of diagram is shown in figure 4.11. The hypothesis generated here is that RBBP4 expression correlates with GNS and MAX gene expression. The miniature histograms for each gene show no "spiking" in the expression measurements (see section 4.5.2). Each of these genes appears at a different chromosomal location. By including information from LocusLink (see section 5.5.1) and Gene Ontology (see section 5.1.1), we see that RBBP4 is known to control cell proliferation, and MAX is involved in oncogenesis.
Figure 4.11: Graphical convention for relevance networks. To optimize the hypothesis generation process, relevance network display software contains ties from microarray accession number to GenBank, UniGene, LocusLink, Gene Ontology, and other databases
(see section 5.5 for details of this process).
Figure 4.12: Relevance network laid out as a circle. Relevance networks with too many genes may be rendered in circles that minimize overlapping nodes and edge crossings. However, these may be impossible to visualize.
Figure 4.13: Subcategories of B-cell lymphoma determined by microarrays correspond clinically to duration of survival. On the left is a dendrogram constructed across the samples of B-cell lymphoma, using an unsupervised technique. The top branchess entially defines an even split between the categories GC B-like DLBCL and Activated B-like DLBCL, but this distinction was never before made clinically. On the right are Kaplan-Meier survival curves of the patients from whom the samples were obtained. Patients whose cancer matched the Activated B-like DLBCL gene expression profile had a significantly worse prognosis. (From Alizadeh et al. .)
Figure 4.14: Feature reduction with principal components analysis. Two genes are measured in eight samples, four from one disease (marked by X) and four from another disease (marked by O). The expression measurements of gene A are represented on the x-axis and the expression measurement of gene B are represented on the y-axis. Note that the first principal component (i.e., the vector that captures the most variance) is parallel to the x-axis, which corresponds to gene A. However, the line that best splits the two diseases is described by its orthogonal vector along the y-axis, which corresponds to gene B. This means that although the variance seen in gene A best explains the overall variance seen in gene expression measurements in each disease and both together, the variance seen in gene B is best used to split the two diseases. Although this is easy to distinguish in this simplified two-dimensional case, it is not so easily visualized in a real-world, multidimensional data set.
Figure 4.15: Distribution of in the original versus permuted gene expression data set. The distribution of calculated using an original gene expression data set is shown with solid circles. For each gene, expression measurements were independently randomly shuffled 100 times. The average distribution of is shown with error bars covering 2 SD. In this example, random permutation was unable to create an association with or < .85.
Figure 4.16: Identifying differences between the test and training sets through clustering. The matrix incision tree algorithm from  was applied to the leukemia classification problem published by Golub et al. . The algorithm correctly clustered 64 of the overall 72 cell lines (94%) in the data set, placing the acute myelogenous leukemia (AML) samples in branches (c) and (d), the acute lymphocytic leukemia (ALL) samples in branches (a) and (b), and misclassifying four ALL samples into branches (c) and (d). However, the matrix incision tree also successfully revealed the distinction between the published training set (cases 1-38, italic, placed in branches (b) and (c)) and the test set (cases 39-72, placed in branches branches (a) and (d)) with 100% accuracy. It is unlikely that the biology of the test set of the AML and ALL cell lines were that different from the cell lines in the training set. It is much more likely that systematic changes in the hybridization conditions, or different sample preparation, or the use of a different batch of microarrays were responsible for the clustering of test versus training sets.
Figure 4.17: Contingency table illustrating performance metrics for classification algorithms. Figure 4.18: Sensitivity and specificity trade-offs are defined bythe ROC curve. ROC curves A, B and C describe the trade-offs in sensitivity and specificity for the same data set using three different classification algorithms. C is clearly the best algorithm because the area under the curve is the highest. That is, it provides a better set of sensitivities and specificities for all values of these. In contrast, b and c provide trade-offs inferior to a and for some range of sensitivities, b provides better performance than c, and vice versa. Figure 4.19: Example of a bayesian network. This network describes the impact of initial_Condition and External_Action on Effect. This network is equivalent to the underlying joint probability distribution over the three variables, shown in table 4.1.
Chapter 5: Bio-Ontologies, Data Models, Nomenclature
Figure 5.1: The Molecular Function Ontology of GO. (From Ashburner et al. .) Figure 5.2: The Cellular Component Ontology of GO. (From Ashburner et al. .) Figure 5.3: The Biological Process Ontology of GO. (From Ashburner et al. .) Figure 5.4: The KEGG representation for apoptosis.
Figure 5.5: The MeSH terminology hierarchy. A portion of the MeSH terminology hierarchy starting at the top level and showing the first level of detail at the disease subhierarchy. The entire hierarchy may be browsed at http://www.nlm.nih.gov/mesh/MBrowser.html. Figure 5.6: Transforming a cluster of genes into an annotated structure automatically using MEDLINE. Summary of concept hierarchy matches with the MeSH hierarchy terms for genes described by Golub et al.  that fell into the acute lymphoblastic leukemia (ALL) cluster.
Figure 5.7: A subset of the definition of sample in the GEO. At the time of this writing, the sample definition in the GEO does not appear to support the full complexity of oligonucleotide high-density microarrays.
Figure 5.8: The abstraction layers of the Gene Expression Mark-up Language (GEML). Figure 5.9: The GEML pattern document type description.
Figure 5.10: The MAML pattern document type description. Shown are three fragments of the MAML DTD. The first column demonstrates the generic description of a data set creator. The second column demonstrates the capability of maml to support composite expression measurements and analyses of the sort found on the Affymetrix platform.
Chapter 7: The Near Future
Figure 7.1: Lawsuits in the microarray field. With the number of lawsuits already in this area, it should be no surprise that companies in this industry are striving to determine new and unique intellectual property. (From Gibbs et al .) Figure 7.2: The Nanogen microelectronic array.
Figure 7.3: Flexibility of on-the-fly ink-jet arrays. On the right is a comparison of morphology of ink-jet versus pin arrays. Ink-jet printing produces uniform and consistent features. An artifact of on-the-fly printing is formation of oblong spots at maximum print speed (full fires) that can be made into circles upon slowing down. This underscores the theoretical flexibility of this system. Figure 7.4: The Illumina BeadArray.
Figure 7.5: The Serial Analysis of Gene Expression. (Derived from Madden et al. . Figure 7.6: Poor measurement reproducibility across microarray generations. Two generations of Affymetrix microarrays were used to the gene expression of seven samples. Each of the 8075 common probe sets thus had seven measurements from both the older HuGeneFL and newer HG-U95A microarrays. The line represents the distribution of correlation coefficients of each probe set. Though the exact correlation coefficient may not be useful, it is notable that over 25% of the probe sets showed measurements that were zero or negatively correlated across generations. (Derived from Nimgaonkar et al. .) Figure 7.7: Measurement reproducibility of probe sets across microarray generations, by the number of probe pairs in common. A total of 8075 probe sets are deemed in common by Affymetrix between the older HuGeneFL and the newer HG-U95A microarrays. However, each probe set has a varying number of common probe pairs, from none to all 16. The bar graph represents the distribution of the 8075 probe sets, and the line represents the correlation coefficients of each set of probe sets across generations. For example, one finds a correlation coefficient of .77 when plotting all 6412 probe sets with one probe pair in common across microarray generations. (Derived from Nimgaonkar et al. .)
Was this article helpful?