Oligonucleotide microarrays

The second popular class of microarrays in use has been most notably developed and marketed by Affymetrix. Currently, over 1.5 x 105 oligonucleotides of length 25 base pairs each, called 25-mers, can be placed on an array. These oligonucleotide chips, or oligochips, are constructed using a photolithographic masking technique similar to the process that is used in microelectronics and integrated circuits fabrication, first described by Stephen Fodor et al. in 1991 [70]. Currently, these commercially available microarrays are not produced individually, but instead are made in parallel. An entire wafer (containing between 40 and 400 microarrays) is constructed, tested, then broken apart to create the individual microarrays. At this time, commercially produced oligochips exist that are disease-as well as species-specific, such as rat neurobiology and yeast genome arrays, and custom microarrays can be ordered with a 4-week turnaround or less.

The manufacturing technique for an Affymetrix oligochip is markedly different from the more mechanical process of making robotically spotted arrays. Eachwafer starts out as an empty glass slide. On this substrate, 25-mer probes are built base by-base by placing single DNA bases on the glass and then on top of a preceding base. All 25-mer probes are constructed in parallel, with high precision, by selectively masking specific coordinates or locations on the glass surface and exposing the entire ensemble to ultraviolet light in between laying on the additional bases adenine (A), thymine (T), guanine (G), and cytosine (C) separately. Each applied photolithographic mask generates different areas of photodeprotection on the solid glass substrate. The combination of these masks with an intervening chemical coupling step allows the incorporation of additional nucleotides to existing strands only where desired. This entire process is known as light-directed oligonucleotide synthesis.

One advantage of the oligonucleotide microarray is that its higher density of probe pairs allows for more genes to be screened or assayed on a single chip, as compared to robotically spotted arrays. Consequently, there is less need for a priori restrictions on the number of genes that are to be scanned. A disadvantage is that the current technology only allows for at most one experiment to be run on a single chip at any one time. Thus, for example, one does not obtain meaningful data from placing the control sample and test probes on an oligonucleotide microarray simultaneously; instead these two samples are measured on two separate oligochips. This means, in turn, that one typically has to apply a suitable normalization transformation across separate microarray data sets (i.e., inter-array) at the subsequent data analysis stage in order to make meaningful comparisons of reported expression changes from a control to a test condition. Additionally, if a scientist is interested in studying a specific species for which no appropriate oligochip exists, then this technology is not presently available. Oligochips are not as easily customizable at the user's end as robotically spotted microarrays.

Since each probe is limited to 25 base pairs in length, a question immediately arises as to how each gene can be screened uniquely using only 25 base pairs. For each gene that needs to be represented, or whose expression needs to be measured on Affymetrix's oligochip, a set of sixteen to twenty 25-mers are chosen which uniquely represent that particular gene, and would hybridize under the same general conditions.'31 The Affymetrix literature calls the sample probe that is to be interrogated by cDNA probes on its microarray, the target. Every set of perfect match (PM) probes for an mRNA has a corresponding set of mismatch (MM) probes. An MM probe is constructed from the same nucleotide sequence as its PM probe partner, except that the middle (usually the 13th) base pair has been switched to result in an alphabet mismatch. For example, the following two 25-mers may be associated PM-MM probes to assay for the following sample probe:

/ PM : ATCGACTGATGCMTGCATCCATCAT on chip | MM ; VTCGACTGATGCCTGCATCCATCAT

Sample Probe TAGC'TGACTACG 7VYCGTAGG TAGTA

The combination of a PM and its associated MM oligonucleotide probe is called a probe pair. There are two principal reasons for the use of MM probes. First, at low concentrations of the target or sample probe when the PM probes have already reached their lower limit of sensitivity, the MM probes display greater sensitivity to changes in concentration. Second, MM probes are thought to bind to nonspecific sequences at the same rate as the PM probes. Thus, MM probes serve as an internal control for background nonspecific hybridization (see section 3.1.2 for details). However, depending upon the total RNA sample, it could turn out that the PM probe is already highly specific and the MM probes are simply binding to differently-specific labeled subsequences in the sample. One has to be careful to distinguish between nonspecific hybridization and differently-specific hybridization with regard to the use of MM probe data.

For target preparation, sufficient amounts of sample probe are first synthesized by reverse-transcribing the total RNA using an oligo-dT primer containing a T7 polymerase site for 52 to 32 transcription. Amplification and labeling of the cDNA sample probe is achieved by carrying out an in vitro transcription reaction in the presence of biotinylated deoxynucleotide triphosphates (dNTP), resulting in the linear amplification of the cDNA population (approximately 30- to 100-fold). This linearity assumption becomes increasingly weak with decreasing quantities of total RNA and increasing number of amplification cycles. The biotin-labeled cRNA probe generated from the sample is then hybridized to the oligonucleotide arrays, followed by binding to a streptavidin-conjugated fluorescent marker. Laser excitation of the hybridized sample, confocal microscopy, and image acquisition by an optical scanner is performed. This results in an image file in which each oligonucleotide species is represented by a small rectangular area (~ 50 1Am2), called the probe cell, which is itself composed of several image pixels, each occupying an area from 3 to 24 ^m2 (figure 3.4). The image file is processed so that the intensity of each probe cell is reported as the 75th percentile of the intensities of all the pixels in a probe cell, excluding the pixels at the border of the cell. These probe cell intensities are stored in a .eel file, which therefore reports a measure of hybridization per contiguous oligonucleotide surface on the microarray. We describe the contents of a typical .eel file below.

Figure 3.3: The photolithographic construction of microarrays. Synthesized high-density oligonucleotide microarray manufacturing with photolithography. Using selective masks, photolabile protecting groups are light-activated for DNA synthesis (1, 2); photoprotected DNA bases are added and coupled to the intended coordinates (3). This cycle is repeated (4) with the appropriate masks to allow for controlled parallel synthesis of oligonucleotide chains in all coordinates on the array (5). (Derived from Lipshutz et al. [121].)

Figure 3.3: The photolithographic construction of microarrays. Synthesized high-density oligonucleotide microarray manufacturing with photolithography. Using selective masks, photolabile protecting groups are light-activated for DNA synthesis (1, 2); photoprotected DNA bases are added and coupled to the intended coordinates (3). This cycle is repeated (4) with the appropriate masks to allow for controlled parallel synthesis of oligonucleotide chains in all coordinates on the array (5). (Derived from Lipshutz et al. [121].)

Figure 3.4: Pixel, probe cell, and Affymetflrix scanned image. Schematic showing how images are composed of probe cells, which contain probes that appear as pixels. (Derived from Jain [102].) Affymetrix microarrays and background intensity calculations The Affymetrix GeneChip analysis protocol uses the term absolute analysis to describe the algorithm for determining whether transcripts represented on the probe array are detected and the intensity of expression. Briely, for each microarray that has been hybridized with a prepared target, washed, and scanned, one obtains the following sequence of files as the Affymetrix software attempts to translate that particular chip experiment into numerical data representing detectable RNA intensity levels:

Figure 3.4: Pixel, probe cell, and Affymetflrix scanned image. Schematic showing how images are composed of probe cells, which contain probes that appear as pixels. (Derived from Jain [102].) Affymetrix microarrays and background intensity calculations The Affymetrix GeneChip analysis protocol uses the term absolute analysis to describe the algorithm for determining whether transcripts represented on the probe array are detected and the intensity of expression. Briely, for each microarray that has been hybridized with a prepared target, washed, and scanned, one obtains the following sequence of files as the Affymetrix software attempts to translate that particular chip experiment into numerical data representing detectable RNA intensity levels:

• A grey-scale .tiff image file of the physical microarray where a lighter or darker pixel indicates a respectively stronger or weaker hybridization of a cDNA fragment to the probes at the particular coordinate marked by the pixel

• A .dat text file containing coordinates and intensity levels for individual pixels

• A .eel text file with probe cell coordinates and intensity calculated as a trimmed average of pixel intensities—on average a probe cell is made up of 8 pixels x 8 pixels

• A .chp text file with 11 columns that contain the statistic, calculated by the Affymetrix software, for sets of 16 to 20 probe cell pairs, PM and MM, which interrogate particular transcripts

An important consideration in all microarray experiments is the contribution of the intensity of the background effects on each microarray. Since this can vary from one array to another, Affymetrix has developed its own methodology for subtracting the background from the hybridization signals. Background effects refer to brightness that ends up in the reported measurement reading for a probe cell, even though these effects did not originate from the probe cell. It typically includes intra-chip phenomena such as localized physical changes (e.g., temperature during hybridization) in a probe cell that diffuse into its neighboring probe cells, or nonuniformity in ambient brightness levels in localized regions on a microarray surface. Background effects are components of the more general phenomenon of noise, which we discuss in section 3.2.

The entire microarray surface is divided into 16 sectors Within each sector an average statistic is calculated from the lowest 2% of probe cell intensity values. This is the background intensity for that sector and this value will be subtracted from the average intensities of all image features in that sector. Consequently, the number of background probe cells (i.e., probe cells that are used to calculate background intensity) will depend upon the number of probe cells in the array. In theory, calculating background on a per-sector basis minimizes the effect of changes in the microenvironment across different parts of the array.

In addition to calculating average background intensity, background probe cells are also used to compute the effects of background noise variations on the reported measurements. As shown in figure 3.5, the mean intensity of these background probe cells is obtained and the distribution around the mean is calculated. In other words, it is assumed that all the background probe cells should be the same near-zero intensity, and the variation around this is considered noise. Intuitively, a wider distribution around the mean for the background probe cells implies a more pronounced noise component for all expression measurements in the probe array.

The calculation corresponding to this intuition is given in figure 3.5(b). This is the formula that Affymetrix uses to calculate the significance of a difference in intensities between the PM probe cells and associated MM probe cells. Both the ratio and the difference between the PM and the MM probes, PM/MM and PM-MM, for each probe pair are computed. These values are then compared against two thresholds respectively: the statistical ratio threshold (SRT) and the statistical difference threshold (SDT)— see figure 3.5C —which are themselves functions of the background for that probe array. If both thresholds are exceeded in the negative direction, the probe pair is considered "negative," and if they are exceeded in the positive direction, the probe pair is "positive." These assignments of probe pairs as negative or positive are then incorporated into more aggregate measures described below.

Figure 3.5: Background noise on Affymetrix arrays. A, A diagram showing how background noise variance is obtained from the background cells. B, The calculation of the background noise. C, The definition of SDT and SRT and how they define "positive" and "negative" probe pairs. (Derived from [10].)

For each gene whose expression the microarray has been designed to measure, there are between 16 and 20 probe cells representing PM probes and a same number of cells representing their associated MM probes. Collectively, these 32 to 40 probe cells are known as a probe set. A .eel file contains all the probe cell intensities for all the probe sets represented on a microarray. The .eel file is used, in turn, to generate derived or aggregate statistics for each probe set (e.g., a measure of expression or a particular gene). These aggregate statistics are stored in a .chp file.

Figure 3.5: Background noise on Affymetrix arrays. A, A diagram showing how background noise variance is obtained from the background cells. B, The calculation of the background noise. C, The definition of SDT and SRT and how they define "positive" and "negative" probe pairs. (Derived from [10].)

For each gene whose expression the microarray has been designed to measure, there are between 16 and 20 probe cells representing PM probes and a same number of cells representing their associated MM probes. Collectively, these 32 to 40 probe cells are known as a probe set. A .eel file contains all the probe cell intensities for all the probe sets represented on a microarray. The .eel file is used, in turn, to generate derived or aggregate statistics for each probe set (e.g., a measure of expression or a particular gene). These aggregate statistics are stored in a .chp file.

Theoretically, each Affymetrix probe set of 16 to 20 25-mers is designed to be uniquely representative of a particular gene or EST, and to no other known gene or EST. However, it is often the case that a probe set cannot be found for a particular EST which fulfills these rules, and thus must be designed by relaxing the rules. In practice, nonspecific hybridization of individual 25-mers is more significant than in robotically spotted microarrays, and this is one form of noise (see section 3.2 for more coverage of noise). In addition, nonspecific hybridization with other RNA species may result in an entire probe set for a gene fluorescing with little or no noticeable contrast between the PM and MM, as shown in figure 3.6:

Figure 3.6: Nonspecific hybridization on an oligonucleotide microarray. The two rows of probe cells represent the probe set for a gene transcript. The PM probes are on the top row and the MM probes are on the bottom row. Even if there is a lot of specific hybridization to the PM oligonucleotides due to the presence of the targeted gene transcript, if there is significant nonspecific hybridization, then the amount of transcript cannot be estimated accurately. In this illustration, the number and intensity of dark probe cells on the bottom row is higher than that in the top row and therefore most software packages would report that the reported intensity for the gene is unreliable—an "Absent" Absolute Call in Affymetrix parlance.

Recall that in theory, an MM probe is designed to bind to nonspecific sequences at the same rate as its associated PM probe. Due to the low specificity of individual oligonucleotide probes as compared to the lengthy cDNA sequences used in most robotically spotted microarrays, Affymetrix's goal has been to use all 32 to 40 probe cells to increase the aggregate specificity. In order to meet this goal, Affymetrix has developed several measures across the PM and MM to increase specificity and generally improve the signal-to-noise ratio of the hybridization measurements. These derived values provide measures of quantified gene expression and the reliability of gene expression. These measures are summarized in table 3.2.

Table 3.2: Affymetrix aggregate or derived measures per probe set f'orilive l-Yar.túm

'""" • Pairs used »re equal to (lie number of probe pairs lota] 0 prubt paid 1

in the probe set minus the masked-out probe pairs. For a variety of reasons, the user may decide to mask out one or more probe pairs. Also, probe cells reporting intensities greater or less than 3 SD from the mean probe cell intensity in the probe set are masked out for being outliers.

f'ositive-to-Negative I tot. to

ratio or positive probe pairs negative probe pairs'

l.tMj Average liatio (Log Avg Ratio)

A number describing the hybridization performance of a probe set by determining the ratio of the PM to MM intensities for each probe pair, taking the logarithm of each of the resulting values, then averaging those across the probe set. This is a slight simplification as the reported Log Avg Ratio is also corrected to exclude the extremal outlier probe pairs in a probe set, «'.<;., pairs with the largest and smallest contrast in intensity. It indicates random cross- or nonspecific hybridization with the higher values suggesting a higher likelihood that a transcript is detected.

Average Difference (Avg Diff)

The Avg Diff is a number calculated by taking the difference between the I'M and MM of every probe pair and averaging the different:«» over the entire probe set. It corresponds to the absolute expression level of a transcript.

Absolute Call (Abs Call)

lias values A (Absent), M (Marginal) or (Present) regarding the presence of a transcript and is determined from a decision matrix combining the Log Avg Ratio, Ratio and Positive Fraction.

The two most commonly quoted measures are the Average Difference (Avg Diff) and the Absolute Call (Abs Call). The Avg Diff is an aggregate measure of the difference between the PM and MM probe cell intensities per probe set. In the simplest case, a highly specific hybridization will "light up" all the PM probe cells and none of the MM probe cells. In this case, the Avg Diff would be a positive number which increases with the quantity of that particular RNA species present in the sample. In practice, the pattern of figure 3.6 is more common. The greater the intensity of the MM probe cells, the lower the Avg Diff. If the intensities of the MM probe cells exceed those of the PM probe cells, then a negative Avg Diff can be reported. Therefore, even if a gene is expressed highly in a particular biological sample, the Avg Diff could be negative if there is also a lot of differently-specific hybridization reported by the MM probe cells.

The various techniques in the literature that have been employed to handle negative values of Avg Diff in an analysis are all controversial. Most often, genes with negative Avg Diffs are simply omitted from the analysis because the common log transformation used for robotically spotted array data is not defined for negative numbers. This loss of information is obviously not desirable. Other approaches such as thresholding negative numbers to a positive constant create a systematic artifactual bias in the distribution of Avg Diff results in the data set that affects all subsequent analyses that depend on these quantities.

Regarding the Abs Call per probe set, Affymetrix has empirically developed a decision table for each probe set of 16 to 20 probe pairs that is used to determine whether there was a "Absent, "Marginal," or "Present" call based upon the Positive-to-Negative Ratio, Positive Fraction, and Log Average Ratio. Note that this decision table will have different values for these three parameters for each chip set. The need for such a decision table stems from the fact that the specificity of the different probe pairs will vary across probes and across samples. In theory, the Abs Call provides a qualitative measure of reliability of the reported expression level (see figure 3.7). The naming convention of the Abs Call values is unfortunate as the report of an Absent Abs Call does not necessarily mean that the gene is not expressed in a sample; rather it signifies that the gene expression measurement on that microarray and for that experiment is not reliable, e.g., its signal was below background effects.

Distribution of A and P Calls by Expression Level

2500

2000

CD 3

S"

Expression Level (AvgDiff) Figure 3.7: The Affymetrix Absolute Call. The relationship between Absolute Call and expression level for a microarray. Far more Absent (A) calls than Present (P) calls are at lower expression levels, but significant numbers of Absent calls are found even at expression levels in the low thousands.

A general understanding of these aggregate measures will make the reader aware of some major implications for analysis. First, the aggregate probe set values may not be reliable for all purposes. For example, in contrasting reported measurements from separate arrays, the aggregate measures Log Avg Ratio and Avg Diff may not be directly comparable as is because of the differences in specificity and sensitivity of each probe set, which is not captured by the aggregate measures. In contrast, Affymetrix software exploits probe-specific knowledge within the probe set in developing the decision table so that analyses and computations occur at the level of the .eel file. As described above, these files report the intensity of the individual probe cells, though at the time of writing, the oligonucleotide sequences for each probe cell are not known. Since most microarray analyses reported in the literature mainly employ Avg Diff values and not individual probe cell intensities, the comparability and reproducibility of these published results are less than optimal.

Distribution of A and P Calls by Expression Level

2000

CD 3

S"

.00 1.00 2.00 3.00 4.00 5.00 6.00 Logio of the first AvgDiff

Figure 3.8: Graphical display of how Affymetrix Absolute Calls predict reproducibility. Correlation of expression levels from the same hybridization "cocktail" measured on two microarrays. Top: the correlation using only those genes for which an Absent call was reported by both microarrays. Bottom: only the Present called genes.

Another factor complicating the use of these aggregate measure values is that several settings, such as the noise threshold, are tunable by the operator of the Affymetrix scanner and software unit such that different reported aggregate measures may not be comparable across different microarray data sets. Despite the preceding concerns regarding aggregate measure values, the standard in the literature has typically been to not publish the .eel files with individual probe measurements but instead to release the aggregate statistic, e.g., .chp files with all their attendant problems with comparability evolving from these.

At present and in sharp contrast to robotically spotted microarrays, the use of oligochips runs counter to the do-it-yourself ethos of many biologists for three primary reasons. First, except for large pharmaceutical companies, the construction of oligonucleotide microarrays occurs offsite and completely under the control and supervision of the manufacturers (essentially not peer-reviewed). Second, current oligochip manufacturers charge what the market will bear, and oligochips are typically more expensive than robotically spotted arrays. Third, the specific oligonucleotide sequences that are used by the manufacturers of oligonucleotides to interrogate an RNA transcript are proprietary information. It is likely that as the market for microarrays broadens, the second and third problems will resolve themselves. Regarding the first reason, in this genomic era in which we see the rapid industrialization of biological investigations, an ever-increasing fraction of the biologist's tools will become commodities available from vendors at diverse price-performance trade-offs. From that perspective, industrial fabrication of whole transcriptome microarrays will likely become common in the near future (see section 7.1).

[1]This schema is adapted from Leming Shi's website http://www.gene-chips.com/.

[2]http://www.cmgm.stanford.edu/pbrown/mguide/index.html.

[3]Early reports at the time of writing indicate the next generation of Affymetrix oligochips may use 11 probe pairs per probe set.

Was this article helpful?

0 0

Post a comment