For the remainder of the chapter, we return to the DNA microarray platforms developed in Section 3 and, in particular, the oligonucleotide platform. Although there are significant differences between the cDNA and oligonucleotide technologies, the underlying assay principles are the same as highlighted in Table 1. The similarities extend to the analysis and interpretation phase of the assay, although, again, there are differences in implementation by platform. Given that this text is not intended to prepare the reader to perform analyses, but as an introduction for the interpretation of microarray experiments, we will limit our further discussion to the oligonu-cleotide format only.
Microarray assays are generally employed for only a limited number of study designs. Relatively infrequently, a time series design is used in which a single subject or group of subjects are sampled at a variety of time-points or experimental conditions (such as increasing drug concentration). Gene expression is followed over time and concentration either with respect to a specific outcome, such as differential response to stimulus, or to determine patterns of gene regulation. Such experiments have been useful in demonstrating the potential of microarrays for elucidating complex molecular networks, even when the function of all genes is unknown (80). Time series comprise a minority of DNA microarray experiments, many of which are directed at molecular pathway elucidation rather than clinical questions, and we will not consider them further.
The most common experimental designs sample subjects only once, not serially, and address one or both of the following questions: (1) What are the dominant patterns of gene expression in this sample, without regard to any specific outcome or pheno-type? (2) What is the dominant pattern of gene expression with regard to a specific outcome or phenotype? An investigator interested in discovering previously unrecognized tumor subtypes is essentially asking Question 1, for example. In a different analysis the investigator trying to define genes associated with aggressive cancer might ask Question 2. Finally, an increasingly common experimental design is the validation of results obtained in the process of answering Questions 1 and 2 (12,81,82). Before addressing the methods needed to execute these analyses, we will consider the final data processing and filtering steps.
9.1. NORMALIZATION The most common DNA microarray experiments attempt to make meaningful comparisons of gene expression patterns across samples. A meaningful pattern is one that relates to underlying biology; yet, patterns could emerge that relate either to experimental conditions or error as well. Proper experimental design and analysis are the tools needed to minimize the influence of error and experimental conditions. We have discussed elements of experimental design in Section 5 in this regard and now turn to the role of experimental conditions. The process of accommodating for experimental conditions in order to elucidate meaningful patterns in the microarray data has been given the name normalization and we will introduce its important features here.
Experiment A Experiment A
Fig. 9. Variation in expression values by experimental conditions.
Microarray images contain a variety of perturbations that are solely related to experimental conditions, and if not corrected for, they will result in uninformative patterns in the final gene expression analysis. Figure 9A-D shows examples of how gene expression values can be altered by differences in starting RNA concentration, variation in scanner calibration, and saturation effects, all without differences in sample biology. Each part shows results of paired experiments in which RNA from the same source has been divided and hybridized on two separate chips. The graphs show the range of expression value in experiments A (without regard to specific probes) vs the validation in experiments B assuming different experimental conditions.
Figure 9A shows a perfect correlation of expression values across the range of expression on the chip—the ideal situation in which experimental conditions were exactly the same across the two assays. Fig. 9B shows the results in which the RNA aliquot was divided into unequal parts 1/3 (experiment A) and 2/3 (experiment B). Expression at every point in experiment B will be twice that of experiment A solely because of starting RNA concentration. When the investigator is aware of unequal RNA starting concentrations, the perturbation is easily corrected by dividing all expression values in experiment B by the appropriate constant—2 in this case. In Fig. 9C, we see the influence of systematic variation caused by the scanner calibration, in which every probe in experiment B has expression systematically increased by a constant. This effect is corrected by subtraction of that constant from all probes in experiment B. Finally, we see the influence of a saturation effect in Fig. 9D, in which small changes in experiment A correspond to larger expression changes in experiment B above a threshold brightness. Although more difficult to recognize, saturation effects can also be corrected by relatively simple computational methods. Although presented as isolated phenomena in this example, in actual practice all three effects are likely present in combination for any given two-array comparison.
The optimal normalization technique to address effects shown in Fig. 9 is an area of active research and most of the analytical software contains some form of normalization procedure (83). In the simplest case, a procedure called standardization can correct the effect (Fig. 9B) by solving the equation Y = MX for the slope M. It is similarly trivial to use linear regression to solves the equation Y = MX + B for the slope and constant and normalize accordingly (useful to address combinations of effects from Figs. 9B,C). Figure 9D requires a more sophisticated modeling to address differential effects that occur across the range of expression. A family of regression techniques of the type commonly know as LOWESS regression or locally weighted polynomial regression have been employed for this purpose (26). Another common approach is to apply probabilistic distributions to all probes in the experiment. For example, the mean expression value can be set to 0 and standard deviation to 1 for all probes on an array and all arrays in an experiment. Each of these methods has its strengths and weakness; for example, the probabilistic method assumes a Normal distribution of expression values, when, in fact, the distribution may not be known. In general, the relative strengths and weaknesses of normalization methods are beyond the scope of this review, however, we emphasize that some form of normalization should be a part of any current array analysis.
The normalization techniques described thus far have the underlying assumption that those probes with high (or low) expression in experiment A are the same in experiment B. Normalization in this case only serves to recalibrate experimental conditions, not alter the underlying biologic relations. This is a reasonable assumption when experiments A and B use RNA from the same source, however, most microarray experiments compare RNA from different sources. When RNA is isolated from different sources, we assume that they will have different RNA expression patterns that relate to differences in the biology of the sample. The most highly expressed RNA from sample A will be different from sample B, so that a normalization curve such as that seen in Fig. 9 is much more challenging to construct. When the probes that measure expressed RNA at the highest (lowest) levels are different across the normalization procedure, normalization reduces the very biologic variation that we are interested in studying. It might surprise the reader to realize that of all the challenges posed by microar-ray technologies, it is this problem of normalization that remains among the most difficult in bringing the technology to clinical use (83).
Rank-invariant normalization is one of a number of techniques that has been suggested to better maintain biologically significant expression difference while performing adequate normalization (84). In rank-invariant normalization, all genes on each array are ranked by the level of their expression. In theory, many genes unrelated to the biologic process of interest will be expressed on each of the two arrays at a wide range of expression levels. For example, some housekeeping genes are expressed at consistently high levels in all cells. These genes might have different absolute expression level according to the experimental conditions, yet their ranks should be similar across arrays. Rank-invariant normalization selects sets of genes with similar ranks across arrays, at high, medium, and low rank, for example, and performs local regression similar to the LOWESS method. In this way, only genes that are invariant across arrays provide information on the normalization scaling factors, yet all genes are normalized. Initial evidence suggests that this method is quite promising.
Other normalization techniques have been employed, both computational and experimentally based. A number of authors have tried spiking RNA species of known concentration in all samples across the experiment and the use of housekeeping genes to normalize in a manner conceptually similar to the rank-invariant technique. In practice, these methods have not proved successful. In summary, normalization is a vital step in making meaningful comparisons across microarray experiments. There are a wide variety of techniques, and although some are perhaps more promising, no consensus has been reached on the optimal method (85).
9.2. HYPOTHESIS TESTING Normalized data are ready for hypothesis testing, however; the most familiar tool for this process often might not be appropriate for genomics applications. Classical biostatistics has evolved a methodology for addressing the challenges of hypothesis testing and study design, the basic principles of which are familiar to most readers. In brief, the method works as follows. An investigator wishes to demonstrate that a parameter A is greater than B, in which the measurements of A and B are associated with a certain error. The investigator collects enough samples of A and B so that the measurements of error are smaller than the expected difference between them. To prove that A > B, the investigator states a hypothesis that he wishes to reject (usually a null hypothesis), such as A is equal to B, and an acceptable threshold for making an error in the conclusion. By convention, the threshold is usually a 0.05 chance of stating that the null is false when, in fact, the null is true. In other words, there is a 5% chance that A and B are equal when we say they are not. If we were to perform the experiment twice, we would increase the chances of making that same error to 0.1 (0.05 + 0.05 = 0.1). In summary, using classical statistics to most efficiently prove that A > B, an investigator needs to collect many samples of A and B and test his hypothesis only once. This is the form in which most clinical trials are conducted, where a large cohort of patients is assembled to test one hypothesis, such as the effectiveness of a drug therapy.
Consider the case of most genomics experiments in light of the previous example that accrue small numbers of subjects, rarely more than a few dozen, while collecting data on thousands of genes. Each of the thousands of genes represents a potential hypothesis, a worst-case scenario by classical statistics methods. With few samples, individual genes cannot be measured reliably. With many hypotheses to test, the number of false positives will be large. For example, when 10,000 genes are analyzed at the 0.05 threshold, we would expect 500 false-positive results (10,000 x 0.05 = 500) in addition to any true positives. Reducing the threshold to 0.01 might reduce the problem of false positives; however, this is at the cost of increasing false negatives. Although genomics data could be analyzed using classical biostatistics, in which several dozen samples were analyzed for a single gene on the array, this represents a highly inefficient study design. The previous examples are grossly simplified, as there are adaptations of common statistical methods that overcome some of the shortcomings described, although there is general consensus that efficient analysis of microarrays requires novel analytical techniques (86). Many of these methods are still under development, and unlike classical biostatistics, the conventions are not yet well established. Despite the ongoing evolution, an analytical framework with broad applicability has emerged that will be discussed in the following subsections.
9.3. GENE FILTERING Thus far, we have focused on processing microarrays without regards to the specific genes on the arrays. A shorter list of useful genes statistically and functionally associated with the outcomes of interest might be preferable to the indiscriminant use of all genes on the array. Accordingly, most investigators try to eliminate genes that have no plausible association with the disease states of interest or are likely to contribute error to the analysis. Attempts to do this systematically by manual review of genes, especially across multiple samples, are hindered by the shear volume of data. As an initial step to decrease the chances of spurious results by data reduction, most analyses include a component of gene filtering. Gene filtering is the process of removing from further analytical consideration a large number of genes unlikely to contribute to the results or have undesirable properties for the specific hypothesis at hand (87,88). As suggested in the previous subsection, no consensus exists for the optimal filtering of genes, but we can consider the most commonly used approaches. Although gene filtering might be useful, as with any step in the preprocessing of the data, it likely influences the results of the final analysis. In this regard, investigators and readers should take note of the filtering method and stringency of the criteria used.
The usefulness of genes is often considered in the context of the entire sample set. Genes that do not vary across samples are generally not considered useful and can be removed from further analysis. Lack of variability can relate to the biology of the sample set, the quality of the probes, or other factors. When there is no variation, however, genes can only be thought of as contributing noise without signal. The variability of a gene can be described in a number of ways, such as the percentage of samples in which the gene exceeds a threshold expression value or the ratio of standard deviation to mean across samples. Eliminating the large number of genes with expression near zero often results in the greatest data reduction. Although gene variability across samples is a desirable analytic characteristic, a gene marked present in one or few samples out of many raises red flags. Although a gene expressed in only one sample can represent a biologically important finding, it often represents a false positive,which prudence suggests striking from further analysis.
A number of authors have proposed using replicate microar-ray, in which the same sample is hybridized against two different arrays to detect the probes with the highest reliability, filtering out low-reliability probes (14). In addition to filtering based on gene distribution, strategies can be based on other gene properties such as underlying biology. Investigators can generate gene lists that relate to biologic processes such as developmental or inflammatory and systematically include or exclude them from analysis.
9.4. GENE SPACE In place of probabilistic distributions, the concept fueling classical statistics and hypothesis testing, the current paradigm proving most useful in the analysis of microarrays is that of multidimensional space or alternatively gene space. An example shown in Fig. 10 serves to highlight the basic principle. Suppose that that we perform 10 microar-ray experiments on 10 separate samples. Each microarray originally surveys 10,000 genes, but after processing and strict filtering, we are left with only two genes for the analysis. These two genes represent a two-dimensional space shown here in Fig. 9. Each microarray experiment, in turn, can be placed in that same two-dimensional space according to the expression levels of genes 1 and 2. In this manner and for this analysis, each microarray is represented fully by one point in this theoretical two-dimensional space. Extending the model to three-dimensional space, a third gene would define a discrete location in three dimensions. Although conceptually abstract, it is computationally trivial to extend the model indefinitely into the nth-dimensional gene space, in which the expression levels of n genes defines a unique point in gene space.
Consider some advantages of the gene space model. Many samples, each with thousands of data points (gene expression values), can be represented by a single point in space. In place of describing the interrelations of thousands of genes over multiple samples, the analysis can be framed in light of the distance of one array from another in gene space. Figure 10 illustrates this point, where the vectors (A) and (B) represent distances between samples and between groups of samples, respectively. Accordingly, vector (A) suggests a relatively short distance among samples in a group compared to a larger vector (B) between the two groups. Any two-sample comparison is reduced to a single distance measure no matter how many genes or dimensions of gene space are involved. Using the concept of distance, the problem of multiple hypothesis testing described in Section 9.2 is averted and we have the most useful current model for the analysis of microarray experiments.
Although a useful concept, distance measures have limitations in modeling biologic systems. First, any vector such as vector (A) from Fig. 10 has value equal to [(Gene1array1-Gene1array2)2+(Gene2a^^ayl-Gene2a^^ay2)2]1/2, as described by toe Pythagorean Theorem. The straight line measuring distance between two points is a measure known as Euclidean distance and is only one of many ways in which distances can be described. For example, we might choose to weight-relative gene contributions differently or use other well-described distance measures (89). Additionally, there are cases in which distance might be ill-defined by any measure, such as a dichotomous variable like gender or "on/off." Ultimately, the relevance of distance measures and the models that use them is reflected by their performance in biologic systems. As we will see, they often perform quite well, and when they do not, there are other model systems.
9.5. MACHINE LEARNING In Section 5, we noted that the two most common microarray experiments attempt to answer the following questions: (1) What is the dominant pattern of gene expression in this sample, without regard to any
Fig. 10. Two-dimensional gene space.
specific outcome or phenotype? (2) What is the dominant pattern of gene expression with regards to a specific outcome or phenotype? The search for patterns defined by multiple data points has been an uncommon problem in biomedical research, which traditionally addresses hypotheses singly by hypothesis testing. Researchers in other field of science such as artificial intelligence, however, have been more interested in problems of pattern recognition (90). The following subsections will develop some basic tools of artificial intelligence that have been adapted for gene expression arrays. Our focus will be more on method application without rigorously detailing the computational aspect.
The technique of finding patterns in data without regard to specific outcome, (Question 1) has been called unsupervised learning and clustering. For example, given a group of tumors that appear similar by current diagnostic methods, are there any gene patterns that represent unrecognized biological differences? In this example, we are looking for gene patterns without regard to any outcome, such as tumor behavior or patient survival. Using the language of machine learning, neither the classes into which we wish to divide the samples nor the genes needed to make those divisions are known at the start of the analysis. Alternatively, techniques for addressing Question 2 have been called supervised learning or classification. Continuing the undifferentiated tumor example, when we look for genes in these samples that associate with a particular outcome, such as aggressive vs indolent behavior, the methods used are those of supervised learning. In supervised learning, the classes into which we wish to divide the data are known, but the genes used to accomplish the task are not.
Was this article helpful?
Are You Prepared For Your First Baby? Endlessly Searching For Advice and Tips On What To Expect? Then You've Landed At The Right Place With All The Answers! Are you expecting? Is the time getting closer to giving birth to your first baby? So many mothers to be are completely unprepared for motherhood and the arrival of a little one, but stress not, we have all the answers you need!