Info

Data Analysis

The data analysis and experimental design need to be adapted to each other. Since careful analysis of microarray data yields an enormous amount of information about rapid pathogen adaptations to environmental changes, we will summarize important steps in this analysis. This is still a constantly evolving field and is being continually improved; furthermore, no single method that will serve all purposes can be described. Figure 1.1 gives a brief outline of a standard scheme for the analysis of your data.

Image acquisition is usually by a camera or a scanner and a frame grabber (ADC, analog to digital converter). Although images are regarded as raw data, different acquisition equipment and settings can have a significant effect on the data that are produced. Several methods are available for image quantification. Widely used freely available tools are Spotfinder by TIGR (http://www.tigr.org/ software/tm4/spotfinder. html) and ScanAlyze from Stanford (http://www. microarrays.org/software.html). Note that TIGR (The Institute of Genome Research) also offers a valuable and constantly updated resource for microbial genomes, including almost all pathogen genomes sequenced to date if made publicly available. When a system-integrated database is used, signal intensities can be automatically assigned to the corresponding gene/EST. If no database is available, this must be done manually, e.g., using the MSAccess database, which is available on most computers. Usually spot location can be used as a link between data and description tables.

After quantification, for the comparison of arrays, it is necessary to normalize signal intensities in order to reduce system variations as far as possible. Most normalization methods are based on the assumption that the data are normally distributed, so it is necessary to log-transform data prior to normalization. Widely used normalization methods are variance stabilization (vsn) [29] or locally weighted regression scatter plot smoothing (lowess) [30]. Different normalization methods and most of the following data analysis procedures are implemented in a

Image acquisition

Image acquisition

Fig. 1.1 Overview of microarray data analysis. Typical steps in gene array analysis are shown. Normalization, gene reduction, clustering, and detailed database analysis are necessary before a well-founded biological interpretation of the data is possible. vsn, variance stabilization; lowess, locally weighted regression scatter plot smoothing.

Fig. 1.1 Overview of microarray data analysis. Typical steps in gene array analysis are shown. Normalization, gene reduction, clustering, and detailed database analysis are necessary before a well-founded biological interpretation of the data is possible. vsn, variance stabilization; lowess, locally weighted regression scatter plot smoothing.

1.7 Adaptation in Time and to Stimuli | 11

software package called bioconductor (http://www.bioconductor.org/, [31]) built on the R environment for statistical computing (http://www.R-project.org/, [32]).

After normalization, if there are only two samples to compare, abundancies may be compared by a simple fold change or statistical tests such as Student's t-test. Yet, since there is usually only a small number of replicates there is a high risk of underestimating variance and thus obtaining artificially low p-values. Random permutation of data and comparing the amount of genes that exceed a certain threshold can help to estimate the false positive rate.

When there are more than two experimental conditions, the signals resulting from a gene in the different test samples should be compared using relative signal intensities. Cluster analysis is then used to group the genes according to their behavior in the experiments. Some cluster algorithms lose power due to noisy or nonsignificant signals. Since genes which do not change during an experiment will not provide much information for differentiation, genes having a higher variance can be selected or - if the number of classes of genes is known - p-values can be used for gene selection.

Unsupervised learning is used if classes or their labels are not known a priori. Easy-to-use array data analysis programs for class discovery on a Windows platform are provided by, for instance, Stanford and TIGR. Unsupervised cluster algorithms like hierarchical clustering, SOMs [33], K-means, or PCA, principal component analysis [34] are implemented in these tools. All of these algorithms are also implemented in bioconductors. The most commonly used cluster methods in microarray data analysis are hierarchical clustering and K-mean clustering. Hierarchical clustering and the graphic representation of the trees gives a good overview of distinct groups within a dataset, but the interpretation of larger groups may be complicated. K-mean clustering requires the number of clusters (k) to be given. Each sample will be randomly assigned to one cluster. The distance between the center of each cluster (centroid) and each sample is calculated and the samples are assigned iteratively to the nearest centroid. Each sample is assigned to exactly one centroid. To identify the optimal choice for the value of k it is advisable to test several k. K-means clustering is a fast algorithm, suitable for large datasets, but can be affected by outliers.

With a dataset with known class labels, supervised learning methods are superior to unsupervised methods. Based on a training set, supervised methods try to construct a classifier, which can be used to identify the nature of an unknown case. Different algorithms are used for the analysis of microarray data. Decision trees described by Leo Breiman [35] construct a classifier based on hierarchically arranged separation rules. They produce models that can be quite easily interpreted by humans. However, to obtain a reasonable result by decision trees, it is imperative to reduce the number of genes first.

Neural networks are statistical models used for pattern recognition and classification. The network is composed of a large number of interconnected nodes. Neural networks learn by example data. They cannot be programmed to perform a specific task. The network finds out how to solve the problem by itself; its operation can be unpredictable and thus hard to interpret. Support vector machines

(SVM) are classifiers which try to linearly separate data vectors of different classes by so-called hyperplanes in a high dimensional space [36] (http://www.csie.ntu. edu.tw/~cjlin/libsvm/). Like neural networks they do not generate gene lists; however, in most expression profiling studies the goal is identification of differentially expressed genes. To identify discriminative genes, subsequent statistical methods or classification using the cluster number as class label may be used.

If two genes are found in the same cluster, this reveals similarities of expression profiles, but not the reasons for these similarities. Statistical significance does not always mean biological relevance. Coregulated genes may be controlled by the same regulatory pathway or share a specific promoter, but apparent coregulation may also be biological noise. Tools for direct sequence annotation and pathway analysis as discussed above can help to interpret and thus promote improved understanding of the biological function of the genes.

Finally, it should be borne in mind that expression analysis can only measure RNA abundances, but gene expression and protein abundance are regulated at many steps. These steps include transcriptional control, alternative splicing, transport and location control, and mRNA degradation control. All these control mechanisms can cause discrepancies between gene expression on the mRNA level and the protein level in response to perturbations. In addition, alternative splicing can

Tab. 1.1 Tools to analyze pathogen gene expression changes using microarrays.

The R-project for statistical computing Bioconductor home page Useful bioconductor packages Normalization MA plot

Principal component analysis K-means clustering Hierarchical clustering Self-organizing maps Regression trees Support vector machines Partitioning around medoids Free Aff/metrix annotation tool

NetAffx Stanford tools TIGR tools

http://www.r-project.org/ http://www.bioconductor.org/

vsn, marray, limma limma, marray stats stats cluster

0 0

Post a comment