Multivariate Data Analysis

Multivariate Data Analysis (MVA) is a very useful tool for classifying sets of compounds and identifying the primary latent variables that summarize the data through Principal Component Analysis (PCA), and for identifying correlations between variables describing the properties of compounds and the biological effects of these compounds through Projection to Latent Structures by means of a Partial Least Squares analysis (PLS).

PCA is a projection method that allows one to take a multivariate data matrix and represent it in low-dimension space. It then becomes more straightforward to identify dominant patterns and major trends in the data. The relationships between compounds and data, and among the data variables, are uncovered. In PCA we take linear combinations of observations (compounds) and variables. The data matrix is summarized row-wise as scores (ta) and column-wise as loadings (pa). The directionality in a scores plot corresponds to that of a loadings plot, so the dominant variables associated with a compound can be identified (Eriksson, 1999).

An example that illustrates the use of PCA is the characterization of a set of 130 aldehydes. Aldehydes are a commonly used reagent class in organic synthesis that are routinely used in lead optimization. When selecting a set of reagents to use in an array synthesis, calculated properties are generally used to aid in the array design process. With the set of 130 aldehydes we conducted a far more extensive characterization including calculation of eighteen properties and measurement of nine physical properties. With twenty-seven variables it is difficult to use all the data to select a smaller subset of the reagents needed for an array synthesis. Using SIMCA P 9.0 (www.umetrics.com) we generated a PCA model with six components (R2X 0.901, Q2 0.687). The plot of the first two scores (tj and t2) showed a good spread of observations with no obvious clusters (Figure 1). Lipophilic aldehydes are located in the upper right quadrant and polar aldehydes are found in the upper left. The scores for each aldehyde can be used to select a representative set of the aldehydes when designing an array of compounds.

cn

Ox

2

on

2

2

o

9

8

7

6

5

4

3

2

11

0

9

8

7

6

5

4

3

2

1

Compound

CO

CO

5 o

CO

OX

00 4

00

on 3

00 91

CO

00 4

on

3 4

M W

10.7

2 0

4.

6

3 6

00 2

po 7

2 4

8

3

5

6

3

7

2 2

3

2 4

9

2 7

8

2 4

2

0

6

6

) . 4

l u i

¿8'0

3

4

o

3 Ox

0

cn

0

2

9

4 2

on

cn

00

CO

91

41

2

5

Ox

on

2

OX

on

^r

2

— a c ffi

A

1.14

cn

cn

o

61

3 2

6

r t &

Dt e t e a.

9

on

on

&

00

5 2

00

5 o

CO

4

00

Ox

A ■

1123

on CO

00

on 00

6 Ox

CO

00

3

3

2

2

O

2

2

7

4

4

3

3

3

3

3

3

2

2

1

on

5 2

on

cn

on

5 2

00

5

5 ^

4

00

4

00

CO

Ox

cn

9

3

00

4

2

7

7

00

cn

0

4.72

9

on

9

on

9

cn

9 2

9 2

7

8

7

8

5

8

0

3

2

8

CO

2

OX

CO

2

2

4

6.03

OX

0

CO

0

OX

9

00

00

01

5

01

5

CO

5

2

on

5

00

3.

00

8

7

00

on

0

2

2

2

2

1

2

o Gc i-t V2

00

CO

CO

5 o

CO

CO

8

CO

CO

CO

7

CO

7

CO

7

CO

Ox

OX

CO

00

OX

9

OX

Ox

CO

Ox

A

4

3

3

3

3

3

3

4

4

4

6

4

6

3

5

4

4

4

5

4

4

5

4

5

3

5

l-t V2

6

3

3

5

3

3

3

3

3

3

5

3

5

3

3

3

3

3

3

4

3

4

3

3

3

VI l

These trends are reflected in the loadings plot (pj and p2) where lipophilicity variables dominate the first component and polarity variables dominate the second (Figure 2). Also, there is an inverse correlation between solubility and the lipophilicity variables.

aldehydes & data (cleaned) 6-24-32 M2 (PCA-X), all data (-rot. AlogD, Ar_Caib) p[1]/p[?l

*FSA

^SokjUlty

;HO_Carb iTotalahDfcR

2nd component: polarity iTotalahDfcR

1st component: lipophilicity

Figure 2. PCA p[1]/p[2] Loadings Plot

This example illustrates how PCA can be used to characterize compound sets. In order to analyze correlations between one data matrix (X) and a second data matrix (Y) it is necessary to use the PLS method. There are many situations in drug discovery where this approach is useful, but one key area is trying to find correlations between calculated or measured physical properties and biological responses. The following example from an early phase drug discovery project illustrates the use of PLS.

0 0

Post a comment