Info

Figure 1. Illustration of a tiered analysis system geared toward correctly partitioning the toxic (red) from the non-toxic in which the first phase has a high sensitivity (93%) and the second phase has high specificity (85%). The overall concordance of this approach is 86%, with an overall sensitivity of 80% and specificity of 89%.

exploration of chemical space. An additional motivation for biasing a model would be if the confirmatory assay were prohibitively expensive or were of low throughput, leading to the need for a model with few false positives.

As an example we will consider carcinogenicity modeling. The assay approved by the Food and Drug Administration (FDA) to evaluate the in vivo carcinogenicity of a compound is the two year rodent assay with cohorts of male and female rats. In order to build a model which is not simply an exercise in statistics, a cost-benefit analysis could be conducted as follows: if the cost (either in real dollars or in opportunity cost due to halting a compound's development) of excluding a compound from an assay is higher than the cost of losing a compound in a later stage of development due to carcinogenicity, the model should have high true negative rate. It would be this pool of compounds which would be carried forward and taken into the in vitro/vivo assays at a later stage of development.

Tractable Toxicological Endpoints

Our ability to model a liability is not merely dependant on the ability of the modeler or the selection of the optimal algorithm, set of descriptors or subset of data for a given endpoint. More so than with traditional quantitative structure activity models, models in the realm of predictive toxicology must be acutely dependant on endpoints for which they are developed (e.g. models for teratogenicity, in which the phenotype characterized as teratogenic is due to an unknown number of biological mechanisms, building a predictive model becomes extremely difficult). Even within an endpoint with a limited number of mechanisms, local models (e.g. the mutagenicity of thiophenes) can often be combined to give a higher overall success rate and interpretability than a single, global model to explain an entire endpoint (e.g. mutagenicity).

Another factor which determines the tractability of a problem is the level of experimental detail available on the biological entities being studied. Modeling of phase I drug metabolism has been greatly impacted by the high resolution x-ray crystal structure of the human cytochrome P450 2C9 and 3A4 enzymes, allowing researchers to carry out high level ab initio quantum mechanical calculations in order to predict possible sites of oxidative metabolism upon docking ligands into the active site of the enzyme. As our understanding of this system increases, our ability to predict the inhibition of the enzyme by pharmaceuticals will also increase.

As such, the endpoints which have been successfully modeled are varied; the greatest number of, and most mature, models exist for mutagenicity, rodent carcinogenicity (both genotoxic and non-genotoxic) and occupational toxicities (eye, skin and respiratory irritation/sensitization and lachrymation). As the quantity of high quality, controlled data increases, the predictive ability of the models developed from the data and the insights provided by those models will also increase.

Strengths and Weaknesses of Various Methodologies

There are currently two general approaches used to develop models which assess a compound's potential toxicological liabilities, quantitative structure-activity relationship (QSAR) derived models and expert system (ES) derived knowledge bases. Hybrid approaches which employ QSAR methods to determine the components rules of ES databases have also been developed. Each of these methods has had varying degrees of success in modeling toxicological endpoints. Local models of activity are often best classified by QSAR derived models (e.g. mutagenicity of anilines). Global models of a toxicological endpoint are often best represented by an expert system (e.g. mutagenicity of the world drug index) due to the general nature of the rules (e.g. aromatic compounds with good leaving groups often form DNA adducts). Both of these approaches may also be combined so that several class-specific models create a larger, global model.

QSARs develop an abstract representation of a series of compounds and seek to extract the features (descriptors) which best correlate with an observed activity. Developing a virtual depiction of a molecule is accomplished by the calculation of molecular indices (e.g. counting the number of rotatable bonds, halogens or heavy atoms) and properties (e.g. LUMO energy, polarizability or logP). These descriptors are then correlated with the experimental data using multivariate statistical methods (e.g. multiple linear regression, principle component analysis or Shannon entropy methods) to keep only those descriptors with strong resolving power. QSAR models can be used to build structural models of theoretically ideal ligands (such as Comparative Molecular Field Analysis and pharmacophore models) given the current data. In addition, this class of model can give a quantitative estimate as to the likelihood of a novel compound having the activity of interest. It is readily apparent that QSARs benefit most from two things: a) a large pool of high quality data; and, b) a single endpoint assay with few mechanisms of action. As the number of molecules modeled increases, the resolving power of the selected descriptors increases. Likewise, if the assay readout to be modeled is a result of multiple mechanisms, the likelihood of finding simple descriptors with a correlation to activity is low, as the signal from any single mechanism is low. Conversely, QSAR models perform extremely well when assays generate a large volume of data through a single mechanism of action, as most high and medium throughput assays are based on single mechanisms.

Expert systems are built upon a body of knowledge obtained from a specialists and/or literature sources. This information, which can be as varied as "rule of thumb" relationships to experimentally determined relationships (e.g. kinase-substrate pairs) or structure activity data from compounds assayed using the same protocol, is collated, evaluated and then entered into a rules-based system which describes these cause-effect relationships (i.e. if substructure is aniline and metabolic activation present then genotoxic else nongenotoxic). Perhaps the best known such relationships in predictive toxicology are the Ashby-Tenant structural alerts for carcinogenicity which have been the basis of numerous expert system databases to identify unwanted activity in compounds. Because each of the rules of an expert system form a component of a decision tree, the collection of rules can be easily combined into modeling global endpoints. Expert systems alone do not provide percent likelihood estimates of an outcome occurring; they merely indicate that it is a possibility (i.e. binary relationship). In order to quantitatively predict an event, predicate logic methods must be applied, as in the DEREK system (discussed below).

The methods described above, QSARs and ESs, can be combined to give greater predictive ability or can each be utilized in the development of the other. A QSAR model can be developed using fragment-based descriptors (e.g. partitioning a molecule at its rotatable bonds) and then these fragments scored for predictive ability. This database of statistically significant fragments can then be used as the expert system. Such methods have been widely utilized by regulatory agencies and are routinely part of the safety assessment process. One such method, the MultiCASE algorithm (Klopman and Macina 1985), will be used extensively in this work to develop novel fragments for the variety of datasets discussed within.

The goal of this paper is not to introduce a novel modeling methodology for toxicology or to present yet another model for a given set of data, but to highlight what can be done with available technologies in order to boost the productivity and predictive accuracy of currently available independent vendor software (IVS) methods. To accomplish this, we will directly compare the performance of these differing methods across a diverse set of chemotypes using a controlled dataset consisting of results from the SOS Chromotest, a gentoxicity screening assay for mutagenicity. Compound libraries for secondary and aromatic amines, thiophenes and polycyclic aromatic compounds have been assayed at Bristol-Myers Squibb. QSAR models for many of these data sets have been developed and published previously and compared to DEREK, Topkat and MultiCASE (He, Jurs et al. 2003; Mattioni, Kauffman et al. 2003; Mosier, Jurs et al. 2003). We have used the MultiCASE algorithm to extract structural fragments with statistical resolving power and will discuss how this new knowledge could be reincorporated into the overall toxicology assessment paradigm to enhance its overall predictive ability.

Methods

Compound Selection

The data employed in this publication have been presented elsewhere and we will give only a description of the methods used to choose compounds and calculate descriptive relationships. All libraries were selected, purchased and assayed at Bristol-Myers Squibb. The compounds comprising these datasets of aromatic amines, thiophenes and polycyclic aromatic compounds were selected based upon their availability from Sigma-Aldrich, in pure form, and their drug-likeness, as profiled using an expanded Lipinski filter. Of the multitude of compounds which could conceivably qualify for this study, the most chemically diverse subset was selected. Chemical diversity was based upon clustering of the

Kier-Hall electrotopological indices (Kier and Hall 1990), molecular weight, and various atom and ring counts. Compound selection was then further enhanced by the selection of specific groups to develop local structure-activity relationships (i.e. para versus ortho substitution about a ring).

SOS Chromotest Assay

The SOS Chromotest was used in this paper as an alternative method to the widely accepted bacterial-reverse-mutation assay, also known as the Ames test. The SOS test has been used extensively with many different chemical classes. A review of published genotoxicity data between 1982 and 1992 indicated that, for the 1776 compounds evaluated, the SOS Chromotest had 90% concordance with the Ames mutagenicity test (Quillardet and Hofnung 1993). The assay is a simple and rapid test for mutagenicity that requires only a few milligrams of compound. These benefits along with the high Ames test concordance, make the SOS Chromotest a good tool for the purposes in this paper.

The SOS Chromotest assay is a colorimetric assay that measures induction of a lacZ reporter gene in response to DNA damage (Hofnung and Quillardet 1988). The SOS pathway plays a leading role in E. coli response to damage of nuclear material (Sutton, Smith et al. 2000). SOS induction is used as an early monitor for DNA damage because this pathway is sensitive to a broad spectrum of genotoxic substances. E.coli were modified with a lacZ reporter gene fused to an SOS gene, sfiA, with the endogenous lac sequence deleted, so all p-galactosidase activity is dependent upon sfiA induction. The strain was made more sensitive to genotoxic substances by increasing cell envelope permeability (rfa mutation) and by eliminating the excision repair pathway (uvrA mutation). In addition, constitutive expression of alkaline phosphatase, an SOS independent gene, was included as a control for cytotoxicity. In response to DNA damage the SOS repair genes are induced resulting in production of ,-galactosidase, the gene product of lacZ. The assay readout is the fold increases in gene induction as determined by measuring p-galactosidase activity using o-nitrophenyl--d-galactopyranoside (ONGP). For taxonomy of activity ranges, both in the presence and absence of S9 activation see Table 1.

Data Set

IMax (-S9)

IMax (+S9)

Min

Max

Min

Max

Thiophenes

0.88

8.01

0.9

6.08

Polycyclic Aromatic Compounds

0.84

7.29

0.91

9.37

Secondary and Aromatic Amines

0.81

11.66

0.85

10.57

Table 1. Chart of activity ranges for the different data sets discussed in the text. Activity is the fold increase in fluorescence over control.

Table 1. Chart of activity ranges for the different data sets discussed in the text. Activity is the fold increase in fluorescence over control.

Computational Assessment Tools - ADAPT

ADAPT is a neural network based QSAR modeling environment (Jurs, Hasan et al. 1983). Briefly, relationships are derived using a probabilistic neural network which evolves the optimal descriptor set to partition active from inactive compounds (i.e. mutagens from non-mutagens). The ADAPT software package is a product of the Jurs research group and can be compiled to run on many platforms. Model construction using ADAPT follows several distinct phases: structure geometry optimization and modeling, descriptor generation, training and prediction set formation, objective feature selection, model building and model validation. The exact procedure has been published in the works describing the specific models and will not be discussed here. The ADAPT system has been used to model a variety of toxicologically relevant endpoints.

Computational Assessment Tools - DEREK

DEREK is an expert system created specifically for toxicological endpoints and assessments (Greene, Judson et al. 1999). Originally developed and implemented by Schering Agrochemicals, DEREK is currently available on two platforms from two sources. The IRIX instance of DEREK is available from the Lhasa Group at Harvard University (Corey, Long et al. 1985). The PC version is available from Lhasa, Ltd. UK (Greene, Judson et al. 1999; Judson, Marchant et al. 2003) and with an expanded rules base from the IRIX version. As such, it was used for all work presented here (version 5.0 and 6.0). PC DEREK (referred to hereafter as DEREK) covers a broad range of endpoints with its greatest strengths in carcinogenicity, mutagenicity (both genotoxic and non-genotoxic), skin irritation and sensitization. The rules set in DEREK is researched, generated and validated through a collaborative effort involving researchers from academia, industry and the staff of Lhasa. These SARs represented by these rules are encoded as substructures (with the ability to handle exceptions) which are subsequently located in a novel compound. Novel SARs can be incorporated by the user via an ISIS Draw input method.

Computational Assessment Tools - BMS DEREK

"BMS DEREK" is a heavily modified version of DEREK 5.0 for PC. The source code was ported to UNIX with the DEREK engine processing the compounds. BMS DEREK differs from DEREK 5.0 in that it contains thirteen new mutagenicity rules along with six carcinogenicity substructural alerts. The recate-gorized rules were found to give the DEREK higher overall sensitivity with a minimal impact on concordance as evaluated on a BMS validation set.

Computational Assessment Tools - MultiCASE

At its operational core, MultiCASE is a QSAR-derived expert system. The range of activity can be rescaled either automatically by the application or by the user. The descriptors generated are connected structural fragments which are from two to ten non-hydrogen atoms in length (referred to as biophores). Once a library of all possible biophores has been constructed, various statistical tests are applied to determine the resolving power of given biophores to correlate to an activity. These biophores are then assigned the average activity of all of the molecules in which they are contained (the CASE algorithm). This procedure is then recursively applied to each biophore to find additional biophores which serve to modify activities (activating and deactivating groups) with their effect being quantified in a manner similar to the above (the "Multi" in MultiCASE). MultiCASE then assesses a compound by giving it a score, calculated as the sum of all biophores present, with additional weight being applied for bioavailability.

Statistical Definitions Used

See Table 2 for definitions of the terms used and the confusion matrix which relates the terms to eachother. The overall agreement between a predcitve method and the experimental data is given by the concordance:

The ability of a predictive method to correctly assess the positive compounds is given by the sensitivity of a method:

The ability of a predictive method to correctly assess the negative compounds is given by the specificity:

The probability of an in silico method to correctly assess a compound by random chance is the frequency of success given the known distribution of positive and negative compounds in the data set. It is defined as:

Another method used to evaluate categorical classification models is the k statistic. In the notation used here, it is defined as:

Po-Pe

with Po equivalent to the concordance as defined above:

and Pe as a random weighted by the methods success sensitivity and specificity:

PA *(PA+ND + PI) + NA*(NA + PD + Nl) 6~ (PA+ND + PI + NA + PD + NI)2

k can best be thought of as a concordance adjusted for the random success rate of a model. Models with values for k above 0.5 are considered to have overall predictive ability when classifying compounds into bins(Agresti 1996).

(+/ )

Experimental Negatives

Computational Positives ( / +)

Pos Agreement (PA)

Pos Deviation (PD)

Computational Negative

Neg Deviation (ND)

Neg Agreement (NA)

Computational

Indeterminate

( / )

Indeterminate (PI)

Neg Indeterminate (NI)

Table 2. Confusion Matrix for calculations in this text

Table 2. Confusion Matrix for calculations in this text

Model Construction

Results from ADAPT models which have been published prior to this work are reported here. Briefly, the compounds were taken into ADAPT after their structures were optimized using CORINA and MOPAC. The data were then partitioned into multiple TS and VS using a bagging algorithm. Multiple models were thus generated and the overall consensus model is the "majority rules" vote of the individual models. This way all compounds were predicted at least once. The DEREK expert system evaluates compounds using substructure-based searches. Standard and BMS modified DEREK were both used to predict the mutagenic activity of all compounds without any specific models being built using this data (the entire data set was treated as the PS). In order to make an equivalent comparison, both the MultiCASE A2I Salmonella mutagenicity module and a model trained in an equivalent fashion to the ADAPT models, were used. Briefly, the data were partitioned exactly as described in the original ADAPT based models (e.g. identical compound distributions were utilized for the TS and VS). The activities were then scaled automatically in MultiCASE so that the activity range would be between zero (inactive) and 99 (most active) with the active-inactive cutoff at twenty-nine units (corresponding to the experimental cut-off in activity of 1.5 fold activity in the SOS Chromotest assay). Studies varying the cutoff value by ± ten percent showed a minimal impact on the sensitivity and concordance of the resulting models (unpublished data). The models thus generated were then allowed to each vote for the membership of a compound in a given class ("mutagenic" or non-mutagenic").

Results

Tables outlining the various methods' predictive abilities are given (Tables 35). In all cases, the ADAPT QSAR consensus model (a majority rules voting of several individual models) had the best concordance/K value, ranging from 95%/0.88 (thiophenes) to 72%/.45 (aromatic and secondary amines). The MultiCASE Salmonella mutagenicity module, the A2I database, performed better than random for only the thiophene data set, with a concordance of 65%, but had a low k value (0.39). In the remaining two data sets, it had the worst overall performance, with a concordance of 24% in the polycyclic aromatic compounds and a concordance of 32% in the aromatic and secondary amines. In two of the three data sets, BMS DEREK had the highest sensitivity, ranging from 100% (thiophenes and aromatic and secondary amines) to 90% (secondary and aromatic amines), but always had a low K value. The polycyclic aromatic compounds were

BMS DEREK

DEREK v5.0

DEREK v6.0

MultiCASE (A2I)

ADAPT Consensus

MultiCASE SOS

Concordance

30%

60%

60%

65%

95%

50%

Sensitivity

100%

17%

17%

33%

83%

83%

Specificity

0%

79%

79%

79%

100%

36%

K

0.2378

0.3135

0.3135

0.3864

0.8816

0.3324

Table 3. Thiophenes. SOS (40+/100-); Random 58% Concordance

BMS DEREK

DEREK v5.0

DEREK v6.0

MultiCASE (A2I)

ADAPT

MultiCASE SOS

Concordance

37%

58%

55%

24%

81%

77%

Sensitivity

48%

39%

39%

59%

74%

28%

Specificity

35%

62%

58%

17%

84%

87%

K

0.1520

0.2482

0.2318

0.1228

0.5197

0.4058

Table 4. Polycyclic Aromatic Compounds. SOS (46+/231-); Random 72% Concordance

BMS DEREK

DEREK v5.0

DEREK v6.0

MultiCASE (A2I)

ADAPT Consensus

MultiCASE SOS

Concordance

32%

61%

61%

32%

72%

69%

Sensitivity

90%

58%

58%

42%

69%

58%

Specificity

17%

62%

62%

30%

74%

72%

k

0.2044

0.3341

0.3341

0.1414

0.4452

0.4056

Table 5. Secondary and Aromatic Amines. SOS (69+/265-); Random 67% Concordance.

Table 5. Secondary and Aromatic Amines. SOS (69+/265-); Random 67% Concordance.

not classified well by the expert systems approach, chiefly due to the lack of rules for heterocyclic compounds. The high sensitivity reported in the BMS DEREK models is offset by their having amongst the lowest specificity, ranging from 0% (thiophenes) to 35% (polycyclic aromatic compounds). Analysis of the specific performance of the expert system based approaches has been reported elsewhere (He, Jurs et al. 2003; Mattioni, Kauffman et al. 2003; Mosier, Jurs et al. 2003) and will only be touched upon here.

Thiophenes

The mutagenic potential of a thiophene is linked to formation of a stable epoxide. If the epoxide is stabilized at C2, then the likelihood of mutagenicity increases. Likewise, in systems which stabilize the formation of the S-oxide, nucleophilic attack at C2 is further stabilized leading to an increased possibility of reactivity with protein and/or DNA. For a more complete discussion of the mutagenicity of thiophenes, see (Mosier, Jurs et al. 2003).

The thiophene data set was the smallest of all the data sets, and was thus the only model for which a consensus model was not built (the results given are for a single model in ADAPT and the SOS trained MultiCASE approach). The MultiCASE fragment-based approach derived four biophores, given in Table 6 and illustrated in Figure 2. As is evident from the table, there was very little representation in the PS of any of the biophores derived. Furthermore, while the overall positive accuracy of these biophores is quite high (69%), the biophore corresponding to the primary alcohol was derived from a small sample (N = 2) of marginal activity (29 CU, the cutoff between active and inactivity). While the MultiCASE algorithm found this a significant biophore, the author would not utilize this in any further development of structural rules for thiophene (or any other) genotoxicity. Removal of this biophore from the model increases the positive accuracy to 91.7%. These four biophores do successfully select the genotoxic compounds from the PS as is reflected in the very high sensitivity (83%), but do not do an acceptable job of partitioning out the non-genotoxic compounds (36% specificity). This leads to an overall concordance of 50% and a k of 0.33, worse than random.

KLN Fragment

Average Frequency in TS

Average Activity

Frequency in PS

Positive Accuracy

cH"-S -c > =

14.0

35 CU

4

75%

COH-c =

9.0

35 CU

3

c =c. -

4.0

29 CU

2

100%

OH -CH2-

2.0

29 CU

2

0%

Table 6. Analysis of Statistically Significant Thiophene Fragments (single model). CU is the CASE Unit of activity.

Table 6. Analysis of Statistically Significant Thiophene Fragments (single model). CU is the CASE Unit of activity.

98-03-3 636-72-6 19952-47-7

0 0

Post a comment