Structurebased property models to support lead optimization

There has been significant activity in recent years toward the objective of developing structure-based predictive models of important biopharmaceutical properties for use in designing successful drug candidates (Ekins et al., 2000; Butina et al., 2002; Lombardo et al., 2003; Stouch et al., 2003; Wilson et al., 2003; van de Waterbeembd and Gifford, 2003; Beresford et al., 2004). Such models may be useful in the optimization campaign, helping to identify promising structural analogs improving on the properties found to be most important in the context of the PB-PK model. As shown in Figure 6, the major ADME events can be considered in terms of the solute biopharmaceutical processes determining them. These biopharmaceutical processes can in turn be related to solute structural and/or physicochemical parameters of most importance in the specific property.

Figure 6. Relationship of biopharmaceutical processes and solute structural properties to ADME events. These dependencies form the basis for predictive structure-property relationships (SPR's) which can be used in the lead optimization program.

This then forms the basis for development of predictive structure property relationships (SPR's) which can be used in lead optimization.

In the next few sections we will review some characteristics of different SPR's in the context of specific lead optimization strategies. The intent is not to comprehensively review the SPR literature, which is fairly extensive, but rather consider general issues which should be addressed in the process of implementing any given property model.


Solute solubility, as has been shown, is an important parameter in the oral absorption process. Generally, aqueous solubility is evaluated early in the lead selection process and minimally acceptable limits identified in order to consider taking a compound forward (Lipinski, 2000). However, in the various environments a drug will encounter in vivo, purely aqueous buffer is not one of them. Especially in the case of the intestinal lumen, various solubilizing agents are present which, in many cases, may improve the effective solubility of a solute in vivo compared to that predicted from the purely aqueous point of view. Similarly, lipids and proteins in blood may serve to increase the circulating solute concentration far beyond that expected for certain "insoluble" compounds. In spite of these caveats, aqueous solubility is an important reference point which needs to be considered and has been the subject of considerable modeling efforts (Abraham and Le, 1999; Jorgensen and Duffy, 2000; Huuskonen, 2000; McElroy and Jurs, 2001; Gao et al., 2002).

Developing structure-based solubility models from a mechanistic perspective requires specific consideration of the solvation process at a molecular level. The forces involved in solubility are solute-solute, solvent-solvent and solute-solvent interactions (Kamlet et al., 1986). These are the same considerations that apply in solute partitioning between two immiscible liquid phases. To a first approximation then, aqueous solubility of a liquid can be modeled as a partitioning of a solute between itself, or an organic solvent surrogate, and water. Consistent with this simple model, it has been possible to correlate aqueous solubility directly with octanol-water partition coefficients for liquid solutes (Hansch et al., 1968; Valvani et al, 1981).

In the case of solids, solute-solute forces existing in the crystal lattice complicate the prediction of solubility. At the present time, no reliable methods are available for estimating the energetics of these solid state, solute-solute interactions which must be overcome in order for the solute to release from the crystal and dissolve in water. An alternative approach to predicting solubility of solids is use of empirical, correlative models derived from collections of experimental data. Numerous examples of such models have been published, differing generally in the composition of the training data set, descriptor representation of the solutes, and statistical method for establishing the correlation (Abraham and Le, 1999; Huuskonen, 2000; Katritzky et al, 2000; McFarland et al, 2001; McElroy and Jurs, 2001; Jorgensen and Duffy, 2002; Gao et al., 2002). Since these are empirical models, they are generally more accurate if the structure of the query molecule is similar to those in the training set. Thus these models strive to employ sufficient structural diversity to be useful in the entire chemistry space of interest to the medicinal chemist. Further, given the correlative nature of the model, the accuracy can be no better than the experimental accuracy of the training set data, which is generally assumed to be about 0.5 log units (Myrdal et al., 1995; Katritzky et al, 1998).

An additional complication in the development and implementation of structure-based models is the influence of solute ionization state of the solute on its apparent solubility at a given pH. Since many drugs are either weakly acidic or basic, this is an important consideration. The solubility of such substances depends upon the intrinsic solubility of the neutral species and relative concentration of neutral and charged species at the pH of interest (Yalkowsky, 1999). These concentrations depend upon the relative values of the pKa(s) of the solute and pH of medium. From a practical perspective this means that in training a model with such solutes, only the intrinsic solubility should be used. If the model is developed from apparent solubility, the relationship of pKa with the pH of data measurement must be taken into consideration, adding considerable complexity to the model. Further, if the model predicts a low intrinsic solubility, this may be somewhat misleading if the solute is expected to be highly charged at physiological pH. In such a situation, it is desirable to incorporate a pKa estimation algorithm with an intrinsic solubility model in order to predict apparent solubility at a relevant pH.

Finally, as with most statistical models, performance is measured by the model's prediction capability and this depends on the space on which the model was trained. With commercial packages, the user generally does not know on which chemical classes the model was trained. When large errors are observed between the predicted and the experimental values it is assumed that the model is not useful. This is one of the main reasons that local solubility models are developed. In this situation, local is defined as representing a relatively focused, homogeneous region of chemical space while global models are developed over much greater structural diversity.

In using such models it is important to understand the limitations of the model in terms of applicability and accuracy in different chemistry and property spaces. Toward that end, a study was undertaken at Pharmacia to compare accuracy and bias in prediction for three commercially available solubility models and a model developed at Pharmacia over different regions of chemistry spaces and three different solubility levels (Crimin, et al., 2004). The four solubility models were: ACD Labs PhysChem v. 6.0, QikProp v. 2.1, Cerius2 and the in-house model (Gao et al., 2002). The in-house solubility model was trained on literature and in-house data. The training sets for the three commercial models were not known.

In the study, six data sets containing a total of 713 compounds were utilized. Four of the sets were local collections of data developed from internal discovery programs. They represent relatively homogeneous structural features generally arising from optimization around a biologically relevant template. One (data set 2) was a structurally diverse collection of Pharmacia data from a variety of discovery program teams and was used for continuous testing of the "in-house" solubility model (Gao et al., 2002). Finally, the last set, data set 5, contained 85% simple, monofunctional organic compounds and the remaining 15% drugs and pesticides.

The experimental solubility was compared to the predicted solubility obtained from each of the four solubility models. Figure 7 shows the root mean squared prediction error (RMSPE) for all six data sets with the four models. RMSPE represents the accuracy of the prediction model in these specific data sets and is assumed to reflect the similarity of the structures in the training, model, space to the query structure. Only in the case of data sets 4 and 6, were all the compounds predicted by all four models. In the other cases, one or more of the models failed to predict solubility for some of the structures in the sets. From the results presented in the table, the in-house model has the smallest RMSPE in three out of the six data sets (1, 2, 3). This is, in part, due to the fact that the training set of this model includes compounds that are similar, but not identical, to structures in those data sets. Similarly, QikProp and Cerius2 perform comparably to the in-house model for data sets 1 and 3 respectively. None of the models perform particularly well with data set 6 suggesting this is an especially challenging region of chemistry space. The implications of these results is that, while all the models are accurate in specific applications, none of them perform well over all structural diversity present in the test sets. From the perspective of prioritizing structures to synthesize based on predicted solubility properties, the existence of such biases should be taken into consideration.

Root Mean Square Prediction Error

Was this article helpful?

0 0

Post a comment