Genet Program Evolvable Mach (2007) 8:413–432 DOI 10.1007/s10710-007-9040-z ORIGINAL PAPER
Genetic programming for computational pharmacokinetics in drug discovery and development Francesco Archetti Æ Stefano Lanzeni Æ Enza Messina Æ Leonardo Vanneschi
Received: 15 November 2006 / Revised: 15 June 2007 / Published online: 8 October 2007 Springer Science+Business Media, LLC 2007
Abstract The success of a drug treatment is strongly correlated with the ability of a molecule to reach its target in the patient’s organism without inducing toxic effects. Moreover the reduction of cost and time associated with drug discovery and development is becoming a crucial requirement for pharmaceutical industry. Therefore computational methods allowing reliable predictions of newly synthesized compounds properties are of outmost relevance. In this paper we discuss the role of genetic programming in predictive pharmacokinetics, considering the estimation of adsorption, distribution, metabolism, excretion and toxicity processes (ADMET) that a drug undergoes into the patient’s organism. We compare genetic programming with other well known machine learning techniques according to their ability to predict oral bioavailability (%F), median oral lethal dose (LD50) and plasma-protein binding levels (%PPB). Since these parameters respectively characterize the percentage of initial drug dose that effectively reaches the systemic blood circulation, the harmful effects and the distribution into the organism of a drug, they are essential for the selection of potentially good molecules. Our results suggest that genetic programming is a valuable technique for predicting pharmacokinetics parameters, both from the point of view of the accuracy and of the generalization ability. F. Archetti S. Lanzeni E. Messina L. Vanneschi (&) D.I.S.Co., Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy e-mail:
[email protected] S. Lanzeni e-mail:
[email protected] E. Messina e-mail:
[email protected] F. Archetti Consorzio Milano Ricerche, Milan, Italy e-mail:
[email protected];
[email protected]
123
414
Genet Program Evolvable Mach (2007) 8:413–432
Keywords Computational pharmacokinetics Drug discovery Genetic programming
1 Introduction Because of technical and scientific advances in combinatorial chemistry, high throughput screening (HTS), genomics, metabolomics and proteomics, pharmaceutical research is currently changing. In fact, in the traditional drug discovery process once a target protein has been identified and validated, the search of lead compounds begins with the design of a structural molecular fragment (scaffold) with therapeutic potency. Libraries of millions of chemical compounds built on the identified active fragment are then tested and ranked according to their specific biological activities. After these assays, some candidate drugs are selected from the library for more specific functional screenings (see Fig. 1a). Although proteomics research generated an impressive number of previously unknown target structures, their biological validation is still an hazardous task which can, as indeed has happened recently, lead to failures in drug development projects. It is interesting to remark that, both in 1991 [1] and in 2000 [2], a considerable fraction of attritions (i.e. failures of compounds development) in pharmacological development were generated at the level of pharmacokinetics and toxicology (see Fig. 1b). Good drugs in fact have not only to show good target binding, but must also follow a proper route into the human body without causing toxic effects. More precisely orally submitted drugs undergo the route depicted in Fig. 2. First of all they have to be absorbed from the gut wall and then to enter into hepatic circulation
Fig. 1 (a) The process of drug discovery from target protein identification to candidate drugs: the identification and validation of the target are followed by lead discovery and optimization. (b) Reasons for failure in drug development in 1991 and 2000: clinical safety (black), efficacy (red), formulation (green), bioavailability (blue), commercial (yellow), toxicology (gray), cost of goods (purple) and others (white), color version available online only
123
Genet Program Evolvable Mach (2007) 8:413–432
415
Fig. 2 Anatomical view of ADMET processes. From the stomach drugs start digestion and pass, via gut wall absorption, into the portal vein. Here they enter the liver, where metabolism tries to demolish them. The fraction of initial dose that exits the liver represents drug oral bioavailability
in the portal vein. Carried by the blood flux and possibly bound to plasma proteins, molecules arrive in the liver, where some biochemical processes that try to demolish them take place. Only the fraction of drug initially submitted that exit the liver and enter the systemic blood circulation will be able to produce some therapeutic effect. Although the development of assays evaluating pharmacokinetics parameters resulted in a reduction of attrition rate related to these properties, failures still produce an unacceptable burden on the budget of pharmaceutical companies. Thus, it is necessary to deeply characterize the behaviors of the pharmacological molecules in terms of adsorption, distribution, metabolism, excretion processes, collectively referred to as ADMET [3]. Automatic assessment of drug discovery by means of computer simulations is becoming a reality, and the development of computational tools applicable for ADMET profiling, also enabling the management of large and heterogeneous databases, would be of outmost relevance [4, 5]. The availability of reliable pharmacokinetics prediction tools would permit to reduce the risk of latestage research failures and will enable to decrease the number of experiments and cavies used in pharmacological research, by optimizing the screening assays. Furthermore, predictive ADMET models would be of critical relevance for preventing Adverse Drug Reactions (ADRs) like those involved in the recent Lipobay-Baycol (cerivastatin) toxicity [6], that can be very dangerous for patients. The potential of predictive modeling in terms of ADRs prediction is one more reason why computational ADMET can be considered an hot research topic in medicine. In this paper, we empirically show that genetic programming (GP) [7] is a promising and valuable tool for predicting the values of oral bioavailability, median oral lethal dose and Plasma Protein Binding levels. Human oral bioavailability (indicated with %F from now on) is the parameter that measures the percentage of initial drug dose that effectively reaches the systemic blood circulation after the passage from the liver. This parameter is particularly relevant, because the oral assumption is usually the preferred way for supplying drugs to patients and because it is a representative measure of the quantity of active principle that effectively can actuate its therapeutic effect. Oral bioavailability is determined by two keys ADMET processes: adsorption and metabolism. The Median Lethal Dose (indicated with LD50 from now on), is one of the parameters measuring the toxicity of a given compound.
123
416
Genet Program Evolvable Mach (2007) 8:413–432
More precisely, LD50 refers to the amount of compound required to kill 50% of the test organisms (cavies). It is usually expressed as the number of milligrams of drug related to one kilogram of mass of the model organism (mg/kg). Depending on the specific organism (rat, mice, dog, monkey and rabbit usually), and on the precise way of supplying (intravenous, subcutaneous, intraperitoneal, oral generally) that are chosen in the experimental design, it is possible to define a spectrum of LD50 protocols. In our work we consider only the Median Lethal Dose measured in rats with the compound orally supplied, which represents the most used protocol. plasmaprotein binding level (indicated with %PPB from now on) corresponds to the percentage of the drug initial dose that reaches blood circulation and binds the proteins of plasma [8]. This measure is fundamental for good pharmacokinetics, both because blood circulation is the major vehicle of drug distribution into human body and since only free (un-bound) drugs can permeate the membranes reaching their targets. Pharmacological molecules bind a variety of proteins in the blood, including albumin, a1-acid glycoproteins, lipoproteins and a,b,c-globulins, and each one of this proteins contribute in determining the binding level. In this paper we define %PPB as the binding levels between the drug and all the proteins contained in the plasma, avoiding the single specific drug–protein interactions. This paper is structured as follows: Sec. 2 describes the methods reported in literature for ADMET in silico predictions. In Sec. 3 we present the experimental settings we used in our experiments defining all the parameters needed for a complete results reproduction. Section 4 shows the results obtained in the prediction of oral bioavailability (%F), median oral lethal dose (LD50) and plasma-protein binding levels (%PPB). Finally, Sec. 5 presents the concluding remarks and hints for future investigation.
2 State of the art and related work Pharmacokinetics prediction tools are based on two approaches: molecular modeling, which uses intensive protein structure calculations and data modeling. Methods based on data modeling are widely reported in literature, and all belong to the category of Quantitative Structure—Activity Relationship (also called QSAR) models [9]. The goal of such models is defining a quantitative relationship between the structure of a molecule and its biological activity. To obtain a Quantitative Structure–Activity Relationship model, it is necessary to collect a training set of drugs for which biological activity parameters are known. A wide spectrum of features, called molecular descriptors, are calculated from the structural representation of each compound. Descriptors are computed from a graph based representation of molecules, in which labeled nodes refer to atoms and labeled edges represent bonds between atoms. Two main categories of features are generally used in QSAR procedures: 2D-chemical descriptors, based on a bi-dimensional representation of compounds and 3D-chemical descriptors, whose computation is time-consuming since they are calculated from a tri-dimensional molecular structure. For a detailed discussion on molecular descriptors, see for instance [10]. After molecular features calculation, it comes natural to try to formalize a
123
Genet Program Evolvable Mach (2007) 8:413–432
417
mathematical relationship between them and the biological target parameter, applying statistical or machine learning procedures. The ability to derive a goodperformance QSAR model with a considerable generalization power mainly depends on the learning algorithm, on the availability of relevant and accurate datasets and on the choice of a significant molecular representation. Unfortunately there exists, especially in the public sector, a relatively small number of datasets with a desirable quality, even if a few noteworthy exceptions exist [11]. Yoshida and Topliss [12] trained a QSAR model, based on ’fuzzy adaptive least squares’, in order to classify drugs into one of four predefined bioavailability ranges, according to the presence/absence of typical functional groups most likely to be involved in metabolic reactions as the structural input. Genetic Algorithms were used in Ref. [13] to select the best molecular descriptors, and Self Organizing Maps (SOMs) were used to collocate the molecule in a bioavailability class. Frohlich et al. recently experimented the use of Kernel methods (SVM) for assessing the problem of bioavailability predictions, basing their approach on the estimation of similarity between different molecules with similar biological behavior [14]. Various kinds of multivariate and Partial Least Square (PLS) regressions, also coupled with recursive partition [15], have been used to give an estimation of oral bioavailability. Drug toxicity prediction is a very difficult and challenging task: because of the complexity of possible interactions between the organism and the pharmacological molecule, similar compounds may undergo different toxic behaviors. A benchmark of mathematical methods, including recursive partitioning (RP) and Partial Least Square regression (PLS) among others, has been tested by Feng et al. [16]. Multivariate linear regression and artificial neural network have been applied for building a model of LD50 in fathead minnow (Pimephales promelas) for 397 organic chemicals [17]. For plasma-protein binding, a sigmoidal relationship between the percentage of drug bound and the log D at pH 7.4 has been proposed [18], even if a considerable scatter of data around the predicted trend has been detected. A technique called Genetic Function approximation (GFA) considering 107 molecular descriptors has been proposed in Ref. [19]. In the same work, the authors describe a QSAR study based on 80 compounds and 12 descriptors, showing a correlation coefficient equal to 0.91 in predicting human serum-albumin (HSA) binding. In 2002, Kratochwil et al. [20] proposed a generic model to predict drugassociation constants to HSA binding using pharmacophoric-similarity concept and Partial Least Square was built using a dataset composed by 138 compounds. As Zupan and Gasteigerdeeply claim out in Ref. [21], artificial Neural Networks are ubiquitously used for pharmacokinetics parameters estimation. Neural networks are widely used also in the existing commercial packages, that are usually developed by software vendors traditionally involved in the field of molecular modeling. The existing commercially available computational tools for predicting toxic potential of drugs are analyzed in Ref. [22]. Among the leaders in this field, Accelrys Inc. [23] and Pharma Algorithms Inc. [24] essentially provide black-box mathematical models and/or data mining tools that can be used to build new predictors. Simulation Plus Inc. [25] adopts a slightly different strategy by using bagged artificial neural networks for ADMET parameters estimation that are subsequently used for simulating the absorption process.
123
418
Genet Program Evolvable Mach (2007) 8:413–432
GP has not been extensively used in pharmacokinetics until now. Two noteworthy exceptions are given by Refs. [26] (where GP is used to classify molecules in terms of their bioavailability) and [27] (where GP is used to quantitatively predict the fraction of drug absorbed). 3 Experimental protocol settings In this section, we first describe the procedures used for datasets collection and preparation and then we focus on a description of the various evolutionary and nonevolutionary machine learning techniques we have employed. 3.1 Dataset collecting and preparation We have obtained a set of molecular structures and the corresponding %F, LD50 and %PPB values using the same data as in Ref. [12] and a public database of food and drug administration (FDA) approved drugs and drug-like compounds [11]. Chemical structures are all expressed as SMILES code (Simplified Molecular Input Line Entry Specification), i.e. strings coding the 2D molecular structure of a compound in an extremely concise form. The resulting libraries of molecules contained 260 (respectively 234 and 662) molecules with measured %F (respectively LD50 and %PPB) values. SMILES strings belonging to the %F dataset have been used to compute 241 bi-dimensional molecular descriptors using ADMET Predictor (a software produced by Simulation Plus Inc. [25]). The features for molecules with known values of LD50 and %PPB have instead been calculated using the on-line DRAGON software [28], which returned 626 bi-dimensional molecular descriptors. Thus, data have been gathered in matrices composed by 260 (respectively 234 and 662) rows and 242 (respectively 627 and 627) columns. Each row is a vector of molecular descriptors values identifying a drug; each column represents a molecular descriptor, except the last one, that contains the known values of %F (respectively LD50 and %PPB). These three datasets, comprehensive of SMILES data structures can be downloaded from our laboratory web page: www.life.disco.unimib.it/Ricerca.html. Training and test sets have been obtained by randomly splitting the two datasets. For each dataset, 70% of the molecules have been randomly selected with uniform probability and inserted into the training set, while the remaining 30% form the test set. 3.2 Genetic programming configurations The four GP versions proposed in this paper are briefly described below. 3.2.1 ‘‘Canonic’’ (or standard) GP The first GP setting that we present (we have called it canonic or standard GP and indicated as stdGP), is a deliberately simple version of standard tree-based GP [7].
123
Genet Program Evolvable Mach (2007) 8:413–432
419
In particular, we used the parameter setting, sets of functions and terminal symbols as similar as possible to the ones originally adopted in Ref. [7] for symbolic regression problems. Each molecular feature has been represented as a floating point number. Potential solutions (GP individuals) have been built by means of the set of functions F = { + ,*, –,%} (where % is the protected division, i.e. it returns 1 if the denominator is zero), and the set of terminals T composed by n floating point variables (where n is the number of columns in the training sets, i.e. the number of molecular descriptors of each compound). The fitness of each individual has been defined as the root mean squared error (RMSE) computed on the training set. For example, given an individual k producing the LD50 prediction LD50(PRE) on the ith i molecule of the training set, we define the fitness of k (RMSEk) as: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u mF uP u ðLD50i LD50ðPREÞ Þ2 i ti¼1 ð1Þ RMSEk ¼ f ðkÞ ¼ mF where mF is the number of molecules (lines) in the training set and LD50i is the correct value of the LD50 for the molecule represented by data in the ith line of the training set. Afterwards, for each individual k, we evaluated the RMSE and the correlation coefficient (CC) on the test set, and we used them for comparing GP results with the ones of the other employed methods. With CC we indicate the statistical correlation coefficient between the results returned by each GP candidate solution and the expected outputs for all the molecules in the test set [27]. The parameters used in our experiments (for stdGP and also for all the other GP variants presented below) are represented in Table 1. By elitism here we mean the unchanged copy of the best individual in the population at each generation. The parameters which are not reported in this table (like the set of terminal symbols and the fitness function) change from a GP version to the other, and they are described separately in each one of the following paragraphs. For stdGP, the fitness function is given by Eq. 1 and the set of terminal symbols that we have used is a set of n floating point variables, where n is the number of columns of our dataset.
3.2.2 LS2-GP The second version of GP differs the previous one for the fitness function employed. In this case, it is obtained executing two steps: (1) application of linear scaling to RMSE (a detailed introduction of this method can be found in Ref. [29]). (2) computation of a weighted average between the RMSE with linear scaling and the CC between expected outputs and the results returned by the GP candidate solution on the training set. Linear scaling consists in calculating the slope and intercept of the formula coded by the GP individual. Given that LD50(PRE ) is the output of the i GP individual k on the ith line of the training set , a linear regression on the target values LD50 can be performed using the equations:
123
420
Genet Program Evolvable Mach (2007) 8:413–432
Table 1 Experimental settings used for the various GP implementations Objective
Evolve a quantitative predictive model
Function set
{ · , + , %, –}
Population size
500 individuals
Initialization
Ramped half-and-half
Selection
Tournament of size 10
Maximum tree depth
10
Subtree crossover rate [18]
0.95
Subtree mutation rate [18]
0.1
Maximum number of generations
100
Algorithm
Generational tree based GP with elitism mF P
b ¼ i¼1
ðPREÞ
½ðLD50i LD50ÞðLD50i Pm
ðPREÞ i¼1 ðLD50i ðPREÞ
LD50
LD50
ðPREÞ
ðPREÞ
Þ ð2Þ
Þ
a ¼ LD50 bLD50
ðPREÞ
where mF is the number of lines in the training set (fitness cases) and LD50 and LD50 denote the average output and the average target value respectively. These expressions respectively calculate the slope and intercept of a set of outputs LD50(PRE), such that the sum of the squared errors between LD50 and a þ bLD50ðPREÞ is minimized. After this, any error measure can be calculated on the scaled formula a þ bLD50ðPREÞ for instance the RMSE: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi umF uP u ða þ bLD50ðPREÞ LD50i Þ2 i ti¼1 ð3Þ RMSEk ðLD50; a þ bLD50ðPREÞ Þ ¼ mF If a is different from 0 and b is different from 1, the procedure outlined above is guaranteed to reduce the RMSE for any formula LD50(PRE) [29]. The cost of calculating the slope and intercept is linear in the size of the training set. By efficiently calculating the slope and intercept for each individual, the burden of searching for these two constants is removed from the GP run. GP is then free to search for that expression whose shape is most similar to that of the target function. The efficacy of linear scaling in GP for many symbolic regression problems has been widely demonstrated in Ref. [29]. Weights equal to 0.4 and 0.6 have been respectively assigned to RMSE (with linear scaling) and CC, after that both of them have been normalized into the range [0, 1], thus giving slightly higher importance to CC. These values have been empirically chosen through a simple experimentation phase (whose results are not shown here), in which we have tested many different weights and we have found that 0.4 and 0.6 are the values that maximize the GP generalization ability on our dataset. The idea behind this weighted sum is that optimizing only the RMSE on the training set may lead to overfitting and thus to a poor generalization power of GP
123
Genet Program Evolvable Mach (2007) 8:413–432
421
solutions (i.e. bad results on the test set). If we optimize more than one criterium, GP more likely returns individuals which are good for all the criteria even though not optimal for any of them. Furthermore, the CC between outputs and targets is a very important measure for results accuracy and thus it deserves to be used as an optimization criterium. This GP version will be called ‘‘Linear Scaling with 2 criteria’’ GP or LS2-GP from now on. 3.2.3 LS2-C-GP The third GP version presented in this paper is similar to LS2-GP with the only difference that a set of ephemeral random constants (ERCs) is added to the set of terminal symbols to code GP expressions. These ERCs are generated uniformly at random from the range [m,M], where m and M are the minimum and the maximum target values on the training set respectively. In the experiments presented in this paper, a number of ERCs equal to the number of floating point variables has been used. This choice has been empirically confirmed to be suitable by a set of GP runs in which different numbers of ERCs extracted from different ranges have been used (the results of these experiments are not shown here). This version of GP will be called ‘‘LS2 with Constants’’ GP and indicated as LS2-C-GP. 3.2.4 DF-GP In the fourth version of GP the fitness function dynamically changes during the evolution. In particular, the evolution starts with the CC between outputs and targets on the training set used as the only optimization criterium. When at least the 10% of the individuals in the population reach a value of the CC which is largest or equal to 0.6, the fitness function changes, and it becomes equal to the RMSE for all those individuals with a CC larger or equal to 0.6; to all the other individuals we artificially assign a very bad fitness value. In other words, fitness of individual k becomes: bad fitness if CORRk \0:6 f ðkÞ ¼ ð4Þ RMSEk otherwise In this way, selection operates as a pruning algorithm, giving a chance to survive for mating only to those individuals whose correlation on the training set is largest or equal to 0.6. The idea behind this method is that the search space is too large for GP to perform efficiently; furthermore, we hypothesize that individuals with a good, although not optimal, CC between outputs and goals on the training set may have a largest generalization ability and thus should take part in the evolutionary process. Some experiments (whose results are not shown here) have empirically confirmed that the threshold value 0.6 for the CC is large enough to avoid underfitting and small enough to reduce overfitting. Of course, this value has been tested and has
123
422
Genet Program Evolvable Mach (2007) 8:413–432
revealed suitable for the particular datasets used in this paper and can by no means be interpreted as a general threshold. Nevertheless, the experiments that we have executed to obtain this value are very simple and if we wish to evolve new expressions on new data, we could easily replicate them. This GP version has been called Dynamic Fitness GP and will be indicated as DF-GP from now on. 3.3 Non-evolutionary methods and feature selection algorithms Various non-evolutionary regression methods were used for comparison with GP results. They are described below in a deliberately synthetic way, since they are well known and well established ML techniques. Furthermore, they have also been used after a pre-processing phase in which two well known feature selection algorithms have been emplyed. They are briefly described in the next paragraph. For more details on these methods and algorithms and their use, the reader is referred to the respective references quoted below. 3.3.1 Feature selection procedures We adopted two attribute selection heuristics: the correlation based feature selection (CorrFS) and the principal component based feature selection (PCFS) [30]. The central hypothesis in CorrFS is that good feature sets contain features highly correlated with the target, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. The algorithm couples the evaluation formula with an appropriate correlation measure and a search strategy. PCFS reduces the dimensionality of attribute space transforming the original features in a new set of variables called principal components (PCs). The latter are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables [31]. The Weka [32] implementation of the two described feature selection procedures have been adopted for the experiments. 3.3.2 Linear and least square regression We used the Akaike criterion for model selection (AIC) [33], that has the objective of estimating the Kullback-Leibler information between two densities, corresponding to the fitted model and the true model. The M5 criterion is used for further attribute selection [33]. The least square regression model is founded on the algorithm of Robust regression and outlier detection described in Ref. [34], searching for the more plausible linear relationship between outputs and targets. 3.3.3 Artificial neural networks The multilayer perceptron [35] implementation included in the Weka software distribution [32] was adopted. We used the Backpropagation algorithm [35] with a
123
Genet Program Evolvable Mach (2007) 8:413–432
423
learning rate equal to 0.3. All the neurons had a sigmoidal activation function. All the other parameters that we have used have been set to the defaults values proposed by the Weka implementation. A momentum of 0.1 progressively decreasing until 0.0001 has been used to escape local minima on the error surface.
3.3.4 Support vector machines regression The Smola and Scholkopf sequential minimal optimization algorithm [36] was adopted for training a Support Vector regression using polynomial kernels. The Weka implementation [32] of this ML method has been used. More precisely, we have built two models using polynomial kernels of first and second degree respectively.
4 Experimental results Here we discuss the experimental results obtained using the various versions of GP and the other ML techniques described above for the quantitative prediction of %F, LD50 and %PPB.
4.1 Oral bioavailability (%F) All the results that we have obtained with the non-evolutionary ML methods presented above are shown in Table 2. In particular, the upper part of the table (part (a)) shows the results that we have obtained when no feature selection strategy has been employed (data from our dataset have been used as input with no filtering, nor pre-processing); the middle part (part (b)) reports the results when PCFS has been used; finally, the lower part of Table 2 (part (c)) shows the results obtained using CorrFS. In these last two cases, only data returned by the respective feature selection technique have been used as input of the various ML methods. The first column of this table identifies the various ML techniques, the second one reports the RMSE results of the trained model on the test set for each ML technique and the third column shows the value of the CC between outputs and goals on the test set. Let us concentrate, first of all, on part (a) of Table 2, reporting the RMSE and CC on the test set for all the presented non-evolutionary ML techniques without feature selection. The best RMSE result is returned by SVM regression with first degree polynomial kernel, while the best CC has been found by multilayer perceptron. Now, let us focus on part (b) of Table 2. Here, we report results of the same techniques using PCFS. The table shows that the use of this feature selection technique helps to generally improve the performances of all the methods. Nevertheless, for some of them the performance improvement appears to be only marginal. The best RMSE result is returned by linear regression, while the best correlation coefficient is found by SVM regression with second degree polynomial kernel.
123
424
Genet Program Evolvable Mach (2007) 8:413–432
Table 2 Experimental comparison between different non-evolutionary ML techniques for oral bioavailability predictions Method
RMSE on test set
Correlation coefficient
(a) No feature selection Linear regression
48.1049
0.1699
Least square regression
37.2211
0.2022
Multi layer perceptron
51.280
0.2880
SVM regression—first degree polynomial kernel
34.804
0.2666
SVM regression—second degree polynomial kernel
44.323
0.2597
(b) Principal component based feature selection (PCFS) Linear regression
30.5568
0.1911
Least square regression
40.4503
0.1165
Multi layer perceptron
48.9771
0.1945
SVM regression—first degree polynomial kernel
36.1850
0.1306
SVM regression—second degree polynomial kernel
42.3377
0.2184
(c) Correlation based feature selection (CorrFS) Linear regression
27.5212
0.3141
Least square regression
31.7826
0.1296
Multi layer perceptron
32.5782
0.2308
SVM regression—first degree polynomial kernel
28.8875
0.2855
SVM regression—second degree polynomial kernel
29.7152
0.2787
Finally, part (c) of Table 2 shows the results of the same ML methods using CorrFS. This selection strategy eliminates some noisy features while maintains the ones that are correlated with bioavailability. As the table shows, this strategy enhances the performances of all the methods employed remarkably better than PCFS. The best results, both for RMSE and CC are returned by linear regression. In Table 3, values of RMSE and CC obtained on the test set by all the considered GP versions are shown. In particular, we report the RMSE and CC of both the individual with the best RMSE value and of the one with the best CC value on the test set contained in all the populations at termination over 20 independent runs1 (we recall that, of course, the test set has not been used to calculate fitness at all: only data in the training set have been used to guide the evolution). Comparing the results shown in this table with the ones in Table 2, we remark that all the GP versions outperform the other ML methods both for RMSE and CC, except for stdGP which is outperformed by SVM and linear regression with CorrFS. The version that has returned the best solution is LS2-C-GP. Comparing the results returned by LS2-CGP with the ones of the non-evolutionary methods, we can remark that LS2-C-GP 1
We have not reported statistics, like the average best fitness and the fitness standard deviation on the test set over the 20 independent runs in this paper, because we want to compare the best results obtained by GP with the best ones obtained by the other non-evolutionary ML methods. For %F some of these statistics are reported in Ref. [27], where we can see that results are enough ‘‘stable’’ over the 20 runs, i.e. the variance of the best fitness found in the different runs is ‘‘small’’. The same also holds for LD50 and %PPB.
123
Genet Program Evolvable Mach (2007) 8:413–432
425
Table 3 Experimental results of the different GP versions in predicting oral bioavailability GP version
Best RMSE individual
Best CC individual
RMSE
CC
RMSE
CC
stdGP
30.1276
0.1661
33.8376
0.1907
LS2-GP
26.5909
0.3735
26.5909
0.3735
LS2-C-GP
26.0126
0.4245
26.0126
0.4245
DF-GP
27.3591
0.3304
27.3591
0.3304
No feature selection
These results concern the individuals with the best RMSE and the best CC values in all the populations over 20 independent runs
has found a better RMSE and a remarkably higher CC value. We hypothesize that this is due to two main reasons: first of all, using two criteria to evolve solutions on the training set allows us to generate solutions which are ‘‘good’’ on both the criteria and that are optimal on none of them. In this way, we prevent the evolutionary process to generate too good solutions on the training set for one single criterium, which could lead to overfitting. In doing that, we also use the CC as an optimization criterium, which is an important measure for results accuracy. Secondly, the use of ERCs may help to asses the relative relevance of the features in the proposed solutions. As the result reported on the last line of Table 3 shows, even though DFGP keeps into the population individuals which have a correlation value larger or equal to 0.6 on the training set, still it is unable to find a correlation better than 0.3304 on the test set. In general, for all the results shown in this paper (both the ones shown until now and the ones that will be shown below) the results on the training set are remarkably better than the ones that we have been able to obtain on the test set. We hypothesize that this is due to the complexity of the problems that we are trying to solve and to the low correlation among the different lines in the datasets. We finally remark that we have deliberately avoided to use a feature selection technique before applying the GP strategies. GP automatically performs a feature selection by keeping into the population individuals which use a subset of the features. Our experimental results demonstrate that this ‘‘automatic’’ feature selection performed by GP is competitive with two of the most known feature selection algorithms (PCFS and CorrFS). Furthermore, it has not to be executed before running the regression algorithm, as it is the case for all the other nonevolutionary ML methods employed here.
4.2 Rat median oral lethal dose (LD50) Here we present the experimental results that we have obtained for the prediction of LD50 values. Table 4 presents the results of the non-evolutionary techniques. The organization of this table is analogous to the one of Table 2. In particular, part (a) of Table 4 reports the RMSE and CC obtained on the test set by all the
123
426
Genet Program Evolvable Mach (2007) 8:413–432
Table 4 Experimental comparison between different non-evolutionary ML techniques for rat median oral lethal dose predictions Method
RMSE on test set
Correlation coefficient
(a) No feature selection Linear regression Least square regression
9876.5079
0.0557
66450.2168
0.1976
Multi layer perceptron SVM regression—first degree polynomial kernel
2412.0279
–0.0133
18533.3505
–0.1723
4815.3241
0.1108
SVM regression—second degree polynomial kernel (b) Principal Component Based Feature Selection (PCFS) Linear regression
2703.6638
–0.1424
Least square regression
3149.4284
–0.1514
Multi layer perceptron
3836.1868
0.2003
SVM regression—first degree polynomial kernel
2523.5642
0.0494
SVM regression—second degree polynomial kernel
3229.7823
–0.1048
(c) Correlation based feature selection (CorrFS) Linear regression
2170.4631
0.2799
Least square regression
2753.6281
–0.0358
Multi layer perceptron
3204.0369
0.1041
SVM regression—first degree polynomial kernel
2241.0289
0.3900
SVM regression—second degree polynomial kernel
2504.2721
0.1150
non-evolutionary ML techniques when no feature selection procedure is applied. The best RMSE result is returned by the multilayer perceptron, while the best CC has been found by the Least Square Regression. The results returned by the same techniques using the PCFS are reported in part (b) of Table 4. These results suggest that the use of this feature selection technique helps to generally improve the performances of all methods. In this case the best RMSE result is returned by SVM regression with first degree polynomial kernel, while the best CC is found by the multilayer perceptron. Results obtained by using CorrFS are reported in part (c) of Table 4. In this case, we can observe a further improvement of all the results, and the best RMSE is returned by linear regression, while the best CC is found by the SVM regression with first degree polynomial kernel. In Table 5 we report RMSE and CC obtained on the test set by all the GP versions introduced above without any explicit feature selection method applied to data. As for the case of the oral bioavailability predictions, to have a complete picture of the GP capabilities, we show the performances of both the individual with the best RMSE value and the individual with best CC contained in all the populations at termination over 20 independent runs. Once again, the GP approaches seem promising; in fact, they outperform the other ML methods both for RMSE and CC The GP version that has returned the best RMSE is LS2-GP, while the best CC has been found by stdGP. Also in this case, we remark that we have explicitly used no feature selection and GP has, by itself, performed it. In fact,
123
Genet Program Evolvable Mach (2007) 8:413–432
427
Table 5 Experimental results of the different GP versions in predicting rat oral median lethal dose GP Version
Best RMSE individual RMSE
CC individual CC
RMSE
CC
stdGP
1776.7700
0.0625
1882.8600
0.3132
LS2-GP
1753.5800
0.2806
1753.5800
0.2806
LS2-C-GP
1779.1300
0.2245
1800.0700
0.2488
DF-GP
1815.0700
0.1834
1879.7600
0.2017
These results concern the individuals with the best RMSE and with the best CC values in all the populations over 20 independent runs
the best individuals returned by each of the GP versions contain only a very restricted subset of all the possible variable symbols.
4.3 Plasma-protein binding levels Here we present the experimental results that we have obtained for the prediction of %PPB. Once again, we first present the results of the non-evolutionary methods with and without feature selection; they are reported in Table 6, which has the same organization as Tables 2 and 4. In part (a) of Table 6 the RMSE and CC results Table 6 Experimental comparison between different non-evolutionary ML techniques for plasmaprotein binding predictions without feature selection Method
RMSE on test set
Correlation coefficient
(a) No feature selection (CorrFS) Linear regression
39.8402
0.3802
Least square regression
56.2947
0.0323
Multi layer perceptron
32.6068
0.1398
SVM regression—first degree polynomial kernel
31.4344
0.5056
SVM regression—second degree polynomial kernel
53.0044
0.2616
(b) Principal component based feature selection (PCFS) Linear regression
27.2677
0.5324
Least square regression
36.4912
0.1844
Multi layer perceptron
56.9232
0.251
SVM regression—first degree polynomial kernel
29.1552
0.4993
SVM regression—second degree polynomial kernel
30.8581
0.5116
(c) Correlation based feature selection (CorrFS) Linear regression
26.7496
0.5512
Least square regression
33.7744
0.4711
Multi layer perceptron
31.7581
0.4379
SVM regression—first degree polynomial kernel
27.2962
0.5586
SVM regression—second degree polynomial kernel
28.3638
0.5188
123
428
Genet Program Evolvable Mach (2007) 8:413–432
obtained without applying any feature selection procedure are reported. Both the best RMSE and CC results are returned by the SVM regression with first degree polynomial kernel. The results of the same techniques using PCFS are reported in part (b) of Table 6. The usage of this feature selection procedure improves the performances of all the methods, with the exception of multilayer perceptron that has returned a considerably higher RMSE. This time, both the best RMSE and the CC results are returned by linear regression. Results obtained using CorrFS are shown in part (c) of Table 6. As in the case of %F and LD50, this feature selection algorithm improves the performances of all the non-evolutionary ML techniques better than PCFS. As for the LD50, the best RMSE result is returned by linear regression, and the best CC is found by the SVM regression with first degree polynomial kernel. Table 7 reports the results of GP for %PPB prediction. This time (as we comment in more details below), GP is outperformed by Linear Regression and SVM with first degree polynomial kernel. For this reason, we have decided to also report in this paper the results obtained by GP after an explicit application of the PCFS and CorrFS feature selection techniques. With that, we mean that we have applied feature selection to data in a pre-processing phase (as for all the other ML methods) and we have considered as fitness cases only the values of the data that have been returned by the feature selection algorithm. In this case, the number of variables that we have inserted in the terminal set for building GP individuals is equal to the number of features that have been selected (thus, a much smaller number of variables is present in the terminal set this time, compared to the case where no Table 7 Experimental results of the different GP versions in predicting plasma-protein binding GP Version
Best RMSE individual
Best CC individual
RMSE
CC
RMSE
CC
stdGP
34.6617
0.1425
36.9151
0.2283
LS2-GP
31.8281
0.3191
31.8281
0.3191
LS2-C-GP
32.4555
0.2210
32.5728
0.2529
DF-GP
32.1279
0.2881
32.2450
0.2949
(a) No feature selection
(b) Principal component based feature selection (PCFS) stdGP
34.9778
0.1971
37.8160
0.2646
LS2-GP
31.0209
0.3353
31.2919
0.3592
LS2-C-GP
31.2285
0.3292
31.5918
0.3344
DF-GP
31.0119
0.3614
31.0119
0.3614
(c) Correlation based feature selection (PCFS) stdGP
33.9380
0.2310
36.3262
0.3703
LS2-GP
30.8059
0.3943
30.8059
0.3943
LS2-C-GP
31.2334
0.3524
31.2712
0.3595
DF-GP
31.0393
0.3702
31.0556
0.3833
These results concern the individuals with the best RMSE and the best CC values in all the populations over 20 independent runs
123
Genet Program Evolvable Mach (2007) 8:413–432
429
feature selection is applied). For LS2-C-GP, in ant case we have always considered a number of ERCs equal to the number of variables employed. Similarly to the organization of Tables 2, 4 and 6, Table 7 is structured as follows: part (a) of Table 7 shows the results with no feature selection, part (b) reports the results that we have obtained using PCFS and part (c) shows the results that we have obtained employing CorrFS. First of all, let us analyse part (a) of Table 7. Once again, for each GP version, results of both the individual with the best RMSE value and the individual with best CC contained in all the populations at termination over 20 independent runs are reported. The GP version that has returned the best solutions in this case, both in terms of RMSE and CC, is LS2-GP. Nevertheless, this time this results are worst than the ones returned by linear regression and SVM (see Table 6). In part (b) of Table 7, we report the results of the four GP approaches with PCFS. The application of this feature selection does not remarkably enhance performances. The same holds for CorrFS, see part (c) of Table 7. From these results, we deduce that, at least for the datasets that we have studied, two well known feature selection methods like PCFS and CorrFS do not help GP to improve the solutions quality. We could informally say that GP seems to ‘‘prefer’’ to select features ‘‘by its own’’, rather than considering a set of features filtered by another algorithm. If we consider that the explicit application of a feature selection algorithm is a time and resources consuming task and that each time that it has to be applied, the particular feature selection strategy to be used has to be chosen among a wide variety of different algorithms that can be found in literature (see for instance [30, 31]), the fact that GP ‘‘implicitly’’ performs feature selection by itself can be considered a noteworthy advantage of GP compared to all the other ML techniques that we have used.
5 Discussion and future work An empirical study based on various versions of genetic programming (GP) and other well known machine learning techniques to predict oral bioavailability (%F), median oral lethal dose (LD50) and plasma-protein binding (%PPB) levels of drugs has been presented in this paper. The availability of good prediction tools for pharmacokinetics parameters like the ones studied in this paper is critical for optimizing the efficiency of therapies, maximizing medical success rate and minimizing toxic effects (like for instance the ones presented in Ref. [6]). Moreover, computational ADMET predictive tools development meets the recent encouragement of UE community contained in the REACH [37] proposal, whose aim is to improve the protection of human health through the better and earlier identification of the toxicity of chemical substances. In our work the technique that has returned the best results in the estimation of %F and LD50 was GP, while for %PPB GP has been outperformed by linear regression and SVM with first degree polynomial kernel. We have tried to investigate the reasons why GP is outperformed by linear regression on the %PPB dataset by studying two indicators of the hardness of fitness landscapes that we have
123
430
Genet Program Evolvable Mach (2007) 8:413–432
developed in the last few years, namely the fitness clouds [38] and the negative slope coefficient [39]. This study is presented in Ref. [40] and, according to the results that we have obtained, the fitness landscape generated by the %PPB dataset seems to be a very GP-hard one, contrarily to what happens for %F and LD50. This explains the reason why GP is outperformed by linear regression in the %PPB approximation only partially and further investigation is needed to asses GP parameters for optimizing its performance for this dataset. Nevertheless, in our opinion, GP can be considered one of the most promising techniques for this kind of tasks. Feature selection procedures, which generally significantly improve the performance of non-evolutionary machine learning techniques, are not relevant for improving GP prediction capabilities for %PPB, and also for %F and LD50 (even though results obtained by GP with feature selection on %F and LD50 have not been shown here). We hypothesize that the reason of this is that GP intrinsically acts as a wrapper, by automatically selecting and evaluating a specific subset of features to be kept inside the population. In some senses, we could say that feature selection is implicit in GP, while for the other machine learning techniques it has to explicitly be executed in a pre-processing phase. Furthermore, analyzing our results, we have remarked that the individual characterized by the best RMSE is often different from the one characterized by the best CC both on the training and test sets. For this reason, we confirm an observation that has already recently been done in Ref. [27]: using more (possibly uncorrelated) criteria for evaluating regression models is often beneficial. In fact, in our experiments the best results on the test set have been returned by the GP versions that use both the RMSE and the CC as optimization criteria on the training set. Nevertheless, in this paper we evaluate individuals by simply performing a weighted average between RMSE and CC. Future work will focus on testing more sophisticated multi-objective learning algorithms, possibly based on the concept of Pareto front and using more than two criteria in the training phase. We believe that this new GP version may return some very interesting results, which could concretely contribute to produce better medicines more efficiently and effectively. Other developments of GP methods are suggested by the analysis of the available data: experimental ADMET measurements return average values associated to specific confidence intervals. This has to be taken into account in our future activity, possibly by considering an ‘‘error-tolerant’’ threshold for calculating fitness on the training set, and also assigning a different relevance to over-estimation and underestimation errors. Finally, in order to deeply characterize the ability of GP in the ADMET in silico arena, we are planning to test our methodologies on other datasets, for example for the prediction of Blood Brain Barrier Permeability or Cytochrome P450 interactions. Acknowledgments During the development of this research, we presented our initial results at the 8th annual conference on Genetic and Evolutionary Computation, GECCO 2006 [27]. We are grateful for the discussion and suggestions by conference attendees and reviewers, who honored our work with the Best Paper Award for the ‘‘Biological Applications’’ conference track.
123
Genet Program Evolvable Mach (2007) 8:413–432
431
References 1. Kennedy, T.: Managing the drug discovery/development interface. Drug Discov. Today 2, 436–444 (1997) 2. Kola, I., Landis, J.: Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug Discov. 3, 711–716 (2004) 3. Eddershaw, J.P., Beresford, A.P., Bayliss, M.K.: ADME/PK as part of a rational approach to drug discovery. Drug Discov. Today 9, 409–414 (2000) 4. van de Waterbeemd, H., Gifford, E.: ADMET in silico modeling: towards prediction paradise? Nat. Rev. Drug Discov. 2, 192–204 (2003) 5. Norinder, U., Bergstrom, C.A.S.: Prediction of ADMET properties. ChemMedChem 1, 920–937 (2006) 6. Tuffs, A.: Bayer faces shake up after Lipobay withdrawn. Br. Med. J. 323(7317), 828 (2001) 7. Koza, J.R.: Genetic Programming. The MIT Press, Cambridge, MA (1992) 8. Berezhkovskiy, L.M.: Determination of drug binding to plasma proteins using competitive equilibrium binding to dextran-coated charcoal. J. Pharmacokinet. Pharmacodyn. 33(5), 920–937 (2006) 9. van de Waterbeemd, H., Rose, S.: In: Wermuth, L.G. (ed.) The Practice of Medicinal Chemistry, 2nd ed., pp. 1367–1385. Academic Press (2003) 10. Todeschini, R., Consonni, V.: Handbook of Molecular Descriptors. Wiley–VCH, Weinheim (2000) 11. David, S., Wishart, Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucl. Acids Res. 34:doi:10.1093/nar/gkj067 (2006) 12. Yoshida, F., Topliss, J.G.: QSAR model for drug human oral bioavailability. J. Med. Chem. 43, 2575–2585 (2000) 13. Pintore, M., van de Waterbeemd, H., Piclin, N., ChrTtien, J.R.: Prediction of oral bioavailability by adaptive fuzzy partitioning. Eur. J. Med. Chem. 38(4), 427–31 (2003) 14. Frohlich, Wegner, J., Sieker, F., Zell, A.: Kernel functions for attributed molecular graphs—a new similarity based approach to ADME prediction in classification and regression. QSAR and Combinatorial Sci. 38(4), 427–431 (2003) 15. Andrews, C.W., Bennett, L., Yu, L.X.: Predicting human oral bioavailability of a compound: development of a novel quantitative structure–bioavailability relationship. Pharmacol. Res. 17, 639– 644 (2000) 16. Feng, J., Lurati, L., Ouyang, H., Robinson, T., Wang, Y., Yuan, S., Young, S.S.: Predictive toxicology: benchmarking molecular descriptors and statistical methods. J. Chem. Inform. Comput. Sci. 43, 1463–1470 (2003) 17. Martin, T.M., Young, D.M.: Prediction of the acute toxicity (96-h LC50) of organic compounds to the fathead minnow (Pimephales promelas) using a group contribution method. Chem. Res. Toxicol. 14(10), 1378–1385 (2001) 18. van de Waterbeemd, H., Smith, D.A, Jones, B.C.: Lipophilicity in PK design:methyl, ethyl, futile. J. Comput. Aided Mol. Design 15, 273–286 (2001) 19. Colmenarejo, G., Alvarez-Pedraglio, A., Lavandera, J.L.: Chemoinformatic models to predict binding affinities to human serum albumin. J. Med. Chem. 44, 4370–4378 (2001) 20. Kratochwil, N.A., Huber, W., Muller, F., Kansy, F., Gerber, P.: Predicting plasma protein binding of drugs: a new approach. Biochem. Pharmacol. 64,1355–1374 (2002) 21. Zupan, J., Gasteiger, P.: Neural Networks in Chemistry and Drug Design: An Introduction, 2nd ed. Wiley (1999) 22. Greene, N.: Computer systems for the prediction of toxicity: an update. Adv. Drug Deliv. Rev. 54, 417–431 (2002) 23. Accelrys Inc.: The world leader in cheminformatics for drug development (2006). See www.accelrys.com 24. Pharma Algorithms Inc.: A company active in the field of ADMET predictions (2006). See www.apalgorithms.com 25. Simulation Plus Inc.: A company that use both statistical methods and differential equations based simulations for ADME parameter estimation (2006). See www.simulationsplus.com 26. Langdon, W.B., Barrett, S.J.: Genetic Programming in data mining for drug discovery. Evolutionary Computing in Data Mining, 211–235 (2004)
123
432
Genet Program Evolvable Mach (2007) 8:413–432
27. Archetti, F., Lanzeni, S., Messina, E., Vanneschi, L.: Genetic programming for human oral bioavailability of drugs. In: Cattolico, M. (ed.) Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 255–262. Seattle, Washington, USA (2006) 28. Tetko, I.V., Gasteiger, J., Todeschini, R., Mauri, A., Livingstone, D., Ertl, P., Palyulin, V.A., Radchenko, E.V., Zefirov, N.S., Makarenko, A.S., Tanchuk, V.Y., Prokopenko, V.V.: Virtual computational chemistry laboratory—design and description. J. Comput. Aided Mol. Design 19, 453–463 (2005). See www.vcclab.org 29. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) Genetic Programming, Proceedings of the 6th European Conference, EuroGP 2003, vol. 2610 of LNCS, pp. 71–83. Essex. Springer, Berlin, Heidelberg, New York (2003) 30. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, Waikato University, Department of Computer Science, Hamilton, NZ (1998) 31. Jolliffe, I.T.: Principal Component Analysis, 2nd ed. Springer series in statistics (1999) 32. Weka: A multi-task machine learning software developed by Waikato University (2006). See www.cs.waikato.ac.nz/ml/weka/ 33. Akaike, H.: Information theory and an extension of maximum likelihood principle. Proceedings of the 2nd International Symposium on Information Theory, Akademia Kiado (1973) 34. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, New York (1987) 35. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, London (1999) 36. Smola Alex, J., Scholkopf, B.: A Tutorial on Support Vector Regression. Technical Report Technical Report Series— NC2-TR-1998-030, NeuroCOLT2, (1999) 37. REACH. Registration, Evaluation and Authorisation of Chemicals, 2006. http://ec.europa.eu/environment/chemicals/reach/reach_intro.htm 38. Vanneschi, L., Clergue, M., Collard, P., Tomassini, M., Ve´rel, S.: Fitness clouds and problem hardness in genetic programming. In: Deb, K., et al. (ed.) Proceedings of the Genetic and Evolutionary Computation Conference, GECCO’04, vol. 3103 of Lecture Notes in Computer Science, pp. 690–701. Springer, Berlin, Heidelberg, New York (2004) 39. Vanneschi, L., Tomassini, M., Collard, P., Ve´rel, S.: Negative slope coefficient. a measure to characterize genetic programming fitness landscapes. In: Collet, P., et al. (ed.) Genetic Programming, 9th European Conference, EuroGP2006, Lecture Notes in Computer Science, LNCS 3905, pp. 178– 189. Springer, Berlin, Heidelberg, New York (2006) 40. Vanneschi, L.: Investigating problem hardness of real life applications. In: Riolo, R., et al. (eds.) Genetic Programming Theory and Practice. To appear (2007)
123