Environ Sci Pollut Res DOI 10.1007/s11356-017-9690-1
RESEARCH ARTICLE
Predictive models for identifying the binding activity of structurally diverse chemicals to human pregnane X receptor Cen Yin 1 & Xianhai Yang 2 & Mengbi Wei 1 & Huihui Liu 1
Received: 12 February 2017 / Accepted: 30 June 2017 # Springer-Verlag GmbH Germany 2017
Abstract Toxic chemicals entered into human body would undergo a series of metabolism, transport and excretion, and the key roles played in there processes were metabolizing enzymes, which was regulated by the pregnane X receptor (PXR). However, some chemicals in environment could activate or antagonize human pregnane X receptor, thereby leading to a disturbance of normal physiological systems. In this study, based on a larger number of 2724 structurally diverse chemicals, we developed qualitative classification models by the k-nearest neighbor method. Moreover, the logarithm of 20 and 50% effective concentrations (log EC20 and log EC50) was used to establish quantitative structure-activity relationship (QSAR) models. With the classification model, two descriptors were enough to establish acceptable models, with the sensitivity, specificity, and accuracy being larger than 0.7, highlighting a high classification performance of the models. With two QSAR models, the statistics parameters with the correlation coefficient (R2) of 0.702–0.749 and the cross-
validation and external validation coefficient (Q2) of 0.643– 0.712, this indicated that the models complied with the criteria proposed in previous studies, i.e., R2 > 0.6, Q2 > 0.5. The small root mean square error (RMSE) of 0.254–0.414 and the good consistency between observed and predicted values proved satisfactory goodness of fit, robustness, and predictive ability of the developed QSAR models. Additionally, the applicability domains were characterized by the Euclidean distance-based approach and Williams plot, and results indicated that the current models had a wide applicability domain, which especially included a few classes of environmental contaminant, those that were not included in the previous models.
Responsible editor: Philippe Garrigues
Introduction
Electronic supplementary material The online version of this article (doi:10.1007/s11356-017-9690-1) contains supplementary material, which is available to authorized users. * Xianhai Yang
[email protected] * Huihui Liu
[email protected] 1
Jiangsu Key Laboratory of Chemical Pollution Control and Resources Reuse, School of Environmental and Biological Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu Province 210094, China
2
Ministry of Environmental Protection, Nanjing Institute of Environmental Sciences, Jiang-Wang-Miao Street, Nanjing 210042, China
Keywords Pregnane X receptor (PXR) . Classification model . Quantitative structure-activity relationship (QSAR) . Logarithm of 20% effective concentration (log EC20) . Logarithm of 50% effective concentration (log EC50)
To date, about 145,299 commercially used chemicals have been preregistered by European REACH (Registration, Evaluation, Authorization and Restriction of Chemicals, Last updated May 10, 2016) (REACH 2016). Many of them can be released into the environment through a variety of routes. Very often, these chemicals find ways to enter human through inhalation, ingestion, skin exposure, and food chain transfer, thereby causing adverse health effects, such as carcinogenesis, immunotoxicity, neurotoxicity, and reproductive toxicity (Tijani et al. 2016; Vrijheid et al. 2016). After getting into the body, almost all the chemicals would undergo redox reactions, conjugations, and eventually excretion. During the biotransformation processes, the chemicals may be converted into non-toxic small molecules or even more toxic compounds
Environ Sci Pollut Res
(Letcher et al. 2002). Enzyme subfamily, acting as crucial catalyst, is responsible for the metabolism, deactivation, and transport of environmental chemicals in human. However, some chemicals can interfere with the enzyme expression and/or their normal physiological functions, thus altering the detoxification pathways, reducing the metabolism efficacy and increasing accumulation effect. Pregnane X receptor (PXR) is a member of the nuclear receptor superfamily, which regulates the expression of metabolizing enzymes such as cytochrome P450, transporters, and multidrug resistance proteins, which are involved in the metabolism, transport, and excretion of toxic chemicals (Kliewer et al. 2002). Inappropriate activation or antagonism of human PXR (hPXR) can lead to disturbance of normal physiological systems. A diverse array of environmental chemicals, such as pharmaceuticals, polychlorinated biphenyls, brominated flame retardants, plasticizers, phthalates, pesticides, and among others (Chen et al. 2014; Kojima et al. 2011; Lille-Langøy et al. 2015; Sui et al. 2012), can active the hPXR, causing chemical-chemical interactions or resulting in detrimental physiological effects. In turn, their own metabolism and clearance may also be affected by the hPXR feedback, feed forward mechanisms or salvage pathway, which contributes to the complex defense strategy of organisms in response to threats by multiple xenobiotics. Currently there are five high-resolution crystal structures of hPXR available in the Protein Data Bank. Moreover, hPXR possesses a bulky, flexible, and hydrophobic ligand-binding pocket which can accommodate various small molecules. Depending on their sizes and shapes, a specific ligand can adopt multiple modes or multiple orientations to locate within the pocket, showing a remarkable variability (Chen et al. 2014; Watkins et al. 2001). Therefore, unlike other nuclear receptors that have highly selective for ligands with specific structural features, hPXR is a promiscuous protein that acts as a sensor for a variety of endogenous and xenobiotic compounds, which further complicates the theoretical modeling and prediction of the interaction. Due to the key role of hPXR as sensor for organisms in response to threats by environmental chemicals, identification of hPXR agonists would provide important information for evaluating health risk of chemicals. Much effort is needed to explore appropriate methods for predicting a broad spectrum of hPXR activator and/or non-activator. In view of the great and ever-increasing number of chemicals in environment, it is not practical to test each chemical through experimental methods, which are usually laborious, time-consuming, expensive, and equipment dependent. Alternatively, the methodology of computational models is becoming an important tool for rapid and cost-effective prediction of biological activities. Up to date, there were few successes with structure-based modeling approaches to predict hPXR activators or nonactivators (Jacobs 2004; Khandelwal et al. 2008; Rao et al.
2012; Ung et al. 2007; Wang et al. 2006). Among them, quantitative structure-activity relationship (QSAR) models can provide a reliable classification when the biological data is binary in nature; and they might also serve to capture the essential structural and chemical features of PXR activators (Abdulhameed et al. 2016; Dybdahl et al. 2012). Thus, the QSAR model is more implicit and thereby requires a more thorough investigation and rigorous validation. One of the limitations of previous computational work is that the data used to build the models were collected from the different literatures and represented results from different experimental groups using different assay formats, reporter genes, etc., thus provided promiscuous quality data for model development. Lack of large set of consistent hPXR data has restricted QSAR models to a relatively small universe of molecules, compared to the known environmental chemical. In this study, we collected the most frequently used EC50 data from the currently published literatures, obtained by the competitive binding assay. Following the Organization for Economic Cooperation and Development (OECD) guidelines on the development and validation of QSAR models (OECD 2007), qualitative classification models and quantitative QSAR models were developed to predict the hPXR binding affinity of various chemicals. Then, Euclidean distance-based approach and Williams plot were used to characterize the applicability domain of the established models. Additionally, a thorough mechanism interpretation was performed to identify the key molecular parameters, those governing the binding affinity of chemicals to hPXR.
Materials and methods Dataset Experimental data on the competing potency of various chemicals to hPXR were collected from a number of 42 previous literatures, the Binding Dataset (http://www.bindingdb. org) and PubChem BioAssay Record of AID 463086 and AID 720659 from a public repository of experimental screening data for millions of compounds across various biological targets (Table S1). The concentration (EC20 or EC50) of a tested chemical at 20 or 50% activation of rifampicin binding to hPXR was employed to scale the binding potency. One of the limitations of collected data was that there was only a small fraction of the data available reported quantitative EC20 or EC50 data, and much of the work was published as greater or less than a cutoff value. Therefore, there are currently no widely available large, diverse continuous datasets to develop quantitative QSAR model. Additionally, for a certain chemical, its data may come from different literatures and obtained by different authors or laboratories; thus, we used
Environ Sci Pollut Res
the average of EC20 or EC50 values for the same chemical from different sources. Very often, the EC20 or EC50 data were expressed as greater than a cutoff value, e.g., 30 or 100 μM. In this case, the chemicals were classified as inactive. For chemicals with the quantitative EC50, data >100 μM were also classified as inactive. When disagreement (i.e., active or inactive) occurs between different sources, the minority should yield to the majority. Additionally, chemicals containing metal ion and composited chemicals by separate moieties were deleted, because the software cannot optimize these types of chemicals. Overall, a total of 2724 chemicals, including 748 active chemicals (defined as A) and 1976 inactive chemicals (defined as I), were used to develop qualitative classification models, which can judge a chemical whether being active or inactive (Table S1). Therein, a number of 147 EC20 (Table S2) and 263 EC50 (Table S3) data points were used to developed quantitative QSAR models, which can screen the key molecular descriptors those governing the binding affinity. In the modeling, the active and inactive chemicals datasets were all randomly divided into a training set and a validation set with a ratio of 3:1. The training set was used to develop models, while the validation set was used to test the predictive ability of models from the training set. The names, CAS numbers, and corresponding experimental EC20 and EC50 data of the chemicals are listed in Tables S1, S2, and S3 of the Supporting Materials. Calculation of the molecular descriptors DRAGON descriptors, characterizing the structure diversity of the compounds, were used for the model development. Before calculating the molecular descriptors, the molecular structures of model compounds were preliminarily optimized with the minimize energy method (at the minimum RMS gradient of 0.001), which was contained in the ChemBio3D Ultra (version 12.0) (Schnur et al. 1991). Then, the molecular structures of model compounds were further optimized by employing MOPAC 2012 software (Keywords: PM6 eps = 78.6, CHARGE = 0, EF GNORM = 0.01, POLAR MULLIK SHIFT = 80). Based on the optimized geometric structures from MOPAC, 4885 DRAGON descriptors were calculated by employing the DRAGON software (version 6.0) (Talete 2012). With the DRAGON descriptors, the default methods in the Dragon 6 software were used to preliminarily select descriptors. In this step, the DRAGON descriptors were excluded if they (1) were constant or near-constant, (2) had missing values for all the molecules in the dataset, and (3) had a high pair wise correlation (one of any two descriptors with a correlation greater than 0.95). Details for the exclusion rules were listed in the homepage of the DRAGON software (http://www.talete.mi.it/help/dragon_help/index.html). As a
result of the prereduction procedure, a final set of 1852 DRAGON descriptors was retained for all the chemicals to develop the classification models, while a number of 1510 and 1497 DRAGON descriptors were retained for the active chemicals with EC20 and EC50 values, respectively, to develop the quantitative QSAR models. Model development and evaluation Classification model The binary classification models can qualitatively judge whether a chemical can bind to hPXR. This model was built by an in-house software with the k-nearest neighbor (k-NN) method (Kovarich et al. 2011; Papa et al. 2013), which was based on the similarity (scaled by Euclidean distance) of chemicals. The similarity was defined by calculating the Euclidean distances between the descriptor vectors, which was expressed as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DE ðx; μÞ ¼ ðx−μÞT ðx−μÞ ð1Þ where μ is the average of descriptor x. The dataset was autoscaled by the k-NN method; therein, chemicals with the most similar k values were assigned to the same class, i.e., active group or inactive group. The predictive ability of the model was checked for k values of 1, 3, 5, and 7. The predictive accuracy (Q) was used as the parameter to select satisfactory models and variables. Moreover, the sensitivity (Se) and specificity (Sp) were calculated to assess the quality of models, by the following formulae TP þ TN 100% TP þ TN þ FP þ FN TP 100% Se ¼ TP þ FN TN 100% Sp ¼ TN þ FP
Q¼
ð2Þ ð3Þ ð4Þ
where TP (true positive) and TN (true negative) are the number of chemicals classified as active and inactive, respectively. FN (false negative) is the number of active chemicals classified as inactive, and FP (false positive) is the number of inactive chemicals classified as active. Quantitative model Stepwise multiple linear regression (MLR) analysis was employed to select variables and develop the QSAR models by using the SPSS software (SPSS 19.0), with the EC50 values as the dependent variable and the molecular descriptors as the predictor variables. In this case, we selected a significant model, which had the lowest number of molecular descriptors, the
Environ Sci Pollut Res
maximum values of the adjusted determination coefficient (R2adj ), the minimum values of the root mean squared error (RMSE), and the variable inflation factor (VIF) of less than 10 for each predictor variable. Additionally, the models should comply with the QUIK rules, i.e., KXX (inter-correlation of selected descriptors) < KXY (the correlation of the X block with Y), where X is the selected molecular descriptors’ matrix and Y is the response variable vector. The developed models were evaluated according to the OECD guidelines (OECD 2007). The statistical parameters R2tra and R2ext (the correlation coefficient square between observed and fitted values in training set and validation set), RMSEtra and RMSE ext (the value of the root mean squared error in training set and validation set), Q2LOO (leave one out cross-validated coefficient), Q2BOOT (bootstrap method, 1/5, 5000 iterations), and Q2ext (external explained variance) were calculated to assess the goodness of fit, robustness, and predictive ability of the developed models. Applicability domain The applicability domain of the models was characterized by the Euclidean distance-based method or Williams plot. Details for the evaluation method were presented in our previous study (Liu et al. 2016).
Results and discussion Development and validation of classification models In the present study, we established the qualitative classification models using the 2724 chemicals, which can judge
whether or not the chemical is active. According to the reference standards proposed by Golbraikh et al. (2003), the crossvalidated coefficient (Q2) and the correlation coefficient (R2) should be larger than 0.5 and 0.6, respectively. Therefore, we used 0.7 as the minimum threshold value. In this case, two descriptors were enough to establish acceptable model. On the basis of accuracy, sensitivity and specificity, 13 satisfactory classification models were screened from the population of classification models (Table S4), and the descriptors involved in the models were listed in Table S5. Considering the good interpretability of descriptors, only one model with NRS and ChiA_Dz(i) as the most relevant variables was discussed in this text. Statistic results indicated that the model had a sensitivity of 73.8%, a specificity of 72.5% and a classification accuracy of 72.8% for the training set, highlighting a high classification performance of the model. With the validation set, the model provided the sensitivity, specificity and accuracy of 74.9, 71.1, and 72.1%, respectively, implying a good external predictive ability of the model. Therefore, the developed model can be roughly used to judge the PXR-binding activity of untested chemicals those within the applicability domain of the model. Development and validation of QSAR models In the dataset with 2724 chemicals, there were 147 chemicals with EC20 values (Table S2) and 263 chemicals with EC50 values (Table S3). Thus, we used the quantitative data to develop QSAR models, which can screen the key molecular descriptors those governing the binding affinity. The obtained optimum QSAR models, using log EC20 and log EC50 values as the endpoints, are expressed as
logEC 20 ¼ −4:01 þ 6:37ðChiA RGÞ−0:0359ðCATS2D 08 ALÞ þ 0:0188ðATSC6pÞ þ 0:142ðGATS7sÞ− 0:101ð F07½C‐S Þ−0:353ðNNRS Þ þ 2:21ðMEccÞ þ 0:197ð F03½N ‐OÞ þ 4:66ðR1uþÞ−1:48ðH7pÞþ 0:451ðB09½N ‐OÞ þ 0:00797ðRDF065uÞ
ð5Þ
logEC 50 ¼ −2:86−0:154ð F10½ F‐ F Þ þ 1:12ðDLS 01Þ−1:10ðnRNHRÞ þ 0:0839ðN %Þ−0:354ð F08½N ‐ClÞþ 0:281ð F10½N ‐OÞ−1:23ðC−015Þ−0:831ðCATS2D 06 DPÞ þ 0:571ðR6sþÞ þ 0:737ðMor27uÞ− 0:243ðDLS 05Þ þ 0:255ðCATS2D 06 DDÞ þ 0:450ðnR12Þ þ 0:903ðH0uÞ−1:58ðnCH2RX Þ− 0:0390ðCATS2D 08 ALÞ−0:678ðB05½O‐Cl Þ−0:499ðCATS2D 06 APÞ−0:485ðN ‐070Þ− 0:290ðMor16mÞ−0:101ðH‐053Þ
ð6Þ
Statistical parameters of the developed two QSAR models were calculated. According to the criteria proposed by Golbraikh et al. (2003) i.e., Q2 > 0.5, R2 > 0.6, the model had acceptable statistical parameters with R2tra , Q2LOO , and Q2BOOT of 0.735–0.749, 0.643–0.669, and 0.718–0.754 in training set, respectively. Furthermore, the validation set also had satisfactory statistical results with R2ext of 0.702–0.732
and Q2ext of 0.695–0.721, which showed high predictive ability of the QSAR models. The RMSE values for log EC20 model were 0.254 and 0.284 for training set and validation set, respectively, while those for log EC50 model were 0.419 and 0.414. The greater root mean squared error in log EC50 model may result from the larger number of data points (198 versus 110) and the broader literature resources (44 versus 3)
Environ Sci Pollut Res 2.5
ChiA_DZ(i)
2.0
0.08
1.5
0.64
1.0
0.48
0.5
0.32
0.0
0.16 -2
-1
0
1
2
3
4
5
6
7
8
9
0
NRS
Fig. 2 Applicability domain for the developed classification model characterized by the Euclidean distance-based approach
Applicability domain For the classification model, in this study, we only focused on a model with NRS and ChiA_Dz(i) as the most relevant variables, its applicability domain was assessed by Euclidean distance-based approach (Fig. 2). As shown, among the 2724 chemicals, a number of 2723 chemicals in both the training set and the validation sets located in the acceptable domain, highlighting a good representativeness of the training sets for the model. Exceptionally, only one chemical (Cadmium oxide) located outside the acceptable domain with the largest ChiA_Dz(i) value, which was identified as an outlier. Moreover, this inactive chemical was misclassified as active by the developed model (Table S1). Reasons for the incorrect classification may be from the limited number of chemicals with the similar structure (metallic oxide) in the dataset; or maybe the experimental values from literatures were not accurate (Gramatica 2007). For the QSAR model of log EC20, the applicability domain was characterized by the Williams plot (Fig. 3a). As shown, five chemicals (decamethylcyclopentasiloxane, rifampicin, SR12813, anilofos, triforine) in the training set and one chemical (BDE-209) in the validation set had hi > h* (h* = 0.35) and |δ| < 3 (Fig. 3a), implying that these six compounds have great influence on the model. Especially, rifampicin and SR12813 were the optimal ligands of PXR, and often were
used as the references in the relative/competitive binding assay; thus, their activity plays a key role in the determination of effective concentrations of test chemicals. Considering the standard residual (δ), even though the δ value of triadimefon was located at the line of warning values |δ| = 3, there were no other chemicals to be identified as the outliers. For the QSAR model of log EC50, only the Williams plot was used to assess its applicability domain. As shown in Fig. 3b, all the chemicals were in the area of |δ| < 3, which indicated that there were no outliers for this model. However, ten chemicals (sulfamethoxazole, diazepines 9c, okadaic acid, C2BA-13, pretilachlor, metolachlor, cephradine, 5βandrostan-3α-ol, SR12813, butafenacil) in the training set and six chemicals (doxycycline, C2BA-10, sulfisoxazole, nbutyl-paminobenzoate, rifapentine, glimepiride) in the validation set had hi > h* (h* = 0.33) and |δ| < 3 (Fig. 3b), implying that these compounds were very influential on the model, and can stabilize the QSAR model and make it more precise. Overall, the developed models in this study have a wide applicability domain, which includes organochlorine pesticides, acetanilide pesticides, triazole chemicals, flame retardants, antibiotics, metallic oxide, phenolic compounds, endogenous and exogenous steroidal chemicals, emerging drugs and their derivatives, and so on.
(b)
(a)
3
2
Training set Validation set
Training set Validation set
2
1
Predicted logEC50
Predicted logEC20
Fig. 1 Plot of the predicted versus observed log EC20 values (a) and log EC50 values (b) for the training set and validation set
0.96
Training set Validation set
Euclidean distance
than that in log EC20 model. Additionally, both models complied with the QUIK rules, i.e., KXX < KXY (0.337 < 0.359 for log EC20 model and 0.271 < 0.290 for log EC50 model), suggesting good performance of the developed models. Plot of experimentally observed data versus model-predicted data is illustrated in Fig. 1, in which the good consistency between the observed and predicted log EC20 (or 50) values proved a good predictive ability of the developed QSAR models. The screened descriptors involved in the models are listed in Tables S6 and S7. As shown, the VIF values for all the predictor variables are less than 10, indicating that there is no serious multicollinearity among the variables.
0
-1
1 0 -1 -2 -3
-2 -2
-1
0
Observed logEC20
1
2
-3
-2
-1
0
1
Observed logEC50
2
3
Environ Sci Pollut Res
(a)
(b) 4
0
2
0
-2
*
*
-2
Training set Validation set
h = 0.33
2
Standard Residual ( )
4 Training set Validation set
h = 0.35
Standard Residual ( )
Fig. 3 Plot of standardized residuals versus leverages (h) for the developed QSAR models of log EC20 values (a) and log EC50 values (b). The transverse shotdotted lines represent ±3 standardized residuals, and the vertical shot-dotted line represents warning leverage value (h*)
-4 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
hi
Mechanism interpretation Within the population of classification models, two variables were enough explicable for the binding activity in each model (Tables S4 and S5), and the screened 13 models included 14 different descriptors. Overall, these descriptors involved two kinds of information, i.e., the information of ring systems and the average Randic-like index. In this text, we discussed the classification model with NRS and ChiA_Dz(i) as descriptors (Fig. 2a), in which NRS is the number of ring systems, and ChiA_Dz(i) is the average Randic-like index from Barysz matrix weighted by ionization potential. Relationship of the percentage of active chemicals with NRS is plotted in Fig. S1. In the whole dataset, all the active chemicals accounted for 27.5%. When the NRS ≤ 1, the percentage of active chemicals was less than 27.5%, while when the NRS > 1, the percentage of active chemicals was more than 27.5%, which implied the positive contribution of the ring systems in chemicals to their binding activity to PXR. Plot of ChiA_Dz(i) values versus activity classes of chemicals is illustrate in Fig. S2. As shown, the activity of chemicals with smaller ChiA_Dz(i) values was ambiguous, while chemicals with larger ChiA_Dz(i) values were prone to be inactive. With the QSAR model of log EC20, four descriptors, i.e., CATS2D_08_AL, F07[C-S], NNRS and H7p, had negative coefficients to the model (Table S6), indicating that increase in these descriptor values results in a decrease in EC20 values and an increase in the binding activity of chemicals to PXR. Especially, NNRS, encoding the normalized number of ring systems, was screened, and its positive contribution to the activity of chemicals agreed well with the conclusion in the classification model. F07[C-S] represents the frequency of CS at topological distance 7, and its negative sign indicated that chemicals with C-S were more inclined to be bound by PXR. The other eight descriptors, i.e., ChiA_RG, ATSC6p, GATS7s, MEcc, F03[N-O], R1u+, B09[N-O], and RDF065u, had positive coefficients (Table S6). Among which, ChiA_RG is the average Randic-like index from reciprocal squared geometrical matrix, its negative contribution to the activity of
hi
chemicals was consist with the findings in the classification model. In addition, F03[N-O] and B09[N-O] characterize the frequency of N-O at topological distance 3 and the presence/ absence of N-O at topological distance 9, respectively. Positive coefficients of the two descriptors implied that the presence of N-O fragments in chemicals was unfavorable for the binding activity to PXR. With the QSAR model of log EC50, three 2D atom pair descriptors F10[F-F], F08[N-Cl], and B05[O-Cl] involved the information of F-F, N-Cl, and O-Cl, respectively; four functional group descriptors—nRNHR, C-015, nCH2RX, and N-070—related to the molecular fragments R-NH-R,=CH2, R-CH2-X, and Ar-NH-Al, respectively (Table S7). Their negative coefficients verified that chemicals with these structural/functional groups were prone to bind with PXR. Besides, another six descriptors CATS2D_06_DP, DLS_05, CATS2D_08_AL, H-053, Mor16m, and CATS2D_06_AP also contributed to the decrease of log EC50 values (Table S7). However, the other eight descriptors DLS_01, F10[N-O], N%, R6s+, Mor27u, CATS2D_06_DD, nR12, and H0u contributed to the increase of log EC50 values, thereby a decrease of the binding activity of chemicals to PXR. Among them, F10[N-O] represents the frequency of N-O at topological distance, and its positive coefficients again proved the adverse effect of the presence of N-O in chemicals on their activity, as that found in log EC20 model. It is worth noting that the descriptors about drug-like score were all screened in the classification model and log EC50 QSAR model, but their contribution was not clear, which may result from the high selectivity of PXR to the drug-like ligands. Model comparison Over the past decade, the binding affinity of chemicals to PXR received wide attention, but most of the studies focused on the drugs and their derivatives, and the most used approach was the qualitative classification model. Comparison of the statistical performances of the current models with previous models is presented in Table 1. Among the previous ten models, four models were developed by using Bayesian model, and two
5
Obtained from the web based software model Quantum chemical descriptors DS descriptors, VolSurf descriptors, molecular fingerprints Quantum chemical descriptors, structural fingerprints DRAGON descriptors
Support vector machine, k-nearest neighbor, artificial neural networks Molecular docking
k-Nearest neighbor
Bayesian model
Bayesian model
73
MOE, VolSurf, Parasurf, CATS
Classification model using the program C5.0 (version 2.05)
2
9
98
42
8
Leadscope® Predictive Data Miner
748 activators, 1976 non-activators
180 activators, 1650 non-activators
299 activators, 332 non-activators 304 activators, 192 non-activators
222 activators, 140 non-activators
397 activators, 239 non-activators
299 activators, 332 non-activators
98 activators, 79 non-activators
74 or 117
Topological variables, fingerprints ECFP_6 and FCFP_6 Quantum chemical descriptors
Bayesian model
A total of 115
195 activators, 142 non-activators
128 activators, 77 non-activators
Number of data points
8
Fingerprints FCFP_6 descriptors
86
VolSurf descriptors
Bayesian model
83
Obtained from an in-house program
k-Nearest neighbor, probabilistic neural network, support vector machines Recursive partitioning, random forest, support vector machine
Number of descriptors
Descriptors
Various environmental chemicals
Various environmental chemicals
Environmental chemicals and drug compounds Structurally diverse compounds
Various chemicals
Drug-like molecules
Various chemicals
Drugs
Steroids
Various chemicals
Various chemicals
Application domain
Documentation of the classification models for the prediction of hPXR activators and non-activators
Algorithm
Table 1
Se = 61.5–91.0% Sp = 78.2–96.1% Q = 70.5–92.7% Se = 75.0–76.0% Sp = 78.0–83.0% Q = 78.0–82.0% Se = 70.4–75.4% Sp = 70.2–73.0% Q = 70.8–72.8%
Se = 82.3% Sp = 84.6% Q = 83.5% Se = 91.6% Sp = 88.0% Q = 90.1% Se = 80.0–100% Sp = 63.6–90.9% Q = 75.4–96.4% –
Se = 77.8–80.7% Sp = 62.4–71.4% Q = 725–74.9% Se = 82.7–99.0% Sp = 62.0–88.6% Q = 73.5–94.4% Se = 84.1% Sp = 69.1% Q = 73.2% –
Parameters of training set
Se = 61.0–92.7% Sp = 46.2–69.2% Q = 61.1–85.2% Se = 61.0% Sp = 87.0% Q = 77.0% Se = 70.1–77.5% Sp = 70.0–73.7% Q = 70.2–74.0%
Se = 35.4–58.5% Sp = 82.5–90.2% Q = 60.0–69.0% Se = 57.9% Sp = 83.5% Q = 69.6% Se = 94.4–97.4% Sp = 64.3–90.9% Q = 86.0–96.0% Se = 87.5–93.8% Sp = 85.7–91.4% Q = 86.7–92.8% –
This study
Abdulhameed et al. (2016)
Shi et al. (2015)
Chen et al. (2014)
Rao et al. (2012)
Matter et al. (2012)
Dybdahl et al. (2012)
Pan et al. (2011)
Ekins et al. (2009)
Khandelwal et al. (2008)
Ung et al. (2007)
– Se = 64.6–68.3% Sp = 61.9–66.7% Q = 63.5–66.9% –
Ref.
Parameters of validation set
Environ Sci Pollut Res
This study This study
Chen et al. (2014)
0.732 0.702 0.669 0.643 d the number of descriptors, n the number of data points
147 263 12 21 log EC20 log EC50 Stepwise multiple linear regression Stepwise multiple linear regression
0.749 0.735
0.716 47 63
electrostatic + steric + hydrophobic descriptors DRAGON descriptors DRAGON descriptors pEC50
0.774
–
Environmental chemicals and drug compounds Various environmental chemicals Various environmental chemicals
Matter et al. (2012) Drug-like molecules 0.774 0.292 273 38 MOE, Volsurf and Parasurf descriptors pEC50
Genetic algorithm + regression tree approach Partial least square regression
0.865
AD
R2ext Q2
R2tra n d Descriptors Endpoints Algorithm
Due to the key role of PXR in the metabolism and clearance of xenobiotic substance, identification of hPXR agonists would provide important information for evaluating health risk of chemicals. This study, following the OECD guidelines, developed qualitative classification models and quantitative QSAR models for predicting the binding activity of compounds with human PXR. With classification models, two kinds of descriptions, i.e., the information of ring systems and the average Randic-like index, were the most related variables; while with QSAR model, a number of 12 and 21 descriptors were screened as the main factors governing log EC20 and log EC50 values, respectively. The mechanism interpretation for models was also performed. Comparing with previous models, the current models cover a larger dataset and a smaller number of variables
Documentation of the QSAR models for predicting the hPXR activators
Conclusions
Table 2
models were established by employing the k-nearest neighbor method, as well as our study. Molecular fingerprints and VolSurf descriptors were the most frequently used descriptor types; there was no report about models with DRAGON descriptors, which was used in this study. In our models, two descriptors could enable the models to have the sensitivity, specificity and accuracy larger than 0.7; however, in the previous models, although tens or even more than 100 of descriptors were used, their statistical parameters still had values less than 0.7 in training sets or validation sets. Additionally, the current models covered a larger number of data points (2724 versus 115–1830), and the application domain extended to a few classes of environmental contaminant those were not included in the previous models. Comparison of our quantitative QSAR models with the previous models is presented in Table 2. As shown, EC50 was the most used endpoint, and there was no report about the QSAR models for EC20 or others. Matter et al. (2012) used the genetic algorithm + regression tree approach to develop a QSAR model with 38 descriptors for 273 chemicals, and obtained satisfactory statistical parameters with R2ext and R2tra of 0.865 and 0.774, respectively, but their application domain was limited to drug-like molecules (Matter et al. 2012). Chen et al. employed the partial least square regression to establish a QSAR model for pEC50; but unfortunately, for a total of 47 data points, the model contained 63 descriptors to achieve acceptable result with R2ext and R2tra of 0.774 and 0.716, respectively (Chen et al. 2014). Overall, our models had significant advantages with larger number of data points, smaller number of variables, and satisfactory values of R2 and Q2. In addition, this study followed integral steps of model development and validation, not just the establishment of a simple relationship, and this is the first paper to explore the model for the endpoint of EC20.
Reference
Environ Sci Pollut Res
Environ Sci Pollut Res
and achieved satisfactory statistical parameters. As to the goodness-of-fit, robustness, and predictability, the current models were comparable with or slightly better than the previous ones. Moreover, the application domain have been extended to a few of environmental compound, such as organochlorine pesticides, acetanilide pesticides, triazole chemicals, flame retardants, antibiotics, metallic oxide, phenolic compounds, exogenous steroidal chemicals, and so on. Acknowledgements The study was supported by the Natural Science Foundation of Jiangsu Province (No. BK20150771) and the National Natural Science Foundation of China (Nos. 21507038, 21507061, and 41671489). Compliance with ethical standards Conflict of interest The authors declare that they have no conflict of interest. Transparency document The transparency document associated with this article can be found in online version.
References AbdulHameed MDM, Ippolito DL, Wallqvist A (2016) Predicting rat and human pregnane x receptor activators using Bayesian classification models. Chem Res Toxicol 29:1729–1740 Chen S, He NH, Chen WS, Sun FJ, Li LQ, Deng R, Hu Y (2014) Molecular insights into the promiscuous interaction of human pregnane x receptor (hPXR) with diverse environmental chemicals and drug compounds. Chemosphere 96:138–145 Dybdahl M, Nikolov NG, Wedebye EB, Jonsdottir SO, Niemela JR (2012) Qsar model for human pregnane x receptor (PXR) binding: screening of environmental chemicals and correlations with genotoxicity, endocrine disruption and teratogenicity. Toxicol Appl Pharm 262:301–309 Ekins S, Kortagere S, Iyer M, Reschly EJ, Lill MA, Redinbo MR, Krasowski MD (2009) Challenges predicting ligand-receptor interactions of promiscuous proteins: the nuclear receptor PXR. PLoS Comput Biol 5:e1000594 Golbraikh A, Shen M, Xiao ZY, Xiao YD, Lee KH, Tropsha A (2003) Rational selection of training and test sets for the development of validated QSAR models. J Comput Aid Mol Des 17:241–253 Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26:694–701 Jacobs MN (2004) In silico tools to aid risk assessment of endocrine disrupting chemicals. Toxicology 205:43–53 Khandelwal A, Krasowski MD, Reschly EJ, Sinz MW, Swaan PW, Ekinst S (2008) Machine learning methods and docking for predicting human pregnane x receptor activation. Chem Res Toxicol 21:1457–1467 Kliewer SA, Goodwin B, Willson TM (2002) The nuclear pregnane x receptor: a key regulator of xenobiotic metabolism. Endocr Rev 23: 687–702 Kojima H, Sata F, Takeuchi S, Sueyoshi T, Nagai T (2011) Comparative study of human and mouse pregnane x receptor agonistic activity in 200 pesticides using in vitro reporter gene assays. Toxicology 280:77–87 Kovarich S, Papa E, Gramatica P (2011) QSAR classification models for the prediction of endocrine disrupting activity of brominated flame retardants. J Hazard Mater 190:106–112
Letcher RJ, Lemmen JG, van der Burg B, Brouwer A, Bergman A, Giesy JP, van den Berg M (2002) In vitro antiestrogenic effects of aryl methyl sulfone metabolites of polychlorinated biphenyls and 2,2bis(4-chlorophenyl)-1,1-dichloroethene on 17 beta-estradiolinduced gene expression in several bioassay systems. Toxicol Sci 69:362–372 Lille-Langøy R, Goldstone JV, Rusten M, Milnes MR, Male R, Stegeman JJ, Blumberg B, Goksoyr A (2015) Environmental contaminants activate human and polar bear (Ursus maritimus) pregnane x receptors (PXR, nr1i2) differently. Toxicol Appl Pharm 284:54–64 Liu HH, Yang XH, Lu R (2016) Development of classification model and QSAR model for predicting binding affinity of endocrine disrupting chemicals to human sex hormone-binding globulin. Chemosphere 156:1–7 Matter H, Anger LT, Giegerich C, Gussregen S, Hessler G, Baringhaus KH (2012) Development of in silico filters to predict activation of the pregnane x receptor (PXR) by structurally diverse drug-like molecules. Bioorgan Med Chem 20:5352–5365 OECD (2007) Guidance document on the validation of (quantitative) structureactivity relationships [(Q)SAR] models. Organisation for economic cooperation and development, paris, france
Pan YM, Li LH, Kim G, Ekins S, Wang HB, Swaan PW (2011) Identification and validation of novel human pregnane x receptor activators among prescribed drugs via ligand-based virtual screening. Drug Metab Dispos 39:337–344 Papa E, Kovarich S, Gramatica P (2013) QSAR prediction of the competitive interaction of emerging halogenated pollutants with human transthyretin. SAR QSAR Environ Res 24:333–349 Rao HB, Wang YY, Zeng XY, Wang XX, Liu Y, Yin JJ, He H, Zhu F, Li ZR (2012) In silico identification of human pregnane x receptor activators from molecular descriptors by machine learning approaches. Chemometr Intell Lab 118:271–279 REACH. Registration, evaluation, authorization and restriction of chemicals. http://echa.Europa.Eu/information-on-chemicals/preregistered-substances. Last updated 10 may 2016 Schnur DM, Grieshaber MV, Bowen JP (1991) Development of an internal searching algorithm for parameterization of the MM2/MM3 force fields. J Comput Chem 12:844–849 Shi HL, Tian S, Li YY, Li D, Yu HD, Zhen XC, Hou TJ (2015) Absorption, distribution, metabolism, excretion, and toxicity evaluation in drug discovery. 14. Prediction of human pregnane x receptor activators by using naive bayesian classification technique. Chem Res Toxicol 28:116–125 Sui YP, Ai N, Park SH, Rios-Pilier J, Perkins JT, Welsh WJ, Zhou CC (2012) Bisphenol A and its analogues activate human pregnane x receptor. Environ Health Persp 120:399–405 Talete srl (2012) Dragon (software for molecular descriptor calculation) version 6.0. Tijani JO, Fatoba OO, Babajide OO, Petrik LF (2016) Pharmaceuticals, endocrine disruptors, personal care products, nanomaterials and perfluorinated pollutants: a review. Environ Chem Lett 14:27–49 Ung CY, Li H, Yap CW, Chen YZ (2007) In silico prediction of pregnane x receptor activators by machine learning approaches. Mol Pharmacol 71:158–168 Vrijheid M, Casas M, Gascon M, Valvi D, Nieuwenhuijsen M (2016) Environmental pollutants and child health-a review of recent concerns. Int J Hyg Envir Heal 219:331–342 Wang CY, Li CW, Chen JD, Welsh WJ (2006) Structural model reveals key interactions in the assembly of the pregnane x receptor/ corepressor complex. Mol Pharmacol 69:1513–1517 Watkins RE, Wisely GB, Moore LB, Collins JL, Lambert MH, Williams SP, Willson TM, Kliewer SA, Redinbo MR (2001) The human nuclear xenobiotic receptor PXR: structural determinants of directed promiscuity. Science 292:2329–2333