Random Forest Approach to QSPR Study of Fluorescence Properties Combining Quantum Chemical Descriptors and Solvent Conditions

The Quantitative Structure – Property Relationship (QSPR) approach was performed to study the fluorescence absorption wavelengths and emission wavelen...

0 downloads 13 Views 2MB Size

Download PDF

Journal of Fluorescence https://doi.org/10.1007/s10895-018-2233-4

ORIGINAL ARTICLE

Random Forest Approach to QSPR Study of Fluorescence Properties Combining Quantum Chemical Descriptors and Solvent Conditions Chia-Hsiu Chen 1 & Kenichi Tanaka 1 & Kimito Funatsu 1 Received: 1 February 2018 / Accepted: 11 April 2018 # Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract The Quantitative Structure – Property Relationship (QSPR) approach was performed to study the fluorescence absorption wavelengths and emission wavelengths of 413 fluorescent dyes in different solvent conditions. The dyes included the chromophore derivatives of cyanine, xanthene, coumarin, pyrene, naphthalene, anthracene and etc., with the wavelength ranging from 250 nm to 800 nm. An ensemble method, random forest (RF), was employed to construct nonlinear prediction models compared with the results of linear partial least squares and nonlinear support vector machine regression models. Quantum chemical descriptors derived from density functional theory method and solvent information were also used by constructing models. The best prediction results were obtained from RF model, with the squared correlation coefficients R2pred of 0.940 and 0.905 for λabs and λem, respectively. The descriptors used in the models were discussed in detail in this report by comparing the feature importance of RF. Keywords QSPR . Fluorescence properties . Random forest . Quantum chemical calculation

Introduction Recently, applications of organic fluorescent dyes and the advances in photodetection technologies in fluorescent probes and labels with high sensitivity, rapid response time, and good selectivity have received substantial attention. Consequently, these dyes have attracted increasing interest in the corresponding research communities and undergone extensive development in chemistry, biology and environmental science. There are many fluorescent probes existed for the detection of ions [1], specific compounds [2], and even for investigating cellular events in real time via fluorescence microscopy [3]. As their applications broaden, properties and functions of organic fluorescent dyes have become increasingly important. For such applications, it is important to select the proper absorption wavelength (λabs) and fluorescence (λem). However, most of the fluorescent probes have been developed rationally rather than empirically [3, 4]. For economic reasons, it would be ideal to be able to predict with various methods, whether * Kimito Funatsu [email protected]–tokyo.ac.jp 1

Department of Chemical Systems Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

the compound to be synthesized meets the fluorescence absorption and emission wavelengths (λabs/λem). The use of computer-aided prediction of desired properties may be the most cost-effective and rapid approach whenever interdependent requirements and a large number of parameters result in unacceptably complex experimental designs for new efficient fluorescent probes exploration. The traditional computational approaches used to predict absorption and fluorescence wavelengths (λabs/λem) involve quantum-chemical methods such as density functional methods [5], ab initio [6], and semi-empirical methods [7]. The accurate estimation of the fluorescence wavelengths from theoretical calculations is still a challenging task since high levels of calculations are necessary for taking into account the effect of both the dynamical and non-dynamical part of electron correlation, which is very difficult for a large size of the substituted fluorescence compounds. In addition, it has been found that the wavelength of some dyes calculated by DFT gave rise to poor results without proper correlated hybrid functionals [8]. Alternatively, a statistical method is one of the promising approaches for prediction of molecular structures and properties. QSPRs are models where structures and characteristics of molecules are correlated with their experimental behavior using various mathematical regression algorithms. More recent examples are models for the prediction of toxicity [9],

J Fluoresc

photovoltaic performance [10], photoconversion efficiency [11], and density functional theory energies [12]. Recently, using the QSPR approach, many attempts have been successfully made to study the photochemical properties of various systems [13–18]. J Xu et al. performed the prediction of the absorption maxima of second-order nonlinear optical (NLO) chromophores using stepwise multilinear regression analysis (MLRA) [13]. C Nantasenamatthe et al. demonstrated the estimation of the absorption and emission maxima of green fluorescent protein (GFP) chromophores using an artificial neural network (ANN) [14]. J Shi et al. performed the QSPR study for fluorescence wavelengths of fluorescence probes using the heuristic method and radial basis function (RBF) neural networks [15]. Li et al. developed both successful linear and nonlinear QSPR methods to predict the absorption wavelengths of boronic acid based fluorescent biosensors [16]. J Xu et al. modeled the relative fluorescence intensity ratio of Eu(III) complex in different solvents and discuss the interpreting the descriptors in the correlation [17]. A Beheshti et al. revealed inadequacies that correlating the fluorescence properties and solvents combining the quantum chemical calculations [18]. It is important to know the limitations of current studies to improve prediction of photophysical properties. The significant limitations of the current QSPR studies follow: 1. Lack of variation and amounts of dyes Machine learning algorithms are required to perform well on large data sets. However, previous works only focused on one chromophore type or utilize dataset less than 100 structures. The predictions prepared by small dataset may reduce the generalization ability for further applications. 2. Lack concern of solvent effect The results of previous reports indicated that some nonlinear relationships existed between the descriptors and photophysical properties (λabs/λem). However, the abnormal phenomena may be related to the solvent effect. A variety of environmental factors affect fluorescence characteristics, including interactions between the fluorescent dye, and surrounding solvent molecules. The effects of these parameters vary widely from one fluorescent dye, to another, but environmental variables can heavily influence the absorption and emission spectra. Moreover, shifts in the absorption and emission wavelengths can be induced by the solvent nature or c o m p o s i t i o n [ 1 9 ] . T h e s e s h i f t s o f t e n c a l l e d th e Bsolvatochromic shifts,^ are experimental evidence of changes in solvation energy. That is, the solvent effect plays a more important role in the process of fluorescence emission. But J Xu et al. [17] did not report predicting fluorescence characteristics in different solvents based on the chemical

structure of fluorescent dye. Even though A Beheshti et al. [18] provide a QSPR with solvent effect, they only focus on one dye. 3. Chemical interpretation Previous reports showed good performance using artificial neural networks; nevertheless, such black-box models are usually hard to explain in simple terms why the predictions were made. Conversely, linear models provided nice and simple regression rules for chemical interpretations but performed poor prediction of properties refer to previous works. Each machine learning method used in QSPR studies such as partial least-squares regression, support vector machines or artificial neural networks…etc. has its specific advantages, weaknesses, and practical constraints. Although to a lesser degree than data curation, it is important to select the most suitable descriptors and QSPR methodology for photophysical property prediction. The random forest (RF) [20] could be one of the most effective solutions to QSPR tasks, but this method has not been widely used yet. RF methodology seems to be very helpful because every forest represents a nonlinear consensus model derived from a large number of a single decision tree. Moreover, RF has the following important advantages: (i) RF models are quite resistant to overfitting. (ii) RF does not require complicated and timeconsuming variable selection. (iii) Compounds with various mechanisms of actions could be studied within the same single training set. It is possible to analyze compounds with different mechanisms of action within one dataset. Moreover, there is no need to pre-select descriptors and cross validation; the method has its own reliable procedure for the estimation of model quality and its internal predictive ability. In this work, the aims of the investigation were the: 1. utilized for the prediction of both absorption wavelengths (λabs) and emission wavelengths (λem) of a diverse set of 413 fluorescent probes and different solvent conditions; 2. development of robust, predictive and interpretable models on the basis of Dragon 7 and quantum mechanical descriptors by RF; 3. comparison of RF models with partial least squares (PLS) regression and support vector machine regression (SVR) models developed using descriptors selected by RF. We modeled the fluorescence characteristics considering the solvent effects, to overcome the shortcomings of current QSPR studies. The established QSPR models were possible to use for the prediction of fluorescence properties (λabs/λem) from the fluorescence structure alone. Meanwhile, the important features related to the wavelengths w selected by RF models will give us some valuable information to guide the synthesis of fluorescent probes and sensors.

J Fluoresc

Methods Data Selection / Selection of Training and Test Data Collecting physicochemical data is the first step to establish a QSPR model since reliable data are required to build reliable predictive models. A large set of 413 dyes maximum experimental absorption (λabs) and emission (λem) wavelength were collected in the database [21] and from literatures [22–24], which included naphthalene, anthracene, fluorene and pyrene solvatochromic dyes in several solvents. A data set containing 413 dyes, 473 samples for 418 dyes in different solvent conditions were used in this study. These dyes include chromophore derivatives of cyanine, xanthene, coumarin, pyrene, naphthalene, anthracene and etc. A complete chromophore derivative list of the dye types is given in Table 1. The data set was randomly divided into two subsets from each chromophore derivatives: a training dataset of 334 dyes and a test dataset of 79 dyes. For QSPR with solvent effect correlation, a training dataset of 392 samples and a test dataset of 81 samples were used. The training set was used to adjust the parameters and construct the QSPR models, and the test dataset was used to evaluate their prediction ability. Two-dimensional (2D) structures were generated in a standardizer software that canonized structures, added hydrogens and performed aromatic form conversions [25]. 3D structures were optimized, and the geometries of the minimum energy conformations were obtained using the MMFF94 optimization routine the MMFF94 force field with Knime nodes [26].

features, fragments or chemical properties. Two types of descriptors were calculated: (a) 2143 Dragon 7 molecular descriptors from 0-dimensional to 2-dimensional molecular information. 3-dimensional descriptors were not considered because they did not lead to an improvement of predictions [28]. (b) 25 quantum mechanical properties calculated by Gaussian 09 software. The geometry optimization and molecular descriptor calculations were performed using Gaussian 09 [29]. The geometries of the molecules were optimized with the B3LYP density functional method [30], using the 6–31G* basis set, and were followed by frequency calculations to verify true energy minima. The calculated quantum chemical (QC) descriptors include: 1. 2 atomic force descriptors: maximum force on molecules (FMAX), root mean square force (FRMS), 2. 4 energy descriptors: highest Occupied Molecular Orbital (HOMO), Lowest Unoccupied Molecular Orbital (LUMO), HOMO-LUMO gap, thermal energy (TE), 3. 6 charge information descriptors: minimum of negative charge (MNQ) and positive charge (MPQ), sum of negative charge (SNQ) and positive charge (SPQ), average of negative charge (ANQ) and positive charge (APQ), 4. 13 polar related descriptors: dipole moment (DP), exact polarizability, (EP(xx), EP(xy), EP(yy), EP(xz), EP(yz), EP(zz)) and approximate polarizability (AP(xx), AP(xy), AP(yy), AP(xz), AP(yz), AP(zz)).

Nearest Neighbor Imputation Calculation of Descriptors For the mathematical treatment of molecules, compounds were described using the molecular descriptors [27], which are numbers encoding for the presence of particular structural Table 1 data set

Chromophore derivatives (dye type) and number of dyes in

Dye type

Number of dyes

Dye type

Number of dyes

Acridine Anthracene Benzene Benzothiazole Benzoxadiazole Benzoxazole BODIPY Coumarin Cyanine Fluorene

10 9 20 3 4 13 16 50 123 2

Luminogren Naphthalene Perylene Phenoxazine Phenyloxazole Porphyrine Pyrene Quinoline Xanthene Others

2 8 13 11 8 10 12 2 79 18

Nearest neighbor (NN) imputation is an efficient method for missing value imputation. NN replaces each missing value with a value obtained from related cases in the whole dataset [31]. Moreover, NN imputation has several characteristics: 1. Imputed values by NN are actually occurring values and not calculated values. 2. NN makes use of auxiliary information provided by the existed values, which can preserve the original data structure. 3. NN does not require explicit models to relate date, which is less prone to model misspecification.

Random Forest Random forest (RF) is a versatile Machine Learning algorithm that can perform both classification and regression tasks, and even multioutput tasks and are capable of fitting complex datasets. RF is an ensemble of single full grown decision trees built by a Classification and Regression Trees algorithm (CART) [32]. Every tree is a logical construction that can be represented as Bif ... then ...^ criterion. A RF algorithm

J Fluoresc

recursively tries to find common criteria for objects from the same class, using some randomly selected descriptors. Each tree in RF has been grown as follows: 1. Bootstrap sample which will be a training set for current tree is produced from the whole training set of N compounds. Compounds which are not in the current training set are placed in out-of-bag (OOB) set. It is used to get a running unbiased estimate of the model error and variable importance; 2. The best split among the m randomly selected descriptors taken from the whole set of M ones in each node is chosen based on the impurity measure; 3. Each tree is full-grown tree without pruning. RF possesses own reliable statistical characteristics, which could be used for self-validation and model selection. Determination coefficients for training set (R2) and out-ofbag set (R2oob ) are two main characteristics of the model. The major criterion for estimation of internal predictive ability of the RF models and model selection is the value of R2oob .

relationships between the dependent variable of the y vector and the descriptors of the X matrix. By extracting latent variables correlated with y considering a large amount of the variation in X, PLS decreases the dimension of the predictor variables In other words, PLS maximizes the covariance between X and y. In detail, X and y are decomposed into score vectors (t and u), loading vectors (p and q), and residual error matrices (E and F): X ¼ tpT þ E and

ð1Þ

Y ¼ upT þ F

ð2Þ

Support Vector Machine Regression A Support Vector Machine (SVM) is one of the most powerful machine learning model, which is capable of performing linear or nonlinear classification, regression [34]. SVMs are particularly useful for small- or medium-sized datasets. The support vector regression is an optimization problem [35].

Feature Selection by RF In a decision tree, every node is a split condition by a single feature so that similar response values would be in the same set. The measure of the optimal condition for splitting is based on impurity. For classification, it is typically Gini impurity or information gain or entropy. For regression trees, mean squared error is used. The decrease of impurity is computed when training a tree. RF could calculate the impurity decrease for each feature, and ranked every feature according to this calculation. Thus, RF provides straightforward methods for feature selection by evaluating the feature importance.

Partial Least Square Regression PLS is a popular and powerful computational method that expresses a dependent target variable in terms of linear combinations of the descriptors commonly known as principal components [33]. The PLS method is used to establish Table 2

min * ¼

w;b;ξ;ξ

ð3Þ

restricted to yi − wT Φðxi Þ þ b ≤ ε þ ξi wT Φðxi Þ þ b −yi ≤ ε þ ξ*i

ð4Þ ð5Þ

ξi ; ξ*i ≥ 0; i ¼ 1; ::…; l

ð6Þ

where l denotes the number of samples, b is the bias term, vector of i-sample is data set with higher dimensional space by the kernel function Φ vector, ξi represents the upper training error, and ξ*i is the lower training error subject to εinsensitive tube (wTΦ(xi) + b). Three parameters determine the SVR quality: error cost C, width of tube, and kernel function. The basic idea in SVR is to map the data set into a highdimensional feature space via kernel functions. Kernel

The coefficient of determination and root mean square error value for the different models of case study 1 Training dataset

Test dataset

Absorption(λabs)

Emission (λem)

R2

R2 RMSE(nm)

RF PLS SVR

l 1 T w w þ C ∑ ξi þ ξ*i 2 i¼1

0.974 0.901 0.973

21.5 39.1 21.7

RMSE(nm) 0.967 0.888 0.950

21.7 40.4 26.9

Absorption(λabs)

Emission (λem)

R2pred

R2pred

0.915 0.825 0.904

RMSE(nm) 36.8 52.6 39.1

0.867 0.818 0.860

RMSE(nm) 40.7 47.7 41.8

J Fluoresc

Fig. 1 Experimental values versus calculated values of λabs and λem by RF

functions perform nonlinear mapping between the input space and a feature space. The approximating feature map for the kernel performs nonlinear mapping. In machine learning theories, the popular kernel functions are. Gaussian(RBF) kernel:

xi −x j k xi ; x j ¼ exp − ð7Þ 2σ2 In eq. 7, σ2 denotes the width of the Gaussian kernel.

Results and Discussion Case study 1: Modeling of Fluorescence Properties without Quantum Chemical Descriptors (413 dyes with Dragon 7 Descriptors) To obtain RF models, the following settings were chosen: 100 estimators in RF model and 2143 Dragon 7 descriptors. The

Fig. 2 Experimental values versus calculated values of λabs and λem by PLS.

results of coefficient of determination (R2) and RMSE are listed in Table 2. For absorption, the well-fitted RF model based on Dragon 7 descriptors with R2= 0.974, R2oob = 0.801, RMSE = 21.5 nm for training dataset was obtained. This model also demonstrates good predictive ability for the external test data set (R2pred = 0.915, RMSE = 36.8 nm) shown in Fig. 1. For emission wavelength, the obtained RF model performed with R2= 0.967, R2oob = 0.765, RMSE = 21.7 nm for training dataset. The prediction results of the test dataset were obtained with R2pred = 0.867 and RMSE = 40.7 nm. Emission wavelength is relatively hard to predict due to complex relaxation process from the excited state. RFs are often used in feature selection because the tree-based strategies in random forests naturally rank by the purity of the node. The RF model provides a simple way to assess feature importance. After the feature selection of RF by selecting high importance variables, the pool of descriptors was reduced to 300 for constructing PLS and SVR models.

J Fluoresc

Fig. 3 Experimental values versus calculated values of λabs and λem by SVR

The linear models were developed using PLS. For each selected model, 5-fold cross-validation statistical parameters (Q2) were calculated. For absorption wavelength, the obtained R2 was 0.888, and the Q2 was 0.596, and the RMSE was 41.80 nm for PLS model with 9 components. The prediction results of the test dataset were obtained with of 0.825 and the RMSE of 52.6 nm. For emission wavelength, the PLS model with 11 components (Q2 = 54.5) has the RMSE of 40.4 nm for the training dataset, 47.7 nm for the test dataset. The R2of the training set is 0.888, and of the test set is 0.818. Fig. 2 shows the predicted versus experimental λabs and λem for training and test dataset. SVR was performed to develop a nonlinear model of absorption and emission wavelength based on the same descriptors of PLS models. To obtain better results, the parameters that influence the performance of SVR were optimized by grid search with 5 fold cross-validation. The hyperparameters(C = 8, ε = 2−6, γ = 2−10) with the highest Q2 = 0.783 for absorption wavelength was chosen as the optimal condition for the SVR models. The results of the models are shown in Fig. 3. The R2 of the training data set is 0.973, and of the test data set is 0.904. The SVR model has the RMSE of 21.7 nm for the training dataset, 39.1 nm for the test dataset. On the other hand, the Table 3

emission SVR model was constructed using C = 4, ε = 2−14, γ = 2−10 for hyperparameters with the highest Q2 = 0.708. The R2 of the training set is 0.950, and R2pred of the test set is 0.860. The SVR model has the RMSE of 26.9 nm for the training dataset, 41.8 nm for the test dataset. From the results of RF, PLS and SVR models, we understand that the RF preformed comparably better than linear PLS and non-linear SVR. Moreover, the RFs provide ensemble non-linear relationships which are different from SVR. It may be resulted from the complex phenomena of absorption and emission.

Case study 2:Modeling of Fluorescence Properties with Quantum Chemical Descriptors (413 dyes with Dragon 7 Descriptors + QC Calculated Descriptors) To the best our knowledge, the absorption and emission wavelength are highly related to some QC parameters [36]. In order to improve the performance of the predictions, we prepared 25 QC descriptors which were calculated by Gaussian 09 software. Thus, the RF models were constructed with 2143 Dragon 7 descriptors and 25 QC descriptors. However, the

The coefficient of determination and root mean square error value for the different models of case study 2 Training dataset

Test dataset

Absorption(λabs)

Emission (λem)

R2

R2 RMSE(nm)

RF PLS SVR

0.973 0.890 0.964

21.5 41.8 25.2

RMSE(nm) 0.969 0.826 0.937

21.3 50.3 30.2

Absorption(λabs)

Emission (λem)

R2pred

R2pred

0.933 0.800 0.914

RMSE(nm) 32 .6 56.5 37.0

0.904 0.774 0.836

RMSE(nm) 34.8 53.1 45.2

J Fluoresc

Fig. 4 Experimental values versus calculated values of λabs and λem by RF

geometry optimization of complicated structures such as cyanine dyes is not an easy task. Specifically, there were about 89 dyes whose QC parameter could not be calculated and led to missing data in the data set. We used NN imputation to filled missing values by replacing a value obtained from related cases in the whole set of data. Compared with case study 1 results using only Dragon 7 descriptors, the RF models got some benefits from the QC descriptors. The superior results list in Table 3 and Fig. 4. The training results of absorption and emission wavelength were performed with R2 = 0.973, R2oob = 0.815, RMSE = 21.5 nm and R2 = 0.969, R2oob = 0.800, RMSE = 21.3 nm, respectively. The two well-fitted models also displayed good prediction in test dataset (R2pred = 0.933 and RMSE = 32.6 nm for absorption wavelength, R2pred = 0.904 and RMSE = 34.8 nm for emission wavelength). Both absorption and emission

Fig. 5 Experimental values versus calculated values of λabs and λemby PLS

predictions have slight improvement than case study 1. It reveals that the use of QC descriptors such as HOMO- LUMO gap which is theocratically related to fluorescence properties can effectively improve model performances. We also selected 300 descriptors including 7 QC descriptors with high feature importance in RF to construct PLS and SVR models. The PLS models with 13 components for absorption wavelength and 6 components for emission wavelength were chosen as the optimal conditions. The PLS model of absorption wavelength has R2 of 0.964 and RMSE of 39.1 nm for the training dataset is, and of the test dataset, R2pred is 0.800 and RMSE is 56.5 nm. As to emission wavelength, the obtained model has the R2 = 0.826 for the training dataset, and R2pred = 0.774 for test dataset. RMSE was 50.3 nm for the training dataset, and 53.1 nm for the test dataset. The results of the model are shown in Fig. 5.

J Fluoresc

Fig. 6 Experimental values versus calculated values of λabs and λem by SVR

For non-linear model of absorption wavelength, the obtained R2 was 0.964 and the Q2 was 0.753 and the RMSE was 25.2 nm for SVR model (C = 4, ɛ = 2−13, γ=2−9). This SVR provides the prediction results of the test dataset R2 of 0.914 and the RMSE of 37.0 nm. To predict emission wavelength, the SVR model (C = 4, ɛ = 2 –6 , γ=2 –11 ) were prepared with R2 = 93.7 and Q2 = 70.9 which has the RMSE of 30.2 nm for the training dataset and R2pred = 83.6, RMSE = 45.2 nm for the test dataset. The results of the model are shown in Fig. 6. To our surprise, QC descriptors had no significant influence on the predictive ability of PLS and SVR models. The PLS models show worse prediction in absorption and emission wavelengths. Besides, SVR models have the better prediction for absorption that is directly related to HOMO-LUMO gap. However, the complicated relaxation process of fluorescence emission results in the poor prediction of SVR. In contrast, the ensemble learning method, RF, has better performance by combining the predictions of 100 estimators. This observation shows that RF is able to predict fluorescence properties for the large variety of dye types. Table 4

Case study 3: Modeling of Fluorescence Properties Considering the Solvent Effects. From the results of case study 1 and 2, we understand that the ensemble learning method is able to learn the complex nonlinear relationship between chemical descriptors and fluorescence properties. The less predictability of emission wavelength may be related to the solvent effect. That is, the solvent effect plays a more important role in the process of fluorescence property especially for emission. Different from PLS and SVR models, a major advantage of decision tree models and random forests is that they are able to operate on both continuous and categorical variables directly. Accordingly, it is possible to deal with solvent information only applying solvent species. We compare 3 different models: (1) RF1 with 2143 descriptors (Dragon 7), (2) RF2 with 2168 descriptors (Dragon 7 + QC), (3) RF3 with 2168 descriptors +1 solvent species. The results of three models are shown in Table 4 and Figs. 7, 8 and 9. The training dataset contains several dyes in different solvent for improving the prediction with solvent effects. The training results of absorption wavelength were performed with

The coefficient of determination and root mean square error value for the different models of case study 3 Training dataset

Test dataset

Absorption (λabs)

Emission (λem)

R2

R2 RMSE(nm)

RF1 RF2 RF3

0.975 0.978 0.979

20.9 19.5 18.9

RMSE(nm) 0.965 0.967 0.971

22.2 21.8 20.2

Absorption(λabs)

Emission (λem)

R2pred

R2pred

0.921 0.939 0.940

RMSE(nm) 35.7 31.2 31.0

0.872 0.901 0.905

RMSE(nm) 39.8 34.9 34.2

J Fluoresc

Fig. 7 Experimental values versus calculated values of λabs and λem by RF1

R2 = 0.975, R2oob = 0.831 for RF1, R2 = 0.978, R2oob = 0.846 for RF2 and R2 = 0.979, R2oob = 0.845 for RF3. The RF1 model (R2pred = 0.921 and RMSE = 35.7 nm) and RF2 model (R2pred = 0.939 and RMSE = 31.2 nm) without solvent information also displayed similar prediction results as case study 1 and case study 2 in test dataset for absorption wavelength. Unlike RF1 and RF2, the RF3 model takes solvent effect into account. The prediction result of RF3 with R2pred = 0.940 and RMSE = 31.0 nm is nearly same as the RF2 result. It might be a good suggestion the fact that absorption has less solvent depended behaviors. In contrast, the three RF models, RF1(R2= 0.965, R2oob = 0.793), RF2(R2= 0.967, R2oob = 0.807) and RF3 (R2= 0.971, R2oob = 0.805) have different results to absorption wavelengths. Due to the Solvent effect refers to a strong dependence of emission spectra with the solvent polarity, the RF3 model with

Fig. 8 Experimental values versus calculated values of λabs and λem by RF2

solvent correlation (R2pred = 90.5, RMSE = 34.2 nm) achieved slightly higher predictive accuracy than RF1 model (R2pred = 87.2, RMSE = 39.8 nm) and RF2 model (R2pred = 90.1, RMSE = 34.9 nm). This good statistical quality clearly explains the solvent effect, and the CARTs in RF model make good correlation with solvent species.

Interpretation of Descriptors The feature importance provided by RF can help identify subsets of input variables that may be most or least relevant of each descriptor to the regression problems and suggest at possible feature selection from the training dataset. We discuss the Top 20 important descriptors in RF3 models because of the high predictive accuracy for absorption and emission wavelength.

J Fluoresc

Fig. 9 Experimental values versus calculated values of λabs and λem by RF3

Table 5 shows that 20 descriptors and feature importance for the absorption wavelength. There are 4 QC descriptors selected by RF algorithm. The HOMO-LUMO gap has the highest importance, and the LUMO is the third highest descriptors as the result of excitation process between HOMO and LUMO. The polarizabilities (AP(xx), EP(xx)) are also significant in RF3 model. This reflects the fact that the polarizability of dyes has a strong effect on absorption wavelengths. Dyes with large conjugation area such as cyanine Table 5 Top 20 descriptors selected by RF with high feature importance

dyes result to the large Van der Waals surface area (P_VSA_e_2, P_VSA_p_3, P_VSA_v_3). Although topology descriptors are hard to connect to properties with interpretations, 6 descriptors (SM10_EA(ri), SpDiam_EA(ri), SM09_EA(ri), SM08_EA(ri), Eig01_EA(ri), SpMax_EA(ri), SM07_EA(ri)) correlated with resonance integral can connect to the resonance phenomenon of absorption. Other descriptors are related to complexity of compounds such as large number of double bonds and symmetry. The more complex structure

Absorption model

Emission model

Selected descriptors

Feature importance

Selected descriptors

Feature importance

gap AP(xx) LUMO SM10_EA(ri) P_VSA_e_2 P_VSA_p_3 SpDiam_EA(ri) SM09_EA(ri) SM08_EA(ri) Eig01_EA(ri) P_VSA_v_3 SM08_EA(ed)

0.249231 0.058344 0.035043 0.029883 0.029355 0.024075 0.022651 0.018227 0.017608 0.01643 0.015178 0.014707

gap AP(xx) Chi0_EA(dm) Chi1_EA(dm) F01[C-N] P_VSA_ppp_L CATS2D_06_LL P_VSA_s_4 SpDiam_AEA(ed) SpMin5_Bh(m) SpMin6_Bh(e) EP(xx)

0.315981 0.12856 0.033721 0.026918 0.01628 0.013386 0.013322 0.011327 0.010218 0.009238 0.00905 0.008636

D/Dtr09 EP(xx) CATS2D_02_AL SpMax_EA(ri) SdsCH SM07_EA(ri) Eig02_EA Chi0_EA(dm)

0.014307 0.012781 0.012779 0.010847 0.010701 0.009531 0.008063 0.007983

IAC SpDiam_EA(ri) Eig01_EA(ed) CATS2D_00_LL LUMO SM15_AEA(ed) Solvent ATSC3i

0.007606 0.006793 0.006083 0.006052 0.005758 0.005531 0.005368 0.005303

J Fluoresc

the dye has, the larger the descriptor value is. Additionally, there is a large number of cyanine dyes in the dataset. For the emission wavelength, the 20 important descriptors also contain 4 QC descriptors. The most important HOMOLUMO gap, Van der Waals surface areas (P_VSA_ppp_L, P_VSA_s_4) and topology descriptors can be explained as absorption wavelength. Since different polarities of the ground and excited state of a dye, a polarity change will lead to different stabilization energy between the ground and excited states. Thus, the emission wavelength has strong solvent effect, as called Bsolvatochromism^. Interestingly, the high importance of polarizability (AP(xx), EP(xx)) and dipole moment correlated topology descriptors(Chi0_EA(dm), Chi1_EA(dm)) support the fact of solvactohromism phenomena. Solvent species has a high influence among 2165 descriptors but still not the main factor for emission wavelength because only 7 solvatochrmic dyes were in dataset. The structural descriptors such as F01[CN], CATS2D_06_LL, IAC and CATS2D_00_LL reflect the structural feature of cyanine dyes. Unfortunately, it is possible to determine the degree of importance of descriptors for RF model, but difficult to establish what kind of influence the descriptors have on the final result. In other words, it is difficult to explain how the descriptors directly influence the predicted properties. This is a direct consequence of the nonlinear character of the RF models.

10.

Conclusion

11.

The present study demonstrates that RF models can be used for the prediction of fluorescence properties. The obtained results showed that the RF models with the quantum chemical descriptors produced a model of good predictability with good agreement with the experimental values. Therefore, it can be concluded that (1) DFT-calculated descriptors constitute useful tools in the prediction of fluorescent dyes, (2) RF models can be accurately applied to describe the complex phenomena of fluorescent dyes compared with PLS and SVR, (3) RF is able to either categorical or continuous variables to take the solvent effect into account, (4) the proposed model could identify and provide some insight into what descriptors were related to the fluorescence properties. The QSPR approach is a promising tool which provides quick and cost-effective for the prediction of both the fluorescence absorption and emission wavelengths.

3.

4.

5.

6.

7.

8.

9.

12.

13.

14.

15.

16.

References 1.

2.

Carter KP, Young AM, Palmer AE (2014) Fluorescent Sensors for Measuring Metal Ions in Living Systems. Chem Rev 114:4564– 4601. https://doi.org/10.1021/cr400546e Yue Y, Huo F, Yin C et al (2015) A new Bdonor-two-acceptor^ red emission fluorescent probe for highly selective and sensitive

17.

18.

detection of cyanide in living cells. Sensors Actuators B Chem 212:451–456. https://doi.org/10.1016/j.snb.2015.02.074 Guo Z, Park S, Yoon J, Shin I (2014) Recent progress in the development of near-infrared fluorescent probes for bioimaging applications. Chem Soc Rev 43:16–29. https://doi.org/10.1039/ C3CS60271K Basabe-Desmonts L, Reinhoudt DN, Crego-Calama M (2007) Design of fluorescent materials for chemical sensing. Chem Soc Rev 36:993–1017. https://doi.org/10.1039/B609548H Guillaumont D, Nakamura S (2000) Calculation of the absorption wavelength of dyes using time-dependent density-functional theory (TD-DFT). Dyes Pigments 46:85–92. https://doi.org/10.1016/ S0143-7208(00)00030-9 Åstrand P-O, Ramanujam PS, Hvilsted S et al (2000) Ab Initio Calculation of the Electronic Spectrum of Azobenzene Dyes and Its Impact on the Design of Optical Data Storage Materials. J Am Chem Soc 122:3482–3487. https://doi.org/10.1021/ja993154r De la Fuente JR, Cañete A, Saitz C, Jullian C (2002) Photoreduction of 3-Phenylquinoxalin-2-ones by Amines: Transient-Absorption and Semiempirical Quantum-Chemical Studies. J Phys Chem A 106:7113–7120. https://doi.org/10.1021/ jp014317c Jacquemin D, Perpète EA, Scuseria GE et al (2008) TD-DFT Performance for the Visible Absorption Spectra of Organic Dyes: Conventional versus Long-Range Hybrids. J Chem Theory Comput 4:123–135. https://doi.org/10.1021/ct700187z Zhao Y, Zhao J, Huang Y et al (2014) Toxicity of ionic liquids: Database and prediction via quantitative structure–activity relationship method. J Hazard Mater 278:320–329. https://doi.org/10.1016/ j.jhazmat.2014.06.018 Venkatraman V, Alsberg BK (2015) A quantitative structureproperty relationship study of the photovoltaic performance of phenothiazine dyes. Dyes Pigments 114:69–77. https://doi.org/10. 1016/j.dyepig.2014.10.026 Kar S, Sizochenko N, Ahmed L et al (2016) Quantitative structureproperty relationship model leading to virtual screening of fullerene derivatives: Exploring structural attributes critical for photoconversion efficiency of polymer solar cell acceptors. Nano Energy 26:677–691. https://doi.org/10.1016/j.nanoen.2016.06.011 Pereira F, Xiao K, Latino DARS et al (2017) Machine learning methods to predict density functional theory B3LYP energies of HOMO and LUMO Orbitals. J Chem Inf Model 57:11–21. https://doi.org/10.1021/acs.jcim.6b00340 Xu J, Zheng Z, Chen B, Zhang Q (2006) A linear QSPR model for prediction of maximum absorption wavelength of second-order NLO chromophores. QSAR Comb Sci 25:372–379. https://doi. org/10.1002/qsar.200530143 Nantasenamat C, Isarankura-Na-Ayudhya C, Tansila N et al (2007) Prediction of GFP spectral properties using artificial neural network. J Comput Chem 28:1275–1289. https://doi.org/10.1002/jcc. 20656 Shi J, Luan F, Zhang H et al (2006) QSPR study of fluorescence wavelengths (λex/λem) based on the heuristic method and radial basis function neural networks. QSAR Comb Sci 25:147–155. https://doi.org/10.1002/qsar.200510142 Li M, Ni N, Wang B, Zhang Y (2008) Modeling the excitation wavelengths (λex) of boronic acids. J Mol Model 14:441–449. https://doi.org/10.1007/s00894-008-0293-0 Xu J, Xiong Q, Chen B et al (2008) Modeling the relative fluorescence intensity ratio of Eu(III) complex in different solvents based on QSPR method. J Fluoresc 19:203. https://doi.org/10.1007/ s10895-008-0403-5 Beheshti A, Riahi S, Ganjali MR, Norouzi P (2012) Highlighting and trying to overcome a serious drawback with qspr studies; data collection in different experimental conditions (mixed-QSPR). J Comput Chem 33:732–747. https://doi.org/10.1002/jcc.22892

J Fluoresc 19.

Marini A, Muñoz-Losa A, Biancardi A, Mennucci B (2010) What is Solvatochromism? J Phys Chem B 114:17128–17135. https:// doi.org/10.1021/jp1097487 20. Breiman L (2001) Random Forests. Mach Learn 45:5–32. https:// doi.org/10.1023/A:1010933404324 21. fluorophores.org. http://www.fluorophores.tugraz.at/. Accessed 1 May 2007 22. Weber G, Farris FJ (1979) Synthesis and spectral properties of a hydrophobic fluorescent probe: 6-propionyl-2(dimethylamino)naphthalene. Biochemistry 18:3075–3078. https://doi.org/10.1021/bi00581a025 23. Kucherak OA, Didier P, Mély Y, Klymchenko AS (2010) Fluorene Analogues of Prodan with Superior Fluorescence Brightness and Solvatochromism. J Phys Chem Lett 1:616–620. https://doi.org/10. 1021/jz9003685 24. Lu Z, Lord SJ, Wang H et al (2006) Long-wavelength analogue of PRODAN: synthesis and properties of Anthradan, a fluorophore with a 2,6-Donor−Acceptor Anthracene Structure. J Org Chem 71:9651–9657. https://doi.org/10.1021/jo0616660 25. ChemAxon (2017) Marvin 17.28.0 26. Berthold MR, Cebron N, Dill F et al (2009) KNIME - the Konstanz Information Miner: Version 2.0 and Beyond. SIGKDD Explor Newsl 11:26–31. https://doi.org/10.1145/1656274.1656280

27. Karelson M (2000) Molecular descriptors in QSAR/QSPR 28. Kode - Chemoinformatics (2016) Dragon version 7.0.4 29. Frisch MJ, Trucks GW, Schlegel HB, et al (2016) Gaussian 09 Revision A.02 30. Becke AD (1993) A new mixing of Hartree–Fock and local densityfunctional theories. J Chem Phys 98:1372–1377. https://doi.org/10. 1063/1.464304 31. Batista GE, Monard MC, others (2002) A Study of K-Nearest Neighbour as an Imputation Method. HIS 87:48 32. Breiman L (1984) Classification and regression trees. Routledge, New York 33. Lorber A, Wangen LE, Kowalski BR (1987) A theoretical foundation for the PLS algorithm. J Chemom 1:19–31. https://doi.org/10. 1002/cem.1180010105 34. Drucker H, Burges CJC, Kaufman L, et al (1997) Support vector regression machines. In: Advances in neural information processing systems. pp 155–161 35. Basak D, Pal S, Patranabis DC (2007) Support vector regression. Neural Inf Process Rev 11:203–224 36. Sharnoff M (1971) Photophysics of aromatic molecules. J Lumin. https://doi.org/10.1016/0022-2313(71)90011-1

Random Forest Approach to QSPR Study of Fluorescence Properties Combining Quantum Chemical Descriptors and Solvent Conditions

Recommend Documents