SCIENCE CHINA Chemistry • ARTICLES •
July 2011 Vol.54 No.7: 1064–1071 doi: 10.1007/s11426-011-4299-6
A new quantitative structure–retention relationship model for predicting chromatographic retention time of oligonucleotides ZHAO Wei1, LIANG GuiZhao1*, CHEN YuZhen2 & YANG Li1 1
Key Laboratory of Biorheological Science and Technology (Chongqing University), Ministry of Education, Bioengineering College, Chongqing University, Chongqing 400044, China 2 Department of Mathematics, Henan Institute of Science and Technology, Xinxiang 453003, China Received August 15, 2010; accepted January 4, 2011
An integrated approach is proposed to predict the chromatographic retention time of oligonucleotides based on quantitative structure–retention relationships (QSRR) models. First, the primary base sequences of oligonucleotides are translated into vectors based on scores of generalized base properties (SGBP), involving physicochemical, quantum chemical, topological, spatial structural properties, etc.; thereafter, the sequence data are transformed into a uniform matrix by auto cross covariance (ACC). ACC accounts for the interactions between bases at a certain distance apart in an oligonucleotide sequence; hence, this method adequately takes the neighboring effect into account. Then, a genetic algorithm is used to select the variables related to chromatographic retention behavior of oligonucleotides. Finally, a support vector machine is used to develop QSRR models to predict chromatographic retention behavior. The whole dataset is divided into pairs of training sets and test sets with different proportions; as a result, it has been found that the QSRR models using more than 26 training samples have an appropriate external power, and can accurately represent the relationship between the features of sequences and structures, and the retention times. The results indicate that the SGBP–ACC approach is a useful structural representation method in QSRR of oligonucleotides due to its many advantages such as plentiful structural information, easy manipulation and high characterization competence. Moreover, the method can further be applied to predict chromatographic retention behavior of oligonucleotides. oligonucleotide, quantitative structure-retention relationship, scores of generalized base properties, auto cross covariance, genetic algorithm, support vector machine
1
Introduction
Synthetic oligonucleotides are utilized as probes to determine the structure of DNAs or RNAs, and can be widely used in many molecular biological applications including gene chips, electrophoresis, and fluorescence in situ hybridization. In numerous separation analysis methods, the ion-pair reversed-phase high performance liquid chromatography (HPLC) is quick, efficient and highly selective [1]. In order to achieve a large-scale purification, suitable experimental conditions and parameters are needed; however, *Corresponding author (email:
[email protected])
© Science China Press and Springer-Verlag Berlin Heidelberg 2011
condition optimization through experiments is costly and time-consuming [2]. In view of that, quantitative structure– retention relationship (QSRR) models have become an important tool in making the choice of separation conditions. Predicting oligonucleotide retention by QSRR models has become an important issue in chromatographic research practice and it has been shown [2] that QSRR models can be used to effectively select and optimize the parameters of HPLC. The main idea of QSRR is to predict retention data from molecular structures. In 1977, quantitative structure–bioactivity relationships were applied to chromatographic retention analysis of a series of solutes [3–5]. To date, an increasing interest in QSRR can be observed. Recently, comchem.scichina.com
www.springerlink.com
Zhao W, et al.
Sci China Chem
prehensive reviews on QSRR have been published by Héberger [6], Put and Heyden [7] and Kaliszan [8]. Nowadays, research objectives of QSRR mainly focus on organic molecules, but there have been some reports about QSRR of peptides or proteins. Put et al. [9] used 1,726 molecular descriptors to develop QSRR models of 90 peptides based on the UVE–PLS algorithm. Baczek et al. [10] adopted three structural descriptors—log SumAA, log VDWVol and clog P—to construct a QSRR model of peptides using multiple linear regression. Bodzioch et al. [11] developed a QSRR model by means of experimental log SumAA descriptors to predict the retention time of peptides. Ladiwala et al. [12] established a QSRR model using a support vector machine (SVM) algorithm and a set of molecular descriptors based on the protein crystal structure and primary sequence information and solvent-accessible protein surface area. At present, there have been only a few reports involving QSRR of DNAs. Kohlbacher et al. [13] employed the sequence and secondary structure information to characterize structures of oligonucleotides with the number of bases ranging from 15 to 48, and constructed a QSRR model using SVM to predict chromatographic retention time. Gilar et al. [2] established a mathematical model for retention time prediction of oligonucleotides using sequence and length information; the model was also used to select the optimal initial gradient strength for fast HPLC purification of synthetic oligonucleotides. The main focus of QSRR is on molecular structural representation and modeling methodology. These molecular representation descriptors often involve the use of physicochemical [14], quantum chemical [15], topological [16, 17], geometric [16], and connectivity index parameters [18, 19]. The general modeling methodology used includes multiple linear regression, artificial neural networks, partial least squares, random forests, and nearest shrunken centroids [6–8]. In this paper, we propose a combination method to predict and analyze chromatographic retention behavior of oligonucleotides. The key step in this method is to obtain a new multi-scale parameterization approach involving scores of generalized base properties with auto cross covariances. Then, the variables selected by genetic arithmetic are used as inputs for SVM modeling, and the modeling performance
1065
July (2011) Vol.54 No.7
is validated by a leave-one-out cross validation and further evaluated by predictive experiments using the models obtained from different training samples. The results show that QSRR models thus obtained can offer a new way to optimize experimental conditions for purification of oligonucleotides, and that this combination method can be further used in QSRR research.
2 2.1
Methods and materials Nucleotide sequence parameterization
DNAs or RNAs are linear macromolecules which are made up of five-membered deoxyribose rings combined with the bases adenine (A), cytosine (C), guanine (G), and thymine (T) or uracil (U) by phosphodiester bonds. Various positions and properties of bases will lead to different biofunctions. Herein, a total of 1,209 various property descriptors were collected to describe the structural diversities (Supporting Information Table S1) of DNA or RNA sequences [20]. The selected molecular descriptors may be highly correlated with each other; therefore, the redundant parameters whose magnitude of the correlation coefficient with another parameter was larger than 0.90 were removed. The remaining 41 independent parameters were retained. Furthermore, principal component analysis (PCA) was employed to find linear combinations of descriptors that capture the variation between different kinds of bases [21]. The first four principal components (PCs) obtained from the PCA account for 99.9998% of variable dispersion. That is to say, the first four PC scores can explain most information in the original data matrix (5 × 41). Therefore, the original data matrix (5 × 41) can be replaced by the matrix of four PC scores (5 × 4). Here, the four score vectors (Table 1) are tentatively called scores for generalized base properties (SGBP) [20]. DNA sequences are calculated in a 5′ to 3′ direction. Accordingly, each base in the sequence is described by four SGBP variables. DNA sequences of different lengths result in different quantitative descriptors. A set of sequences with varied base positions can be characterized by the continuation of 4 × n vectors. Unequal-length strings for oligonucleotide sequences would be produced using the SGBP representation. Con-
Table 1 Four principal component solution scores for 41 selected base properties SGBP1
SGBP2
SGBP3
SGBP4
A
3.9505
4.0764
1.1507
1.2426
C
4.3677
1.0541
1.5173
3.2084
G
2.7552
4.8467
1.1540
1.4321
T
0.4217
0.8763
3.3983
4.0915
Bases
1.9163
1.1601
4.9190
1.7917
Eigenvalues
11.5312
10.8331
10.1758
8.4599
Variance explained (%)
28.1249
26.4223
24.8189
20.6337
Cumulative variance explained (%)
28.1249
54.5472
79.3661
99.9998
U
1066
Zhao W, et al.
Sci China Chem
structing a model requires a fixed number of training inputs; therefore, the numerical arrays of unequal lengths were transformed into uniform matrices by an ACC transformation [22], as a simple and convenient pre-processing method. ACC considers the interactions of bases at different sites with minimal information loss. Crossed auto covariances between two different scales, a and b, are calculated according to eq. (1). n l
ACCa,b,l =
Z a,i Z b,i+l
i=1
nl
(l = 1, 2, 3, ···, L)
(1)
Indices a and b are used for the scales (1, 2, 3 and 4), n is the number of bases in a sequence, index i is the base position (1, 2, ···, n), l is the distance between one base and its neighbor, L is the maximum value of l, and is less than the value for the sequence with the minimum length in the total dataset, and Z is the SGBP vector. A total of 42 × l variables were achieved by ACC. 2.2
Oligonucleotide data and their representation
Thirty-nine oligonucleotides and their retention index were obtained from ref. [2] (Supporting Information Table S2). The experimental values for acetonitrile (%) were measured using a 50 × 4.6 mm XTerra MS C18, 2.5 m column, with a flow-rate of 1 mL/min at 60 °C; Mobile phase A: acetonitrile–0.1 M triethylamine acetate (TEAA), pH 7 (5:95, volume ratio); Mobile phase B: acetonitrile–0.1 M TEAA, pH 7 (15:85, volume ratio). The gradient started at 5% acetonitrile, and the gradient slope was 0.25% acetonitrile/min (mL). The number of bases in the 39 oligonucleotides ranged from 16 to 60, and different variables which resulted from the SGBP representation were further transformed by ACC into uniform variables. 2.3
Variable selection
Variable selection was performed using GA [23], as a useful variable selection tool. GA mimics natural selection in nature. The principle of natural selection is that species with a high fitness under some environmental conditions can prevail in the next generation, whilst the species with a low fitness cannot survive through selection. The best species may be reproduced by crossover, together with random mutations of chromosomes in the surviving members. In GA, the chromosome and its fitness in the species correspond to a set of variables and the internal predictive ability of the derived partial least square model, respectively. The internal predictive performance of the model is expressed in terms of a leave-one-out cross validation (LOOCV) square of multiple correlation coefficient value (Qc2v). 2.4
SVM modeling
The SVM [24] training process always seeks a global opti-
July (2011) Vol.54 No.7
mized solution, so it has the ability to handle a large number of features. The optimal interface found does not implement the minimum experimental error but rather the minimum generalization error, based on the theory of structural risk minimization. Several parameters need to be set during the SVM training phase. SVM with a radial basis function kernel was used in all of our experiments, and three parameters, C, and , were tuned based on the criteria of using a grid search approach, mainly based on a leave-one-out cross validation [25] Qc2v. The best parameters were used to train the whole training set and generate the final model. 2.5
Software used
A total of 1,209 parameters for five bases were calculated by the Dragon program (http://www.disat.unimib.it/chm/), Chem3D 2005, and the GITMHDV program which was programmed using True BASIC language by our group. The SGBP_ACC program was compiled by C language. SVM was implemented by Libsvm-2.89 (http://www.csie.ntu.edu. tw/~cjlin/libsvm/). GA and PCA were carried out by Matlab R2007b.
3 3.1
Results and discussion Modeling results
Thirty-nine oligonucleotide sequences were represented by the SGBP–ACC program. Different values of l will result in different variables; as a consequence, this will lead to different modeling performance. In this study, several values of l were optimized in order to achieve the best characterization of oligonucleotides, and to further obtain satisfying modeling results. The maximum possible l is less than the length of the shortest sequence (16 bases) in the total of 39 oligonucleotides studied (Supporting Information Table S2), so the step lengths were set as 3, 6, 9, 12 and 15, respectively. In that case, each oligonucleotide was represented by 48, 96, 144, 192 and 240 variables, respectively. The performance was not satisfactory when these original variables were used as inputs of SVM (Supporting Information Table S3); therefore, GA was applied to the original data set to eliminate the autocorrelation vectors and to optimize their descriptive power. The number of variables selected by GA was 16, 20, 38, 56 and 82, when step lengths were 3, 6, 9, 12 and 15, respectively. Then, the variables selected were used as inputs for SVM modeling. The parameters of SVM were optimized and determined by a grid search method. The modeling results are shown in Table 2. There was a highest Qc2v of 0.875 and a lowest mean squared error by LOOCV (MSEcv) of 0.398 when C = 256, = 0.00098 and = 0.00781 (with the step length l = 9). Under these conditions, the measure of self-consistency was the square of the multiple correlation coefficient value (Rc2um) = 0.972, and the mean squared error (MSEcum) = 0.143, indicating that the
Zhao W, et al.
Sci China Chem
1067
July (2011) Vol.54 No.7
Table 2 The modeling results using the variables selected by GA for different step lengths (l) a) l
The number of variables
3 6
LOOCV
Self-consistency test R2cum
MSEcum
Q2cv
MSEcv
16
0.828
0.452
0.738
0.650
20
0.934
0.364
0.849
0.530
9
38
0.972
0.143
0.875
0.398
12
56
0.972
0.153
0.839
0.463
15
82
0.989
0.116
0.841
0.497
a) R2cum, the square of the multiple correlation coefficient obtained by self-consistency tests; MSEcum, the mean squared error obtained by self-consistency tests; Q 2cv, the square of the multiple correlation coefficient obtained by LOOCV; MSEcv, the mean squared error obtained by LOOCV.
model has good fitting ability. The whole dataset was divided into two random parts, i.e., a training set and a test set, in different proportions (9:30, 13:26, 20:19, 26:13, and 30:9). In this way, we (1) determined the minimum number of samples required for training and (2) validated the predictive ability of the models trained. Under the condition of the step length l = 9, we repeated modeling 30 times using training samples of different proportions. The modeling results (Table 3) show that with the increase in the number of training samples, the modeling results of cross validation and external validation both become more and more satisfactory, and that more than 26 training samples are sufficient to construct a QSRR model with relatively good predictive ability. 3.2
Model evaluation
The good predictive performance of our model may be ascribed to two key contributors—the new combined representation and the combined modeling technique. Here, we further compare and evaluate the predictive power of this method. Our representation technique includes two indispensable steps, the SGBP vectors and the ACC transformation. Based on the principle that the chromatographic retention behavior of oligonucleotides is influenced by their sequences and advanced structural features, we use PCA to summarize 1,209 0D, 1D, 2D and 3D features of the five bases, and thereby obtain SGBP to represent the features of sequences and structures of oligonucleotides. SGBP vectors involve 4
PCs which extract more than 99.99% of the information about 1,209 parameters; therefore, they can be used for the characterization of DNAs or RNAs. The loadings reflect the relative contribution of each variable to the four SGBP vectors (Table 4). The loading analysis above shows that, as a multi-scale parameterization technique for DNAs or RNAs, SGBP vectors can represent physicochemical, quantum chemical, topological, and spatial structural features. As the oligonucleotides used in the study have different lengths, ACC was used to transform the variables to a uniform length. Most importantly, ACC with large values of l might account for the neighboring effects of bases at distant parts in a sequence (Figure 1). The comparison results demonstrate that varying ACC vectors according to different l values for five successful models may represent some useful features of the training oligonucleotide samples used for prediction experiments. In contrast using fewer or larger ACC with different l values may lose some useful information or introduce redundant noise instead of improving the prediction power of the model. Our modeling technique includes variable selection, correlation method, and model validation, etc. In the QSRR study, the redundant variables should be eliminated in order to enhance the robustness and predictive capability of models obtained, especially when the number of variables is very large. The second key step in the present method is to use GA to select variables as the input for SVM models for predicting the chromatographic retention of the oligonucleotides studied. The results show that GA can effectively eliminate the noise variables. The third key step is that
Table 3 The modeling results using training oligonucleotide samples with different sample sizes with length l = 9 a) No.
Self-consistency test
Data size
2 cum
LOOCV
External validation
training
test
R
9
30
0.501
1.538
0.379
1.783
0.333
1.863
2
13
26
0.721
0.819
0.589
1.263
0.493
1.593
3
20
19
0.853
0.735
0.753
0.983
0.689
1.273
4
26
13
0.932
0.503
0.832
0.687
0.802
0.753
5
30
9
0.956
0.393
0.846
0.539
0.826
0.698
MSEcv
Q
2 ext
1
MSEcum
Q
2 cv
MSEext
a) Q 2ext, the square of the multiple correlation coefficient obtained by external validation; MSEext, the mean squared error obtained by external validation.
1068
Zhao W, et al.
Sci China Chem
July (2011) Vol.54 No.7
Table 4 Loading from PCA of the descriptor matrix (5×41) for bases a) No. 1
Base properties
PC1
average molecular weight
0.007
PC2
PC3
PC4
0.120
0.283
0.056
2
sum of atomic van der Waals volumes (scaled on carbon atom)
0.240
0.028
0.169
0.065
3
sum of Kier–Hall electrotopological states
0.102
0.256
0.036
0.135
0.105
0.046
0.289
0.017
0.211
0.104
0.091
0.184
0.215
0.097
0.189
0.006
4
mean atomic polarizability (scaled on carbon atom)
5
mean electrotopological state
6
number of multiple bonds
7
aromatic ratio
0.090
0.132
0.265
0.023
8
first Zagreb index by valence vertex degrees
0.203
0.208
0.053
0.054
9
reciprocal hyper-detour index
0.156
0.078
0.145
0.228
10
E-state topological parameter
0.091
0.168
0.161
0.199
11
2-path Kier alpha-modified shape index
0.092
0.087
0.279
0.055
12
Kier flexibility index
0.064
0.048
0.273
0.142
13
Kier benzene-likeliness index
14
sum of topological distances between N and O
0.097
0.118
0.240
0.137
0.056
0.285
0.075
0.057
15
structural information content (neighborhood symmetry of 1-order)
0.253
0.109
0.111
0.025
16
lag 3 (weighted by atomic van der Waals volumes) of Moran autocorrelation
0.014
0.160
0.166
0.228
17
lag 6 (weighted by atomic van der Waals volumes) of Moran autocorrelation
0.058
0.226
0.055
0.212
18
lag 2 (weighted by atomic polarizabilities) of Moran autocorrelation
0.262
0.108
0.003
0.098
19
lag 4 (weighted by atomic polarizabilities) of Moran autocorrelation
0.123
0.039
0.064
0.301
20
lowest eigenvalue n. 1 of Burden matrix (weighted by atomic polarizabilities)
0.127
0.106
0.242
0.107
21
radial distribution function-3.0(weighted by atomic masses)
0.155
0.193
0.106
0.156
22
radial distribution function-3.0(weighted by atomic polarizabilities)
0.102
0.056
0.180
0.247
23
signal 21(unweighted) of 3D-MoRSE
0.015
0.085
0.052
0.325
24
signal 22(unweighted) of 3D-MoRSE
0.185
0.024
0.059
0.258
25
signal 29(unweighted) of 3D-MoRSE
0.023
0.249
0.005
0.195
26
signal 27(weighted by atomic masses) of 3D-MoRSE
0.108
0.260
0.084
0.085
27
signal 18(weighted by atomic van der Waals volumes) of 3D-MoRSE
0.165
0.215
0.106
0.092
28
signal 16(weighted by atomic Sanderson electronegativities) of 3D-MoRSE
0.080
0.126
0.201
0.202
29
3rd component symmetry directional WHIM index (weighted by atomic masses)
30
1st component shape directional WHIM index (weighted by atomic van der Waals volumes)
0.289
0.050
0.024
0.020
0.069
0.199
0.221
0.049
31
K global shape index (weighted by atomic electrotopological states)
0.241
0.065
0.164
0.040
32
R maximal autocorrelation of lag 5 (weighted by atomic van der Waals volumes)
0.078
0.276
0.071
0.079
33
number of urea derivatives
0.212
0.054
0.083
0.212
34
number of acceptor atoms for H-bonds (N O F)
0.229
0.145
0.032
0.136
35
Moriguchi octanol–water partition coefficient. (logP)
0.130
0.217
0.114
0.139
36
the seventh weight molecular holographic distance vector
0.105
0.135
0.062
0.274
37
HOMO
0.167
0.133
0.151
0.173
0.218
0.183
0.090
0.026
0.144
0.242
0.087
0.078
dipole moment
0.030
0.086
0.258
0.166
torsion energy
0.257
0.122
0.051
0.077
38
total energy
39
electronic energy
40 41
a) Relatively large loadings are represented by bold font.
SVM is used to construct QSRR models. SVM can not only solve some problems such as nonlinearity, high dimension-
ality and local minima, but also avoid an over-training phenomenon. The SVM models thus obtained reflect the
Zhao W, et al.
Sci China Chem
Figure 1 Schematic illustration of a sequence with four SGBP descriptors for every base. Red circles indicate residues. Each residue is represented by four values denoted iSj, where i is the no. of residues, and j is the no. of principal components. The lines connecting the descriptors indicate how cross auto covariances between the 1st and 2nd descriptors would be calculated.
complicated relationship between the ACC variables and the responses, and thus led to satisfactory results. Furthermore, we used three test methods, i.e., resubstitution tests, cross validation tests and external validation tests, to ensure the validity of the model obtained. Kohlbacher et al. [13] constructed a SVM model to predict chromatographic retention time of oligonucleotides; however, in addition to the sequence information, this model needed secondary structural information to represent structures of oligonucleotides, which is inconvenient and dependent on the reliability of the secondary structure prediction methods. Gilar et al. [2] used sequence and length information to establish a mathematical model for prediction of retention time of oligonucleotides; however, the performance of the model was not satisfactory at relatively low temperatures due to inadequate structural representation. In comparison, although it may produce large errors for sequences with complicated secondary structures, the relationship between the features and retention time is easily manifested without specific calculations of the secondary structural prediction, and our representation method is derived from various parameters related to the functions of nucleic acids and is easily operated using our program SGBPACC compiled by C++ language. 3.3
Analysis and application of QSRR models
To investigate the features of sequences and structures of those samples with large errors is very necessary for exploring molecular retention behavior. The error distribution of the model with l = 9 displayed in Figure 2 shows that most of the errors are less than 0.150 when different values of l are set, and that relatively large errors arise only from the 7th, 34th, 35th and 36th samples. In general, nucleic acids have relatively small molecular mass, and some of the hydrophobic groups of their primary structures have a direct impact on their functions; thus, the hydrophobicity of nucleic acids can be analyzed by their sequence features [26]. The retention index in reversed phase chromatography usually increases with increasing hydrophobicity [27]. The separation of oligonucleotides in a reversed-phase HPLC system could be challenging due to the different hydrophobicity of the A, G, C, and T bases. In fact, oligonucleotides
July (2011) Vol.54 No.7
1069
Figure 2 Error distribution (39 samples were used to construct the SVM-based model using the variables selected by GA by LOOCV. The step length of ACC was 9. The errors for the 7th, 34th, 35th, and 36th samples are all larger than 0.150).
with the same length, but different sequences may have different retention indices. Earlier published data [28] showed that the hydrophobicity contribution to the oligonucleotide retention increases in order C < G < A < T. This trend was also observed in our samples. For example, the first (GTAGCAGCAGCCAGAC) and second samples (GTCTGGCTGCTGCTAC), both include 16 bases, with 5 G and 5 C mononucleotides, respectively. However, the number of T bases in the first sample is less than that in the second one, which may be responsible for the lower experimental retention value for the first sample. There is only a slight difference between predicted and experimented retention times for both samples; therefore we can conclude that the SGBP– ACC method may effectively represent the structural features of oligonucleotides, and that the model thus obtained can accurately express the influence of sequence composition on the predicted retention times. If only the impact of primary sequence is taken into consideration, oligonucleotides with the same length and proportion of A, G, T, and C should possess uniform or similar retention times. However, this was not the case. The reason may be that in addition to the primary sequence composition, secondary structural features can also contribute to the retention time of oligonucleotides. Complicated secondary structures may influence the retention of oligonucleotides by virtue of the change in hydrophobicity. For example, Dickman found that additional non-canonical B-DNA structures (Holliday junctions) in an oligonucleotide can increase its retention time in chromatography [29]. Here, the effect of secondary structures on chromatographic retention time was studied. Two software packages, namely [30] RNAstructure 4.6 and [31] GeneBee, were employed to predict the secondary structures of the 39 oligonucleotides (see the Supporting Information). The results indicated that the secondary structures of the 34th, 35th and 36th samples are more complicated than those of other samples (Supporting Information Table S2); therefore, it can be inferred that the complicated secondary structures of the 34th, 35th, 36th samples may lead to large prediction errors using our model. Further, we tested the prediction ability of our model for
1070
Zhao W, et al.
Sci China Chem
sequences with the same length and, same ratio of bases but different positions for A, C, G, and T bases. There were relatively small and large prediction errors for the 26th and 34th sequences, respectively (Supporting Information Table S2), so we used the model derived from the 39 training samples with the step length of ACC of 9 to design and predict eight sequences (Supporting Information Table S4) using the 26th and 34th sequences as templates. The eight sequences have the same ratio of bases but different positions, for A, C, G, and T in the 26th and the 34th sequence. The results predicted different retention times for different sequences, meaning that the model has the ability to characterize different sequences. We also found that there was a relatively large error for the 7th sequence (Supporting Information Table S4). To investigate the reason for this, the secondary structures of the designed sequences were predicted using RNAstructure 4.6 [30] and GeneBee [31], because no experimental data are available for the sequences. We found that the 7th sequence (Supporting Information Table S4) has no obvious predicted secondary structure using Genebee [31], and the differences remain an open question. Our model has specific applications in the purification of oligonucleotides. First, it can be used to predict the chromatographic retention time of a new analyte. Second, it can be used to select the optimal initial gradient strength for fast chromatographic purification of synthetic oligonucleotides as discussed in ref. [2]. Third, it can be used to quantitatively compare separation properties of individual types of chromatographic columns. Fourth, it can be used to evaluate properties such as lipophilicity and dissociation constants, and estimate relative bioactivities within sets of drugs and other xenobiotics as well as the material properties of members of a family of chemicals [8]. Fifth, it can express the relationship between structure and retention time of oligonucleotides, and further, gain insight into the molecular mechanism of separation operating under certain experimental conditions. For example, using our model, we can determine whether the secondary structure of a sequence is complicated or not. As the discussion above, our QSAR model may produce large errors for sequences with complicated secondary structures. When a large number of sequences are predicted by the model, if a few of them produce large errors, it can be inferred that the secondary structures of the sequences with large errors may be complicated.
July (2011) Vol.54 No.7
ior. The performance of the resulting model is generally satisfactory, but the large prediction errors for some samples cannot be ignored. Therefore, we intend to add some information—particularly secondary structural information—closely related to the retention behavior, or delete some redundant information in the representation procedure in order to optimize our method and further reduce the prediction errors. This work was supported by the National Natural Science Foundation of China (10901169), National 111 Programme of Introducing Talents of Discipline to Universities (0507111106), Innovation Ability Training Foundation of Chongqing University (CDCX008) and Innovative Group Program for Graduates of Chongqing University, Science and Innovation Fund (200711C1A0010260).
1
2
3
4
5 6 7
8 9
10
11
12
13
4
Conclusions
We developed a new combination approach, SGBP-ACCGA-SVM, to predict chromatographic retention times of oligonucleotides. The sequence–structure–function (retention) relationship of oligonucleotides was analyzed according to QSRR models, which could be used to choose mobile phase strength and study chromatographic retention behav-
14
15
Huber CG. Micropellicular stationary phases for high-performance liquid chromatography of double-stranded DNA. J Chromatogr A, 1998, 806: 1–28 Gilar M, Fountain KJ, Budman Y, Neue UD, Yardley KR, Rainville PD, Russell RJ, Gebler JC. Ion-pair reversed-phase high-performance liquid chromatography analysis of oligonucleotides: retention prediction. J Chromatogr A, 2002, 958: 167–182 Kaliszan R, Fork H. The relationship between the RM values and the connectivity indices for pyrazine carbothioamide derivatives. Chromatographia, 1977, 10: 346–355 Kaliszan R. Correlation between the retention indices and the connectivity indices of alcohols and methyl esters with complex cyclic structure. Chromatographia, 1977, 10: 529–540 Michotte Y, Massart DL. Molecular connectivity and retention indexes. J Pharm Sci, 1977, 66: 1630–1632 Héberger K. Quantitative structure-(chromatographic) retention relationships. J Chromatogr A, 2007, 1158: 273–305 Put R, Heyden YV. Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships. Anal Chim Acta, 2007, 602 :164–172 Kaliszan R. QSRR: Quantitative structure-(chromatographic) retention relationships. Chem Rev, 2007, 107: 3212–3246 Put R, Daszykowski M, Baczek T, Vander Heyden Y. Retention prediction of peptides based on uninformative variable elimination by partial least squares. J Proteome Res, 2006, 5: 1618–1625 Baczek T, Wiczling P, Marszall M, Heyden YV. Kaliszan R. Prediction of peptide retention at different HPLC conditions from multiple linear regression models. J Proteome Res, 2005, 4: 555–563 Bodzioch K, Baczek T, Kaliszan R, Vander Heyden Y. The molecular descriptor log SumAA and its alternatives in QSRR models to predict the retention of peptides. J Pharm Biomed Anal, 2009, 50: 563–569. Ladiwala A, Xia F, Luo Q, Breneman CN, Cramer SM. Investigation of protein retention and selectivity in HIC systems using quantitative structure retention relationship models. Biotechnol Bioeng, 2006, 93: 836–850 Kohlbacher O, Quinten S, Sturm M, Mayr B, Huber C. Structure– activity relationships in chromatography: Retention prediction of oligonucleotides with support vector regression. Angew Chem Int Ed, 2006, 45: 7009–7012 Harju M, Andersson PL, Haglund P, Tysklind M. Multivariate physicochemical characterisation and quantitative structure–property relationship modelling of polybrominated diphenyl ethers. Chemosphere, 2002, 47: 375–384 Bucinski A, Wnuk M, Goryński K, Giza A, Kochańczyk J, Nowaczyk A, Bączek T, Nasal A. Artificial neural networks analysis used to evaluate the molecular interactions between selected drugs and human 1-acid glycoprotein. J Pharm Biomed Anal, 2009, 50: 591–596
Zhao W, et al.
16
17 18
19
20
21 22
23
24
Sci China Chem
Can H, Dimoglo A, Kovalishyn V. Application of artificial neural networks for the prediction of sulfur polycyclic aromatic compounds retention indices. J Mol Struct (THEOCHEM), 2005, 723: 183–188 Yang C, Zhong C. Chirality factors and their application to QSAR studies of chiral molecules. QSAR Comb Sci, 2005, 24: 1047–1055 Rybolt TR, Janeksela VE, Hooper DN, Thomas HE, Carrington NA, Williamson EJ. Predicting second gas–solid virial coefficients using calculated molecular properties on various carbon surfaces. J Colloid Interface Sci, 2004, 272: 35–45 Skrbic B, Onjia A. Prediction of the Lee retention indices of polycyclic aromatic hydrocarbons by artificial neural network. J Chromatogr A, 2006, 1108: 279–284 Liang GZ, Li ZL. Scores of generalized base properties for quantitative sequence-activity modelings for E.coli promoters based on support vector machine. J Mol Graph Model, 2007, 26: 269–281 Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 2002 Nystrǒm A, Andersson PM, Lundstedt T. Multivariate data analysis of topographically modified -melanotropin analogues using auto and cross auto covariances. Quant Struct-Act Relat, 2000, 19: 264–269 Leardi R, Lupianez A. Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemolab, 1998, 41: 195–207 Vapnik V. Statistical Learning Theory. NewYork: Wiley-Interscience,
July (2011) Vol.54 No.7
25
26 27
28
29
30
31
1071
1998 Chou KC, Shen HB. Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organism. Nat Protoc, 2008, 3: 153–162 Li WJ, Wu JJ. The construction of RNA secondary structure prediction system. Progr Biochem Biophys, 1996, 23: 449–453 Zou HF, Zhang YK, Hong MF, Lu PC. Retention behavior of small peptides in reversed-phase high-performance liquid chromatography. Chin J Chromatogr, 1991, 9: 257–262 Huber CG, Oefner PJ, Bonn GK. High-performance liquid chromatographic separation of detritylated oligonucleotides on highly crosslinked poly-(styrene-divinylbenzene) particles. J Chromatogr, 1992, 599: 113–118 Dickman JM. Effects of sequence and structure in the separation of nucleic acids using ion pair reverse phase liquid chromatography. J Chromatogr A, 2005, 1076: 83–89 Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA, 2004, 101: 7287–7292 Brodsky LI, Ivanov VV, Kalai dzidis YL, Leontovich AM, Nikolaev VK, Feranchuk SI, Drachev VA. GeneBee-NET: internet-based server for analyzing biopolymers structure. Biochemistry, 1995, 60: 923–928