Near-infrared spectroscopy and chemometric modelling for rapid diagnosis of kidney disease Mengli Fan1, Xiuwei Liu1, Xiaoming Yu2, Xiaoyu Cui1, Wensheng Cai1 & Xueguang Shao1,3,4,5* 1
Research Center for Analytical Sciences, College of Chemistry, Nankai University, Tianjin 300071, China 2 Laboratory of Clinic, People’s Hospital of Gaomi City, Gaomi 261000, China 3 Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China 4 State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China 5 Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300071, China Received March 3, 2016; accepted June 19, 2016; published online October 26, 2016
Rapid diagnosis is important for efficient treatment in clinical medicine. This study aimed at development of a method for rapid and reliable diagnosis using near-infrared (NIR) spectra of human serum samples with the help of chemometric modelling. The NIR spectra of sera from 48 healthy individuals and 16 patients with suspected kidney disease were analyzed. Discrete wavelet transform (DWT) and variable selection were adopted to extract the useful information from the spectra. Principal component analysis (PCA), linear discriminant analysis (LDA) and partial least squares discriminant analysis (PLSDA) were used for discrimination of the samples. Classification of the two-class sera was obtained using LDA and PLSDA with the help of DWT and variable selection. DWT-LDA produced 93.8% and 83.3% of the recognition rates for the validation samples of the two classes, and 100% recognition rates were obtained using DWT-PLSDA. The results demonstrated that the tiny differences between the spectra of the sera were effectively explored using DWT and variable selection, and the differences can be used for discrimination of the sera from healthy and possible patients. NIR spectroscopy and chemometrics may be a potential technique for fast diagnosis of kidney disease. near-infrared spectroscopy, discrimination, serum, kidney disease, chemometrics Citation:
Fan M, Liu X, Yu X, Cui X, Cai W, Shao X. Near-infrared spectroscopy and chemometric modelling for rapid diagnosis of kidney disease. Sci China Chem, 2016, doi: 10.1007/s11426-016-0092-6
1 Introduction The kidney disease is a common clinical disease with an increasing incidence, which would lead to kidney failure and complications such as cardiovascular disease [1,2]. Patients may experience no symptom in the early stages of kidney disease, causing the absence of early treatment and the increase of the mortality . For preventing the adverse
outcomes, it is essential to develop methods for effective prediction and early diagnosis of the kidney disease. The laboratory serum biomarkers have been proved to be a common and efficient clinic application for diagnosing diseases . Among them, serum creatinine, urea and uric acid are the most common biomarkers, which can reflect pathologic changes in the kidney . Therefore, techniques have been developed for analyzing these biomarkers to investigate the variations in sera. Conventional methods include Jaffe method , high performance liquid chromatography (HPLC) , enzymatic method  and hyphenated techchem.scichina.com
Fan et al.
Sci China Chem
niques of gas chromatography (GC) or liquid chromatography (LC) with isotope dilution mass spectrometry (GC-, LC-IDMS) [9,10]. With the advances of techniques to synthesize novel materials, especially nanomaterials, the preconcentration techniques were adopted for the sample preparation of chromatography and mass spectrometry (MS) to reduce the detection limit [11,12]. The use of differently labeled 13C analogues and microwave assisted enzymatic hydrolysis decreased the interference in the determination of serum creatinine and reduced the analyzing time compared with classical MS-based methods [13,14]. However, these methods are still complicated, time-consuming and labor-intensive. Rapid and cost-effective diagnosis of kidney disease is desirable for routine operation. Near-infrared (NIR) spectroscopy, as a rapid analytical technique, has been widely used in the food industry, pharmaceutical industry and agriculture [15–17]. In the past decades, the technique has attracted considerable attention in disease diagnosis, such as the diagnosis of chronic fatigue syndrome , infections of human immunodeficiency virus type-1 and endometrial carcinoma [19,20]. However, NIR is generally of low selectivity and sensitivity due to the weakness of the absorption and the broadness of the peaks. Chemometric methods are commonly needed in NIR spectroscopic discriminant and quantitative analysis. Chemometric modelling techniques combined with NIR spectroscopy have been used in variety discrimination. For example, the k-nearest neighbours (KNN) and linear discriminant analysis (LDA) were adopted to discriminate the purple sweet potato, white sweet potato and their adulterated samples . The results demonstrated that the sweet potato samples in three types can be satisfactorily classified. The principal component accumulation (PCAcc) and principal component discriminant transformation methods were employed to classify different Chinese patent medicines and provided satisfactory identification rates for the samples both in the calibration and validation sets [22,23]. In disease diagnosis, principal component analysis (PCA) and soft modelling of class analogy (SIMCA) were used for discrimination of influenza virus-infected nasal fluids . It was indicated that the samples can be identified with a correction rate 96.7%. Meanwhile, PCA and SIMCA were successfully used for diagnosis of chronic fatigue syndrome with NIR spectroscopy . Moreover, discrimination of colorectal cancer was studied by partial least squares discriminant analysis (PLSDA) method . The results proved that NIR spectroscopy and chemometric analysis are a reliable diagnostic method for cancers. In this work, discrimination of sera from healthy individuals and patients with suspected kidney disease was studied using NIR spectroscopy with the aid of PCA, LDA and PLSDA. The discrimination models were established to distinguish the two classes of sera. Besides, discrete wavelet transform (DWT) was used for removing both the back-
October (2016) Vol.59 No.10
ground and noise in the spectra, and variable (wavenumber) selection was adopted to select the informative wavenumbers. The feasibility of the proposed method in fast diagnosis of kidney disease was demonstrated.
2 Materials and method 2.1
Human serum samples were supplied by the people’s hospital of Gaomi (Gaomi, China). The study was approved by the Ethics Committee of the people’s hospital of Gaomi. The patients were informed and their consents were obtained prior to the start of this study. Serum samples were collected from patients of clinical laboratory, and the values of routine blood test were measured with a fully automatic biochemical analyzer (Beckmann, USA). When the values of the blood test items are with normal concentration of creatinine (40–120 μmol L−1), urea (3.2–7.0 mmol L−1) and uric acid (142–420 μmol L−1), the sample is considered as “normal” one and denoted as class A. The mean value and standard deviation for creatinine, urea and uric acid in the samples of class A are 59.4 (11.7), 4.7 (1.0) and 255.6 (69.0) μmol L−1, respectively. The samples whose concentration of serum creatinine, urea and uric acid are higher than the reference limit are considered as sera from patients with suspected kidney disease and denoted as class B. The mean value and standard deviation for creatinine, urea and uric acid in the samples of class B are 435.1 (320.6), 24.7 (9.4) and 521.7 (113.7) μmol L−1, respectively. 2.2 Spectral measurement All the spectra were measured with an FT-NIR spectroscopy (ThermoFisher, USA) at 37 C. Each spectrum is composed of 2205 data points from 4000 to 12500 cm−1 with the digitization interval 3.855 cm−1. Because the useful information is focused on the wavenumber range of 5600–8000 cm−1, 624 data points from 5600 to 8000 cm−1 were used in the calculations. Figure 1 displays the measured raw spectra of the serum samples. In the figure, the capital letter A and B denote the samples of class A and B, respectively. 48 class A and 16 class B samples were measured. It can be seen that most of the spectra are similar and highly overlapped. The reason for the result is that human sera are mixtures of water, proteins, fats, sugars, inorganic salts, etc. The difference in the concentration of creatinine, urea and uric acid between the sera does not bring an obvious change in the spectra. Thus, it is difficult to achieve the discrimination of human sera directly by the spectra. For building the discriminant model, 32 class A spectra and 10 class B spectra were used as the calibration set, and the remaining spectra were used as validation set to test the efficiency of the models. Kennard-Stone (KS) algorithm
Fan et al.
Sci China Chem
October (2016) Vol.59 No.10
as a supervised pattern recognition method for discrimination of the serum samples. Because the number of variables in the spectra is much bigger than the number of samples, LDA will be invalid if the raw data are used directly. Therefore, PCA is employed to reduce the dimension of the NIR spectra and the scores of the PCs are used as the input data of LDA.
Figure 1 Measured near-infrared spectra of the two-class sera. A and B denote the spectra of the sera from healthy individuals and patients with suspected kidney disease, respectively (color online).
was used for determination of the two sets . Calibration set was used for building the models, and the validation set was used for evaluation of the models. Furthermore, for correcting the background and noise in the spectra, the commonly used standard normal variate (SNV) transformation and Savitzky-Golay (SG) smoothing were adopted [27,28]. The former was used to correct the variant background and the latter was used for removing the noise in the spectra. On the other hand, discrete wavelet transform (DWT) was used as a spectral preprocessing technique, which is discussed in detail in section 3.2. It should be noticed that, when DWT is used, SNV and SG are not necessary because DWT can remove the background and noise simultaneously. 2.3 Chemometric methods 2.3.1 Principal component analysis Principal component analysis (PCA) is an effective data mining technique and has been the basic method for classification or discrimination analysis . It compresses the information of the original data into a few new variables named as principal components (PCs), which are linear combinations of the original variables. The PCs are sorted in a descending order according to the variance explained. Therefore, the first few PCs explain most of the variability of the data. The information contained in the first two or three PCs are generally used for inspection of the classification. In this study, PCA is used for reducing the dimensions of data. The first two PCs are employed to inspect the classification. 2.3.2 Linear discriminant analysis Linear discriminant analysis (LDA) is one of the most popular techniques of data classification and dimensionality reduction [30,31]. This method maximizes the variance among categories and minimizes the variance within a categorie by discriminant functions. In this study, LDA is used
2.3.3 Partial least squares discriminant analysis Partial least squares discriminant analysis (PLSDA) consists of a PLS regression model where the response variable is replaced by a set of dummy variables describing the categories as a reference value [32,33]. In this study, the samples of class A are assigned a dummy value of 1.0, and those of class B are assigned to −1.0. Thus, the criterion for the discrimination is zero. The samples whose prediction value is above zero are class A and those below zero are class B.
Results and discussion Discrimination using PCA and LDA
PCA was firstly employed to investigate the classification of the sera. Figure 2(a) shows the distribution of the calibration set of the two-class sera in PC1-PC2 space. The first two PCs explain 86.5% of the variances, indicating that the majority of the spectral information is included in the first two PCs. It can be seen, however, that the samples A and B are overlapped in the PC space and cannot be discriminated. Therefore, LDA was used to further investigate the classification. The scores in linear discriminant (LD) subspaces are calculated by the scores of calibration samples in the subspace of the first sixteen PCs explaining 99.9% of the variances. Figure 2(b) shows the distribution of the samples in the space of LD1-LD2. It can be seen that the samples in class A and B are almost separated, but there is still obvious overlapping. For building the discriminant models for the two classes, the confidence ellipses were calculated with the confidence level 95%. When a prediction sample is located in the ellipse of a class, the samples can be classified as the corresponding class. Clearly, the two ellipses are seriously overlapped, indicating that the two classes can not be classi-
Figure 2 Distribution of the calibration samples of the two-class sera in PC1-PC2 space (a) and in LD1-LD2 space (b) (color online).
Fan et al.
Sci China Chem
fied and identified. 3.2
Spectral preprocessing and variable selection
In order to further improve the effect of the classification, wavelet transform (WT) was used to extract the useful information from the spectra. WT has been successfully applied to the chemical signal processing . The most commonly used characteristic of WT is to decompose a signal into its components of different frequency or resolution. In this study, discrete wavelet transform (DWT) is employed and the commonly used multiresolution signal decomposition (MRSD) algorithm is adopted for the calculation . A spectrum can be decomposed into the components of different scales [D1, D2,…, DJ, CJ], where both C and D are known as wavelet coefficients but named as discrete approximation and discrete detail, respectively, and J is the decomposition level. In the calculation, different wavelet filters, such as Daubechies, Coiflet and Symmlet were investigated. No significant difference was found between these results. Therefore, Daubechies wavelet with vanishing moment 4, i.e., “db4”, was used. Moreover, according to MRSD algorithm, the value of J should not exceed log2N (N is the length of the input data) and a larger value is better for the extracting the features in the spectra. Therefore, the maximal number, 9, was used. According to the theory of WT and MRSD, the information contained in D1 to C9 corresponds to the information from high to low frequency. For analytical signals like NIR spectra, for instance, the low and high frequency components generally correspond to background and noise, respectively, but the middle frequency components correspond to the useful information [34,36]. Therefore, compared with SNV transformation and SG smoothing, DWT can effectively eliminate background and noise simultaneously. In this work, D5, D6 and D7, were selected for reconstructing the spectra. Figure 3 shows the reconstructed spectra. It can be seen that the variant background and the noise are removed, but the spectra of the two class samples are still overlapping. In order to further extract the useful information from the spectra, variable selection is used for selecting the informative wavenumbers. This has been proved to be very important for optimizing the models for both quantitative and discrimination analysis of NIR spectra . Monte Carlo-uninformative variable elimination (MC-UVE) [38,39], which was developed in our previous studies, is employed in this study for variable selection. The method builds a large number of models with randomly selected calibration samples, and then by using the coefficients of these models, each variable is evaluated with a stability of the corresponding coefficients. The wavenumbers with high stability are considered as the informative ones. In the calculation, the stability of all the variables is ranked from the highest to the lowest, and set the stability of the Njth variable as the
October (2016) Vol.59 No.10
cutoff value. The variables whose stability is less than the cutoff are eliminated. The selected variables for the model by MC-UVE are shown in Figure 3 with the short vertical bars. From the figure, it can be seen that most of the selected variables are concentrated in five narrow wavelength intervals around 5800, 6300, 6500, 7600 and 7800 cm−1, respectively. The wavenumbers around 5800, 7600 and 7800 cm−1 can be assigned to the first overtones of C–H stretching modes, and the absorption bands around 6300 and 6500 cm−1 can be ascribed to the first overtones of N–H vibrations of the creatinine, urea and uric acid molecules. To determine the number of retained variables (Nj), the variation of the root mean squared error (RMSE) obtained by cross-validation of the calibration set with Nj is investigated. Figure 4 shows the RMSE obtained with Nj from 20 to 200 and a step of 20. The results of 30 repeated runs were used. The mean RMSE values and their standard error (σ) are plotted in Figure 4. It can be seen that, at the beginning, both the mean value and the standard error are large, with the increase of Nj, both decrease sharply. Clearly, the mean RMSE reaches a minimum when Nj is 100. Then, the mean value of RMSE increases gradually after 100. Therefore, Nj=100 is used for further study.
Figure 3 Reconstructed spectra with the use of D5, D6 and D7, and the selected variables (the wine vertical short bars) by MC-UVE method (color online).
Figure 4 Variation of RMSEs with the number of selected variables. Standard deviation of 30 runs is plotted as error bars crossing the mean value.
Fan et al.
Sci China Chem
October (2016) Vol.59 No.10
3.3 Discrimination models To investigate the classification effect using the selected variables from the calibration spectra after DWT, the commonly used PCA is performed at first. Figure 5(a) shows the distribution of the calibration samples in PC1-PC2 space. It is clear that the two classes are still overlapping. Therefore, LDA is adopted using the first sixteen PCs for further investigation. As shown in Figure 5(b), the samples of class A and B are well separated in the LD1-LD2 space except for two samples of class A. The confidence ellipses (with confidence level 95%) for the two classes are also plotted in the figure, which can be taken as the models for identifying the prediction samples. Clearly, there is no overlap between the two ellipses, although they are close to each other. To validate the efficiency of the method for discrimination of the prediction samples, identification was performed using the spectra of the validation set. Only the selected variables in the spectra preprocessed by DWT were used in the calculation. The predicted results are plotted in Figure 5(b) as the stars. The results are summarized in the first line of Table 1 according to the location of the samples in the LD1-LD2 space of Figure 5(b). It can be seen that the recognition rates for the two classes are 93.8% and 83.3%, respectively. This demonstrates that, when background is removed and informative wavenumbers are used, satisfied identification can be obtained. In order to further investigate the feasibility of NIR spectroscopy and chemometric modelling for the problem, PLSDA was employed for classification of the serum samples. In the calculation, the selected variables from the DWT preprocessed spectra were used and the number of latent variables is set to nine according to the prediction error obtained in the Monte Carlo cross validation (MCCV). Figure 6(a) shows the result of the calibration samples for building the PLSDA models of the two classes, and Figure 6(b) shows the prediction result of the validation set. The prediction results are summarized in the second line of Table 1. It can be seen that the calibration samples of class A and class B are satisfactory classified. Only one calibration sample in class B is not correctly classified, and the recognition rate for the validation set is 100%. This indicates that
Figure 6 Distribution of the predictions for the calibration (a) and validation (b) samples by PLSDA after DWT and variable selection (color online).
Table 1 Discrimination results (recognition rates, %) of LDA and PLSDA models for the 64 serum samples using the spectral data after DWT and variable selection Calibration set
acceptable classification can be achieved by DWT-PLSDA model. Compared with DWT-LDA, better result was obtained. The reason may be that PLSDA is a supervised method. The learning from the calibration spectra extracts more information of the calibration samples.
4 Conclusions Discrimination of sera from healthy individuals and patients with suspected kidney disease was studied using NIR spectroscopy and chemometric modelling. DWT was adopted to remove the variant background in the spectra and variable selection was employed for selecting the informative wavenumbers. Satisfactory classification can be achieved using LDA and PLSDA with the help of DWT and variable selection. NIR spectroscopy and chemometric modelling may provide a fast and cost-effective diagnostic method for kidney disease. Acknowledgments This study was supported by the National Natural Science Foundation of China (21475068) and MOE Innovation Team (IRT13022) of China. Conflict of interest interest.
The authors declare that they have no conflict of
Ethical approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the Ethics Committee of the people’s hospital of Gaomi and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Figure 5 Distribution of the calibration samples of the two-class sera in PC1-PC2 space (a) and in LD1-LD2 space (b) after DWT and variable selection. The stars denote the validation samples (color online).
Levey AS, Eckardt KU, Tsukamoto Y, Levin A, Coresh J, Rossert J, De Zeeuw D, Hostetter TH, Lameire N, Eknoyan G. Kidney Int, 2005, 67: 2089–2100
6 2 3 4
5 6 7 8 9 10 11
Fan et al.
Sci China Chem
Coresh J, Selvin E, Stevens LA, Manzi J, Kusek JW, Eggers P, van Lente F, Levey AS. J Am Med Assoc, 2007, 298: 2038–2047 Levey AS, Coresh J. Lancet, 2012, 379: 165–180 Radzikowska E, Jaguś P, Skoczylas A, Sobiecka M, ChorostowskaWynimko J, Wiatr E, Kuś J, Roszkowski-Śliż K. Pol Arch Med Wewn, 2013, 123: 533–537 Yang ZX, Liang Y, Li C, Xi WQ, Zhong RQ. Rheumatol Int, 2012, 32: 2715–2723 Blass KG, Thibert RJ, Lam LK. Zeitschrift Fur Klinische Chemie Und Klinische Biochemie, 1974, 12: 336–343 Dai XH, Fang X, Zhang CM, Xu RF, Xu B. J Chromatogr B, 2007, 857: 287–295 Lorentz K, Berndt W. Anal Biochem, 1967, 18: 58–63 Kulik W, Oosterveld MJS, Kok RM, Meer K. J Chromatogr B, 2003, 791: 399–405 Harlan R, Clarke W, Bussolo JMD, Kozak M, Straseski J, Meany DL. Clin Chim Acta, 2010, 411: 1728–1734 Mundaca-Uribe R, Bustos-Ramírez F, Zaror-Zaror C, Aranda-Bustos M, Neira-Hinojosa J, Pena-Farfal C. Sensor Actuat B Chem, 2014, 195: 58–62 Kalhor H, Alizadeh N. Anal Bioanal Chem, 2013, 405: 5333–5339 Fernandez-Fernandez M, Rodríguez-Gonzalez P, Alvarez MEA, Rodríguez F, Menendez FVA, Alonso JIG. Anal Chem, 2015, 87: 3755–3763 Fernández-Fernández M, González-Antuña A, Rodríguez-González P, Álvarez MEA, Álvarez FV, Alonso JIG. Clin Chim Acta, 2014, 431: 96–102 Lee S, Choi H, Cha K, Kim MK, Kim JS, Youn CH, Lee SH, Chung H. Bull Korean Chem Soc, 2012, 33: 4267–4270 Gowen AA, Marini F, Tsuchisaka Y, Luca SD, Bevilacqua M, O’Donnell C, Downey G, Tsenkova R. Talanta, 2015, 131: 609– 618 Tan C, Li ML, Qin X. Anal Bioanal Chem, 2007, 389: 667–674 Sakudo A, Kuratsune H, Kato YH, Ikuta K. Clin Chim Acta, 2012,