Opt Quant Electron DOI 10.1007/s11082-014-9944-9
Identification of GMOs by terahertz spectroscopy and ALAP–SVM Jianjun Liu · Zhi Li · Fangrong Hu · Tao Chen · Yong Du · Haitao Xin
Received: 16 December 2013 / Accepted: 6 May 2014 © Springer Science+Business Media New York 2014
Abstract An approach for identification of terahertz (THz) spectral of genetically modified organisms (GMOs) based on active learning affinity propagation clustering algorithm (ALAP) combined with support vector machine (SVM) in this paper, and THz transmittance spectra of some typical genetically modified (GM) cotton samples are investigated to prove its feasibility. Firstly, principal component analysis is applied to extract features of the spectrum data. Secondly, instead of the original spectrum data, the feature signals are fed into the ALAP–SVM pattern recognition, where an improved active learning ALAP is applied to SVM. The experimental results show that THz spectroscopy combined with ALAP–SVM can be effectively utilized for identification of different GM cottons. The proposed approach provides a new effective method for detection and identification of different GMOs by using THz spectroscopy. Keywords
GMOs · Terahertz · SVM · Affinity propagation · Cotton
J. Liu · Z. Li (B) School of Mechano-Electronic Engineering, Xidian University, Xi’an 710071, Shanxi, People’s Republic of China e-mail:
[email protected] Z. Li Guilin University of Aerospace Technology, Guilin 541004, Guangxi, People’s Republic of China F. Hu · T. Chen School of Electronic Engineering and Automation, Guilin University of Electronic Technology, Guilin 541004, Guangxi, People’s Republic of China Y. Du Shool of Information Engineering College, Jimei University, Fujian 361005, Xiamen, People’s Republic of China Y. Du · H. Xin Xiamen University, Fujian 361005, Xiamen, People’s Republic of China
123
J. Liu et al.
1 Introduction Nowadays, with the popularization of genetically modified (GM) technology, the GM products are increasing in the global market. Nevertheless, the potential problems of genetically modified organisms (GMOs) for environmental, ethical and religious impacts are unknown. GM products are severely limited in most regions around the world, especially in China, unfortunately due to the majority of people are frightened for consuming GM products. Hence, it is necessary to find effective methods for detection of GM products, which is one of the most crucial issues of food safety and quality (Moreira et al. 2013). In this sense, there are several methods to identify GM products, which are shown in Alishahi et al. (2010). As a whole, PCR and ELISA are the two most common methods for clearly distinguishing GM products (Nakamura et al. 2013; Li-Juan et al. 2008; Milcamps et al. 2009; Fiehn et al. 2000; Margarit and Reggiardo 2006; Zhu et al. 2010), DNA methods for recognize GM products have sufficient confidence and reliability (Ahmed 2002; Liu et al. 2004; Small 2006; Raamsdonk et al. 2007) However, these methods are high cost, tedious, time-consuming and destructive, and what’s more, cannot detect the changes caused by gene implant. On the contrary, terahertz (THz) spectroscopy is a fast, non-destructive method which low cost, timesaving and no preparation of samples to identify GMOs. However, the application of THz in the genetic field and especially in GM foods is almost new. Most of the studies are focused on the non-transgenic molecules, and numerous researchers have proposed methods of identification and classification of different molecules, such as Zhang (2007) proposed a measurement technology to identify bio-molecules using THz time-domain spectroscopy. Chen et al. (2013) researched the identification of bimolecular by THz spectroscopy and fuzzy pattern recognition. Markelz et al. (2000) reported the first use of pulsed THz spectroscopy examining low-frequency collective vibration modes of calf thymus DNA, bovine serum albumin in 0.006–2.00 THz, and found that these three bimolecular have characteristic absorption in the THz frequency. Walther et al. (2003)investigated the THz spectra of glucose, fructose and sucrose in polycrystalline and amorphous forms using THz time-domain spectroscopy. Upadhya et al. (2004) measured THz absorption spectra of a range of monosaccharide molecules together with different disaccharides in the frequency region between 0.1 and 3.0 THz using a time-resolved THz spectroscopy system. Chen et al. (2013) proposed an approach for measuring multicomponents in tablets based on THz spectroscopy. In summary, it is worth mentioning that the phenotypical changes are the best indicators for expressing the changes on the genotypic structure (Munck et al. 2004). Thus, the basis of this technology is that it could identify phenotypical changes caused by genotypic changes that ultimately bring about changes on molecular bonds such as C–H, C–N and C–O. Therefore, with the application of THz spectroscopy it would be possible to evaluate the specific gene expression based on the phenotypical changes, and it is possible to use THz spectroscopy to identify GM products.
2 Experimental samples and apparatus 2.1 Samples Eight differences GM cottons are respectively supplied by Institute of Cotton Research of Chinese Academy of Agricultural Sciences(ICRCAAS), Shandong Luyi Cotton Industry Technology Company Limited (SLCIT Co., Ltd.) and Shandong Xinqiu Seed Technology Company Limited (SXST Co., Ltd.). Different kinds of transgenic cotton seeds are crushed
123
Identification of GMOs by terahertz spectroscopy and ALAP–SVM Table 1 Genetically modified cotton material Cotton seed
Provider
Geometry
Thickness (mm)
Diameter (cm)
Label
Train set
Test set
Lumianyan 28
SLCIT Co., Ltd.
sheet
1.5
1.2
1
17
10
Lumianyan 29
SLCIT Co., Ltd.
sheet
1.5
1.2
2
10
7
Lumianyan 36
SLCIT Co., Ltd.
sheet
1.5
1.2
3
10
5
Xinqiu k638
SXST Co., Ltd.
sheet
1.5
1.2
4
14
7
Xinqiu 107
SXST Co., Ltd.
sheet
1.5
1.2
5
26
15
New Luzhong 6
ICRCAAS
sheet
1.5
1.2
6
18
10
Zhongmian 28
ICRCAAS
sheet
1.5
1.2
7
17
8
Yinmian 8
ICRCAAS
sheet
1.5
1.2
8
25
18
Fig. 1 Schematic diagram of THz-TDS spectrometer
into powder after peeled and dried, and then press the powder into a sheet. Details of selected transgenic cottons are shown in Table 1. 2.2 Experimental apparatus The time-resolved spectroscopy detection system used in this experiment is the same as what is depicted in Chen et al. (2013). The experimental device used is schematically shown in Fig. 1, where the center wavelength of laser is 800 nm. In order to ensure the accuracy of the experiment, in the system, Nitrogen is injected until the internal relative humidity to 2 % below. Indoor relative humidity is 25 % and temperature is 292 K. 2.3 THz absorption peak of GM cotton seeds Figure 2 shows the absorption peaks of eight difference GM cotton seeds. Visually, THz spectra of eight differences GM cotton seeds are very similar (in addition to the individual). Therefore, it is difficult to identify directly the types of the samples using their feature spectral features. So, we adopt pattern recognition methods to identify THz spectra of these samples (Zhang and Liu 2002; Yin et al. 2007; Chen et al. 2012). Here, an active learning affinity propagation (ALAP)–support vector machine (SVM) pattern recognition method is introduced, in which ALAP is applied to optimize of training set and the SVM is used to build the classification model.
123
J. Liu et al.
Fig. 2 The THz absorption peak of different GM cotton seeds
3 Theory and algorithm 3.1 Affinity propagation algorithm Affinity propagation (AP) is a clustering algorithm that is proposed by Frey for the first time in 2007 (Frey and Dueck 2007). This algorithm has the advantage of fast and effective etc, and it has application in cluster of face image, discover “gene exon” and search the optimal route. AP does not only can find representative examples of sample set but also can segment the original data set. The input of AP algorithm is a measure between pairs of sample points, AP takes S N ×N (the similarities matrix) between N data points as an input, where the similarity s(i, k) indicates how well the data point with index k is suited to be the exemplar for data point i usually each similarity is set to a negative squared error (Euclidean distance). When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points xi and xk , s(i, k) = − xi − xk 2 . All the data points of the original sample concentration could have the potential exemplar samples, so the initializations of s(i, k) all are the same. Indeed, the method described here can be applied when the optimization criterion is much more general. There are two kinds of two important parameters are defined as “responsibility” r (i, k) and “availability” a(i, k), respectively, in the process of clustering, and the two parameters need to be updated by competition mechanism. The update rules are described as: r (i, k) = s(i, k) − max {a(i, k ) + s(i, k )} k s.t.k =k
r (k, k) = s(k, k) − max {a(k, k ) + s(k, k )} k s.t.k =k ⎫ ⎧ ⎬ ⎨ max{0, r (i , k)} a(i, k) = min 0, r (k, k) + ⎭ ⎩ i s.t.i =i,k max{0, r (i , k)} a(k, k) = i s.t.i =k
123
Identification of GMOs by terahertz spectroscopy and ALAP–SVM
The r (i, k) reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i. The a(i, k) reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar. According to the above update rules t, the information between the sample points are exchanged and the energy function is chosen to achieve a minimum. 3.2 Active learning affinity propagation In the AP Clustering Algorithm, the sample labels are added to the train set by the current classifier predicted instead of manual annotation. Therefore, if the prediction errors are caused, those errors will be accumulated constantly in the subsequent iterations. If that sample point of the clearest classification results (away from the hyperplane) is selected to learn in the current classifier, then the probability of error labels is induced is minimum. However, from the view of the amount of information, the information of these samples is very small. When these sample points are added to the train set, effect of these samples on the classification model is quite minute, but will increase the computational burden at the same time. Conversely, if select those sample points which contain a large amount of information (near the hyperplane), the probability of error prediction is induced is great. So the key is to seek the appropriate balance between the information of sample points and the accuracy of predicting a label (Rong et al. 2011). In order to solve this problem, an ALAP approach is proposed in this paper. Firstly, use the AP clustering is used to cluster unlabeled samples. Secondly, obtain a regional center of each cluster and distribution information of samples for each class. From this, select the most favorable sample point to add to training set. The sample clustering center of AP plays an important role in prior distribution of data and the boundary points of clusters (nearest sample points to hyperplane) is very significant for performance of the classifier. In order to facilitate the subsequent description, the clustering center zr and the sample points of cluster boundary sample points z b based on above two elements are defined as follows: zr = arg max(r (i, i) + a(i, i)) (1) i⊂U
z b = arg min(s(i, k))
(2)
k⊂U
where i is the clustering center of AP algorithm; k is the minimum similarity point to i; U is describe the library of unlabeled samples. The core idea of the ALAP algorithm is that add all clustering center points and boundary points to the original train set as a train data of SVM. In this way, it is not only can actively learn the most informative samples but also can label a part of the sample points in a certain amount of information. For which, assuming L and U represent the samples library of labeled and unlabeled, respectively. L is the training set. U is the all unlabeled samples at the beginning of iteration.
123
J. Liu et al.
Unlabed sample U
Experimental data set
Initial train set L
no
Training phase Test phase
Test data set T
SVM classifier
AP precluster
Updata train set U =U − M L = L+M
Active choice M
Termination condition S ? yes
SVM best Classfier Ω
Test output
Fig. 3 The flow chart of ALAP SVM
3.3 ALAP–SVM algorithm process To sum up, the ALAP algorithm is described as follows: Input: the original train set L (at least one sample per class), the candidate samples U , the number of sample point m, the number of categories N um = C, and the termination condition S which is a fixed number of iterations. Output: the optimization of train set T , the final SVM model . (a) Initial SVM classifier by original train set L (artificial selection); (b) Judge whether there are unlabeled sample points in U , if yes, the unlabeled sample points are labeled by the specified class number AP algorithm, and then select m sample points of most valuable M using formulas (1) and (2), else go to step (f); (c) Classify and label the sample points M using SVM; (d) Update train set L : L = L + M and unlabeled sample points U : U = U − M; (e) Retrain classifier using the updated L; (f) Judge whether the SVM classifier reach the termination condition Safter the end of k-th training, if yes, end the train, else return step (b), the flow chart is shown in Fig. 3.
4 Results and discussion 4.1 Feature extraction and selection The THz spectrum between 0.2 and 1.2 THz of GM cotton samples are selected as a feature to identify in this paper. In order to reduce the amount of data and save time, principal component analysis (PCA) has been performed to extract features of original spectra data.
123
Identification of GMOs by terahertz spectroscopy and ALAP–SVM
Fig. 4 The scree plot of the number of component
Figure 4 gives the PCA scree plot of GM cottons between 0.2 and 1.2 THz. It can be seen from Fig. 4 that the first three principal components in total account for most variance in the data set, and the three factors describe the most spectral variations related to origin. Figure 5 shows the scattered scores plot PC1 versus PC2 and PC1 versus PC3 of different GM cottons after using PCA. As seen from Fig. 5, using the first two or three principal components of GM cotton samples as features, the eight different samples can be separated from each other. Therefore, PCA can effectively extract THz spectral features of samples, and it is very capable of similar feature clustering to THz spectra. 4.2 Identification of genetically modified cottons by using ALAP–SVM In order to verify the effectiveness and robustness of the ALAP SVM classification model proposed in this paper, the initial parameters of ALAP–SVM is given in Table 2. In this paper, the total number of samples is 217, in which 136 for training and the other 81 for identification, and compared them with three different methods, which are random selection learning, entropy based on active learning and uncertainty learning. It can be seen from Fig. 6 that with the labeled samples increasing by using active learning, the classification accuracy rate also has a corresponding change. Because the proposed ALAP–SVM algorithm considers the impact of data prior distribution on classifier, the clustering centers play an important role in classifier performance when the amount of initial training samples is small, and fewer error labels are introduced. As seen from Fig. 6, in the beginning, the classification performances of the other three algorithms have different degrees decline, the selection strategy of random SVM is random select labeled samples, so the classification effect is not stable and the classification accuracy is slightly lower than the other three algorithms. It is can be seen from Fig. 6 that the classification effect and classification accuracy of proposed ALAP SVM are better than the other three algorithms in the case of same original train samples.
123
J. Liu et al.
Fig. 5 Scattered scores plot PCA1 versus PCA2 (a), PCA1 versus PCA3 (b) Table 2 Initial parameters of ALAP–SVM Train set L
Unlabeled sample U
Most valuable M
Categories C
Termination condition S
Test set
Sampling time (ps)
136
81
5
8
20
81
0.04064
The accuracy of classifier is shown in Table 3 after different iteration number. It can be seen obviously from Table 3 that the proposed ALAP algorithm converges faster than the other algorithms when achieving the same accuracy rate with fewer iteration number. It can
123
Identification of GMOs by terahertz spectroscopy and ALAP–SVM
Fig. 6 The relationship between correct classification rate and active add unlabeled samples
Table 3 The relationship between correct classification rate and iteration number
Iteration number
Random SVM
Entropy SVM
Uncertainty SVM
ALAP SVM
1
101/136
101/136
101/136
101/136
2
93/136
97/136
95/136
103/136
3
88/136
95/136
93/136
106/136
4
90/136
93/136
98/136
109/136
5
95/136
98/136
101/136
113/136
6
103/136
105/136
107/136
118/136
7
104/136
107/136
110/136
121/136
8
112/136
114/136
115/136
123/136
9
119/136
121/136
119/136
125/136
10
123/136
124/136
125/136
126/136
11
127/136
128/136
128/136
127/136
12
128/136
129/136
129/136
127/136
13
129/136
129/136
131/136
130/136
14
129/136
130/136
131/136
133/136
15
129/136
130/136
131/136
133/136
not only save a lot of labeled samples but also effectively improve the accuracy and robustness of the algorithm. Figure 7 illustrates the comparison chart of test sets using four different methods. It can be seen from Fig. 7 that there are four samples misjudged by the random SVM, three samples misjudged by entropy SVM and uncertainty SVM, but only one sample misjudged by the proposed method (details of misjudging are see the arrows in Fig. 7; the meanings of Labels are shown in Table 1). Results show that the proposed method is of distinguishing effect than the other three methods.
123
J. Liu et al.
Fig. 7 The comparison chart of actual and prediction classification of test sets
5 Conclusions In this paper, a model is established for recognizing genetically modified cotton using ALAP SVM with THz spectroscopy. The results show that the model has a recognition rate of 97.794 %. The proposed approach improves the recognition accuracy of THz spectra of GM cotton. The proposed method is fast, simple, and nondestructive for the detection of GM products, and lays the foundation for application of a qualitative analysis model in actual sample detection. It can be widely applied to the fields of food security, and it guides the detection of different GM products. Acknowledgments This research is partly supported by the National Natural Science Foundation of China (No. 61265005); partly supported by Nation Science Foundation of Fujian (No. 2013J01246); partly supported by the foundation from Guangxi Experiment Center of Information Science Guilin University of Electronic Technology (No. 20130101) and the program for innovation research team of Guilin University of Electronic Technology.
References Ahmed, F.E.: Detection of genetically modified organisms in foods. Trends Biotechnol. 20(5), 215–223 (2002) Alishahi, A., Farahmand, H., Prieto, N., Cozzolino, D.: Identification of transgenic foods using NIR spectroscopy: a review. Spectrochim. Acta A Mol. Biomol. Spectrosc. 75(1), 1–7 (2010)
123
Identification of GMOs by terahertz spectroscopy and ALAP–SVM Chen, T., Li, Z., Mo, W.: Identification of terahertz absorption spectra of explosives based on fuzzy pattern recognition. Chin. J. Instrum. 33, 2480–2486 (2012) Chen, T., Li, Z., Mo, W.: Identification of biomolecules by terahertz spectroscopy and fuzzy pattern recognition. Spectrochim. Acta A Mol. Bimol. Spectrosc. 106, 48–53 (2013) Chen, T., Li, Z., Mo, W.: Simultaneous quantitative determination of multicomponents in tablets based on terahertz time-domain spectroscopy. Spectrosc. Spectr. Anal. 33, 1220–1225 (2013) Fiehn, O., Kopka, J., Trethewey, N., et al.: Identification of uncommon plant metabolites based on calculation of elemental compositions using gas chromatography and quadrupole mass spectrometry. Anal. Chem. 72(15), 3573–3580 (2000) Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) Liu, Y., Lyon, B.G., Windham, W.R., Lyon, C.E., Savage, E.M.: Prediction of physical, color, and sensory characteristics of broiler breasts by visible/near infrared reflectance spectroscopy. Poult. Sci. 83, 1467–1474 (2004) Margarit, E., Reggiardo, M.I., et al.: Detection of BT transgenic maize in foodstuffs. Food Res. Int. 39(2), 250–255 (2006) Markelz, A.G., Roitberg, A., Heilweil, E.J.: Pulsed terahertz spectroscopy of DNA, bovine serum albumin and collagen between 0.1 and 2.0 THz. Chem. Phys. Lett. 320(1–2), 42–48 (2000) Milcamps, A., Rabe, S., Cade, R., et al.: Validity assessment of the detection method of maize event Bt10 through investigation of its molecular structure. J. Agric. Food Chem. 57(8), 3156–3163 (2009) Moreira, I., Scarminio, I.S.: Chemometric discrimination of genetically modified Coffea arabica cultivars using spectroscopic and chromatographic fingerprints. Talanta 107, 245–254 (2013) Munck, L., Moller, B., Jacobsen, S., Sondergaard, S.: Near infrared spectra indicate specific mutant endosperm genes and reveal a new mechanism for substituting starch with (1/3, 1/4)-b-glucan in barley. J. Cereal Sci. 40, 213–222 (2004) Nakamura, K., Akiyama, H., Kawano, N., et al.: Evaluation of real-time PCR detection methods for detecting rice products contaminated by rice genetically modified with a CpTI-KDEL-T-nos transgenic construct. Food Chem. 141(3), 2618–2624 (2013) Raamsdonk, L.W.D., Holst, C., Baeten, V., Berben, G., Boix, A., Jong, J.: New developments in the detection and identification of processed animal proteins in feeds. Anim. Feed Sci. Technol. 133(1–2), 63–83 (2007) Rong, C.H.E.N., Yong-Feng, C.A.O., Hong, S.U.N.: Multi-class image classification with active learning and semi-supervised learning. Acta Autom. Sinic. 37, 954–962 (2011) Small, G.W.: Chemometrics and near-infrared spectroscopy: avoiding the pitfalls. Trends Anal. Chem. 25(11), 1057–1066 (2006) Upadhya, P.C., Shen, Y.C., Davies, A.G., Linfield, E.H.: Far-infrared vibrational modes of polycrystalline saccharides. Vib. Spectrosc. 35(1–2), 139–143 (2004) Walther, M., Fischer, B.M., Jepsen, P.U.: Noncovalent intermolecular forces in polycrystalline and amorphous saccharides in the far infrared. Chem. Phys. 288(2–3), 261–268 (2003) Xie, L.-J., Ying, Y.-B., Ying, T.-J., et al.: Application of Vis/NIR diffuse reflectance spectroscopy to the detection and identification of transgenic tomato leaf. Spectrosc. Spectr. Anal. 28, 1062–1066 (2008) Yin, X.X., Ng, B.W.-H., Fischer, B.M., Ferguson, B., Abbott, D.: Support vector machine applications in terahertz pulsed signals feature sets. IEEE Sens. 7(12), 1597–1608 (2007) Zhang, X.R., Liu, F.: A patten classification method based on GA and SVM. In: 6th International Conference on Signal Processing. vol. 1, pp. 110–113 (2002). Zhang, T.J.: Research on measurement technology of bio-molecules based on terahertz time-domain spectroscopy. Zhejiang Univ. Hangzhou 23, 45–71 (2007) Zhu, D., Liu, J.F., Tang, Y.B., et al.: A reusable DNA biosensor for the detection of genetically modified organism using magnetic bead-based electrochemiluminescence. Sens. Actuators B 149(1), 221–225 (2010)
123