Anal Bioanal Chem (2007) 387:1739–1748 DOI 10.1007/s00216-006-0851-1
ORIGINAL PAPER
Fourier transform infrared (FT-IR) spectroscopy in bacteriology: towards a reference method for bacteria discrimination Ornella Preisner & João Almeida Lopes & Raquel Guiomar & Jorge Machado & José C. Menezes
Received: 10 July 2006 / Revised: 8 September 2006 / Accepted: 8 September 2006 / Published online: 4 November 2006 # Springer-Verlag 2006
Abstract Rapid and reliable discrimination among clinically relevant pathogenic organisms is a crucial task in microbiology. Microorganism resistance to antimicrobial agents increases prevalence of infections. The possibility of Fourier transform infrared (FT-IR) spectroscopy to assess the overall molecular composition of microbial cells in a non-destructive manner is reflected in the specific spectral fingerprints highly typical for different microorganisms. With the objective of using FT-IR spectroscopy for discrimination between diverse microbial species and strains on a routine basis, a wide range of chemometrics techniques need to be applied. Still a major issue in using FT-IR for successful bacteria characterization is the method for spectra pre-processing. We analyzed different spectra pre-processing methods and their impact on the reduction of spectral variability and on the increase of robustness of chemometrics models. Different types of the Enterococcus faecium bacterial strain were classified according to
chromosomal DNA restriction patterns produced by pulsed-field gel electrophoresis (PFGE). Samples were collected from human patients. Collected FT-IR spectra were used to verify if the same classification was obtained. In order to further optimize bacteria classification we investigated whether a selected combination of the most discriminative spectral regions could improve results. Two different variable selection methods (genetic algorithms (GAs) and bootstrapping) were investigated and their relative merit for bacteria classification is reported by comparing with results obtained using the entire spectra. Discriminant partial least-squares (Di-PLS) models based on corrected spectra showed improved predictive ability up to 40% when compared to equivalent models using the entire spectral range. The uncertainty in estimating scores was reduced by about 50% when compared to models with all wavelengths. Spectral ranges with relevant chemical information for Enterococcus faecium bacteria discrimination were outlined.
O. Preisner (*) : J. C. Menezes Centre for Biological and Chemical Engineering, IST, Technical University of Lisbon, Av. Rovisco Pais, 1049-001 Lisbon, Portugal e-mail:
[email protected]
Keywords PFGE typing . FT-IR spectroscopy . Partial least-squares (PLS) . Genetic algorithm (GA) . PLS-bootstrapping
J. A. Lopes REQUIMTE, Department of Chemistry-Physics, Faculty of Pharmacy, University of Porto, Rua Anibal Cunha 164, 4050-047 Porto, Portugal
Introduction
R. Guiomar : J. Machado Enterobactereaceae Unity, Bacteriology Center, National Health Institute Ricardo Jorge, Avenida Padre Cruz, 1649–016 Lisbon, Portugal
In the recent years, the effective treatment of various infectious diseases has become more difficult due to the progressive spread of bacterial resistance to commonly used antimicrobial agents. Precise epidemiologic investigation requires an assessment of relatedness between individuals with similar infections in order to determine whether person-to-person spread has occurred. This implies a
1740
continuing need for rapid, accurate, and reliable methods that allow discrimination among a wide range of clinically relevant microbial pathogens down to the species, subspecies, and strain level. In most clinical microbiological laboratories, molecular diagnostics have been successfully used to identify and characterize epidemic microorganisms [1, 2]. However, although these techniques are highly specific, sensitive, and helpful in providing a more complete view of microbial systematics, it is still difficult to adapt them for use in large-scale studies on a routine basis. The methods can generally be time consuming and labor intensive, thereby hindering the effective treatment of microbial infections [3]. Vibrational spectroscopic techniques, namely Fourier transform infrared (FT-IR) and Raman spectroscopy, have been used since the 1980s as complementary methods for bacteria differentiation owing to their rapid ‘fingerprinting’ capabilities and the molecular information that they can provide [4–6]. These techniques present several advantages in the microbiological classification and identification field: they are fast (requiring virtually no sample processing), non-destructive, general, multi-purpose (e.g., detection, enumeration, classification, identification), discriminating at different taxonomic levels (serotype, strain, species, or genus) [7]. The use of infrared spectroscopy in this context was reported by Riddle and co-workers [8] and Norris [9] in the late 1950s. Development and cost reduction of modern spectrometers (improvement of spectrometers specifications like the development of interferometric IR spectroscopy) and of multivariate statistical methods boosted the number of applications of vibrational spectroscopy in the microbiology field. FT-IR has been the most used technique for bacterial classification and identification. FT-IR spectroscopy allows the discrimination of intact microbial cells in a non-destructive manner, producing an infrared (IR) spectrum composed of many different vibrational modes of all cellular components, e.g., nucleic acids, proteins, and membrane and cell wall components. Therefore, the IR spectra can be considered as highly specific fingerprints that enable accurate microbial identification [5, 7]. Several papers published in recent years have reported the use of different sampling techniques such as transmittance, diffuse reflectance, attenuated total reflectance (ATR) [10, 11]. Raman spectroscopy has also been used to differentiate between bacterial cells at different taxonomic levels [12, 13]. FT-IR microspectroscopy and confocal Raman have also been used in recent years for more specific microbiological [14–16] and clinical applications [17]. Vibrational spectroscopy generates large amounts of data, thus requiring appropriate multivariate statistical methods (chemometrics methods) depending on the specific
Anal Bioanal Chem (2007) 387:1739–1748
objective. Previous studies on microbial identification applied both unsupervised and supervised techniques for representation of information from hyper-spectral data, namely principal component analysis (PCA), factor analysis (FA), linear discriminant analysis (LDA), hierarchical cluster analysis (HCA), partial least-squares regression (PLS), and artificial neural networks (ANNs) among others [18–20]. Despite the fact that levels of almost 100% correct bacteria classification or discrimination have been demonstrated [7], reports of successful field implementation of vibrational spectroscopy for routine analysis are almost non-existent. This is due to a lack of robustness of the classification/discrimination algorithms. In general it is not enough to assess the long-term accuracy of the developed method: it is also important to assess the model’s precision. Both should be minimized in order to develop a robust model. Errors are normally estimated with RMSECV (root mean square error of cross-validation: internal validation error) or RMSEP (root mean square error of prediction: external validation error). Even when the experimental sampling and spectroscopic procedure are well established there are sources of variability that must be kept to minimum. Calibration transfer has been extensively analyzed and it is known to be a problem [21]. Removal of physical effects caused by light scattering, baseline drift, and random noise has an impact on the performance of the microbial characterization process. These are typically compensated by adequate spectra pre-processing (e.g., spectra normalization, filtering, applying derivatives, standard normal variates, multiplicative scattering correcting) [22]. A criterion for wavelength (or wavenumber) selection is normally helpful to prevent non-informative parts of the spectra being included in the chemometric model (regarding the specific objective). Typical FT-IR spectra can have more than 3,600 data points (e.g., spectra from 4,000 to 400 cm−1 with 1 cm−1 resolution). Some are redundant and even more do not contain useful information for the specific purpose. Taking into account all wavelengths give rise in general to less robust predictions. Several methods exist for efficient wavelength selection in this context (e.g., evolutionary methods, genetic algorithms, and bootstrapping) [23, 24]. The main purpose of this study was to point out the expected improvement of careful spectra pre-processing and variable selection methodology on a FT-IR-based serotype level classification model accuracy and robustness. Different pre-processing and two variable-selection methods, namely, genetic algorithms (GAs) and PLS-bootstrapping, were investigated and their relative merits for microorganisms classification are reported together with results obtained using the entire spectra. The results are compared with the pulse-field gel electrophoresis (PFGE) typing method.
Anal Bioanal Chem (2007) 387:1739–1748
1741
Theory
Table 1 Genetic algorithm parameters adopted for wavelength selection in PLS models
Chemometrics models
Parameter
Value
Partial least-squares (PLS) is a class of regression models based on the calculation of latent variables or factors [25]. In PLS these variables are calculated in order to maximize the covariance between the scores of an independent block (X) and the scores of a dependent block (Y). In this paper the prediction of a multivariate array (Y) from a twodimensional matrix (X) will be considered. The multivariate array Y contains only zeros and ones. It is used to code classes. For each sample the vector length is equal to the maximum number of classes and contains the value 1 for the class corresponding to that sample (the remaining are zeros). This special type of PLS models are called discriminant PLS models (Di-PLS), since the goal is to discriminate between classes [26, 27]. Principal component analysis (PCA) is a widely used chemometric technique for data-mining, classification, feature extraction, and process supervision purposes [27]. It was applied here to analyze FT-IR spectroscopy data in terms of discriminative performance.
Population size Probability of mutation Probability of crossover Number of iterations
20 0.005 1.000 100
Genetic algorithms (GAs) Genetic algorithms (GAs) are classes of evolutionary optimization methods that can be used as a search strategy in large-multivariate problems for which there are many possible solutions [28, 29]. The first step is to create an initial population (array), consisting of a predefined number of individuals (rows) and variables (columns). The next step in the GA is inspired by principles of genetics and natural selection. The individuals are selected for the next generation through the process of fitness assessment, crossover, and mutation. In order to evaluate the robustness of the model proposed by each individual, the GA uses a fitness function. Selection of individuals for the next generation is typically accomplished by a selection rule. The tournament selection rule is one of the most commonly adopted [29]. With the purpose of avoiding the risk of overfitting, adjacent variables were grouped together and 50 was the minimal number of the wavenumber channels in the windows. Table 1 summarizes the parameters of GA used in this study. Partial least-squares bootstrap method (PLS-bootstrap) The bootstrap is a computational technique used for assigning measures of accuracy to statistical estimates. Bootstrap is a non-parametric resampling technique that can be used to estimate statistical parameters of some population. Mean and standard deviation statistics are some
parameters that can be estimated with this technique. Bootstrap is based on a repetitive estimation of some statistic by changing the population individuals used to estimate that statistic. The uncertainty in estimating these statistics depends on the number of times the bootstrapping is applied. It is common to vary the bootstraps between 1,000 and 10,000 times. Here, bootstrap was applied in conjunction with PLS and PCA. In the former case, the objective is to assess statistical significance of model coefficients (PLS regression coefficients). Standard deviation of each coefficient is estimated and it is used to verify statistical significance of each regression coefficient (β i ¼βi tα=2;df sβ where βi is the true coefficient, bi is its average value, tα=2;df is the critical value of the t-student distribution for significance level α and appropriate degrees of freedom, and sβ is the PLS coefficient sample standard deviation estimated with bootstrapping). In our study, the PLS-bootstrap algorithm is applied as a variable selection technique, as previously proposed by Lazraq et al. [30]. A PLS regression coefficient is considered to be nonsignificant if zero is within the coefficient confidence limits (95%) for that coefficient. Wavelengths yielding nonsignificant coefficients are removed from calibration, and the model is repeated. The process lasts until all wavelengths are considered to be significant. Principal component analysis-bootstrap method (PCA-bootstrap) To investigate whether selected pre-processing and variable selections methods yielded more robust classification models, a resampling strategy associated with PCA was selected. The basis for this methodology is to assess the uncertainty in principal components (or scores) using bootstrapping. The bootstrap method coupled to the leave-one-out concept yields different estimates for each sample score (in a crossvalidation sense). Standard deviation for each sample score can then be used to assess uncertainty. For example, in a twodimensional score map, samples are actually represented by a cloud of points (where the cloud area represents uncertainty). For classification purposes with this representation it is easier to visualize the existing clusters than observing single estimates for scores. Consider a spectra data set X with
1742
Anal Bioanal Chem (2007) 387:1739–1748
dimensions N (samples) times J (wavenumbers). The following steps describe this procedure:
Perform adequate pre processing steps in X Build a PCA model with X ðobtain loadings PÞ For each sample n ðxn Þ in dataset X do ðn ¼ 1 : NÞ Remove xn from dataset X ðobtain Xn Þ For each bootstrap b do ðb ¼ 1 : BÞ Select randomly with resample N 1 samples from Xn ðobtain Xnb Þ Mean center Xnb Perform a PCA model with Xnb ðobtain loadings Pnb Þ Rotate Pnb so they match reference loadings P ðobtain PRnb Þ Apply mean centering factors to xn
loadings from the rotated loadings. The subtraction was squared and divided by the squared reference loadings (yielding a captured variance). The relative differences are summed over all wavelengths. Because at each crossvalidation step a new set of rotated loadings is generated, an average is computed for the similarity measure. This measure can be obtained for each loading.
Experimental Strains We studied a total of 31 vancomicin-teicoplanin resistant Enterococcus faecium (VRE) isolates from a Portuguese hospital during the year 2005. Twenty eight strains were isolated from feces, one from a catheter, one from a perianal swab, and another isolated from urine.
Project sample n in the PCA model using PRnb ðobtain scores Tnb Þ End End
Molecular typing (PFGE) and phenotypic typing (FT-IR) Results are a series of B scores vectors (Tnb) for each sample n in the entire spectra dataset. Standard deviation can be obtained for each dimension in the scores vectors. The procedure for rotating loadings is required, since different PCA models (made with different bootstrapped sets) yield rotated loadings (rotational freedom of PCA model). To be able to compare the scores from different models the corresponding loadings must be rotated. The rotation is done so that each bootstrapped model loadings fit the loadings for the reference model. Consider that the goal is to match loadings for the b bootstrapped model made without sample n (Pnb) to the reference loadings (P). Reference loadings can be obtained using a PCA model with all samples. The procedure used to perform the rotation is (superscript T denotes the transpose): Perform a singular value decomposition on PTnb P : USVT ¼ svd PTnb P Qnb ¼ UVT
Obtain rotation matrix Qnb : Obtain rotated loadings PRnb
:
PRnb ¼ Pnb Qnb
Rotated loadings PRnb will fit as best as possible the reference loadings (P). Differences between PRnb and reference P will give rise to differences in scores estimation which is required to assess uncertainty. The effectiveness of the rotation towards the reference model was assessed through a measure of similarity. This measure provides an indication of the average similarity between the reference model (loadings obtained for a model with all samples) and the rotated models (models obtained by cross-validation followed by the illustrated rotation). The proposed similarity measure was obtained by subtracting the reference
For the molecular typing the strains were cultivated on blood agar at 37 °C for 24 h. The DNA was prepared and enclosed in agarose plugs using a previously described method [31]. Restriction digestion of chromosomal DNA was performed with SmaI for 18 h at 30 °C. The electrophoresis was performed in a CHEF Mapper XA Pulse Field Electrophoresis System (BioRad). The Lambda Ladder PFG Marker (BioLabs) was used as a molecular size marker. The pulse time ramped from 1 to 35 seg. over 28 h at 11 °C and 6 V/cm. The interpretation of the results was done by using the bio-informatics program BioNumerics, version 3.5. For the phenotypic typing the strains were cultured in tryptose soya agar (TSA), 24 h at 37 °C. Thirty five microliter aliquots of the bacterial suspensions were evenly applied onto each well in a plate. Prior to analysis the samples were over dried at 44 °C for 40 min. Samples were run in replicate and analyzed by FT-IR spectroscopy using a TENSOR spectrometer (Bruker Optik GmbH) in transmittance mode. Spectra were collected over the wavenumber range of 4,000 cm−1 to 600 cm−1 under the control of a personal computer using OPUS 5.0 software. Spectra (see Fig. 1) were displayed in terms of absorbance as a function of the wavenumber (cm−1). Data analysis and calibration development The quality of each spectrum was evaluated using a quality test in the OPUS 5.0 software. All calculations were carried out using Matlab version 6.5 release 13 (MathWorks, Natick, MA) and the PLS Toolbox version 3.5 for Matlab (Eigenvector Research, Manson, WA). Savitzky-Golay
Anal Bioanal Chem (2007) 387:1739–1748
1743
Dice (Opt:5.00%) (Tol 2.0%-2.0%) (H>0.0% S>0.0%) [0.0%-100.0%]
PFGE 100
95
90
85
PFGE
25
sma1
32
sma1
2
sma1
5
sma1
6
sma1
11
sma1
18
sma1
20
sma1
22
sma1
29
sma1
31
sma1
36
sma1
40
sma1D
1
sma1
3
sma1
7
sma1A
21
sma1C
38
sma1G
30
sma1E
4
sma1B
8
sma2
9
sma2
10
sma2
17
sma2
33
sma2
16
sma2A
27
sma2B
19
sma2C
Fig. 1 Dendrogram and PFGE patterns of SmaI- and SmaII–digested chromosomal DNA of Enterococcus faecium isolates. The dendrogram is constructed with BioNumerics software, with 5% optimization, by using the UPGMA algorithm and Dice similarity coefficients. The clusters (Sma1 and Sma2) contain isolates with similarity coefficients greater than 84%
derivatives, multiplicative scatter correction (MSC), extended multiplicative scatter correction (EMSC), and standard normal variate (SNV) were applied to the spectra and compared. Discriminant partial least-squares (Di-PLS) was used to develop models for bacteria discrimination [32]. Two algorithms for variable selection were investigated: genetic algorithm (GA) and PLS-bootstrap. The optimal number of latent variables (LV) to include in the model was determined through the contiguous-block-out cross-validation method (block size equal to 4 samples). This repetitive procedure consists in setting aside 4 samples of the calibration set at a time, developing a calibration model without the excluded samples, and predicting the identity of the excluded samples, using the calibration model developed. The performance of the developed
models was evaluated according to their predictive capability (bias), assessed as the root mean square error of cross-validation (RMSECV). Uncertainty in estimating spectra principal components analysis components was also statistically assessed (variance).
Results and discussion Samples Thirty one isolates of Enterococcus faecium bacterial strain were classified according to chromosomal DNA restriction patterns produced by PFGE. The PFGE typing technique of Enterococcus faecium involves the digestion of the chromosomal DNA with restriction endonucleases (SmaI and SmaII). The restriction fragments are resolved into patterns of discrete bands that, when compared with one another, determine the relatedness of isolates (Fig. 1). Of the 31 isolates analyzed, 24 were identified as E. faecium SmaI type and seven were identified as E. faecium SmaII type. To evaluate the performance of FT-IR in discriminating between the strains under study, a total of 62 spectra were acquired for two replicate test portions for each of the bacterial strains (replicates were independently conducted and measured in different days). Figure 2 depicts the raw FT-IR spectra of 62 samples of E. faecium. Pre-processing methods evaluation With the purpose of optimizing the microbial discrimination, we analyzed different spectra pre-processing methods and assessed their impact on spectral variability and model complexity reduction (e.g., number of latent variables), and
Fig. 2 Observed FT-IR raw spectra collected for 31 samples and their repliactes of Enterococcus faecium
1744
Anal Bioanal Chem (2007) 387:1739–1748
Table 2 Description of the pre-processing methods evaluated for Enterococcus faecium discrimination (SmaI and SmaII types) by FT-IR spectroscopy
Pre-processing
Category
Description
MSC EMSC SNV S1
Scatter correction Scatter correction Scatter correction Derivatization filtering
S2
Derivatization filtering
SNV_S1
Derivatization filtering scatter correction Derivatization filtering scatter correction
Multiplicative scatter/signal correction Extended multiplicative scatter correction Standard normal variate Savitzky-Golay-frame size 9 points, polynomial of 2nd degree, 1st derivative Savitzky-Golay-frame size 15 points, polynomial of 2nd degree, 2nd derivative Savitzky-Golay-frame size 9 points, polynomial of 2nd degree, 1st derivative, standard normal variate Savitzky-Golay-frame size 15 points, polynomial of 2nd degree, 2nd derivative, standard normal variate
SNV_S2
Two different variable selection methods (GA and PLSbootstrap) were applied to eliminate non-informative wavenumbers and thus obtain potentially more robust models. The optimal number of latent variables (LV) to include in the calibration model was accomplished by cross-validation (CV). We adopted the contiguous-blockout method as the internal validation strategy (block size equal to 4 samples). The contiguous block method was adopted because replicates (there is one replicate for each sample) were placed consecutively in the dataset. Therefore the cross-validation procedure, by excluding four consecutive samples each time, prevents the same sample being used in the training and testing sets. The performance of developed models was evaluated according to their predictive performance, assessed as the root mean square error of cross-validation (RMSECV). Table 3 describes model performance and details with the PLS algorithm, for both the original spectral variables and the variable sets obtained after variable selection with
influence on model robustness. Since some of the methods used in spectra pre-processing can lead to information removal, several methods and their combinations were tested (Table 2) such as first- and second-order spectral derivatives (Savitzky-Golay filtering method), multiplicative scatter correction (MSC), extended multiplicative scatter correction (EMSC), and standard normal variate (SNV). Di-PLS was used to develop the calibration model for bacteria discrimination, which regresses the PFGE classification data onto the spectral data while maximizing the squared covariance between them [26]. Models were developed based on the mean-centered pre-processed spectral data and the autoscaled PFGE classification data. Outlier detection was preformed based on the leverage values, Q-residuals, and Studentized y-residuals [33]. From the original set of 62 samples, 6 were identified as outliers based on the abovementioned criteria. Therefore, only 56 samples were included for model development and variable selection.
Table 3 Cross-validation results for original spectral variables and variable sets obtained after variable selection Total range
Variable selection method Genetic algorithm
PLS-bootstrap
Model code
RMSECV
LV
RMSECV
LV
Number of variables
RMSECV
LV
Number of variables
MSC EMSC SNV S1 S2 SNV _S1 SNV _S2
0.1748 0.1736 0.1748 0.1422 0.1362 0.1064 0.0807
8 8 8 4 2 6 4
0.0656 0.1072 0.0657 0.1604 0.1426 0.0672 0.0677
5 6 6 3 2 3 4
100 200 100 50 50 50 100
0.1093 0.1043 0.1057 0.1372 0.1338 0.0567 0.0700
11 9 10 3 2 6 2
396 492 421 1,315 1,151 366 723
The performance of developed models was evaluated according to their predictive capabilities assessed by RMSECV (see code keys in Table 2). Optimal number of latent variables (LV) was evaluated according to the predictive performance of models. Number of variables = number of wavenumbers selected from entire spectrum after variable selection (selected spectral region)
Anal Bioanal Chem (2007) 387:1739–1748
1745
GA and PLS-bootstrap for different spectra pre-processing treatments (Table 2). In general, both variable selection methods improved the predictive performance of the models. The application of
Fig. 3 Partial least-squares score plots for pre-processed FT-IR spectra (model SNV_S1) of Enterococcus faecium strains classified accordingly to their chromosomal DNA restriction patterns produced by PFGE. Circles represent SmaI type, squares represent SmaII type a using all wavenumbers; b using GA-selected wavenumbers; c using PLS-bootstrap-selected wavenumbers
Fig. 4 Pre-processed FT-IR spectra with superimposed spectral regions used for PLS analysis: a all wavenumbers (4,000– 400 cm−1); b GA-selected wavenumbers; c PLS-bootstrap-selected wavenumbers
1746
Anal Bioanal Chem (2007) 387:1739–1748
Table 4 Captured variance and measure of similarity between rotated and reference models Components
Captured variance (%) Similarity (%)
1
2
3
4
5
6
7
8
9
10
37.6 99.7
17.0 99.6
14.1 99.6
8.0 97.6
7.1 98.9
5.0 98.5
2.9 96.7
2.1 94.1
1.2 82.3
1.0 70.8
Rotated models were obtained by leave-one-block-out cross-validation (block size equal to 4 samples). Values are relative squared differences between rotated and reference loadings for the first 10 components (values in %).
PLS-bootstrap led to a significant decrease in RMSECV for all models, while the GA approach had an impact on RMSECV (decrease for MSC, EMSC, SNV, SNV_S1, and SNV_S2; small increase for S1 and S2). The SNV_S1 model presented the lowest value of RMSECV for all spectral range, as well as for each variable selection method. Its performance together with selected variables by both GA and PLS-bootstrap methods are presented in Fig. 3. By viewing 3D score plots of the first three latent variables, it is possible to observe that selection of certain wavenumbers has a remarkable impact on isolates’ clustering at the level of PFGE typing. In comparison with a score plot of all the spectral range, two non-overlapping clusters corresponding to known strain types under investigation are clearly separated in the score plots while using only selected variables (Fig. 3b and c). The plots in Fig. 4 present the variables that were commonly selected from all 100 independent GA and PLS-bootstrap runs, superimposed on the pre-processed mean spectrum of the whole dataset. In general, the GA technique selected less variables from the initial dataset than PLS-bootstrap. In the SNV_S1 model, the GA approach selected a subset of 50 variables that revealed a spectral region of 1,110–1,058 cm−1, whereas the PLS-bootstrap method selected a bigger subset of 366 variables that were assigned to wider spectral region of predominantly 1,500–850 cm−1. The results showed that the region selected by both methods (1,110–1060 cm−1) is rich in bands assigned to P=O symmetric stretching. Further studies are required to investigate the differences between the two groups from the biochemical point of view and if these can be correlated with the identified discriminant FTIR spectral regions. The performance of the rotation method was analyzed for this dataset. Table 4 contains the captured variance for a PCA model including all samples, all wavelengths, and SNV_S1 processing type. The first and second components represent more than half of the captured variance. The performance of the rotation procedure described before was also assessed using the similarity measure proposed in the “Theory” section. PCA models with 10 components were tested using the leave-one-block-out strategy (leaving 4 samples out) as described before (models including all
samples and wavelengths). A similarity measure result was obtained for each loading (averaging the differences between loadings for each cross-validation step) as depicted in Table 4. It can be seen that the rotation procedure is very effective for the first components. More than 95% is obtained for components 1 to 7. The last components have higher rotation errors as expected. For example, the error for component 10 is around 30%. This can be explained because last components retain less variance and are subjected to more important variations in the crossvalidation procedure. Therefore the rotation method was validated at least for the first components. In order to assess the impact of applied pre-processing and variable selection methods on the models’ robustness and to give an impression of the uncertainty in our estimation procedure, we used the resampling strategy associated with PCA. The obtained PCA models for the entire spectral range as well as for selected variables are presented in Fig. 5. Two-component models were selected. It can be observed that confidence limits decrease in general, while reducing the amount of variables in the model (Fig. 5b and c). Considering the GA-variableselected model and the model with all wavelengths, the relative standard deviation estimation for the first PCA score reduced about 54%. A similar value was found for the PLS-bootstrap-variable-selected model. Table 5 summarizes the relative standard deviation obtained for the first two components. These results show that for closely related samples and problems of difficult discrimination, a variable selection method not only improve the bias of regression models, but can also improve discrimination based on unsupervised models like PCA (SIMCA) or hierarchical
Fig. 5 PCA score plots for the first and second principal components of FT-IR spectroscopy data. Spectra were pre-processed with SNV_S1 processing and mean centered before PCA. Bootstrap was used to assess uncertainty on scores estimation (100 bootstraps per sample). Samples scores were calculated using a leave-one-out strategy as explained in the “Theory” section. Ellipses corresponding to 95% confidence limits are represented for each sample. Circles represent SmaI type, squares represent SmaII type: a all wavelengths; b GAselected wavelengths, c PLS-bootstrap-selected wavelengths
Anal Bioanal Chem (2007) 387:1739–1748
1747 Table 5 PCA scores average relative uncertainty over 56 samples (percentage values relative to the mean when all wavelengths are considered are in parenthesis)
Score 1 Score 2
All wavelengths
Selected wavelengths (GA)
Selected wavelengths (PLS-bootstrap)
0.25 (100) 0.35 (100)
0.14 (54.1) 0.16 (47.0)
0.14 (55.7) 0.36 (102.6)
cluster analysis for example. By decreasing the uncertainty in estimating scores (for PCA models) it is less probable that misclassifications can occur.
Conclusions Finding appropriate spectra pre-processing strategies to compensate for experimental and instrumental variability is critical to successful identification of microbial pathogens by FT-IR. The aim of our study was to analyze and optimize different spectra pre-processing methods and their impact on the reduction of spectral variability and on the increased robustness of chemometrics models. In order to further optimize bacteria classification we assessed the impact of selecting combinations of the most discriminative spectral regions. Our results clearly demonstrate that application of genetic algorithms (GAs) and bootstrapping (PLS-bootstrap) techniques improved the predictive ability of PLS calibration models developed without variable selection. There is no significant difference between the performances of calibration models obtained with the two variable-selection algorithms and in both cases discriminant PLS models based on corrected spectra showed improved predictive ability up to 40% when compared to equivalent models using the entire spectral range. The selection of specific wavenumbers also improved the estimation of scores in principal component analysis models. The uncertainty in estimating scores was reduced by about 50% when compared to models using all wavelengths. These results also indicate that unsupervised models will perform better when a careful variable selection is used. The outputs of the variable selection methods additionally reveal information on the most informative spectral ranges for microbial classification. Further investigation is required to see whether these variables are common for all bacterial pathogens, or if they vary from species to subspecies and different strain levels of different microorganisms. Acknowledgements O.P. gratefully acknowledges financial support from the Portuguese Foundation for Science and Technology (research grant no. SFRH/BD/15218/2004).
1748
References 1. Ausubel F, Brent M, Kingston R, Moore D, Seidman J, Smith J, Struhl K (eds) (1989) Current protocols in molecular biology. Wiley, New York, NY 2. Brown M, Dunn W, Ellis D, Goodacre R, Handl J, Knowles J, O’Hagan S, Spasic I, Kell D (2005) Metabolomics 1(1):39–51 3. Neu HC (1992) Science 257:1064–1073 4. Maquelin K, Kirschner C, Choo-Smith LP, Van den Braak N, Endtz H, Naumann D, Puppels GJ (2002) J Microbiol Methods 51:255–271 5. Naumann D, Helm D, Labischinski H (1991) Nature 351:81–82 6. Naumann D, Keller S, Helm D, Schultz Ch, Schrader B (1995) J Mol Struct 347:399–406 7. Naumann D (2000) Infrared spectroscopy in microbiology. In: Meyers RA (ed) Encyclopedia of analytical chemistry. Wiley, Chichester, pp 102–131 8. Riddle JW, Kabler PW, Kenner BA, Bordner RH, Rockwood SW, Stevenson HJR (1956) J Bacteriol 72:593–603 9. Norris KP (1959) J Hyg 57:326–345 10. Winder CL, Carr E, Goodacre R, Seviour R (2004) J Appl Microbiol 96:328–339 11. Sandt C, Sockalingum GD, Aubert D, Lepan H, Lepouse C, Jaussaud M, Leon A, Pinon JM, Manfait M, Toubas D (2003) J Clin Microbiol Mar:954–959 12. Consuelo López-Díez E, Goodacre R (2004) Anal Chem 76:585– 591 13. Wu Q, Hamilton T, Nelson WH, Elliott S, Sperry JF, Wu M (2001) Anal Chem 73:3432–3440 14. Essendoubia M, Toubasb D, Bouzaggoua M, Pinonb JM, Manfaita M, Sockalingum GD (2005) Biochim Biophys Acta 1724:239–247 15. Rosch P, Harz M, Peschke K-D, Ronneberger O, Burkhardt H, Popp J (2006) Biopolymers 82:312–316
Anal Bioanal Chem (2007) 387:1739–1748 16. Xie C, Mace J, Dinno MA, Li YQ, Tang W, Newton RJ, Gemperline PJ (2005) Anal Chem 77:4390–4397 17. Maquelin K, Kirschner C, Choo-Smith L-P, Ngo-Thi NA,van Vreeswijk T, Stammler M, Endtz HP, Bruining HA, Naumann D, Puppels GJ (2003) J Clin Microbiol Jan:324–329 18. Goodacre R, Adaoin E, Timmins M, Burton R, Kaderbhai N, Woodward AM, Kell DB, Rooney PJ (1998) Microbiology 144: 1157–1170 19. Mobley P, Kowalski B, Workman J, Bro R (1996) Appl Spectrosc Rev 31:347–368 20. Udelhoven T, Naumann D, Schmitt J (2000) Appl Spectrosc 54 (10):1471–1479(9) 21. Bakeev KA, Kurtyka B (2005) J Near Infrared Spectrosc 13 (6):339–348 22. Martens H, Nielsen JP, Engelsen SB (2003) Anal Chem 75 (3):394–404 23. Jarvis RM, Goodacre R (2005) Bioinformatics 21(7):860–868 24. Olivieri AC, Goicoechea HC (2003) J Chemom 17(6):338–345 25. Geladi P, Kowalsky B (1986) Analytica Chimica Acta 185:1–17 26. Alsberg B, Kell D, Goodacre R (1998) Anal Chem 70:4126–4133 27. Martens H, Naes T (eds) (1992) Multivariate calibration. Wiley, UK 28. Mitchell M (1999) An introduction to genetic algorithms. The MIT Press 29. Michalewicz Z (ed) (1997) Genetic algorithms+data structures= evolution programs. Springer-Verlag, Berlin Heidelberg New York 30. Lazraq A, Cléroux R, Gauchi J-P (2003) Chemometr Intell Lab Syst 66:117–126 31. Graves L, Swaminathan B (2001) Int J Food Microbiol 65:55–62 32. Brereton RG (ed) (2003) Chemometrics. Data analysis for the laboratory and chemical plant. Wiley 33. Næs T, Isaksson T, Fearn T, Davies T (eds) (2002) A userfriendly guide to multivariate calibration and classification. NIR Publications, Chichester, p 344