European Radiology https://doi.org/10.1007/s00330-018-5343-0
MOLECULAR IMAGING
Robustness versus disease differentiation when varying parameter settings in radiomics features: application to nasopharyngeal PET/CT Wenbing Lv 1 & Qingyu Yuan 2 & Quanshi Wang 2 & Jianhua Ma 1 & Jun Jiang 1 & Wei Yang 1 & Qianjin Feng 1 & Wufan Chen 1 & Arman Rahmim 3,4 & Lijun Lu 1 Received: 4 January 2018 / Revised: 14 January 2018 / Accepted: 17 January 2018 # European Society of Radiology 2018
Abstract Objectives To investigate the impact of parameter settings as used for the generation of radiomics features on their robustness and disease differentiation (nasopharyngeal carcinoma (NPC) versus chronic nasopharyngitis (CN) in FDG PET/CT imaging). Methods We studied 106 patients (69/37 NPC/CN, pathology confirmed), and extracted 57 radiomics features under different parameter settings. Robustness was assessed by the intra-class correlation coefficient (ICC). Logistic regression with leave-oneout cross validation was used to generate classification probabilities, and diagnostic performance was assessed by the area under the receiver operating characteristic curve (AUC). Results Varying averaging strategies and symmetry, 4/26 GLCM features showed poor range of pairwise ICCs of 0.02–0.98, while depicting good AUCs of 0.82–0.91. Varying distances, 5/26 GLCM features showed ICCs of 0.82–0.99 while corresponding AUCs were 0.52–0.91. 6/13 GLRLM features showed both high AUC (0.81–0.89) and high ICC (0.85–0.99) regarding to averaging strategies. 7/13 GLSZM features showed AUCs of 0.81–0.90 while having ICCs of 0.01–0.99 under different neighbourhoods. 2/5 NGTDM features showed AUCs of 0.81–0.85 while having ICCs of 0.19–0.89 for different window sizes. Differentiating a subset of NPC (stages I–II) form CN, both SumEntropy and SZLGE achieved significantly higher AUCs than metabolically active tumour volume (AUC: 0.91 vs. 0.72, p<0.01). Conclusions Radiomics features depicting poor absolute-scale robustness regarding to parameter settings can still lead to good diagnostic performance. As such, robustness of radiomics features should not be overemphasized for removal of features towards assessment of clinical tasks. For differentiating NPC from CN, some radiomics features (e.g. SumEntropy, SZLGE, LGZE) outperformed conventional metrics. Key Points • Poor robustness did not necessarily translate into poor differentiation performance. • Absolute-scale robustness of radiomics features should not be overemphasized. • Radiomics features SumEntropy, SZLGE and LGZE outperformed conventional metrics. Keywords Fluorodeoxyglucose F18 . Positron emission tomography computed tomography . Nasopharyngeal carcinoma . Nasopharyngitis . Radiomics
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00330-018-5343-0) contains supplementary material, which is available to authorized users. * Jianhua Ma
[email protected]
2
Nanfang PET Center, Nanfang Hospital, Southern Medical University, 1023 Shatai Road, Baiyun District, Guangzhou, Guangdong 510515, China
3
Department of Radiology, Johns Hopkins University, 601 N. Caroline St, Baltimore, MD 21287, USA
4
Department of Electrical and Computer Engineering, Johns Hopkins University, 3101 Wyman Park Drive, Baltimore, MD 21218, USA
* Lijun Lu
[email protected] 1
School of Biomedical Engineering and Guangdong Provincal Key Laboratory of Medical Image Processing, Southern Medical University, 1023 Shatai Road, Baiyun District, Guangzhou, Guangdong 510515, China
Eur Radiol
Abbreviations AUC Area under the ROC curve CN Chronic nasopharyngitis 18 F-FDG 2-[18F]-fluoro-2-deoxy-D-glucose GLCM Grey level co-occurrence matrix GLRLM Grey level run length matrix GLSZM Grey level size zone matrix ICC Intra-class correlation coefficient LOOCV Leave-one-out cross validation MATV Metabolically active tumour volume NGTDM Neighbourhood grey tone difference matrix NPC Nasopharyngeal carcinoma ROC Receiver operating characteristic SUV Standardized uptake value TLG Total lesion glycolysis
Introduction FDG PET/CT has been established as a powerful technique for diagnosis in nasopharyngeal carcinoma [1–4]. However, the intense physiological uptake of FDG PET/CT in normal brain [3–5] has hampered the diagnosis of nasopharyngeal carcinoma; furthermore, FDG exhibits increased uptake in both malignant and benign masses and is not tumour-specific, thus depicting relatively poor specificity for differentiating tumour from inflammatory tissue (chronic nasopharyngitis, CN) [6–8]. Conventional metrics (such as SUVmax/mean/peak, MATV and TLG) have been widely adopted in routine PET/ CT clinical oncology [9]. Such routine analysis does not quantify intra-tumour heterogeneity. Nonetheless, intra-tumour heterogeneity of malignant tumours is well documented, and may provide additional information on tumour phenotype compared to conventional metrics [10]. Radiomics features (which quantify heterogeneity and texture in the shape and uptake of tumours) [11–13] can be used to develop models for clinical staging [14], assessment of the treatment response [15, 16], discrimination of phenotype (histological subtypes) [17, 18] and prediction of outcome [19–21] for several types of tumours in oncology [22–26]. Radiomics analysis is increasingly developed as a powerful tool towards precision (personalized) medicine [27] and the field is evolving rapidly [12, 13, 21, 28–30]. At the same time, radiomics features are affected by many factors, such as PET image acquisition [31], reconstruction [32], post-smoothing [33], tumour delineation [34] and greyscale discretization [35, 36]. Although these studies have investigated the robustness of radiomics features to different image processing parameters, the method to determine radiomics matrices/features is still subject to variability and there is great need for standardization of parameter settings [37], including averaging strategy, symmetry, distance,
neighbourhood and window size. To the best of our knowledge, the robustness of radiomics features has not been thoroughly evaluated given variations in such parameter settings, which we have pursued in the present work. The purpose of this study was to investigate the impact of variations in parameter settings as used in the generation of feature matrices. We focus on the task of differentiating NPC from CN in FDG PET/CT images with pathology as the gold standard. We provide a brief background next as further motivation for these efforts.
Background The grey level co-occurrence matrix (GLCM) [38, 39] is generated by considering the occurrence of two voxels with intensity i and j separated by distance D in direction θ. The element P(i, j) of GLCM is given by: n o Pði; jÞ ¼ # I ðx; y; zÞ ¼ i; I k; l; m ¼ j j D; θ ð1Þ where # represents the number of occurrence of the two voxels, I represents the voxel intensity, (x, y, z) and (k, l, m) are the coordinates (positions) of two different voxels. Obviously, the matrix is a function of the direction θ and distance D between the neighbouring voxels. More elaboration is provided in Online Supplemental Appendix A. Many efforts describe using one voxel as the distance and taking the average of feature values in 13 directions in 3D volumes of interest [32, 40–42]. Alternative strategies were considered for the computation of GLCM in other efforts [13, 43–46]. Some works investigate the influence of different distances on the performance of radiomics features [43–45]. Hatt et al. compared strategies of averaging features from 13 matrices each with one direction amongst the 13 directions, versus features from a single matrix obtained by including all 13 directions simultaneously [46]. Aerts et al. constructed asymmetrical GLCM to derive the texture features [13]. Furthermore, other matrices also involve different parameter settings. Grey level run length matrix (GLRLM) counts the number of runs with colinearly adjacent pixels having the same grey level in 13 directions. Aerts et al. used 13 GLRLMs followed by averaging the value calculated separately in each matrix [13]. By comparison, Vallieres et al. used only one GLRLM by simultaneously considering all the 13 directions [26]. Grey level size zone matrix (GLSZM) describes the number of a certain size zone having the same intensity within N-connected neighbourhoods. Neighbourhood grey tone difference matrix (NGTDM) characterizes the difference between a centre voxel and its neighbours within a certain window size. Most studies use 3 pixels as the window size, while Yu et al. used 7 pixels to distinguish tumour from normal tissue [47]. Thus, averaging strategies for
Eur Radiol
GLRLM, neighbourhood extent in GLSZM, and window size in NGTDM are also important parameters to consider.
Table 1
Demographics of the patients in the study NPC
CN
69
37
Mean Range
46.1±12.0 15-67
45.8±11.7 22-65
Gender Male
61
30
Female
8
7
Number of patients Age (years)
Materials and methods This retrospective study was Institutional Review Board approved and written informed consent was waived.
Patients and FDG PET/CT protocol One hundred and six patients (mean age: 46.1±11.8 years, range 15–67; 91 males, 15 females) were retrospectively enrolled. Sixty-nine patients were primarily diagnosed by histopathology with nonkeratinizing undifferentiated carcinoma, including 22 cases with stages I–II and 47 cases with stages III–IV, which were staged according to the AJCC criteria (8th edition), utilizing the TNM classification system. Thirtyseven patients were diagnosed as CN. The demographics of the patient population are listed in Table 1. All patients underwent fasting for at least 6 h prior to tracer injection. Imaging was performed 62 min (58±5 min, range: 52–67 min) post-intravenous injection of 306–468 MBq (8.27– 12.65 mCi) of 18F-FDG (~150 µ Ci/kg of body weight), and whole-body PET/CT scanning was performed on a Siemens Biograph-128 mCT scanner at the Nangfang Hospital, in compliance with the SNMMI procedure guidelines [48]. PET images were reconstructed using standard OSEM algorithm with three iterations and 21 subsets. PET image voxel size was 4.07×4.07×5 mm3 and matrix size was 200×200. The CT scans (80 mA, 120 KVp) were used for attenuation correction [49]. CT voxel size was 0.98×0.98×3 mm3, and matrix size was 512×512. DICOM format PET and CT images were exported from the console, and the body-weight-based SUVs were calculated according to: SUV ðg=mLÞ ¼
tissue activityðBq=mLÞ injected doseðBqÞ=body weightðgÞ
ð2Þ
where the tissue activity was decay-corrected to account for the time elapsed between injection and acquisition. Then, SUV images were interpolated to the same resolution as CT images for registration/fusion purposes, and SUV images and CT images were exported in a NiFTI format file.
Stages I–II Stages III–IV
22 47
NPC nasopharyngeal carcinoma, which was confirmed by histopathology as nonkeratinizing undifferentiated carcinoma, CN chronic nasopharyngitis
physicians, who were blinded to the histological results. The volume of interest (VOI) was generated based on consensus reached by two expert physicians, resulting in metabolically active tumour volumes (MATV) with a distribution of 7.51±9.81 cm3 [range: 3.39–48.8 cm3]. Five conventional metrics were then extracted from each segmented tumour: SUVmax, SUVmean, SUVpeak, MATV and total lesion glycolysis (TLG) (Online Supplemental Appendix A). The SUVs of each VOI were then discretized with a constant resolution bin size B = 0.1, as follows [35]: SUV Dis ðxÞ ¼
SUV ðxÞ SUV ðxÞ −min þ1 B B
ð3Þ
where SUV(x) is the SUV of voxel x, SUVDis(x) is the resampled value of voxel x. The discretization step is necessary to generate matrices whose size (defined by the maximum SUVDis(x)) highly impacts computation, and are used to reduce image noise and generate a constant intensity resolution so that textural features from different patients are comparable.
Matrix construction
Image preprocessing
The GLCM, GLRLM, GLSZM and NGTDM were all constructed from each three-dimensional tumour. Three parameters were investigated for the construction of GLCM:
For each patient, NiFTI format SUV images were fused to CT images in ITK-SNAP software, and horizontal, coronal and sagittal view were displayed for visualization, where a soft tissue window [40, 300] was used. Delineation of primary tumours was performed independently by two expert
1) Symmetry: Asymmetry counts the number of voxel pairs (i,j) along a given direction (as shown in Fig. 1a), and symmetry counts the number of voxel pairs including also in the reverse direction (−θ), which result in asymmetrical (‘A’) and symmetrical (‘S’) GLCMs, respectively.
Eur Radiol
Fig. 1 Illustration of 18F-FDG PET/CT imaging of patients with nasopharyngeal carcinoma (NPC) or chronic nasopharyngitis (CN), and the corresponding segmented lesions. The parameter setting of (a) averaging strategy and symmetry (1S, 1A, 13S and 13A), and (b) distance (D1, D2, D3, …, D10) for the construction of grey level cooccurrence matrix (GLCM). (c) The three different neighbourhoods
(N6, N18 and N26) used in the generation of grey level size zone matrix (GLSZM). (d) The window size (W3, W5, W7, W9 and W11) for the construction of neighbourhood grey tone difference matrix (NGTDM). The robustness analysis by intra-class correlation coefficient (ICC) and diagnostic performance by area under the ROC curve (AUC)
2) Averaging strategy: (i) A given GLCM feature value was obtained by averaging 13 individual feature values where each was derived from one specific direction (this is referred to as ‘13’). (ii) The GLCM feature value was alternatively derived by considering all the 13 directions simultaneously (noted as ‘1’), thus arriving at a single matrix and thus not requiring a subsequent averaging step. Thus, four strategies (1S, 1A, 13S and 13A) were investigated (Fig. 1a). 3) Distance: One to ten voxel distances (Dx, x = 1, 2, 3, …, 10) were investigated (Fig 1b).
Statistical analysis
GLRLM was acquired by considering the same averaging strategy as GLCM (noted as M13 and M1). Three different neighbourhoods (Fig. 1c) were used (i.e. Nx, x = 6, 18 and 26 voxels) in GLSZM. Five different window sizes (i.e. Wx, x = 3, 5, 7, 9 and 11 voxels) were considered in NGTDM (Fig. 1d). Feature extraction Feature definitions are listed in Online Supplemental Appendix A. Briefly, after the construction of matrices with different parameters, 26, 13, 13 and five features were extracted from GLCM, GLRLM, GLSZM and NGTDM, respectively. The matlab codes have been made publicly available at: https://github.com/ WenbingLv/NPC-radiomics.
In order to analyse the robustness of feature values calculated with different parameters, the intra-class coefficient (ICC) [35] was adopted: ICC ¼
BMS−WMS BMS þ WMS
ð4Þ
where BMS and WMS were the between-subjects and within-subjects mean squares, obtained via Kruskal-Wallis one-way ANOVA. ICC ranges from 0 to 1; the higher the ICC, the more robust the feature, and an ICC of 1 indicates perfect robustness (i.e. identical feature values). Logistic regression with leave-one-out cross validation (LOOCV) [50] was used to compute the classification probability of each feature with different parameter settings. Subsequently, to analyse the diagnostic performance of each feature, the receiver operating characteristic (ROC) curve was obtained by varying the threshold of classification probability, and area under the ROC curve (AUC) was used to the quantify the diagnostic performance, the AUC ranging between 0.5 (random guess) to 1 (perfect classification). The statistically significant difference between AUCs was tested via DeLong’s method [51] using the MedCalc software package (Ver. 9.5, MedCalc Software), and level of significance was set at p<0.05. A flowchart outlining the study is shown in Fig. 1.
Eur Radiol
Fig. 2 (a) Pairwise intra-class correlation coefficient (ICC) for each grey level co-occurrence matrix (GLCM) feature extracted from four combination strategies (1S, 1A, 13S and 13A), noted as ICC1S-13S, ICC1A-13A, ICC1S-1A and ICC13S-13A. (b) Areas under the ROC curve (AUCs) of 26 GLCM based features extracted from four combination
strategies (1S, 1A, 13S and 13A), noted as AUC1S-1A-13S-13A. Symbols and whiskers represent the median and range, respectively. (Red indicates for AUC ≥0.8; grey indicates AUC <0.8 but with small variability to different parameters; blue indicates AUC with large variability for different parameters. Colours are for better clarity)
Results
The robustness and diagnostic performance of GLRLM features
Robustness and diagnostic performance of GLCM features for different strategies Fifteen features showed narrow range of AUC values among the four combination strategies (1S, 1A, 13S and 13A), and eight features depicted ICC values of nearly 1 as shown in Fig. 2b. Four features (IDMN, MaxPossibility, SumEntropy and Entropy) showed a poor range of pairwise ICCs of 0.02– 0.98, while depicting good AUC values of 0.82–0.91. Thus, the effect of combination strategies on the diagnostic performance of these 15 features was negligible despite the fact that some features were not robust (Fig. 2a).
Impact of distance on robustness and diagnostic performance of GLCM features Multi-scale texture information can be captured by using different distances to extract GLCM features, as shown in Fig. 3a. Five features (namely IDN, IDMN, DiffVar, Dissimilary and Contrast_GLCM) showed ICCs of 0.82–0.99 while corresponding AUCs were 0.52–0.91, nine features showed AUCs of 0.79–0.90 while corresponding ICCs were 0.01– 0.99 (Fig. 3b). This means that robust features do not necessarily lead to good diagnostic performance, and non-robust features do not result in poor diagnostic performance.
Six features showed both high AUC (0.81–0.89) and high ICC (0.85–0.99) (Fig. 4). As such, the absolute quantitative scale as well as AUC performance were robust with respect to M1 versus M13 strategies. Meanwhile, AUC varied widely amongst different features (from 0.51 to 0.89), indicating that diagnostic performance of GLRLM features was mainly dependent on the definition of features themselves.
The robustness and diagnostic performance of GLSZM features Overall, seven features showed AUCs of 0.81–0.90 while having ICCs of 0.01–0.99, three features showed variable ICC and wide-range AUC, only two features (namely HGZE and LGZE) showed ICCs of nearly 1 and AUC of nearly 0.9 (Fig. 5a, b), meaning that for most features robustness was affected by different neighbourhoods while diagnostic performance was not.
The robustness and diagnostic performance of NGTDM features NGTDM features were not robust for different window sizes (Fig. 5c, d), only two features (namely Complexity and
Eur Radiol Fig. 3 (a) Pairwise intra-class correlation coefficients (ICCs) of 26 grey level co-occurrence matrix (GLCM)-based features between one voxel distance (D1) and each of the other nine different distances (D2~10), i.e. ICCD1-D2~10. Orange indicates high ICC values across different distances (narrow whiskers). (b) Areas under the ROC curve (AUCs) of 26 GLCM-based features for different distances (Dx, x = 1, 2, 3, …, 10), i.e. AUCDx, as extracted from the 1S combination strategy. Symbols and whiskers represent the median and range, respectively
Coarseness) showed AUCs of 0.81–0.85 while having ICCs of 0.19–0.89 for different window sizes.
AUCs of 0.91, which were significantly higher (p=0.04) than SUVmax (AUC of 0.88), and even more significantly higher (p<0.01) relative to MATV (AUC of 0.72). Figure 6 shows the ROC curves of several features from Table 2.
Comparison with conventional metrics Table 2 lists the AUC, sensitivity and specificity of six radiomics features (SumAverage1, SumAverage2, SumSquVar, SumEntropy, SZLGE and LGZE) and five conventional metrics. For the differentiation between all NPC cases (stages I–IV) and CN cases, LGZE showed a higher AUC than SUVmax (0.93 vs. 0.90), though the difference was not statistically significant (see Online Supplemental Appendix B, Tables S1, S2, for an elaboration of these comparisons). At the same time, LGZE depicted a significantly higher AUC compared to MATV (0.93 vs. 0.86, p=0.03). For the differentiation between a subset of NPC cases (stages I–II) and CN cases, both SumEntropy and SZLGE achieved Fig. 4 (a) Intra-class correlation coefficient (ICC) of grey level run length matrix (GLRLM)-based features between strategies M1 and M13 (ICCM1-M13). (b) Areas under the ROC curve (AUCs) of GLRLM-based features for strategies M1 and M13 (AUCM1M13). Symbols and whiskers represent the median and range, respectively
Discussion The present study assessed the impact of parameter settings as used in the generation of radiomics features on their robustness and disease differentiation in nasopharyngeal PET/CT (NPC vs. CN). The results demonstrated that poor absolutescale robustness of radiomics features did not necessarily translate into poor disease differentiation. Previous studies on robustness and/or diagnostic performance of radiomics features have primarily focused on factors that were external to the very definition and generation of radiomics features (i.e.
Eur Radiol Fig. 5 (a) Pairwise intra-class correlation coefficients (ICCs) of grey level size zone matrix (GLSZM)-based features between N26 and N6 (ICCN26N6), N18 (ICCN26-N18). (b) Areas under the ROC curve (AUCs) of GLSZM-based features for Nx (x = 6, 18 and 26), i.e. AUCNx. (c) Pairwise ICCs of neighbourhood grey tone difference matrix (NGTDM)-based features between W3 and W5~11 (ICCW3W5~11). (d) AUCs of NGTDMbased features for Wx (x = 3, 5, 7, 9, and 11), i.e. AUCWx. Symbols and whiskers represent the median and range, respectively
different image acquisition or processing [31–36, 46, 52–55]). Our present study focused on the internal factor, namely variations in parameter settings for the construction of radiomics feature matrices. We analysed the link between the robustness (pair-wise ICC) and diagnostic performance (AUC) of features. Features that were robust to parameter variations commonly resulted in robust but not necessarily high diagnostic performance. This was iteratively validated, e.g. eight GLCM-based features showed high robustness (ICCs Table 2
Diagnostic performance comparison between conventional metrics and radiomics features
Conventional metrics
SUVmax SUVpeak SUVmean MATV TLG
nearly 1) to different parameters, and had robust diagnostic performance, while the diagnosis accuracy ranged from 0.79 to 0.91. Nine GLRLM-based features showed high robustness (ICC>0.8) to averaging strategies, while the AUC ranged from 0.51 to 0.89. We also observed that the range of ICC values was not directly related to the range of AUCs. As shown in Fig. 5, features with high and narrow range AUCs could have high (HGZE, LGZE) or low ICCs (LZHGE, LZLGE, GLN, Complexity and Coarseness). In fact, some features have high ICC values but low AUC
Stages I–IV
Stages I–II
Radiomics features
AUC
Se
Sp
AUC
Se
Sp
0.90 0.90 0.88 0.86b 0.88
0.84 0.84 0.77 0.81 0.84
0.89 0.89 0.94 0.89 0.94
0.88a 0.87 0.87 0.72c 0.82
0.82 0.86 0.77 0.73 0.77
0.89 0.84 0.94 0.84 0.94
SumAverage1 SumAverage2 SumSquVar SumEntropy SZLGE LGZE
Stages I–IV
Stages I–II
AUC
Se
Sp
AUC
Se
Sp
0.91 0.91 0.90 0.91 0.90 0.93b
0.86 0.88 0.75 0.85 0.88 0.95
0.83 0.82 0.92 0.89 0.83 0.85
0.89 0.90 0.90 0.91ac 0.91ac 0.88
0.95 0.85 0.85 0.86 0.95 0.91
0.75 0.92 0.92 0.89 0.80 0.87
Detailed comparisons of various metrics are shown in the Online Supplemental Appendix B Se sensitivity, Sp specificity Stages I-IV: All NPC cases (N=69), from stage I to stage IV Stages I-II: A subset of NPC cases (N=22), namely those with stage I and stage II, were included abc : The upper case pairs (e.g. a for two features, one on the left, one on the right side of the table) represent the presence of a significant difference (p-value) between AUC of conventional SUVmax or MATV vs. AUC of a radiomics feature a
p=0.04
b
p=0.03
c
p<0.01
Eur Radiol Fig. 6 ROC curves of radiomics features and conventional metrics for the differentiation of nasopharyngeal carcinoma (NPC) with (a) stages I–IVand (b) stages I–II from chronic nasopharyngitis (CN)
values, as in Fig. 2 (ClusterPro), Fig. 3 (IDN, IDMN, DiffVar, Dissimilarity, Contrast_GLCM) and Fig. 4 (SRE). Non-robustness of features (in the absolute quantitative sense) did not necessarily imply poor diagnostic performance. There are also some features that have low ICC but high AUC: see Fig. 2 (MaxPossibility, Entropy) and Fig. 3 (SumAverage1, SumAverage2, SumSquVar). This is because changes in absolute values of features (low ICC) do not necessarily alter their relative ordering significantly, thus the diagnostic performance can remain high. This was also in accordance with the observation of Hatt et al. [34], in a study of oesophageal carcinoma, that although tumour segmentation and partial volume correction (PVC) led to variations in feature quantitation, the variations did not necessarily affect the prediction of response to therapy. Overall, the clinical usefulness of features varied widely with the definition of features themselves, and was partially affected by different parameter settings. This suggests that the choice of clinically useful features should be based on the specific task (i.e. diagnosis, prognosis, etc.). An important implication of this is that if a radiomics feature is not robust, this does not mean that the feature will necessarily perform poorly for diagnostic performance. In other words, robustness of radiomics features in the absolute-scale should not be over-emphasized, since radiomics features that are non-robust in absolute-scale (ICC) may still be robust in the AUC scale, and AUC performance is determined by the relative ordering of feature values. Conversely, radiomics features that are robust in the absolute-scale may still result in low AUC values. Nonetheless, the absolute-scale is also a factor of consideration, since multi-centre clinical trials and standards might require a fixed threshold. In other words, even if rankings (relative distribution of values) are preserved between different centres and scanners, one may still need consistent (absolute scale) values to make clinical decision-making recommendations. At the same time, it is plausible that future research might pave the way for convenient scanner/site-specific adjustment of recommended absolute thresholds.
Limitations Our study had some limitations. First, 47 of 69 NPC patients had stages III–IV, rendering easy differentiation of NPC from CN. As a result, it was observed that the diagnostic performance of radiomics features was improved, but by a limited degree, compared to conventional metrics. In fact, falsepositives (caused by CN) are commonly found in diagnosis of NPC with stages I–II. The reason is that the primary tumours with stages I–II have similar volumes compared to CN [56]. This was validated in our present study, and we found that conventional metrics (especially MATV) have an obviously poor diagnostic performance (AUC=0.72) compared to radiomics metrics (AUC=0.88–0.91) for differentiation of NPC (stages I–II) from CN. Second, the discretization bin size was fixed as 0.1 SUV, but the maximum SUV among patients varied, thus, different patients may produce a different size matrix (though this may be a strength of this approach). Third, the present work does not assess reproducibility with respect to a number of reconstruction and processing steps prior to feature generation (e.g. inter-observer variability of radiomics features with respect to segmentation). Future studies should include the effect of SUV discretization and processing steps combining varying parameter settings on robustness and clinical task performance of radiomics features in larger patient cohorts.
Conclusions Robustness and diagnostic performance of most GLRLMbased features were seen to be relatively independent of feature matrix parameter settings. Importantly, we showed that while a number of features (GLCM, GLSZM and NGDTM based) may be variable (in absolute scale) with respect to parameter variations, they can still perform very robustly with
Eur Radiol
respect to AUC (diagnostic performance). This has implications in approaches to pre-selection of radiomics features, in that absolute-scale robustness of radiomics features should not be overemphasized. For the specific diagnostic task of differentiating nasopharyngeal carcinoma from chronic nasopharyngitis, a number of radiomics features (e.g. SumEntropy, SZLGE and LGZE) outperformed conventional metrics. Funding This work was supported by the National Natural Science Foundation of China under grants 61628105, 81501541, U1708261, 61471188, 81271641, the National key research and development program under grant 2016YFC0104003, the Natural Science Foundation of Guangdong Province under grants 2016A030313577, and the Program of Pearl River Young Talents of Science and Technology in Guangzhou under grant 201610010011.
6.
7.
8.
9. 10.
11.
Compliance with ethical standards
12.
Guarantor The scientific guarantor of this publication is Dr. Lijun Lu.
13.
Conflict of interest The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.
14.
Statistics and biometry No complex statistical methods were necessary for this paper.
15.
Informed consent Written informed consent was waived by the Institutional Review Board. 16. Ethical approval Institutional Review Board approval was obtained. Methodology • retrospective • diagnostic study • performed at one institution
17.
18.
19.
References 1.
2.
3.
4.
5.
Liu FY, Lin CY, Chang JT et al (2007) 18F-FDG PET can replace conventional work-up in primary M staging of nonkeratinizing nasopharyngeal carcinoma. J Nucl Med 48:1614–1619 O'Donnell HE, Plowman PN, Khaira MK, Alusi G (2008) PET scanning and Gamma Knife radiosurgery in the early diagnosis and salvage "cure" of locally recurrent nasopharyngeal carcinoma. Br J Radiol 81:e26–e30 Ng SH, Chan SC, Yen TC et al (2009) Staging of untreated nasopharyngeal carcinoma with PET/CT: comparison with conventional imaging work-up. Eur J Nucl Med Mol Imaging 36:12–22 Wu H, Wang Q, Wang M, Zhen X, Zhou W, Li H (2011) Preliminary study of 11C-choline PET/CT for T staging of locally advanced nasopharyngeal carcinoma: comparison with 18F-FDG PET/CT. J Nucl Med 52:341–346 King AD, Ma BB, Yau YY et al (2008) The impact of 18F-FDG PET/CT on assessment of nasopharyngeal carcinoma at diagnosis. Br J Radiol 81:291–298
20.
21.
22.
23.
24.
Strauss LG (1996) Fluorine-18 deoxyglucose and false-positive results: a major problem in the diagnostics of oncological patients. Eur J Nucl Med 23:1409–1415 van Waarde A, Cobben DC, Suurmeijer AJ et al (2004) Selectivity of 18F-FLT and 18F-FDG for differentiating tumor from inflammation in a rodent model. J Nucl Med 45:695–700 Hustinx R, Smith RJ, Benard F et al (1999) Dual time point fluorine-18 fluorodeoxyglucose positron emission tomography: a potential method to differentiate malignancy from inflammation and normal tissue in the head and neck. Eur J Nucl Med 26: 1345–1348 Wahl RL (2008) Principles and practice of PET and PET/CT. Lippincott Williams & Wilkins, Philadelphia Gerlinger M, Rowan AJ, Horswell S et al (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. New Engl J Med 366:883–892 Parekh V, Jacobs MA (2016) Radiomics: a new application from established techniques. Expert Rev Precis Med Drug Dev 1:207–226 Lambin P, Rios-Velazquez E, Leijenaar R et al (2012) Radiomics: Extracting more information from medical images using advanced feature analysis. Eur J Cancer 48:441–446 Aerts HJ, Velazquez ER, Leijenaar RT et al (2014) Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 5:4006 Mu W, Chen Z, Liang Y et al (2015) Staging of cervical cancer based on tumor heterogeneity characterized by texture features on 18F-FDG PET images. Phys Med Biol 60:5123–5139 Yip SS, Coroller TP, Sanford NN, Mamon H, Aerts HJ, Berbeco RI (2016) Relationship between the temporal changes in positronemission-tomography-imaging-based textural features and pathologic response and survival in esophageal cancer patients. Front Oncol 6:72 Coroller TP, Agrawal V, Narayan V et al (2016) Radiomic phenotype features predict pathological response in non-small cell lung cancer. Radiother Oncol 119:480–486 Wu W, Parmar C, Grossmann P et al (2016) Exploratory study to identify radiomics classifiers for lung cancer histology. Front Oncol 6:71 Soussan M, Orlhac F, Boubaya M et al (2014) Relationship between tumor heterogeneity measured on FDG-PET/CT and pathological prognostic factors in invasive breast cancer. PLoS One 9: e94017 Lovinfosse P, Janvary ZL, Coucke P et al (2016) FDG PET/CT texture analysis for predicting the outcome of lung cancer treated by stereotactic body radiation therapy. Eur J Nucl Med Mol Imaging 43:1453–1460 Tixier F, Hatt M, Valla C et al (2014) Visual versus quantitative assessment of intratumor 18F-FDG PET uptake heterogeneity: prognostic value in non-small cell lung cancer. J Nucl Med 55: 1235–1241 El NI, Grigsby P, Apte A et al (2009) Exploring feature-based approaches in PET images for predicting cancer treatment outcomes. Pattern Recognit 42:1162–1171 Win T, Miles KA, Janes SM et al (2013) Tumor heterogeneity and permeability as measured on the CT component of PET/CT predict survival in patients with non-small cell lung cancer. Clin Cancer Res 19:3591–3599 Cheng NM, Fang YH, Lee LYet al (2015) Zone-size nonuniformity of 18F-FDG PET regional textural features predicts survival in patients with oropharyngeal cancer. Eur J Nucl Med Mol Imaging 42: 419–428 Tixier F, Groves AM, Goh V et al (2014) Correlation of intra-tumor 18F-FDG uptake heterogeneity indices with perfusion CT derived parameters in colorectal cancer. PLoS One 9:e99567
Eur Radiol 25.
26.
27.
28. 29. 30. 31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
Soufi M, Kamali-Asl A, Geramifar P, Rahmim A (2017) A Novel Framework for Automated Segmentation and Labeling of Homogeneous Versus Heterogeneous Lung Tumors in [F-18]FDG-PET Imaging. Mol Imaging Biol 19:456–468 Vallieres M, Freeman CR, Skamene SR, El Naqa I (2015) A radiomics model from joint FDG-PET and MRI texture features for the prediction of lung metastases in soft-tissue sarcomas of the extremities. Phys Med Biol 60:5471–5496 Lambin P, Leijenaar R, Deist TM et al (2017) Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol14:749-762 Aerts HJ (2016) The Potential of Radiomic-Based Phenotyping in Precision Medicine: A Review. JAMA Oncol 2:1636–1642 Gillies RJ, Kinahan PE, Hricak H (2016) Radiomics: Images Are More than Pictures, They Are Data. Radiology 278:563–577 Yip SS, Aerts HJ (2016) Applications and limitations of radiomics. Phys Med Biol 61:R150–R166 Galavis PE, Hollensen C, Jallow N, Paliwal B, Jeraj R (2010) Variability of textural features in FDG PET images due to different acquisition modes and reconstruction parameters. Acta Oncol 49: 1012–1016 van Velden FHP, Kramer GM, Frings Vet al (2016) Repeatability of radiomic features in non-small-cell lung cancer [18F]FDG-PET/CT studies: Impact of reconstruction and delineation. Mol Imaging Biol 18:788–795 Doumou G, Siddique M, Tsoumpas C, Goh V, Cook GJ (2015) The precision of textural analysis in 18F-FDG-PET scans of oesophageal cancer. Eur Radiol 25:2805–2812 Hatt M, Tixier F, Cheze LRC, Pradier O, Visvikis D (2013) Robustness of intratumour 18F-FDG PET uptake heterogeneity quantification for therapy response prediction in oesophageal carcinoma. Eur J Nucl Med Mol Imaging 40:1662–1671 Leijenaar RT, Nalbantov G, Carvalho S et al (2015) The effect of SUV discretization in quantitative FDG-PET Radiomics: the need for standardized methodology in tumor texture analysis. Sci Rep 5: 11075 Lu L, Lv W, Jiang J et al (2016) Robustness of radiomic features in [11C]choline and [18F]FDG PET/CT imaging of nasopharyngeal carcinoma: impact of segmentation and discretization. Mol Imaging Biol 18:935–945 Hatt M, Tixier F, Pierce L, Kinahan PE, Le Rest CC, Visvikis D (2017) Characterization of PET/CT images using texture analysis: the past, the present... any future? Eur J Nucl Med Mol Imaging 44: 151–165 Haralick RM, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cyb. SMC-3:610– 621 Soh L, Tsatsoulis C (1999) Texture Analysis of SAR Sea Ice Imagery Using Gray Level Co-Occurrence Matrices. IEEE T Geosci Remote 37:780–795 Metser U, Jhaveri KS, Murphy G, Halankar J (2015) Multiparameteric PET-MR assessment of response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer:
PET, MR, PET-MR and tumor texture analysis: A pilot study. Adv Mol Imaging 5:49–60 41. Roy A, Warbey V, Ferner R, O’Doherty M, Marsden P (2012) Feature based differentiation of benign, malignant and atypical neurofibroma in FDG-PET scans. J Nucl Med 53:2256 42. Rahmim A, Salimpour Y, Jain S et al (2016) Application of texture analysis to DAT SPECT imaging: Relationship to clinical assessments. Neuroimage Clin 12:e1–e9 43. Gelzinis A, Verikas A, Bacauskiene M (2007) Increasing the discrimination power of the co-occurrence matrix-based features. Pattern Recogn 40:2367–2372 44. Rahmim A, Salimpour Y, Blinder S, Klyuzhin I, Sossi V (2016) Optimized haralick texture quantification to track Parkinson’s disease progression from DAT SPECT images. J Nucl Med 57:428 45. Nanni L, Brahnam S, Ghidoni S, Menegatti E, Barrier T (2013) Different approaches for extracting information from the cooccurrence matrix. PLoS One 8:e83554 46. Hatt M, Majdoub M, Vallieres M et al (2015) 18F-FDG PET uptake characterization through texture analysis: investigating the complementary nature of heterogeneity and functional tumor volume in a multi-cancer site patient cohort. J Nucl Med 56:38–44 47. Yu H, Caldwell C, Mah K, Mozeg D (2009) Coregistered FDG PET/CT-based textural characterization of head and neck cancer for radiation treatment planning. IEEE Trans Med Imaging 28: 374–383 48. Delbeke D, Coleman RE, Guiberteau MJ et al (2006) Procedure guideline for tumor imaging with 18F-FDG PET/CT 1.0. J Nucl Med 47:885–895 49. Jiang J, Wu H, Huang M et al (2015) Variability of Gross Tumor Volume in Nasopharyngeal Carcinoma Using 11C-Choline and 18F-FDG PET/CT. PLoS One 10:e131801 50. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc. 36:111–147 51. DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44:837–845 52. Shiri I, Rahmim A, Ghaffarian P, Geramifar P, Abdollahi H, Bitarafan-Rajabi A (2017) The impact of image reconstruction settings on 18F-FDG PET radiomic features: multi-scanner phantom and patient studies. Eur Radiol 27:4498–4509 53. Bailly C, Bodet-Milin C, Couespel S et al (2016) Revisiting the robustness of PET-based textural features in the context of multicentric trials. PLoS One 11:e159984 54. Orlhac F, Boughdad S, Nioche C, Alberini JL, Soussan M, Buvat I (2017) An original approach to deal with multi-center variability of PET textural features. J Nucl Med 58:506 55. Lin C, Bradshaw T, Perk T, Harmon S, Liu G, Jeraj R (2015) Repeatability of [18F]-NaF PET imaging biomarkers for bone lesions: A multicenter study. Med Phys 42:3587 56. Busson P (2013) Nasopharyngeal carcinoma keys for translational medicine and biology. Landes Bioscience and Springer Science+ Business Media, Austin