Environ Monit Assess (2010) 163:539–544 DOI 10.1007/s10661-009-0856-2
Water security assessment in Haihe River Basin using principal component analysis based on Kendall τ Huiqun Ma · Ling Liu · Tao Chen
Received: 14 May 2008 / Accepted: 10 March 2009 / Published online: 9 April 2009 © Springer Science + Business Media B.V. 2009
Abstract Kendall τ has reasonable theoretic background than Pearson correlation. It can be applied more widely in all aspects. Instead of using widely adopted Pearson correlation or its extensions in a large number of principal component analysis (PCA) instances, we introduce the Kendall τ into the PCA method. PCA is a wellknown statistical data analysis algorithm and is aimed to extract feature from high-dimensional data. It is designed to reduce the number of variables to a small number of indices while attempting to preserve the relationships present in the original data. This paper uses PCA based on Kendall τ in water security assessment of Haihe River Basin. Keywords Water security · PCA · Kendall τ · Haihe River Basin
Introduction The Haihe River Basin, which is the political, economic, and cultural center of China, is a very
H. Ma (B) · L. Liu · T. Chen State Key Laboratory of Hydrology–Water Resources and Hydraulic Engineering, Hohai University, Nanjing, 210098, China e-mail:
[email protected]
important part of North China. It contains two important cities, Beijing and Tianjin. However, it is also an area facing crises for water resources. Water shortage has also resulted in serious problems of ecosystem degradation, such as drying up of river system, shrinking of lake and wetland, and the decrease of inflow into Bohai Sea. Principal component analysis (PCA) is a wellknown statistical data analysis algorithm and is aimed to extract feature from high-dimensional data (Jolliffe 1986). PCA is a powerful patternrecognition technique that attempts to explain the variance of a large dataset of intercorrelated variables with a smaller set of independent variables (Hopke 1985). PCA is a multivariate statistical technique used to identify important components or factors that explain most of the variables to a system. It is designed to reduce the number of variables to a small number of indices while attempting to preserve the relationships present in the original data. PCA (Simeonov et al. 2003) can be employed on the dataset to compare the compositional patterns between the examined water systems and to identify the factors that influence each one. Ouyang (2005) applied PCA and principal factor analysis techniques to evaluate the effectiveness of the surface water qualitymonitoring network in a river where the evaluated variables are monitoring stations. Uttam et al. (2008) used PCA to identify electrical conductivity. Bernard (2004) regarded that PCA using
540
Environ Monit Assess (2010) 163:539–544
coefficients of linear regression is an appropriate tool for water quality evaluation and management and applied it to a tropical lake system.
Theoretical considerations In mathematical terms, PCA involves five major steps: (1) standardization of the measurements to ensure that they all have equal weight in the analysis; (2) calculate the covariance matrix; (3) find the eigenvalue λ1 , λ2 , . . ., λp , and the corresponding eigenvectors a1 , a1 , . . ., ap ; (4) discard any components that only account for a small proportion of the variation in datasets; and (5) develop the factor loading matrix and perform a varimax rotation on the factor loading matrix to infer the principal parameters (Ouyang et al. 2006). Normalization of the original evaluating matrix Suppose there are evaluating indicators counted m, evaluating years counted n, and then form an original indicators value matrix X = (xij)m×n (Zou et al. 2006): ⎡
x11 ⎢ x21 X=⎢ ⎣ ··· xm1
··· ··· ··· ···
x12 x22 ··· xm2
⎤ x1n x2n ⎥ ⎥. ··· ⎦ xmn
Kendall τ Kendall τ is widely discussed in statistics literature (Kendall and Gibbons 1990; Cliff 1996) Given a set of items N = {1, 2, . . . , n}, a ranking τ with respect to N is a permutation of all elements of N which represents a user’s preference on these items. For each i ∈ N, τ (i) denotes the position of the element i in τ (i), and for any two elements i ∈ N, τ (i) < τ ( j) implies that i is ranked higher than j by τ . The Kendall τ distance between two rankings τ 1 , τ 2 is defined as D(τ1 , τ2 ) = |(i, j ) : i < j, (τ1 (i) < τ1 ( j ) ∧ τ2 (i) > τ2 ( j)) ∨ (τ1 (i) > τ1 ( j) ∧ τ2 (i) < τ2 ( j))|. Usually, the Kendall τ distance will be normalized by dividing the maximum possible distance. Similarly, we can denote C(τ 1 , τ 2 ) as the concordance between these two rankings. That is, C(τ1 , τ2 ) = |(i, j ) : i < j, (τ1 (i) < τ1 ( j )∧ τ2 (i) < τ2 ( j )) ∨ (τ1 (i) > τ1 ( j ) ∧ (τ2 (i) > τ2 ( j ))| (Yao 2006). In contrast to Kendall τ distance, Kendall τ correlation coefficient is calculated by counting the difference between the number of concordant pairs with the number of discordant pairs in these two rankings. That is: Ka (τ1 , τ2 ) =
C(τ1 , τ2 ) − D(τ1 , τ2 ) . n(n − 1)/2
This matrix is normalized to get the equation:
Instead of using the widely adopted Pearson correlation or its extensions in a large number of PCA instance, we use Kendall τ in the assessment of water security.
R = (rij)m×n
Assessment
where rij is the data of j-th evaluating year on the indicator and rij ∈ [0, 1]. Among these indicators, to which the bigger the better, there are:
PCA is designed to transform the original variables into new, uncorrelated variables (axes), called the principal components (PC), which are linear combinations of the original variables. The new axes lie along the directions of maximum variance. PCA provides an objective way of finding indices of this type so that the variation in the data can be accounted for as concisely as possible (Sarbu and Pop 2005). PC provides information on the most meaningful parameters, which describes a whole dataset affording data reduction with minimum loss of original information
xij − min{xij} rij =
j
max{xij} − min{xij}
.
j
j
While, the smaller the better, there are: max{xij} − xij rij =
j
max{xij} − min{xij} j
j
.
Environ Monit Assess (2010) 163:539–544
(Helena et al. 2000). The preceding k PC can be expressed as: ⎧ F1 = a11 R1 + a21 R2 + · · · + a p1 R p ⎪ ⎪ ⎨ F2 = a12 R1 + a22 R2 + · · · + a p2 R p ··· ⎪ ⎪ ⎩ Fk = a11 R1 + a2k R2 + · · · + a pk R p where F is the component score, a is the component loading, R is the measured value of variable, k is the component number, and p is the total number of variables (Shrestha and Kazama 2007). Let the contribution percentage of each eigenvalues βi = λi /λi , i = 1, 2, · · · , k as the weight coefficient and get F = β1 F1 + β2 F2 + · · · + βk Fk . F is the sample assessment value, which we can use to sort the samples (Yue 2007).
Research areas The Haihe River Basin, located between 35◦ and 43◦ N latitude and 112◦ and 120◦ E longitude, fronts the Bohai Sea on the east and the Yellow River on the south, while on the west, it is adjacent to Yunzhong and Taiyue Mountains, and on the north by the Mongolian Plateau. It covers an area of 318,000 km2 , among which mountains and plateaus comprise 189,000 km2 , accounting for 60% and with plains covering 129,000 km2 , accounting for 40%. The main characteristics of the Haihe River Basin include limited water resources, unequal rainfall distribution over time and space, and frequent occurrence of drought years; over all, it is considered as an area of water resource shortages. It is one of the areas lacking of water resources in China, and the water environment has decayed seriously recently. From 1956 to 1998, the total column of water resources was 37.2 billion cubic meters, and the water resource per capita was 305 m3 . This accounted for only 1/7 of the average in China, 1/24 of the world average, and was far below the international water shortage criterion average of 1,000 m3 per capita. Meanwhile, the average water per hectare was 3,375 m3 , accounting for 1/8 of the whole country. The Haihe River Basin is an important industrial and high-tech production base, holding an
541
important strategic position in national, economic, and social development. In 2000, the total population in the area was 126 million, accounting for 10% of the nation’s population. Also, its urban population was 36.47 million (according to recorded vital statistics), and the rate of urbanization reached 28.9%. Within the area was 1,121.1 billion RMB Yuan, or about 15% of the national GDP, with an average GDP per head of 8,890 RMB Yuan, which was above the national average. The water security problem in North China becomes a major issue of sustainable developments in China (Xia 2006). The Haihe River Basin can be divided into four parts, which are Luan River and eastern Hebei coastal area, northern area of Haihe River, southern area of Haihe River, and Tuhai and Majia River. This paper selects 25 indices of Haihe River Basin and uses the PCA method to assess its water security. The connotation of water security is defined based on analysis of the relationship between environment changes and security issues considering not only the situation of water resources, but the related factors of environment, ecology, society, politics, and economy. So, in the choice of indices, they should, on one hand, characterize ecosystems, human itself, the socioeconomic, and other aspects, on the other hand, combine the actual situation and the availability of indices. The indices values are shown in Table 1. In order to show the difference between Pearson’s correlation coefficient and Kendall τ coefficient, this paper calculates the relationship using Pearson’s correlation coefficient, which is shown in Fig. 1. The relationship between C11 and C12 is nonlinear, but the correlation between them is very strong, a complete correlation. Consequently, it is not appropriate to use Pearson’s correlation coefficient. Kendall τ can be used both in linear and nonlinear situations; this paper uses it to describe the relationship of different indices. The eigenvalues are listed in Table 2. It can be seen that four significant factors were extracted by PCA, which explains 92% of the total variation. The first factor (PC1) accounts for 50%, the second (PC2) 23%, the third factor (PC3) 13%, and the last (PC4) 6%.
542
Environ Monit Assess (2010) 163:539–544
Table 1 Indices values of water security in Haihe River Basin Indices
Luan river Northern area Southern area Tuhai and Majia and eastern of Haihe of Haihe River Hebei coastal River River area
Average GDP per head (104 ¥/person) C1 9,808.164 Growth rate of GDP (%) C2 9.500 Rate of urbanization (%) C3 33.697 Population density (person/km2 ) C4 193.810 Engel coefficient (%) C5 0.441 Land area per head (ha/person) C6 0.093 Water quantity of farmland irrigation 5,808.000 (m3 /ha) C7 Drought index (/) C8 2.300 Water resource per capita (m3 /person) C9 598.131 Rate of groundwater exploitation (%) C10 79.954 Utilization ratio of water resources (%) C11 61.614 Non-point source pollutants per ha (t/km2 ) C12 21.729 Point source pollutants to the river per hectare 5.410 (t/km2 ) C13 Waste water discharge per 10,000 Yuan industry 27.140 gross production (t) C14 Well water quality rate of surface water supply (%) C15 19.330 Well water quality rate of ground water supply (%) C16 28.490 River comparative connectivity indices (/) C17 0.250 Minimum ecological water demand assurance 3.993 rate (%) C18 Ratio of waste water to runoff (%) C19 6.216 Ratio of soil loss area (%) C20 41.053 Atrophia ratio of everglade (%) C21 0.000 Ratio of dried-up riverway (%) C22 41.367 Ratio of nature reserve area (%) C23 3.947 Proportion of environmental protection 1.681 devotion in GDP (%) C24 Rate of treated sewage (%) C25 26.237 The values came from the Haihe River Basin planning report
Fig. 1 Relationship between nonpoint source pollutants per hectare (C12) and utilization ratio of water resources (C11)
13,411.459 10.501 50.687 310.466 0.389 0.096 4,288.000
8,081.756 8.387 32.802 490.789 0.406 0.085 3,760.000
7,208.274 10.399 26.436 508.280 0.404 0.111 3,872.000
2.600 344.260 93.688 94.815 22.697 6.704
2.000 244.091 135.757 119.211 40.681 8.584
1.500 234.471 81.183 168.193 55.943 11.462
19.170
19.990
32.200
17.660 30.780 0.236 1.936
37.890 21.690 0.122 2.332
13.100 22.760 0.000 4.111
27.639 33.017 88.972 76.104 6.344 2.792
21.726 36.204 75.061 71.701 4.323 0.908
30.440 35.027 100.000 23.194 6.460 1.454
39.348
15.723
0.000
Environ Monit Assess (2010) 163:539–544
543
Table 2 Eigenvalues of the correlation matrix Eigenvalues
1 2 3 4 5 6 7
11.64 6.23 3.72 1.41 1.18 0.82 0.00
Variance contribution ratio
Cumulative variance contribution ratio
0.47 0.25 0.15 0.06 0.05 0.03 0.00
0.47 0.71 0.86 0.92 0.97 1.00 1.00
The varimax rotated factor matrix is shown in Table 3. From the eigenvectors obtained in the PCA, the first component (PC1), F1 , can be given as: F1 = 0.24x1 + 0.01x2 + 0.24x3 + 0.28x4 − 0.08x5 −0.11x6 − 0.18x7 − 0.24x8 + 0.28x9 +0.06x10 + 0.28x11 + 0.28x12 + 0.28x13 +0.15x14 + 0.11x15 + 0.14x16 − 0.28x17
Table 3 Eigenvectors
− 0.15x18 + 0.20x19 − 0.08x20 + 0.20x21 − 0.15x22 − 0.20x23 + 0.14x24 + 0.24x25 where x is the normalization of the indices values and the coefficients are the eigenvectors. Similarly, the F2 , F3 , and F4 can be expresses by x. Then, we can get the assessment value of this four parts as 0.6971, 1.3129, 0.3571, and −0.5881, which means that the water security conditions in the northern area of Haihe River part is the best, followed by Luan River and eastern Hebei coastal area, southern area of Haihe River part, and the worst part is the Tuhai and Majia River part.
Conclusions Kendall τ has reasonable theoretic background than Pearson correlation. It can be applied more widely in all aspects because Kendall τ coefficient is a nonparametric statistic used to measure the degree of correspondence when both variables
Indices
Eigenvector 1
Eigenvector 2
Eigenvector 3
Eigenvector 4
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25
0.24 0.01 0.24 0.28 −0.08 −0.11 −0.18 −0.24 0.28 0.06 0.28 0.28 0.28 0.15 0.11 0.14 −0.28 −0.15 0.20 −0.08 0.20 −0.15 −0.20 0.14 0.24
0.18 0.31 0.18 −0.05 0.32 0.24 −0.04 −0.18 −0.05 −0.03 −0.05 −0.05 −0.05 0.19 −0.24 0.27 0.05 −0.19 −0.25 0.32 −0.25 −0.19 0.25 0.27 0.18
0.05 −0.21 0.05 −0.08 0.10 −0.31 0.30 −0.05 −0.08 −0.41 −0.08 −0.08 −0.08 0.35 0.31 −0.18 0.08 −0.35 0.01 0.10 0.01 −0.35 −0.01 −0.18 0.05
−0.27 0.01 −0.27 0.13 0.35 −0.25 −0.40 0.27 0.13 0.14 0.13 0.13 0.13 0.07 0.25 0.00 −0.13 −0.07 −0.10 0.35 −0.10 −0.07 0.10 0.00 −0.27
544
are measured at the ordinal level, while Pearson correlation coefficient is a parametric statistic and, when distributions are not normal, it may be less useful than nonparametric correlation methods. This paper introduces it into the PCA method, which is applied in the Haihe River Basin and gets a comparative good result. According to the assessment value, the government can find the main problem of water security in the Tuhai and Majia River part and take some measurements to resolve the water security problem. Acknowledgements The authors sincerely thank Su Xiaohui of FINCM Co., Ltd. and Li Yan of Heze College for the advice on the text revision.
References Bernard, P., Antoine, L., & Bernard, L. (2004). Principal component analysis: An appropriate tool for water quality evaluation and management—application to a tropical lake system. Ecological Modelling, 178, 295– 311. doi:10.1016/j.ecolmodel.2004.03.007. Cliff, N. (1996). Ordinal methods for behavioral data analysis. New Jersey: Erlbaum. Helena, B., Pardo, R., Vega, M., et al. (2000). Temporal evolution of groundwater composition in an alluvial aquifer (Pisuerga River, Spain) by principal component analysis. Water Research, 34, 807–816. Hopke, P. K. (1985). Receptor modeling in environmental chemistry. USA: Wiley. Jolliffe, T. (1986). Principal component analysis. New York: Springer. Kendall, M., & Gibbons, J. D. (1990). Rank correlation methods, fifth edition. Oxford: Oxford University Press.
Environ Monit Assess (2010) 163:539–544 Ouyang, Y. (2005). Evaluation of river water quality monitoring stations by principal component analysis. Water Research, 39, 2621–2635. doi:10.1016/j.watres. 2005.04.024. Ouyang, Y., Nkedi-Kizza, P., Wu, Q. T., et al. (2006). Assessment of seasonal variations in surface water quality. Water Research, 40, 3800–3810. doi:10.1016/j. watres.2006.08.030. Sarbu, C., & Pop, H. F. (2005). Principal component analysis versus fuzzy principal component analysis. A case study: The quality of Danube water (1985–1996). Talanta, 65, 1215–1220. Shrestha, S., & Kazama, F. (2007). Assessment of surface water quality using multivariate statistical techniques: A case study of the Fuji river basin, Japan. Environmental Modelling & Software, 22, 464–475. Simeonov, V., Stratis, J. A., Samara, C., et al. (2003). Assessment of the surface water quality in Northern Greece. Water Research, 37, 4119–4124. doi:10.1016/ S0043-1354(03)00398-1. Uttam, K. M., Warrington, D. N., Bhardwaj, A. K., et al. (2008). Evaluating impact of irrigation water quality on a calcareous clay soil using principal component analysis. Geoderma, 144, 189–197. doi:10.1016/ j.geoderma.2007.11.014. Xia, J., Feng, H. L., Zhan, C. S., et al. (2006). Determination of a reasonable percentage for ecological wateruse in the Haihe River Basim, China. Soil Science Society of China, 16, 33–42. Yao, Y., Zhu, S. F., Chen, X. M. (2006). Collaborative filtering algorithms based on Kendall correlation in recommender systems. Wuhan University Journal of Natural Sciences, 11, 1086–1090. Yue, T. L., Peng, B. Z., Yuan, Y. L., et al. (2007). Modeling of aroma quality evaluation of cider based on principal component analysis. Transactions of the CSAE, 23, 223–227 (in Chinese with English abstract). Zou, Z. H., Yun, Y., & Sun, J. N. (2006). Entropy method for determination of weight of evaluating indicators in fuzzy synthetic evaluation for water quality assessment. Journal of Environmented Sciences, 18, 1020– 1023. doi:10.1016/S1001-0742(06)60032-6.