Computational Management Science https://doi.org/10.1007/s10287-018-0325-x ORIGINAL PAPER
Big data analytics: an aid to detection of non-technical losses in power utilities Giovanni Micheli1 · Emiliano Soda2 · Maria Teresa Vespucci1 · Marco Gobbi2 · Alessandro Bertani2 Received: 31 October 2017 / Accepted: 12 June 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018
Abstract The great amount of data collected by the Advanced Metering Infrastructure can help electric utilities to detect energy theft, a phenomenon that globally costs over 25 billions of dollars per year. To address this challenge, this paper describes a new approach to non-technical loss analysis in power utilities using a variant of the P2P computing that allows identifying frauds in the absence of total reachability of smart meters. Specifically, the proposed approach compares data recorded by the smart meters and by the collector in the same neighborhood area and detects the fraudulent customers through the application of a Multiple Linear Regression model. Using real utility data, the regression model has been compared with other data mining techniques such as SVM, neural networks and logistic regression, in order to validate the proposed approach. The empirical results show that the Multiple Linear Regression model can efficiently identify the energy thieves even in areas with problems of meters reachability. Keywords Energy theft detection · Meters reachability · Multiple linear regression · Data mining
B
Giovanni Micheli
[email protected] Emiliano Soda
[email protected] Maria Teresa Vespucci
[email protected] Marco Gobbi
[email protected] Alessandro Bertani
[email protected]
1
Department of Management, Information and Production Engineering, University of Bergamo, Bergamo, Italy
2
CESI, Milan, Italy
123
G. Micheli et al.
1 Introduction In recent years, through the development of information system and communication technology, many countries have been modernizing the aging power system into smart grids. Smart grids have emerged as a new approach to efficiently deliver reliable, economic and sustainable electricity services. The adoption of this new technology allows: (1) to better align the supply of energy with demand, (2) to increase customer’s control over household consumption, (3) to obtain a better integration of renewable energy sources (Alejandro et al. 2014). Smart grids are based on a new electrical infrastructure, called AMI (Advanced Metering Infrastructure), a hierarchical structure consisting of different networks communicating with each other. These networks can be described as follows (Jang et al. 2014): • Home Area Network, a local network of household appliances and meters; • Neighborhood Area Network, a network of meters located in the same zone and connected to a digital device, called collector. While meters measure users individual energy consumption, the collector records the total energy consumed by the serviced area; • Wide Area Network, a network of collectors connected to the utility control center. In the AMI the old mechanical meters have been replaced by smart meters, i.e. digital devices that enable two-way communications between utilities and energy customers. The main characteristics of these devices are (Alahakoon and Yu 2013): • • • • •
Real time or near real time capture of consumptions; Remote and local reading; Remote controllability; Possibility of linking to other commodity supply; Record of measures every 15 min, leading to a 3000-fold increase in the total amount of available readings (IBM 2012).
These digital devices allow collecting some additional data that can provide great advantages to the electrical utilities. More specifically, one of the biggest opportunities for the electrical companies is related to the detection of the so-called non-technical losses, i.e. the energy losses due to customers’ fraudulent behaviors. Energy theft is a very serious concern in traditional power systems. In fact, it is reported that electrical utilities around the world loose over $25 billion every year due to this problem (Jang et al. 2014). Moreover, energy theft has become a critical issue in both developed and developing countries. In the United States alone utilities incur losses over $6 billion each year (McDaniel and McLaughlin 2009), while the highest theft rates are recorded in developing countries. As shown in Fig. 1, India has one of the highest rate, with close to 30% of the power in the country lost to theft and estimated losses of $4.5 billion (Ministry of power 2013). In the former Soviet Union almost 50% of electricity is acquired via theft (Fehrenbacher 2013). Brazil has a theft rate of 15%, with estimated losses of $5 billion (Federal Court of Audit 2007). In order to assist electrical utilities in the identification of energy frauds, in this paper we introduce a new approach to non-technical losses detection, which has been tested on a real utility database. Specifically, in Sect. 2 we review the techniques
123
Big data analytics: an aid to detection of non-technical…
Fig. 1 Electricity theft rates worldwide. (Reproduced with permission from Fehrenbacher 2013)
proposed in the literature to identify fraudulent customers. The model we propose for energy theft detection, i.e. a Multiple Linear Regression model based on a variant of the P2P computing approach, is introduced in Sect. 3. In Sect. 4 we present the results of numerical testings, as well as the used data set and the evaluation criteria. Conclusions are drawn in Sect. 5.
2 Current methods for energy theft detection In recent years several data mining techniques have been applied in order to identify energy frauds, including Statistical Methods, Decision Trees, Support Vector Machines, Artificial Neural Networks, Clustering methods and ARMA models. Specifically, Support Vector Machines (SVM) are widely used in the literature (Nagi et al. 2008, 2011; Depuru et al. 2011, 2012, 2013) for the classification of customers’ load profiles. For example, Nagi et al. (2008) combine Support Vector Machines with Genetic Algorithm to detect more accurately users’ abnormal behaviors, which are highly correlated with energy frauds. Depuru et al. (2011) present the approximate energy consumption patterns of several customers involving theft. Using historical data, a database with both honest and fraudulent customers’ energy consumption is used to train the SVM classifier and to identify suspicious users. Nagi et al. (2011) use an SVM-based model combined with a Fuzzy Inference System, in the form of fuzzy IF–THEN rules, in order to improve the detection rate by including in the model the human knowledge and expertise. Depuru et al. (2012) implement a data encoding technique to reduce the complexity of the instantaneous energy consumption. After this pre-processing, the data are inputted to a SVM model, which classifies users into
123
G. Micheli et al.
three groups: honest, illegal and suspicious customers. In order to reduce the processing time, Depuru et al. (2013) investigate the possibility of implementation of some High Performance Computing algorithms in the detection of illegal consumers through Support Vector Machines. Besides the SVM model, other techniques are applied in the literature, such as clustering, Artificial Neural Networks and ARMA models. For example, dos Angelos et al. (2011) propose a two-step computational technique for the classification of electricity consumption profiles: in the first step a C-means-based fuzzy clustering is implemented, in order to aggregate users into groups of customers with similar consumption profiles; the second step performs a fuzzy classification using a fuzzy membership matrix and the Euclidean distance to the cluster centers. Then, through the normalization of distance measures, a unitary index score is computed for any customer, where the highest scores are associated to potential thieves and users with irregular consumption patterns. Muniz et al. (2009) aim at improving the accuracy in the detection of abnormalities among low-tension consumers by creating an intelligent system, based on Artificial Neural Networks. More specifically, the proposed technique consists of two neural network ensembles. First, a set of five neural networks is used for filtering the database, selecting irregular and normal consumers that better characterize these two different classes. The resulting filtered database is then employed for training the Classification Module, which is composed by a committee of five neural networks that indicates a customer as normal or irregular, after the system has been fully trained. The Auto Regressive Moving Average (ARMA) models represent another detection technique proposed in the literature (Mashima and Cardenas 2012). In particular, in this parametric approach, for any user, a first set of data is used to fit the best ARMA model, which is then employed to forecast the consumptions in the test set. By comparing the forecasts with the real consumptions and by defining a threshold, computed on historical data, the customers are classified as honest or fraudulent. All above mentioned methods are based on procedures that consist of the following phases: data acquisition, pre-processing, feature extraction, model training, classification of new instances, post-processing and suspicious customer list generation. A different approach to non-technical losses analysis is represented by the Peer to Peer (P2P) computing method, an expression that denotes three different algorithms proposed by Salinas et al. (2012, 2013) to identify fraudulent customers that tamper their meters so that only a fraction γ , 0 < γ < 1, of the real consumption is recorded. This approach is based on processing a data set collected over a given time horizon by n + 1 meters in a Neighborhood Area Network, where n is the cardinality of the set J of users located in that area: a meter located at each user j, 1 ≤ j ≤ n, measures the user’s individual consumption and a meter located at the collector measures the overall electricity supply to all customers in the area. The time horizon is discretized in T periods of equal length (usually 15 min): let ct , 1 ≤ t ≤ T , denote the overall electricity supply in period t in the Neighborhood Area Network, recorded by the meter at the collector, and let ut,j denote the electricity consumption of user j, 1 ≤ j ≤ n, in period t, recorded by the meter associated to that user. If in the area there are no fraudulent users, the measured consumption ut,j equals the real consumption et,j for all j and t, therefore in all time periods the measured supply ct equals the sum of the measured consumptions of all customers
123
Big data analytics: an aid to detection of non-technical… n
u t, j
j1
n
et, j ct 1 ≤ t ≤ T.
j1
If user j tampers his own meter so that only the fraction γ j , 0 < γ j < 1, of the real consumption et,j is measured, then ut,j γ j et,j for all t. Therefore in areas with fraudulent users it holds that n j1
u t, j
n
et, j
j1
n j1
u t, j
1 ct 1 ≤ t ≤ T. γj
Salinas et al. propose to detect fraudulent users in the area by computing the values of the honesty coefficients k j γ1j of users j, 1 ≤ j ≤ n, defined by ⎧ n ⎪ ⎪ u 1, j k j c1 ⎪ ⎪ ⎪ ⎪ ⎨ j1 .. (2.1) . ⎪ ⎪ n ⎪ ⎪ ⎪ u n, j k j cn ⎪ ⎩ j1
The system of n linear Eq. (2.1) is obtained by selecting a subset I ⊂ T , with |I| n and T n, and by considering, for the selected periods i, 1 ≤ i ≤ n, the energy conservation 1 1 n e c where e equations i i, j u i, j γ j and k j γ j . For the linear system to j=1 i,j have a unique solution, Salinas et al. suggest to choose n time periods with different values of ci , 1 ≤ i ≤ n, arguing that equations with equal (or almost equal) right-hand sides are more likely to be linearly dependent. A user j with k j 1 is honest; a user j with k j > 1 is an energy thief, as the consumption recorded by the meter is less than his real consumption. For the main detection techniques proposed in the literature we report in Table 1 the performance indices provided by their authors. The detection rate, i.e. the ratio of detected fraudulent customers to the number of fraudulent customers in the area, is the most important index in non-technical losses identification and is reported by all authors. The false positive rate, i.e. the percentage of honest users classified as energy thieves by the model, is provided only by two references. For most of the detection schemes a not very high detection rate (approximately 60–70%) is reported by their authors. For only few methods the ability is claimed by their authors to identify more than 90% of energy frauds. In particular, 100% detection rate is claimed for the P2P computing: indeed, the honesty coefficients, that solve system (2.1), allow to identify all illegal users that proportionally steal energy, i.e. that tamper a smart meter in order to making it record a fraction of the real energy consumption. A further advantage of the P2P computing approach is that it allows identifying all occurring frauds by simply comparing meters readings, without invading consumers’ privacy, while many detection techniques, to be high performing, require collecting some users private information (therefore violating customers’ privacy). For all these reasons the P2P computing may be considered as the best fraud identification approach. However, in
123
G. Micheli et al. Table 1 Data mining techniques: detection rate and false positives as reported in the literature (n.a.: not available) Technique
References
Detection rate (%)
False positives (%)
Genetic SVM
Nagi et al. (2008)
62
n.a.
SVM
Depuru et al. (2011)
98.4
n.a.
SVM and fuzzy inference system
Nagi et al. (2011)
72
13.57
Data encoding and SVM
Depuru et al. (2012)
76–92
n.a.
SVM and high performance computing
Depuru et al. (2013)
92
n.a.
Fuzzy clustering and classification
dos Angelos et al. (2011)
74.5
n.a.
Neural networks
Muniz et al. (2009)
24.9–62
n.a.
ARMA models
Mashima and Cardenas (2012)
62
4.2
P2P computing
Salinas et al. (2012, 2013)
100
n.a.
the literature this approach has never been tested on a real utility dataset, but only on simulated data. Indeed, as it will be explained in detail in Sect. 3, the P2P computing method originally proposed by Salinas et al. presents some restrictions that do not allow to implement it in a real context: in order to identify real energy frauds, it is necessary to modify this method, as it will be shown in the following section.
3 The new proposed methodology 3.1 A multiple linear regression model We propose a stochastic version of the method by Salinas et al. based on the Multiple Linear Regression model n
k j u t, j + εt ct 1 ≤ t ≤ T.
j1
The vector k k j ∈ Rn of honesty coefficients k j is computed as the least squares solution of the over-determined linear system U k c, where U u t, j ∈ RT,n and c [ct ] ∈ RT , i.e. as the vector k that minimizes the square of the Euclidean norm of the residual vector ε [εt ] ∈ RT , defined as ε c − U k. The component εt of vector ε is the random error at time period t, assumed to have the following properties • E(εt ) 0. • V ar (εt ) σ 2 . • Cov εt1 , εt2 0 f or t1 t2 .
123
Big data analytics: an aid to detection of non-technical…
While in Salinas et al. only n time periods, out of the T n available, are used to compute the honesty coefficients k j , 1 ≤ j ≤ n, in our approach all available data is used, avoiding to neglect a large amount of information. Moreover, the vector kˆ
T −1 T U c that solves the system of n linear equations U U U T U kˆ U T c, is the best estimator of k. The possibility to conduct statistical tests on the honesty coefficients is a main advantage of the proposed model over the technique described by Salinas et al., which only allows classifying users as either honest or fraudulent. The stochastic model, instead, allows associating a confidence level to the classification of any user: for any suspicious customer the t test H0 : k j 1 against H1 : k j > 1 can be performed. The statistic for testing H 0 against H 1 is T j , defined as Tj
kˆ j − 1
, Vˆ ar kˆ j
where kˆ j is the estimator of the honesty coefficient k j and Vˆ ar kˆ j is the estimation
of the estimator’s variance. The estimate Vˆ ar kˆ j can be computed as the (j, j) T
−1 1 th element of matrix σˆ 2 U T U c − U kˆ c − U kˆ is the , where σˆ 2 T −n estimation of the error’s variance. Under the null hypothesis T j has a t-distribution with T − n degrees of freedom, therefore for any user j we can compute the p-value
p.value j p tT −n ≥ T j , which is the probability for the t-distribution with T − n degrees of freedom to take values greater than or equal to the statistic value T j . Successive inspections, to reduce the number of false positives, may be focused on users with the lowest p-values. One condition for a proper use of the proposed approach is the absence of multicollinearity in the dataset. Specifically, multicollinearity refers to the existence of near-linear relationships among the independent variables. The high correlation of predictor variables in a Multiple Linear Regression model can distort the results, inflating the standard errors and providing inaccurate estimates of the regression coefficients. Indeed, the exact linear relationship among the independent variables causes a division by zero, that prevents determining regression coefficients. When the relationship is not exact, even if the division by zero does not occur, the computed estimate of the regression coefficients may have very low accuracy, as the division by a very small quantity distorts the results. In our application the presence of users with similar energy consumptions in the Neighborhood Area Network might be thought to
123
G. Micheli et al.
cause multicollinearity; however, the high frequency of measure collection hinders the occurrence of multicollinearity. Indeed, similar users present analogies in the total daily and weekly consumptions and show load profiles with peaks and valleys in similar parts of the day: these similarities can be caught through a daily, or at most hourly, discretization of the time horizon, but are not easily observable with a 15min discretization, as meters’ readings are closely related to specific and individual users’ habits and behaviors. For these reasons multicollinearity should not represent a problem in this application. These theoretical considerations have been empirically confirmed by computing on the dataset employed in the analysis the main multicollinearity detection schemes, such as the evaluation of the variance inflation factor (VIF) and the analysis of the eigen-structure of the correlation matrix U T U. However, for datasets affected by multicollinearity the proposed method can be modified into Ridge Regression Model, which provides a more reliable estimation of the regression coefficients by adding a suitably chosen value λ to the diagonal elements of the correlation matrix, i.e. −1
U T c. kˆ U T U + λI We refer the reader to (El-Dereny and Rashwan 2011) for a more complete analysis of multicollinearity and Ridge Regression models. 3.2 Not total reachability A further case in which equalities n
u t, j
j1
n
et, j ct 1 ≤ t ≤ T
(3.1)
j1
do not hold is when user j disconnects his own meter so that information are no longer transmitted to the utility control center, therefore causing a “reachability” problem. This type of fraud, although easier to be detected, prevents the use of the method described in the previous section, since relations (3.1) no longer hold. In order to overcome the reachability problem we have introduced in the model the parameter α, that represents the percentage of reachable meters in the considered time horizon. Parameter α can be computed as the ratio between the number of meters without transmission problems and the number of users located in the neighborhood area. The quantity 1 − α therefore represents the percentage of digital devices with difficulties in transmitting data to the utility control center. We have introduced the model k j u t, j + εt ct μ 1 ≤ t ≤ T j∈R
where R ⊂ J denotes the subset, with cardinality |R| α · n, of reachable meters. The model regressors are the readings of the reachable meters; the dependent variable is the fraction μ, 0 < μ < 1, of the collector reading: indeed, in this case the independent
123
Big data analytics: an aid to detection of non-technical…
variables cannot explain the total zonal energy consumption, since some consumptions cannot be recorded due to the reachability problems. When all customers in the neighborhood area have similar characteristics and n is large, the value μ = α may be chosen, as in this case αct is a good approximation to the sum of reachable meters real consumption at time t. Otherwise, μ can be defined by considering historical data, in order to determine, for each no longer reachable user, the weight of his consumption on the total zonal consumptions. Let U˙ denote the matrix obtained by considering only the column of matrix U associated to the reachable meters. The j-th column of U˙ represents the data recorded by the reachable meter j (1 ≤ j ≤ αn), while the t-th row of U˙ represents the data collected by all reachable meters at time t (1 ≤ t ≤ T ). In matrix notation, the model is formulated as U˙ k + ε cμ.
4 Empirical analysis In this section, we present an empirical study carried out on a real utility database to assess the performances of the Multiple Linear Regression model with respect to the other data mining techniques. All models have been implemented using the software R on a laptop ASUS with Windows 10, a 3 GHz Intel Core i7-5500U Processor with 4 GB of RAM. 4.1 The available data The numerical study has been carried out with reference to a database of a utility with 7000 customers divided into 48 Neighborhood Area Networks of different sizes: 38 NANs of small dimension (10 NANs with 20 users each, 11 with 50 users each, 9 with 100 users each and 8 with 150 users each) and 10 NANs with a high concentration of users (5 with 250 users each, 3 with 500 users each and 2 with 700 users each). The time horizon comprises 77 days, from 18 July 2016 to 2 October 2016, with 96 daily measurements (every 15 min), yielding 7392 measurements for every smart meter, as well as for the collector. All Neighborhood Area Networks presented a total reachability of smart meters (α 100%). Variations of performance with respect to the value of α, the fraction of reachable smart meters, has been investigated by the following procedure: a level α of meters reachability is selected and in every Neighborhood Area Network α · n users are randomly chosen, whose meters are considered as “reachable”; the regression model has then been used in every Neighborhood Area Network, taking into account the collector’s reading and the records of the “reachable” meters. In order to obtain more reliable information, for every level of α and for every neighborhood area, 20 different problems have been generated by performing 20 random selections of the meters to be considered “reachable”. Confidence intervals for the models’ performances have then been computed.
123
G. Micheli et al.
Fig. 2 Confusion matrix
In order to apply the other data mining techniques for the classification of customers’ load profiles, we have aggregated all users in a unique dataset, which has been divided into a training and a test set. Specifically, we have built the training set by randomly selecting 70% of customers and we have used this data to train the classifiers. After the models have been fully trained, we have tested their performances in the classification of users’ load profiles in the test set. In order to obtain more reliable information, we have repeated 100 times the dataset division into the training and the test set. In this way confidence intervals for the models’ performances have been computed also for these techniques. 4.2 Evaluation criteria The performances of a classifier can be evaluated through the computation of three indices: accuracy, sensitivity and specificity. These indices can be explained by introducing the confusion matrix, i.e. the matrix that shows all the different results in a classification problem. Specifically, in our two-class problem the confusion matrix has two rows and two columns and it presents the structure shown in Fig. 2. Four possible outcomes exist: • • • •
True Positives (TP), the fraudulent users detected by the classifier; False Positives (FP), the honest customers denoted by the model as energy thieves; True Negatives (TN), the honest customers rightly identified by the model; False Negatives (FN), the fraudulent users not detected by the classifier. The three indices are defined as follows:
N : percentage of right classifications made by the model, where 1. Accuracy T P+T n n is the total number of users classified. TP 2. Sensitivity T P+F N or detection rate: ability of the model to detect the frauds. TN 3. Specificity T N +F P : ability of the model to correctly classify the honest users.
In this problem the two types of errors present different costs: while the cost of the false negatives is very high, because it is equal to the energy stolen and hence not paid
123
Big data analytics: an aid to detection of non-technical…
by customers, the cost of the false positives can be considered much lower, since it is equal to the inspection costs. For this reason in the models evaluation more importance was given to sensitivity than specificity. 4.3 Numerical results The three indices (Accuracy, Sensitivity and Specificity) have been computed on the available datasets and the results are shown in Table 2. As it can be observed, the model is a perfect detector in a full information context (i.e. when α 100%): indeed, the proposed technique correctly classifies all instances, the three indices being equal to 100%. With lower rates α of meters reachability the performances obviously decrease, however we can notice that for α 90% the model still shows excellent performances, with sensitivities of about 85–90% and always greater than specificities. The Multiple Linear Regression is a good approach even for α 80%, presenting the ability to detect more than 80% of fraudulent customers. From Table 2 it can also be observed that the neighborhood size has a negative impact on the performances: the best results are reached by the smallest zones at any level of α. For instance, when α 80%, while in zones with 250 or more customers sensitivities and specificities are close to 80 and 70% respectively, the smallest neighborhood areas show much better performances, with sensitivities close to 90% and specificities between 75 and 80%. For this reason, while 80% can be considered as the lowest rate of meters reachability for the application of the proposed technique in the wider areas, the zones with a number of users lower than or equal to 150 allow the implementation of this approach even for smaller values of α, until α 70%. Indeed, as shown in the last panel of Table 2, when the reachability rate α is equal to 70% the classifier still shows a good accuracy in the smallest neighborhood areas, with specificities higher than 70% and sensitivities of about 85%. Through the computation of the p-values, it is possible to improve the performances of the model, reducing the false positives rate. In fact, as shown in Fig. 3a, true positives and false positives present different p-values: while the true positives are largely concentrated in the class with the lowest p-value, the false positives show an opposite behavior, since their frequency increases with p-value. For example, while in the first group, characterized by null p-values, there is almost 70% of total true positives and less than 15% of false positives, in the last class, that is the one with the highest p-values, there is less than 10% of total true positives and about 30% of false positives. Thus, directing the inspections to the suspicious users with null p-values allows to detect almost 70% of the frauds and to reduce by 85% the false positives controlled. As it can be noted in Fig. 3b, if the choice of the electrical company is to investigate users with p-values lower than or equal to 1%, the inspections can lead to the detection of more than 80% of the frauds, reducing by 60% the number of false positives. Therefore, the p-value represents an important index to define the priority of the inspections: directing the controls to the suspicious users with the lowest p-values allows to optimize the results of the inspections, maximizing the detection rate and limiting the number of false positives.
123
G. Micheli et al. Table 2 Confidence intervals at level 99% for the performances of the proposed technique α
n
100
Any size
100
100
100
95
20
89.32 ± 1.68
93.63 ± 2.86
88.94 ± 1.78
50
90.18 ± 1.04
93.19 ± 2.53
89.77 ± 1.12
100
87.81 ± 0.84
92.42 ± 2.04
87.30 ± 0.91
150
88.88 ± 0.67
91.51 ± 1.84
88.45 ± 0.72
90
85
80
70
Accuracy (%)
Sensitivity (%)
Specificity (%)
250
86.21 ± 0.57
91.49 ± 1.37
85.48 ± 0.62
500
83.20 ± 0.44
88.72 ± 1.11
82.38 ± 0.48
700
80.74 ± 0.39
87.94 ± 0.98
79.72 ± 0.43
20
88.42 ± 1.82
92.78 ± 2.51
87.50 ± 1.94
50
85.06 ± 1.09
90.95 ± 3.00
84.35 ± 1.44
100
82.93 ± 1.00
90.38 ± 2.36
81.90 ± 1.08
150
82.15 ± 0.84
89.28 ± 1.97
81.20 ± 0.91
250
79.74 ± 0.69
88.42 ± 1.65
78.42 ± 0.75
500
77.76 ± 0.50
86.75 ± 1.21
76.48 ± 0.55
700
75.55 ± 0.44
85.46 ± 1.09
74.08 ± 0.48
20
83.39 ± 2.12
92.66 ± 2.72
82.04 ± 2.27
50
82.48 ± 1.45
90.66 ± 2.40
81.77 ± 1.56
100
81.00 ± 1.07
85.33 ± 4.21
79.53 ± 1.18
150
79.38 ± 0.91
86.12 ± 2.34
78.42 ± 0.99
250
77.62 ± 0.73
85.70 ± 1.89
76.46 ± 0.79
500
74.36 ± 0.55
83.94 ± 1.40
72. 98 ± 0.59 71.30 ± 0.51
700
72.96 ± 0.47
83.48 ± 1.14
20
82.13 ± 2.27
88.63 ± 3.40
81.09 ± 2.39
50
78.90 ± 1.59
89.12 ± 3.04
77.43 ± 1.72
100
76.74 ± 1.20
87.15 ± 2.75
75.22 ± 1.31
150
76.86 ± 0.98
87.18 ± 2.26
75.45 ± 1.07
250
74.65 ± 0.79
84.23 ± 2.00
73.28 ± 0.85
500
72.87 ± 0.44
82.36 ± 1.12
71.52 ± 1.12 69.2 ± 0.54
700
70.77 ± 0.50
82.12 ± 1.40
20
76.59 ± 1.79
87.52 ± 3.38
74.93 ± 1.99
50
75.52 ± 1.80
86.32 ± 3.04
74.38 ± 1.93
100
74.31 ± 1.33
85.92 ± 2.94
72.83 ± 1.43
150
70.83 ± 1.21
84.32 ± 2.62
70.82 ± 1.20
In order to validate our approach the other data mining techniques commonly used in classification problems have been applied to the available data, leading to the results shown in Table 3. We refer the reader to (Micheli 2016) for a detailed description of the implementation of these data mining techniques on the 7000 customers dataset. It can be noted that SVM, Logistic Regression and Neural Networks reach the best accuracy on the available data and show a good balance between sensitivity and
123
Big data analytics: an aid to detection of non-technical…
(a)
TP
80%
(b)
FP
70% 60% 50% 40% 30% 20% 10% 0%
0%
(0% ; 1%]
(1% ; 7%]
>7%
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
TP
0%
≤ 1%
FP
≤ 7%
≤ 100%
p-value classes
p-value classes
Fig. 3 Distribution of true and false positives, depending on p-values. a Percentage of true and false positives, b cumulative distribution of true and false positives Table 3 Confidence intervals at level 99% for the performances of data mining techniques on available data Technique
Accuracy (%)
Sensitivity (%)
Specificity (%)
Average time (min) 8.44
Decision trees
72.22 ± 2.15
50.50 ± 7.42
78.76 ± 6.03
Logistic regression
72.28 ± 2.88
75.54 ± 3.53
71.68 ± 4.01
8.03
Neural networks
74.01 ± 1.55
70.79 ± 2.37
75.57 ± 1.71
8.70
SVM
73.94 ± 2.47
77.89 ± 3.21
72.87 ± 3.25
7.91
Clustering (DBSCAN)
69.41 ± 1.62
64.44 ± 2.85
71.45 ± 3.88
8.54
specificity; on the contrary Decision Trees present very high values of specificity, but an extremely bad sensitivity. It can also be observed that the SVM technique presents all three indices greater than Logistic Regression, therefore this last method can be considered as a dominated model. Neural Networks, instead, show a greater specificity and a lower sensitivity than SVM. As previously mentioned, given the high cost related to false negatives, the SVM technique can be considered as the best model, preferable to the Neural Networks for the higher values of sensitivity. Table 3 compares the data mining techniques also in terms of average computing time requested for the classification of the 7000 customers. As it can be observed, the techniques are very similar from the computational point of view, as they classify all users in about 8 min. With 7.91 min of average executing time, the SVM results the faster model. By comparing the performances of the SVM with the Multiple Linear Regression approach, we can conclude that the proposed technique is longer a better detector. In fact, in a full information context, hence when α 100%, the regression model is a perfect classifier, since all the three indices are equal to 100% in all the neighborhood areas. Also in scenarios with reachability problems—and specifically until α 80% in the wider areas or α 70% in the smallest zones—the proposed model shows better performances, especially for sensitivity, always greater than 80%. Even from
123
G. Micheli et al.
the computational point of view, the Multiple Linear Regression model results the best approach. Indeed the average time requested for the implementation of the regression model on a Neighborhood Area Network is 4.6 s. Therefore, the classification of the 7000 customers located in 48 different Neighborhood Area Networks requires 3.68 min, approximately half of the data mining techniques executing time.
5 Conclusions In this paper we have presented a variant of the P2P computing approach, that can be applied also in scenarios characterized by an incomplete knowledge of meter readings. The proposed technique allows conducting statistical tests on customers to provide more insights to electrical utilities. This method has been tested on real electrical data, showing good performances and some important advantages over the other data mining techniques: • it has a better detection rate; • it reaches good performances even in the analysis of a small number of users, unlike the other data mining techniques, that require a great amount of observations to be high performing; • similarly, it needs a lower number of records to be well performing; • since it does not require the creation of features, it presents lower computational costs; • it gives more information about users, because it does not simply classify the customers in the classes of honest and fraudulent users, but it also provides evidence about the entity of the energy stolen (information contained inside the coefficients k j ). The data used in the numerical experiments have been measured by smart meters that meet the conditions defined by CEI EN 50470-3 regulation, with 1% accuracy. The impact of measurement accuracy on fraudulent customer identification will be investigated in a future work.
References Alahakoon D, Yu X (2013, November 14) Advanced Analytics for Harnessing the Power of Smart Meter Big Data. In: Proceedings of the 2013 IEEE International Workshop on Inteligent Energy Systems (IWIES), Vienna, pp 40–45 Alejandro L, Blair C, Bloodgood L, Khan M, Lawless M, Meehan D, … Tsuji K (2014) Global market for smart electricity meters: government policies driving strong growth. In: U.S. International trade commission, office of industries working paper. Washington DC Depuru S, Wang L, Devabhaktuni V (2011) Support vector machine based data classification for detection of electricity theft. In: 2011 IEEE/PES power systems conference, pp 1–8 Depuru S, Wang L, Devabhaktuni V (2012) Enhanced encoding technique for identifying abnormal energy usage pattern. In: IEEE North American power symposium, pp 1–6 Depuru S, Wang L, Devabhaktuni V, Green RC (2013) High performance computing for detection of electricity theft. Int J Electr Power Energy Syst 47:21–30 dos Angelos E, Saavedra O, Cortes O, de Souza A (2011) Detection and identification of abnormalities in customer consumptions in power distribution systems. IEEE Trans Power 26(4):2436–2442
123
Big data analytics: an aid to detection of non-technical… El-Dereny M, Rashwan NI (2011) Solving multicollinearity problem using ridge regression models. Int J Contemp Math Sci 6(12):585–600 Federal Court of Audit (2007) Operational audit report held in national agency of electrical energy. Tech. Rep., No. TC 025.619/2007-2. Brazil Fehrenbacher K (2013, January 21) A startup emerges to use wireless mesh and the cloud to fight energy theft. Gigaom: the industry leader in emerging technology research: https://gigaom.com/2013/01/21/a -startup-emerges-to-use-wireless-mesh-and-the-cloud-to-fight-energy-theft/. Accessed 15 Nov 2016 IBM (2012, May) Managing big data for smart grids. http://www-935.ibm.com/services/multimedia/Man aging_big_data_for_smart_grids_and_smart_meters.pdf. Accessed 15 Nov 2016 Jang R, Lu R, Wang Y, Luo J, Shen C, Shen XS (2014) Energy-theft detection issues for advanced metering infrastructure in smart grid. Tsinghua Sci Technol 19(2):105–120 Mashima D, Cardenas A (2012, Springer) Evaluating electricity theft detectors in smart grid networks. In: Research in attacks, intrusions, and defenses, pp 210–229 McDaniel P, McLaughlin S (2009) Security and privacy. IEEE Secur Priv 7(3):75–77 Micheli G (2016) Big data analytics: individuazione delle perdite non tecniche nelle reti elettriche. Master’s thesis. Università degli Studi di Bergamo. Bergamo Ministry of Power (2013) Overview of power distribution. India. http://www.powermin.nic.in. Accessed 15 Nov 2016 Muniz C, Figueiredo K, Vellasco M, Chavez G, Pacheco M (2009) Irregularity detection on low tension electric installations by neural network ensembles. In: IEEE international joint conference on neural networks, pp 2176–2182 Nagi J, Yap K, Tiong S, Ahmed S, Mohammad A (2008) Detection of abnormalities and electricity theft using genetic support vector machines. In: TENCON 2008-2008 IEEE region 10 conference, pp 1–6 Nagi J, Yap KS, Tiong SK, Ahmed S, Nagi F (2011) Improving SVM-based nontechnical loss detection in power utility using the fuzzy inference system. IEEE Trans Power Deliv 26(2):1284–1285 Salinas S, Li M, Li P (2012) Privacy-preserving energy theft detection in smart grids. In: 9th Annual IEEE communications society conference on sensor, mesh and ad hoc communications and networks (SECON), pp 605–613 Salinas S, Li M, Li P (2013) Privacy-preserving energy theft detection in smart grids: a P2P computing approach. J Sel Areas Commun 31(9):257–267
123