PROCEEDINGS OF 7th INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION AND IMAGE ANALYSIS: NEW INFORMATION TECHNOLOGIES ST. PETERSBURG, RUSSIAN FEDERATION, OCTOBER 18–23, 2004
Some Modifications of the Algorithm for Construction of Classification Trees A. O. Skomorokhova, V. N. Kutinskya, and M. T. Slepovb a
Obninsk State Technical University of Atomic Energy, Obninsk, Kaluga oblast, Russia b Novovoronezhskaya Nuclear Power Plant e-mail:
[email protected],
[email protected],
[email protected] http://www.data2money.com
Abstract—Some modifications of the algorithm for constructing classification trees that are helpful in the processing of noise spectra in the technical diagnostics of the nuclear power plant are presented. DOI: 10.1134/S1054661808010161
1. SYSTEM FOR VIBRATION MONITORING OF THE NOVOVORONEZHSKAYA NUCLEAR POWER PLANT The system for vibrational monitoring of the Novovoronezhskaya nuclear power plant (NVNPP) has been developed by Siemens and was put in operation in 1992. The system provides extended facilities for measurement and preliminary processing of vibration signals. However, the system has no functions for making decisions in the course of diagnostics except for the simplest functions (plotting of the trends of individual spectral components). In this regard, it is expedient to develop systematic, algorithmic, and software tools for computerization of the intellectual data processing and support of diagnostics using the existing system for vibration monitoring. The spectra of vibration signals that are obtained from sensors mounted on various elements of equipment serve as the primary information for the diagnostic system. Figure 1 demonstrates the positions of the sensors at the reactor housing and the equipment of six coolant loops (steam generators and main circulation pump (MCP)). The application of the methods of pattern recognition in processing of vibration spectra is considered in [1–3].
(ii) a relatively easy interpretation of the obtained solution as a set of simple classification rules, (iii) easiness of the simultaneous analysis of both numerical and factor (qualitative) variables, (iv) simultaneous construction of classification rules and the search for informative features, (v) application of well-developed procedures for optimization of the complexity and predictive ability of the solutions obtained, (vi) the possibility of effective processing of large (with respect to the number of measurements and the number of predictors) data arrays. The CTs are efficiently used to solve the following two main classes of problems: construction of decision rules for classification, diagnostics, and prediction and search for hidden regularities and singularities in large data arrays. The results on the CT construction and application can be used directly and as a priori information (inforSteam generator 4
Main coolant pump 3
Main coolant pump 4
2. SEARCH FOR REGULARITIES IN DATABASE The methods based on construction of classification trees (CTs) are intensively used in the solution of the general classification problem [4]. CTs exhibit the following advantages:
Steam generator 3
Steam generator 2 Main coolant pump 2
Steam generator 5
Main coolant pump 5 Main coolant pump 1
(i) calculation simplicity and a relatively high efficiency of the algorithms for CT construction, Main coolant pump 6
Received March 31, 2005
Steam generator 1
Fig. 1. Positions of sensors.
ISSN 1054-6618, Pattern Recognition and Image Analysis, 2008, Vol. 18, No. 1, pp. 132–138. © Pleiades Publishing, Ltd., 2008.
SOME MODIFICATIONS OF THE ALGORITHM FOR CONSTRUCTION
j
The interpretation of deviance is similar to that of entropy. However, note that factor nij is used instead of factor pij . The meaning of this difference is as follows. We assume that several candidates for thresholds bk, bl , and bm that can be used for the vertex partition are determined at a certain stage of the tree construction. We also assume that the partition with respect to each of these thresholds leads to a pure vertex. Then, in accordance with the measure of deviance, we choose the threshold that yields a pure vertex with the maximum number of objects of the given class. Consider vertex i whose partitioning yields two daughter vertices j and k. If the corresponding values of deviance are Di , Dj , and Dk, then the decrease in the deviance is written as Di – D j – Dk .
(1)
The task is to choose such a partition variant (split) of the parent vertex that maximizes this decrease. After the split of a certain vertex, the prospects for the split of the resulting daughter vertices are estimated. If a certain vertex contains the objects of a single class only (pure vertex) or the number of objects of another class is no greater than a certain threshold level ε, this vertex is classified as the terminal and is not involved in further partitioning. In the opposite case, the partitioning is recursively repeated until all of the new vertices are classified as terminal. The CART algorithm is most widely spread in comparison with alternative algorithms for the CT construction (C4.5, NeId, CHAID, etc.). The method of CTs constructed using the CART algorithm is commonly accepted as the most efficient and popular approach to solution of problems on searching for hidden nontrivial information in databases and data mining problems. However, this algorithm for the CT construction cannot take into account specific features of particular problems, which is predominantly related to the properties of the algorithm. Below, we demonstrate that this can lead to complexities related to interpretation of the PATTERN RECOGNITION AND IMAGE ANALYSIS
C1
S 2 ( f 1 ) S 2 ( f 2 ) … S 2 ( f 400 )
C2
S n ( f 1 ) S n ( f 2 ) … S n ( f 400 )
…
ij log p ij .
S 1 ( f 1 ) S 1 ( f 2 ) … S 1 ( f 400 ) …
∑n
We represent the available sets of spectra as a matrix S and a classification vector C:
…
Di = –2
4. LOCAL PEAK COUNT
…
3. CT CONSTRUCTION: D CRITERIA In this work, we consider binary CTs, whose construction algorithm is based on the recursive separation using the classification and regression trees (CART). The deviance, which characterizes the vertex purity, serves as a criterion for the vertex partition into daughter vertices. For a pure vertex (a vertex that contains only objects of a single class), the deviance equals zero. In general, the deviance for vertex i is given by
results with regard to certain knowledge and the uniqueness and optimality of the resulting solutions. In this work, we present several modifications of the CART algorithm based on the long-term analysis and processing of the vibration control data on the NVNPP first circuit and the application of CTs as the main instrument in this work.
…
mative features, general character of the decision function, and specific data subsamples) in the application of alternative classification methods.
133
.
Cn
The rows of matrix S correspond to individual spectra estimated using 400 equidistant frequencies in the range 0–50 Hz. The columns of matrix S correspond to the power spectral densities (PSDs) at a fixed frequency estimated at various instants. In terms of pattern recognition, we consider a set of n samples (points) in the 400-dimensional space of features. The classification vector C contains the numbers of classes for each spectrum. Evidently, vibration is inherent in the working equipment. The natural wear and the possible anomalies lead to a variation in the original vibration characteristics of the system, so that different effects cause different variations in the vibration characteristics. Normally, such variations are related to appearance and disappearance of peaks (amplitude variations, frequency shifts, etc.). Thus, the variations in the vibration characteristics can be analyzed in terms of the presence and absence of peaks. In this regard, it is expedient that the algorithm chooses the feature corresponding to the peak rather than the first feature (default operation of the CART algorithm) when the CTs are constructed in the case of several features (frequencies) with the same value of the D criterion. The degree of conformity of a certain frequency to the peak is determined using the number of local maxima observed at this frequency in all spectra of initial data. Thus, the modification of the original algorithm of the CT construction lies in the introduction of an additional criterion based on the degree of conformity of a certain frequency to the peak and the application of this criterion in the selection of the best partition variant with the same value of the D criterion. Naturally, a local maximum is defined as a spectral point whose amplitude is higher than the neighboring amplitudes:
Vol. 18
S ( f i – 1 ) < S ( f i ) > S ( f i + 1 ) ⇒ P ( f i ) = 1. No. 1
2008
134
SKOMOROKHOV et al. PSD 14
D
120 80 40 0 60 40 20 0
12 10
Peaks
8 6 0
100
200 300 Frequency index
400
4 2 0
100
200
300 400 Frequency index
Fig. 2. Plots of D criterion and the number of local maxima.
Fig. 3. Selection of partition corresponding to the peak.
The presence of the local maximum at frequency fi is represented using logical variable P. A peak at a certain frequency can be present or absent in a spectrum. To determine the degree of conformity of the spectrum at a certain frequency to the peak, we count the number of local maxima over the entire sample:
In the conventional CT construction (D criterion), frequency 19 (left vertical line in Fig. 3) would be selected. This frequency does not correspond to the peak with respect to any class, which impedes the interpretation. The application of criterion (3) yields frequency 157, which corresponds to a developed peak for each class. Note that both features yield the same classification quality D = 0. To extract the peaks in the spectra, one can also employ the second derivative of the spectra. Then, the spectral peaks correspond to minima in the second derivative. To calculate the second derivative of a spectrum, we use the parabolic smoothing with the least-squares method. For the running frequency window
n
N( f i) =
∑ P ( f ).
(2)
i
j=1
The optimal frequency is determined using the maximum N among the frequencies with the zero value of the D criterion: f opt = max { N ( f i ) D ( f i ) = 0 }. i
(3)
We consider the application of this approach using an example of the problem of recognition of axial and transverse vibrations of the steam generator in one of the six coolant loops. The conventional CT constructed to solve this problem shows that the presented classes can be correctly classified using the signal at only one frequency, which corresponds to the CT consisting of only two terminal vertices. However, a more detailed analysis of the resulting tree shows that the solution is not unique and that there exist several frequencies for which the partitioning leads to a result of the same quality. This is illustrated by the upper plot in Fig. 2: several frequency regions correspond to the same value of the D criterion (D = 0). Then, for each frequency with D = 0, we estimate its conformity to the peak using formula (2). Using the resulting values (lower plot in Fig. 2), we choose the most optimal frequency in accordance with additional criterion (3). Figure 3 shows the median spectra calculated with respect to each of the two classes using the following formula: S med ( f i ) = med { S ( f i ) ∈ ω i }, i
i = 1, 2.
f i – k, f i – ( k – 1 ), …, f i, …, f i + ( k – 1 ), f i + k with the length 2k + 1, we estimate the parabola coefficients using the least-squares method: 2
S = a0 + a1 f + a2 f . The following expression is used as the smoothed estimate for the second derivative: ∂ S --------2 = 2a 2 . ∂f 2
Both criteria for the selection of the spectral peaks can be used simultaneously and make it possible to choose the most easily interpreted frequencies among the features with equal classification efficiencies. 5. FISHER CRITERION Equal values of the D criterion can correspond to features of different quality. Figures 4 and 5 demonstrate distributions (schematic diagrams) for two features fi and fm for the problem of classification with respect to two classes. Both features exhibit zero value of the D criterion (the samples for different classes do not overlap) and allow the correct classification of points of the learning sample. Note that the quality of
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 18
No. 1
2008
SOME MODIFICATIONS OF THE ALGORITHM FOR CONSTRUCTION
the second feature (Fig. 5) is higher, since the difference of the mean values with respect to classes is significantly greater than the spread inside classes. It is commonly accepted that such a characteristic of features is described using the Fisher criterion [5], which is one-dimensional in the case under study: (mx – my) ------------------------, 2 2 sx + sy
25 20 15 10
2
5
(4)
0 1
where m are the centers of classes and s are variations for classes. The following procedure is used for the CT construction to take into account the above fact. To determine each split, we employ the subset of features with the minimum value of the D criterion and choose the best feature with respect to the maximum of the Fisher criterion (4). Using this approach to solution of the recognition problem for the signals from sensors mounted on the steam generators and MCP, we obtain the following result. Figure 6 demonstrates the values of the D criterion for the first partition (upper plot) of the constructed CT and the corresponding values of the Fisher criterion (lower plot). It is seen in the upper plot that the partition with respect to several frequencies corresponds to the criterion value D = 0. Note that the values of the Fisher criterion (4) for these frequencies can differ several times. Thus, the application of the D criterion yields the partition with respect to frequency component 4, for which the classification quality is not the best. The application of the additional (Fisher) criterion leads to the selection of frequency component 198, which corresponds to the peak of the MCP rotation frequency (24.625 Hz).
20 18 16 14 12 10 8 1
2
Fig. 5. Distributions at frequency fm.
Peaks
D
300 200 100 0 0.4 0.2 0 0
The next problem that needs to be analyzed in detail is the optimization of the classification models obtained using the CT method. Such an optimization is necessitated by the nature of the CT construction algorithm, in particular, the criterion for estimating the quality of the partition variants at the tree vertices. The problem is as follows. The constructed CT represents a series of sequential partitions each of which is the best partition for the given node. However, in the estimation of the quality of each specific partition, the algorithm takes into account the purity of only the daughter vertices (see formula (1)) and does not estimate the effect of this partition on the quality of the subsequent partitions of the resulting daughter vertices. In other words, the selection of the best partition does not take into account the quality of the entire CT. Thus, the CT constructed using the best splits can be inferior to the CT with lower quality partitions at the nodes. Note that the optimal CT normally exhibits a smaller number of terminal vertices
2
Fig. 4. Distributions at frequency fl .
6. RANDOM SEARCH WITH ADAPTATION
PATTERN RECOGNITION AND IMAGE ANALYSIS
135
100
200 300 Frequency index
400
Fig. 6. Plots of D criterion and Fisher criterion.
(and, hence, more effective classification rules) and a smaller classification error on the learning sample. A variant to solve this problem can be the linear search for all of the possible CTs resulting from all of the possible partitions with respect to all of the possible nodal vertices. Obviously, such determination of the optimal tree is inapplicable due to enormous number of variants that need to be analyzed. Note that the CART algorithm performs the linear search for all of the possible partitions with respect to one vertex or, in other words, the linear search for the classification subtrees. We propose another approach to this problem. Instead of constructing CT with regard to all of the
Vol. 18
No. 1
2008
136
SKOMOROKHOV et al. V386 < 1.37673
Probability 0.00265
0.00255 V4 < 9.47276
V67 < 5.14347 2
V5 < 7.02825
1 1
1
2
0.00245 0
100
200
300 400 Frequency index
Fig. 7. Conventional classification tree.
Fig. 8. Results of the estimation of feature informativeness.
available features, we propose the CT construction using the most informative features. The features can preliminary be assessed using the random search with adaptation (RSA). The feature informativeness is estimated in terms of probability, when the most informative features correspond to the highest probabilities, and the estimation procedure is as follows. (i) First, equal probabilities (equal values of informativeness) are assigned to all of the features. This corresponds to a uniform probability distribution. The sum of the probabilities equals unity. (ii) In accordance with the current probability distribution, n features are randomly chosen. A CT is constructed using this features and the classification error for this tree on the learning sample is estimated. This step is repeated r times. Quantities n and r are the algorithm parameters. (iii) The best and worst CTs are chosen based on the classification error for r trials. The probabilities of the features that are used to construct the best CT are increased by dP (are stimulated), where 0 < dP < 1 is an algorithm parameter. The probabilities of the features that are used to construct the worst CT are decreased by the same value dP (are punished). (iv) The second and third steps are repeated R times (R is an algorithm parameter). (v) The most informative features are chosen using the resulting probability distribution in which the highest probabilities correspond to these features. Thus, the complete optimization procedure for the CT construction is as follows. (i) The informativeness of features is estimated using the RSA method and a set of the most informative features is determined. (ii) A CT is constructed only using this set of features. By way of example that demonstrates the proposed procedure, we consider the classification of signals
from sensors mounted on steam generators (SGs) that are connected to various turbogenerators (TGs). The problem is as follows. Two 220-MW TGs are available at the NVNPP third and fourth units. Note that SG1– SG3 and SG4–SG6 work for TG11 and TG12, respectively. In the case of the normal operation of the unit, the TG loads are different predominantly due to different qualities of vacuum in the capacitors. This power imbalance can differently affect the vibrations of SGs working for different TGs. Thus, the problem lies in searching for such differences and in the vibration analysis with regard to these differences. The search is realized using CTs. Figure 7 demonstrates the maximum-size CT constructed for the problem under study using the CART algorithm in the absence of additional criteria. The size of this tree is five vertices, and the error on the learning sample is 0%. After the first split, the lefthand vertex contains 153 objects of class 1 (signals from the sensors on SG1–SG3) and three objects of class 2 (signals from the sensors on SG4–SG6). In the right-hand vertex, we have nine objects of class 1 and 152 objects of class 2. Using the RSA method, we choose the group of the most informative features. The resulting probability distribution for the parameters n = 5, r = 20, and R = 10000 is presented in Fig. 8. The horizontal line in Fig. 8 shows the threshold with which we choose two regions of the most informative features in the ranges of the 10th and 160th frequency components. The following frequency components are included into the set of the most informative features: 5–12 and 153–172. This result makes it possible to draw the first important conclusion: the 386th feature, which is the most informative in accordance with the conventional CT (Fig. 7) is not contained in the above group. This additionally proves a weak relation between the qualities of individual splits and the CT quality in general.
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 18
No. 1
2008
SOME MODIFICATIONS OF THE ALGORITHM FOR CONSTRUCTION
137
V5 < 9.28808
PSD 12 10 8
V7 < 7.2669 1
6 4 2 0
1
0
100
200
2
300 400 Frequency index
Fig. 9. Median spectra for the signals of two classes.
Fig. 10. Optimized classification tree.
Figure 9 shows the median spectra for the two classes under study. We clearly see two frequency ranges with the most significant difference between the aforementioned classes. Figure 10 demonstrates the CT constructed using the set of informative features. Thus, at the same zero classification error on the learning sample, the size of the optimized CT is three vertices. The decrease in the number of terminal vertices leads, first, to a decrease in the number of logical rules needed for the classification of spectra from various groups of sensors and, second, to significant simplification of these rules. Another important feature of the result obtained is the presence of the terminal vertex after the first split. This makes it possible to identify about 80% of signals from sensors of class 1 using a single feature (the PSD value at the fifth frequency component). The significance of this factor is especially high in the presence of gaps in experimental data. The quality of a certain partition variant can be characterized using the length of branches that connect the parent and daughter vertices. In particular, the analysis of the graph of the conventional CT (Fig. 7) can clearly illustrate this problem. Choosing the V386 feature as the best partition variant, the algorithm for the CT construction ignores the fact that this leads to a series of extremely bad partitions. Note that the number of partitions is reduced and the total CT quality increases when the V5 feature with the lower D criterion is chosen as the first split. This result also proves that, in comparison with the conventional CT, the optimized CT employs the features with respect to which the above classes are spatially separated to a greater extent.
take into account the specific character of each problem. Sometimes, this can lead to a scenario in which the resulting solutions are difficult to interpret or the solutions are unnecessarily complicated. We assume that such problems can be solved using additional quality criteria. In this work, we consider several criteria whose efficiency is demonstrated using specific examples of the analysis of vibration spectra obtained on the system of the NVNPP vibration monitoring.
7. CONCLUSIONS The method of CTs constructed using the CART algorithm remains one of the most efficient means for the comprehensive data analysis. However, it cannot PATTERN RECOGNITION AND IMAGE ANALYSIS
REFERENCES 1. A. O. Skomorokhov and M. T. Slepov, “Data Verification in the System of Vibration Diagnostics at the Novovoronezhskaya Nuclear Power Plant,” Yad. Energetika, No. 1 (1999). 2. A. O. Skomorokhov and M. T. Slepov, “Pattern Recognition in APL with Application to Reactor Diagnostics,” APL Quote Quad 29 (2), 92–103 (1998). 3. A. O. Skomorokhov and V. Kutinsky, “Classification Trees in APL: Implementation and Application,” APL Quote Quad 31 (2), 97–109 (2001). 4. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees (Wadsworth, Belmont, 1984). 5. R. Duda and P. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973; Mir, Moscow, 1976). Aleksandr O. Skomorokhov. Born 1951. Graduated from the Faculty of Radio Physics, Nizhni Novgorod State University in 1973. Received candidate’s degree in 1984. Associate Professor of the Obninsk State Technical University of Atomic Energy. Scientific interests: development and application of modern methods for data analysis, technical diagnostics, and matrix programming languages. Author of 30 papers. Member of SIGAPL and SIGKDD of ACM, BCS, and the Russian Association of Artificial Intelligence.
Vol. 18
No. 1
2008
138
SKOMOROKHOV et al. Vladimir N. Kutinsky. Born 1975. Graduated from the Obninsk Institute of Atomic Power Engineering in 1998. Senior Lecturer of the Obninsk State Technical University of Atomic Energy. Author of three papers. Member of the Russian Association of Artificial Intelligence. Winner of the Obninsk Stipend for Students, PhD Students, and Young Lecturers.
Mikhail T. Slepov. Born 1966. Graduated from the Obninsk Institute of Atomic Power Engineering in 1992. Received candidate’s degree in 1999. Head of the Laboratory of Technical Diagnostics at the Novovoronezhskaya Nuclear Power Plant. Scientific interests: vibration analysis, signal processing, spectra. Author of seven papers. Member of Advisory Committee of the Rosenergoatom Concern.
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 18
No. 1
2008