Neural Comput & Applic (2012) 21:1477–1489 DOI 10.1007/s00521-012-0887-4
ORIGINAL ARTICLE
Using artificial neural networks to enhance CART William A. Young II • Gary R. Weckman Vijaya Hari • Harry S. Whiting II • Andrew P. Snow
•
Received: 5 August 2009 / Accepted: 6 February 2012 / Published online: 21 February 2012 Ó Springer-Verlag London Limited 2012
Abstract Accuracy is a critical factor in predictive modeling. A predictive model such as a decision tree must be accurate to draw conclusions about the system being modeled. This research aims at analyzing and improving the performance of classification and regression trees (CART), a decision tree algorithm, by evaluating and deriving a new methodology based on the performance of real-world data sets that were studied. This paper introduces a new approach to tree induction to improve the efficiency of the CART algorithm by combining the existing functionality of CART with the addition of artificial neural networks (ANNs). Trained ANNs are utilized by the tree induction algorithm by generating new, synthetic data, which have been shown to improve the overall accuracy of the decision tree model when actual training samples are limited. In this paper, traditional decision trees developed by the standard CART methodology are compared with the enhanced decision trees that utilize the ANN’s synthetic data generation, or CART?. This research demonstrates the improved accuracies that can be obtained with CART?, which can ultimately improve the
W. A. Young II (&) Department of Management Systems, College of Business, Ohio University, Copeland Hall, Athens, OH 45701-2979, USA e-mail:
[email protected] G. R. Weckman V. Hari H. S. Whiting II Department of Industrial and Systems Engineering, Russ College of Engineering and Technology, Ohio University, Stocker Center, Athens, OH 45701-2979, USA A. P. Snow The J. W. McClure School of Information and Telecommunication Systems, Ohio University, Lindley Hall, Athens, OH 45701-2979, USA
knowledge that can be extracted by researchers about a system being modeled. Keywords Decision trees Classification and regression trees CART Artificial neural networks Knowledge extraction
1 Introduction The human quest for predicting future trends and patterns in business-related scenarios has created the need to collect and analyze raw data in order to develop mathematical models that can help researchers understand and manage systems better [1]. Knowledge discovery in data sets (KDD) is a process of identifying useful and comprehensible patterns in data [2, 3]. Data mining, one of the steps of KDD, is a widely used tool in competitive business environments. For example, data mining is used to reduce overall cost and enhance profits in business sectors [4]. In addition to business applications, data mining algorithms have been adopted in engineering, medicine, sports, and environmental systems. Within these sectors, the most commonly used algorithms include multivariate regression, decision trees, decision rules, artificial neural networks (ANNs), clustering, and Bayesian models [5]. Decision trees have a major benefit over other modeling approaches like ANNs or multiregression models as they can incorporate missing values directly in the induction algorithm. In this case, no additional data preprocessing techniques like single or multiple imputation are required to generate a decision tree. Data mining methods are usually descriptive or predictive. The descriptive data mining discovers patterns and makes the extracted patterns understandable, while predictive data mining builds models to predict specific outcomes.
123
1478
Decision trees are predictive tools that can be used to model either classification or continuous outcomes. There are several decision trees that have been used in practice such as CHAID (Chi-Squared Automatic Interaction Detector), QUEST (Quick, Unbiased, Efficient Statistical Trees), and CART (Classification and Regression Trees) [6–9]. CHAID was developed to build non-binary decision trees using only categorical variables [10]. CHAID and QUEST are considered binary decision trees because their methodologies are restricted to only one categorical variable [11]. Both CHAID and QUEST are more computationally expensive and require longer run times in comparison with other tree induction algorithms, especially when implemented with large databases [9]. CART is capable of overcoming these drawbacks and is the area of interest for this research [12]. CART is a non-parametric decision tree algorithm, which is unlike most tree-induction methods. In addition, CART is designed for either categorical (classification) or continuous (regression) decision variables or input attributes. Besides being an efficient tree induction method for large data sets, it has been used as a feature selection methodology [13]. For example, CART has been well received in medical diagnosis due to its dependable accuracies [14] and relative ease of use to recommend treatment methods quickly [15]. There are many examples of researchers using CART in academic and applied research. For example, Waheed et al. [16] used the classification method to study the effect of water stress, presence of weeds, and nitrogen rates on crop growth. The decision rules for this agriculture system helped decision makers to select suitable irrigational conditions correctly and to improve overall yields depending on the season. In another example, CART was used as a decision tool by Bell Atlantic Telephone Company [17]. This company used the set of rules created by CART to determine the type of technician that was needed to work on a particular customer problem. The use of this method was simple, cost effective, and flexible, which helped save the company time and money by more efficiently allocating their resources. Regression trees can use continuous variables to predict a certain response. There are many examples found in academic research using CART for this type of prediction. For example, variables such as crime rate, tax, and pollution were used to predict housing closing prices [18]. CART has been used for other real-world applications such as defining garment size for army soldiers [19]. By analyzing the decision tree, a wider range of body shapes could be fitted with fewer sizes, which reduced inventory costs and leads to a more efficient production plan for the manufacturer. Another example of a real-world application is seen with the Donnelly plant in Gallatin, Tennessee [20].
123
Neural Comput & Applic (2012) 21:1477–1489
A mechanical problem caused delays in production until the problems were repaired. The decision tree model was used to troubleshoot the process parameters that were causing delays and was able to identify the problems that resulted in low efficiencies. Because of CART’s predictive accuracy and transparency, it is used in many other fields of study including stock trading [21], manufacturing [22], medical diagnosis [23], meteorology [24], physics [7], economy [25], education [26], e-commerce [26], ecology [15], analytical chemistry discovery [14], and real estate mortgage [12]. As with all decision tree algorithms, CART’s major limitation is that the amount of data reaching the terminal nodes diminishes as the depth of the tree grows. Even though the more sensitive attributes used at the beginning of the tree to predict a certain outcome, the terminal node may lead to misclassifications since the final decision tree splits are based on limited sample sizes. This problem is illustrated in Fig. 1, where the rectangular box indicates nodes of the decision tree and the circles represent the number of samples remaining after a binary split had been made. As noted in the figure, one resulting terminal node had seven samples remaining to determine the final split of the tree. However, the ‘‘true’’ distribution of the relationships for predicting a certain outcome is unknown because there is a limited amount of samples remaining. In real-world applications, the primary concerns with decisions trees are accuracy, clarity, and comprehensibility. For example, being able to provide accurate and understandable information may help a credit card company make a better decision about the acceptance or denial of an application [28]. Likewise, a health professional can achieve greater ability to provide appropriate treatment to patients. Despite the limitation of diminishing sample sizes when determining the splits of a decision tree, they can be easy to interpret and provide clear insight into the decision process. Sample data points
Decision with limited samples
Decision tree node
Continue to grow tree How should the remaining samples be partitioned? What is the “true” relationship? vs.
Fig. 1 Knowledge extraction from a decision tree (adapted from [27])
Neural Comput & Applic (2012) 21:1477–1489
The primary research objective was to develop a methodology to enhance CART, which overcomes the deficiency of limited data at terminal nodes in order to increase the tree’s accuracy. To address this problem specifically, the proposed methodology uses ANNs to reduce the ill effects of splitting upon limited data. ANNs are often used as the computational backbone for many decision support systems. For example, they have been used in atmospheric science [29], psychopathology [30], image recognition of crops [31], cryptography [32], adaptive controlled driving machines [33], bioinformatics [34], and aircraft and diesel engine monitoring systems [35]. Although ANNs are superior in performance to other conventional machine learning techniques, they are often called ‘‘black-boxes’’ because of their lack of transparency [27–37]. In other words, it is very difficult to determine the estimating relationships that an ANN is using to derive its prediction. In contrast, decision trees are very transparent, where a decision maker can follow the logic used to produce an estimate. Craven and Shavlik developed a unique tree-induction algorithm called TREPAN [38]. This algorithm uses ANNs to help create additional synthetic data when limited data are used to create a split at a given node. In their initial work, the ‘‘Oracle’’ process has proven to be a novel approach to make the decision tree robust for poor or limited sample sizes. However, TREPAN has the limitation of only being a classification tree. For this research, TREPAN’s Oracle process was used in conjunction with CART’s existing algorithm to make continuous regression trees more robust to limited data. The enhanced version of CART that is presentenced in this paper will be referred to as CART? from this point forward. The organization of this paper is outlined as follows. Section 2 provides a discussion about the methods and issues involved with the development of CART?. Section 3 details the methodology and describes the various data sets used for discussion of the method’s accuracy. Section 4 compares the results obtained from using CART and CART?. Finally, Sect. 5 concludes and provides discussions about the CART? methodology.
1479
comparison with other tree induction techniques like CHAID, CART uses binary splits, which means a particular node splits into two subsets, or child nodes. The CART methodology has three main stages. In the first stage, a complex tree with many terminal nodes is grown. Though this maximum tree will interpret the data perfectly, the predictive ability can be low due to over-fitting the data. In the second stage, a process called pruning removes a group of nodes by using cross-validation and cost-complexity factor [14]. In the final stage, the predictive error is considered as criterion to select an optimal tree. The following sections provide an overview of each of these stages of the CART tree-induction methodology. 2.1.1 Tree growing The process of tree building begins by splitting the root node into two child nodes. CART computes the best split by considering all probable splits for each independent variable. The best split obtained with an impurity function, which exists between the parent node and two child nodes, is minimized. The best split equation is given as: Dðs; tÞ ¼ iðtÞ ðpL iðtL Þ þ pR iðtR ÞÞ
ð1Þ
where s is the split of the independent variable, t is the parent node, i(t) is the impurity of the node t, pL and pR are the number of data objects going to left and right nodes, and i(tL) and i(tR) are the impurities of the left and right nodes, respectively [14]. For a classification tree, the impurity measure i(t) is computed using different criteria such as the Gini Index, Entropy Index, and Towing Rule, which determine the best split. For a regression tree, where the response variable is numerical, least-square deviation for a regression tree estimates node impurity, which is given by the following equation: 1 X RðtÞ ¼ wi fi ðyi ! y ðtÞÞ2 ð2Þ Nw ðtÞ i¼1
This section gives an overview of the methodologies used to create CART?. The three main discussions found in this section include CART, ANNs, and TREPAN.
where Nw(t) is the measure of the weighted number of objects in node t, wi is the value of the weighting variable for object i, fi is frequency variable, yi is the value of dependent or response variable, and y(t) is the mean of the values of objects in node t [9]. Using the best split equation, a maximal tree is grown, until the initial sample data are exhausted. The recursion stops when all of the objects in the terminal node belong to the same class are homogeneous or when the objects in the terminal nodes reach a predefined number.
2.1 CART
2.1.2 Tree pruning
CART is a non-parametric binary recursive partitioning technique developed by Leo Breiman et al. [18]. In
Pruning to the over-fit tree grown in the initial step of CART is then applied. Pruning develops an optimal tree,
2 Technical overview
123
1480
Neural Comput & Applic (2012) 21:1477–1489
by shedding off the branches of the large, over-fit tree. The pruning procedure develops a sequence of smaller trees and computes cost-complexity for each tree. Based on the cost-complexity parameter, the pruning procedure can determine the optimal tree with high accuracy. The costcomplexity parameter Ra is a linear combination of tree complexity and cost associated with the tree. Complexity is given by the following equation:
Ra ¼ RðTÞ þ a jT j , a ¼
Ra RðTÞ
ð3Þ
jT j where R(T) is the re-substitution estimated error, T~ is the measure of the number of terminal nodes of the tree, which determines the complexity of the tree, and a is the cost complexity associated with the tree. R(d) is the misclassification error or number of cases misclassified by the classifier for the classification tree and is computed by the following equation: RðdÞ ¼
N 1X Xðdðxn Þ 6¼ jn Þ N i¼1
ð4Þ
where X is the indicator function, which is equal to 1 if the statement X(d(xn) = jn) is true and 0 if it is false, and d(x) is the classifier of R(T). The re-substitution error in case of a regression tree is given by expected squared error and is computed by the following equation: n 1X RðTÞ ¼ ðyi dðxi ÞÞ2 N i¼1
2.2 Artificial neural networks ð5Þ
where (xi, yi) is the learning sample and d is the numerical predictor. The value of the complexity parameter in the pruning usually lies between 0 and 1. The pruning procedure develops a group of trees using different values of the complexity parameter, giving different sizes of trees. According to Brieman et al., among a group of trees of different sizes, for a value of a, only one tree of smaller size has high accuracy; in other words, that tree reduces the squared error if it is a regression tree or misclassification error if it is a classification tree [14]. 2.1.3 Optimal tree selection The optimal tree is one that has the smallest prediction error for new samples. Prediction error measured using independent testing sets or cross-validation (CV). The independent test set divides the original data set into a training set and a testing set. The independent test set is usually used when the data set is large. When the data set is not large enough to split the data into training and testing data, V-fold cross-validation is used, where the data set is separated into
123
V-subsets. Each subset consists of different groups of data, with the same number of data objects in each group. V-subsets of data are used as test data, and (V-1) subsets are used as the training data set during each run of cross-validation. Cross-validation is repeated V-times, considering each time different sub sets of training and test data, and thus developing V-number of varied trees. Among the V-different trees, the simplest tree that has the lowest independent test set error rate or cross-validation error rate (CV error) is selected as the optimal tree. However, the tree at this stage is considered unstable, as the results predicted from the decision tree can change rapidly with slight change in the training data set [39]. Therefore, the 1-SE rule known as one standard error rule is used to avoid instability. The 1-SE rule builds trees of different sizes and complexity by varying the importance of variables that lead to the drop in the error rate and instability. From the group of trees, the tree with the smallest size and CV error within one standard deviation error (1-SE error) of the minimal CV error is selected as the optimal tree. This tree is less complex and has higher accuracy than the tree with the lowest CV error. The selection of the optimal tree concludes the final stage of CART, and the prediction is made by observing the tree from the root node to the terminal nodes and values in the terminal nodes [14].
Artificial neural networks (ANNs) are an information processing technique inspired by the human biological nervous system. An ANN is composed of a large number of highly interconnected processing elements called neurons, which work in parallel to form a complex interpretation system that can be used to solve a wide variety of problems [40]. The majority of ANNs feature hidden layers, which receive signals from prior layers and pass the single to the next layer of the network, which could be other hidden layers, or the output layer. Thus, after the neuron processes the signal, it is passed in a feed-forward direction until it reaches the output of the ANN. The signals that are processed are altered by the transform functions in the neurons as well as the weighted values that interconnect the neurons. ANNs develop an abstract representation of the information presented to the algorithm during the learning stage. The true power of ANNs is to derive meaningful relationships out of the complex non-linear sample space. However, the universal acceptance of ANNs is hindered due to the complexity of ciphering and understanding the relationships that the ANN is using to formulate an output. For this reason, ANNs are often thought of as ‘‘black-box’’ models. ANNs are stochastic in nature and work similarly
Neural Comput & Applic (2012) 21:1477–1489
1481
to the human brain with the capacity to learn from experience, or in this case, historic data. The learning process of an ANN is known as training. Training begins by initializing the interconnected weights between nodes with small random values. Once the network’s architecture has been determined, the learning phase is ready to implement, which adjusts the internal weights of the network to minimize error. If adequately trained, ANNs are capable of being efficient modeling tools due to their ability to adapt to unseen or noisy data. Thus, ANNs have a great capacity to generalize systems and predict outcomes whenever new samples are presented [41–43].
with limited data samples or training samples result in low accuracy. To avoid this, TREPAN utilizes the concept of Oracle to derive and make membership queries at the node, and ensures that the node has at least an agreeable amount of instances before splitting the training samples further or assigning class labels to the instances.
2.3 TREPAN
3.1 Data sets introduction
The tree-forming process of TREPAN is similar to traditional decision tree algorithms. However, TREPAN makes certain that a node has a predetermined minimal amount of training samples before partitioning. TREPAN obtains samples from two sources to create a decision tree. The training samples generated by the network become the first source while the model built by TREPAN that generated the new samples using the available training samples becomes the second source. Though the new samples are generated randomly, if used and classified by the tree as class labels, they are prone to the constraint that they reach the node being expanded. In all the cases, TREPAN queries the trained ANN to assign class labels to the training samples [38, 44]. In each node of the tree, TREPAN reserves (1) a subset of training samples, (2) a set of query instances, and (3) a group of constraints [38]. The reserved set of training samples consists of only those samples that reach the node of the tree. The query instances are used in combination with training instances or samples to select the splitting test for a node or to assign class labels to the samples in the terminal node. The constraint set represents the restrictions imposed on the instances, which must be satisfied in order to reach the node. The data gathered is used to derive a set of query instances for a newly produced node [28]. TREPAN interacts with the ANN entirely by means of membership queries [45]. A membership query is a question made to the Oracle, consisting of an instance from the group of instances. If provided with a membership query, the Oracle returns the class label for the respective instance. Since the target is the function represented by the network, the network itself contributes the role of Oracle and membership queries are answered by classifying the instances with a label using the network. In addition to the network’s trained data instance, TREPAN is also applied to derive membership queries similarly for other instances. As the depth of the tree increases, the training instances available at the node decrease and the splitting of the data
To test the improvements made to CART, two well-known data sets were selected from the UCI data repository [46] (i.e., Body Fat [47] and Boston Housing [18]). The other three data sets come from studies investigating ecological [27, 48], crude oil [49], and wireless communications [50, 51] problems. These data sets were chosen because they represent real-world continuous value problems. The general characteristic about the data sets retrieved are summarized in Table 1. The Outages data set was based on simulated data, related to a single wireless network infrastructure building (WIB), which is subscribed by 100,000 customers per year. Besides simulated data, related data were collected from ARMIS in an FCC database. Research has been performed to explore and emphasize the availability, reliability, maintainability, and survivability (ARMS) of wireless networks due to significant increase in the usage of wireless communication. Outage or loss of service results from service disruption for a certain amount of time. Reliability of a wireless network depends upon the number, size, and time-period of Outages. Reliability increases with the decrease of Outages [50]. Variables of the data include mean time to failure (MTTF) and mean time to restore (MTR), which comprise of the elements of WIB. This data set is used throughout to illustrate the method for creating CART? enhanced decision trees. The Saginaw Bay data set consists of real-world data collected from Saginaw Bay located in Michigan. Data include physical and chemical variables such as water temperature, level of water clarity, light attenuation, total suspended solids, total phosphorous, soluble reactive phosphorus, nitrate, ammonia, silica, chloride, particulate organic carbon, and dissolved organic carbon. These variables are used to estimate the content of phytoplankton biomass, which is a source of the chlorophyll level [27, 48]. The level of chlorophyll in the bay area is used to analyze the algae growth, water quality, toxic nature, and other ecological conditions of Saginaw Bay.
3 Methodology This section describes the structure of the methodology, which consists of four phases and is summarized in Fig. 2.
123
1482
Neural Comput & Applic (2012) 21:1477–1489
Regression problems CART PRO Software Parameters • Tree size • Accuracy
Body fat
PHASE 1
Outages
Boston Housing
Corrosion
Saginaw Bay
Identify optimal tree
Create neural network model
PHASE 2 Identify best model
CART+ CART PRO Software Parameters • Tree size • Accuracy
PHASE 3 Create enhanced decision tree
CART
Prediction accuracy
CART+
PHASE 4
Conclusions
Fig. 2 CART? methodology flow chart Table 1 Summary of regression-type data sets
Database name
No. of attributes
No. of samples
Dependent attribute
Outages
12
300
Reportable outages of wireless infrastructure
Saginaw Bay
13
976
Chlorophyll level of Saginaw Bay
Body Fat
14
150
Body fat percentage
Corrosion
11
95
Boston Housing
13
506
Body fat is a continuous problem domain used to predict percentage of body fat of a person [47]. The data set consists of body weight and circumference measurements such as height, neck, chest, abdomen, hip, thigh, knee, age, biceps, forearm, and wrist. The corrosion data set consists of data related to the analysis of 15 Venezuelan crude oils [49]. The data are
123
Corrosion inhibition rate of crude oil Median value of house in Boston
used to study the effect of crude oil in inhibiting corrosion in CO2. Attributes such as API, sulfur, total nitrogen, total acid number, saturates, aromatics, resins, asphaltenes, vanadium, nickels, and percentage of crude oil are the variables used in the analysis of crude oil. The American Petroleum Institute scale is the unique scale used to grade crude oils by the petroleum industries [27].
Neural Comput & Applic (2012) 21:1477–1489
Finally, the Boston Housing data set, which was originally investigated by Harrison and Rubinfeld, was chosen as the final data set. Initially, this particular data set was used to investigate how various factors affected the cost of homes in the Boston area [18]. These data comprise of 14 variables, namely crime rate, average number of rooms, percent tax rate, and percent lower-status population. 3.2 Phase 1: benchmark The first stage of this research was to develop decision trees for five data sets using CART Pro software [52]. For all five data sets, 70% of the data are used for training, while 30% is held out for testing [53]. Various methods were used to create the baseline decision tree. The three methods discussed here are (1) within one standard error, (2) minimum cost tree, and (3) 1-SE rule (One standard deviation error rule). To determine the predictive accuracy of the trees created, the relative error rate was calculated [54]. Experimentation is needed to determine an optimal decision tree. For this research, 210 experiments were run on all of the five data sets. Various parameters used include parent node minimum cases, terminal node cases, maximum number of cases, and max-depth [52]. For example, the parent node cases varied from 15 to 30 nodes and the terminal node minimum cases were incremented from 5 to 10. Finally, the maximum number of nodes was set to 50 and depth was set to 10 throughout the entire experiment. After the results from the experiment were recorded, the next step is to identify the optimal tree that had the best prediction accuracy. 3.3 Phase 2: using ANNs as an Oracle The aim of this phase is to obtain the best ANN through an experimentation process for each of the five data sets. Determining the best ANN is important because it will be the CART? Oracle, which is discussed in later sections. NeuroSolutions 5.07 [55] was used to develop the ANNs for each of the data sets. NeuroSolutions 5.07, made by NeuroDimensions Inc. [55], is a commercial add-into Excel for creating ANNs. The generalized feed-forward and multilayer perceptron ANN architecture types were chosen for the experiment, where the number of hidden layers were varied from 1 to 5, and the number of processing elements were varied from 3 to 20. Once the extensive experimentation was complete, the results for all of the ANNs generated were compared. The ANN that produced the most accurate testing results was considered the best network. For each of the data sets used in this experiment, Table 2 provides a summary of the best performing network architectures for each database. For example, the network architecture for the Body Fat database was the only one found with a single hidden layer.
1483
In other words, only one layer of connected neurons were between the input and output layers. However, the rest of the architectures included two hidden layers between the input and output layers of the networks. For these different structures, the table also indicates how many neurons were found through the experimental process. Finally, two ANN types were consistently found to produce the best result. These two types were the generalized feed-forward (GFF) and the multi-layer perceptron (MLP) neural networks. 3.4 Phase 3: CART? To develop the CART? decision tree, the ANN Oracle, developed in Phase 2, is used to generate additional artificial samples when attempting to determine the decision criteria based on limited samples. The goal of the Oracle is to provide information learned from the initial ANN training with observed samples in order to enhance the optimal or baseline decision tree found by increasing the accuracy of terminal node splits. The entire procedural flow of generating a CART? model is shown in Fig. 3, where the remainder of this section will describe each of the five steps in detail, where the Outages data set will be used as a case study. For more information regarding the Outages data set, readers are referred to the original research by Weckman et al. [51]. 3.4.1 Step 1 As shown in Fig. 3, the development of a CART? model begins by selecting the optimal tree found in Phase 1 of the methodology. In this initial step, the goal is to find the terminal nodes, which were split with less data than the predetermined minimum sample size. For this investigation, the minimum sample size for partitioning the CART? algorithm was predetermined to be 50. Figure 4 depicts the optimal tree found experimentally for the Outages data set. The figure also identifies nine nodes initially partitioned with less than the minimal partitioning sample size of 50. These nodes are indicated in the figure with letters A through I. It should be noted that the nodes identified in this step are not all terminal nodes (i.e., nodes A, B, and D). For example, in Fig. 4, there are four terminal nodes below A and B. These nodes all have less than the ideal sample size; however, the parent nodes also have less than 50 samples. Therefore, the nodes that are nearest the root node with less than the minimal sample size should be selected in this step. 3.4.2 Step 2 After identifying the nine nodes that were initially partitioned with the CART algorithm with less than 50 samples,
123
1484
Neural Comput & Applic (2012) 21:1477–1489
Network type
No. of hidden layers
No. of neurons in 1st hidden layer
No. of neurons in 2nd hidden layer
Body fat
GFF
1
5
Corrosion
MLP
2
10
5
Outages
MLP
2
10
5
N/A
5
5
5
Step 1
Step 3
Step 4
Step 5
Build histogram
Step 2
Create new data and develop new decision tree using CART Pro
10
2
Use random numbers as production data in neural networks
2
GFF
Repeat above steps for all input variables
GFF
Boston Housing
Data source Based on discrete distribution
Saginaw Bay
Identify the occurrence of nodes with sample<50
Optimal decision tree From CART Pro
Database name
Create random numbers
Table 2 Artificial neural network configuration
Fig. 3 CART? procedure
Node 1 MTTFBSCBS <= 2.09 STD = 11.720 Avg = 39.883 W = 232.00 N = 232 MTTFBSCBS <= 2.09
MTTFBSCBS >
Node 2 MTTFBSCBS <= 1.51 STD = 11.022 Avg = 52.986 W = 60.00 N = 60
Node 5 MTTFBS <= 1.51 STD = 7.881 Avg = 35.312 W = 172.00 N = 172
MTTFBSCBS <= 1.51
MTTFBSCBS >
Node 3 MTTFBS <= 1.72 STD = 8.954 Avg = 59.706 W = 29.00 N = 29
Node 4 MTTFBS <= 1.92 STD = 8.852 Avg = 46.699 W = 31.00 N = 31
A
MTTFBS <= 1.72 Terminal Node 1 STD = 6.996 Avg = 67.327 W = 11.00 N = 11
MTTFBS >
1.51
B
1.72
Terminal Node 2 STD = 6.486 Avg = 55.048 W = 18.00 N = 18
MTTFBS <= 1.92 Terminal Node 3 STD = 7.112 Avg = 52.219 W = 17.00 N = 17
MTTFBS >
1.92
Terminal Node 4 STD = 5.495 Avg = 39.997 W = 14.00 N = 14
2.09
MTTFBS <= 1.51
MTTFBS >
Node 6 MTRBS <= 1.36 STD = 7.103 Avg = 43.120 W = 49.00 N = 49
Node 9 MTRBS <= 1.59 STD = 5.728 Avg = 32.202 W = 123.00 N = 123
MTRBS <= 1.36
MTRBS >
Terminal Node 5 STD = 3.530 Avg = 36.811 W = 17.00 N = 17
1.36
MTRBS <= 1.59
Node 7 MTTFBS <= 1.36 STD = 6.184 Avg = 46.472 W = 32.00 N = 32
C
MTTFBS <= 1.36
MTRBS <= 1.60 Terminal Node 6 STD = 4.276 Avg = 46.278 W = 10.00 N = 10
MTRBS >
MTRBS >
Node 10 MTTFBSCBS <= 2.88 STD = 4.741 Avg = 29.816 W = 76.00 N = 76
D
MTTFBS >
Node 8 MTRBS <= 1.60 STD = 4.780 Avg = 49.510 W = 20.00 N = 20
1.51
1.36
Terminal Node 8 STD = 4.784 Avg = 41.408 W = 12.00 N = 12 1.60
Terminal Node 7 STD = 2.551 Avg = 52.742 W = 10.00 N = 10
MTTFBSCBS <= 2.88 Terminal Node 9 STD = 3.403 Avg = 34.973 W = 20.00 N = 20
E
MTTFBSCBS >
2.88
MTTFBS <= 2.12
Node 11 MTRBS <= 1.32 STD = 3.671 Avg = 27.974 W = 56.00 N = 56 MTRBS <= 1.32 Terminal Node 10 STD = 2.912 Avg = 26.434 W = 35.00 N = 35
F
1.59
Node 12 MTTFBS <= 2.12 STD = 5.041 Avg = 36.060 W = 47.00 N = 47
Terminal Node 12 STD = 3.479 Avg = 40.529 W = 16.00 N = 16
H
MTRBS >
MTTFBS >
2.12
Terminal Node 13 STD = 4.080 Avg = 33.754 W = 31.00 N = 31
I
1.32
Terminal Node 11 STD = 3.357 Avg = 30.540 W = 21.00 N = 21
G
Fig. 4 Optimal tree for outages
the parent node is identified in order to determine the splitting value or data source. In terms of the Outages tree, the parent node for node A and node B is marked as Node 2 in Fig. 5. This particular node contained 60 observed samples that were initially used in the CART algorithm. This node was split using the input variable MTTFBSCBS (Mean Time to Failure for BSCBS Links) value of 1.51. Based on the 60 samples in Node 2, a histogram is made for the MTTFBSCBS input attribute and is plotted in Fig. 6.
123
3.4.3 Step 3 In this step, random artificial samples were created based on the data source’s average value, probability, and frequency. The number of random samples to be generated in this step is determined by the user. However, based on previous efforts, the number of samples generated for this step should be 50–250 [56]. In addition to this heuristic estimation, the user-defined value of minimum sample size should also be considered. In other words, the number of generated
1485
Node 2 MTTFBSCBS<=1.51 STD=11.022 Avg=52.966 W=60.00 N=60
MTTFBS <=1.72 STD=8.964 Avg=59.706 A W=29.00 N=29
MTTFBS <=1.72 STD=8.852 B Avg=46.699 W=31 N=31
Frequency
Neural Comput & Applic (2012) 21:1477–1489 10 9 8 7 6 5 4 3 2 1 0
Cumulative %
1.1 Terminal node 1 STD=6.996 Avg= 67.32 W=11.00 N=11
Terminal node 2 STD=6.486 Avg= 55.048 W=18.00 N=18
Terminal node 3 STD=7.112 Avg= 52.219 W=17.00 N=17
Terminal node 4 STD=5.495 Avg=39.997 W=14.00 N=14
Fig. 5 Parent node or data source for nodes A and B
samples should be relatively large in comparison and the ‘‘best’’ value can ultimately be determined through robust trial-and-error experimentation procedure. With that being said, 100 random synthetic samples were generated for the Outages example presented in this paper’s case study. 3.4.4 Step 4 The process of creating random artificial sample instances is repeated for all of the identified data sources found in Step 2. 3.4.5 Step 5 The synthetic samples randomly created in Step 3 and 4 are implemented into the ANN, which will act as the decision tree Oracle. Once an output was estimated from the artificial samples, additional tests were performed to ensure reasonable values were produced. In terms of the Outages dataset, the dependent variable is REPORTABLE OUTAGES. Once synthetic samples are generated and an output values are predicted by the model, the output values are individually inspected to ensure that the values lie within the dependent variable’s minimum and maximum observed range. Any artificial sample that has a predicted output outside of the observed range is discarded and is not used to determine a splitting criterion for a given node found in Phase 2 of the methodology. After the initial check of output feasibility, the remaining data samples were randomized and then sorted based on the splitting criterion. For Node A, which is shown in Fig. 7, the splitting criteria was found to be MTTBSCBS less than or equal to 1.51. As indicated by the figure, Node A contained 29 original values, but needs 21 more meet the minimum splitting sample size of 50. Thus, the pool of artificially created data is verified to determine if prior node splits will place the samples of the node in question. After these checks
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Frequency
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0 More
Bins Fig. 6 Histogram for MTTFBSCBS
are made for Node A, 21 values were extracted from the pool of artificial samples and added to node A. Now using 50 samples, a new decision tree is built using the original CART growing algorithm that was described in Phase 1 of the methodology. The enhanced node found for Node A is shown in Fig. 7, where the new splitting criterion was found to be MTTFBS (Mean Time to Failure for BS Links) of 1.38, which had previously been determined as 1.51. The process (i.e., Steps 1 through 5) for the node shown in Fig. 7 is repeated for all nine nodes shown in Fig. 4. Thus, once all of the steps are completed for the nodes that have been identified in Phase 2, the nodes from the original ‘‘optimal’’ tree are replaced with the nodes created by the randomly creating the synthetic data. The results of this analysis are summarized in a later section. 3.5 Phase 4 The predictive accuracy of the learning samples and the test samples is estimated using the mean square error and relative error rate. Phase 4 compares the optimal decision tree found using the unmodified CART algorithm with the CART?. Thus, for each data set used in the study, the mean square error and relative error rate were calculated for each tree type. 3.5.1 Performance measures The predictive accuracy of the learning sample is estimated with reference to the accuracy of the test sample [53]. The tree that performs equally well with both the test and learning samples is considered to be the best model. The mean square error (MSE) determines the amount by which the predicted value of the target variable differs from the estimated value where the formula is shown below in Eq. 6: MSE ¼
N 1 X ðyi dðxi ÞÞ2 N ðx;yÞ2z
ð6Þ
123
1486
Neural Comput & Applic (2012) 21:1477–1489
Likewise, an improvement of 2.6 and 5% was obtained for MSE and relative error respectively for the independent testing set.
A Node 1 MTTFBS <=1.38 STD=8.781 Avg=59.26 W=50.00 N=50
MTTFBS <= 1.38
4 Results
MTTFBS > 1.38
New Terminal Node 1 STD=4.19 Avg=68.414 W=13 N=13
New Terminal Node 2 STD=7.632 Avg=56.043 W=37 N=37
Fig. 7 Node A of the enhanced decision tree
where yi is the actual value of the target variable and xi is the predicted value of target variable. The relative error is also a performance metric that determines the absolute percentage difference between the actual and a predictive value, where the equation for relative error is shown below. Relative Error ¼
ABSðActual value Predicted valueÞ Actual value ð7Þ
3.6 Testing For the development of CART?, a series of experiments were conducted on the Outages data set. After performing an experiment to find the optimal CART model, CART? was developed. This section compares the results of both methods using the observed training samples as well as the testing samples that were held out independently for the Outages data set. The optimal tree for Outages was obtained using the one standard error method, which consisted of 13 enhanced nodes and relative error equal to 0.07. Table 3 shows the comparison between optimal and enhanced decision trees. It can be observed from the table that nodes A, D, E, F, G, and H of the enhanced tree showed improvement in the accuracy over the optimal tree; however, node B of the enhanced tree could not reduce MSE and node C could not reduce relative error. In addition, node H showed a major reduction in the mean square error by 53.5%, while node F showed significant reduction in the relative error by 26.6%, compared to other nodes. As shown in Table 3, the classification accuracy for the Outages database improved for both the training and testing data sets. In particular, a reduction of 6.9% of MSE and 9.7% for relative error was achieved in the training data set.
123
The proposed methodology was applied to the other data sets presented in this research. However, the specific results were omitted in order to describe the methodology more clearly by using the Outages data set. Therefore, in order to keep this paper brief, specific results about the differences that occurred at each enhanced node were omitted for the remaining data sets. However, this section describes the overall classification accuracy results that were obtained by utilizing the CART? methodology to the real-world data sets selected for this research. To test the developed methodology of the enhanced decision tree, CART?, five problem domains were investigated. Since TREPAN has been proven to be an acceptable methodology used to increase categorical problem sets, CART? was only tested on data sets with continuous (or regression) dependent variables. In particular, the data sets included in this research were Body Fat [47], Boston Housing [46], Saginaw Bay [27, 48], Corrosion [49] and Wireless Outages [51]. A summary of the testing results for delta MSE is shown in Table 4. The results demonstrate with the exception of the Corrosion database, CART? developed a better model in terms of accuracy using the independent testing set in all of the real-world data sets. In four of the five data sets that were selected, the CART? methodology improved overall accuracies in terms of MSE. For these cases, the biggest improvement was observed with the Boston Housing data set with a reduction of MSE of 29.3%. Unfortunately, CART? could not improve the accuracy over CART with the experiment with the Corrosion database. No major difference in the tree size was found between the two decision trees tested. This could be attributed to the fact that the distribution of the data was not continuous in the Corrosion database. Further analysis will be needed to make inferences about the inconsistent performance decrease. Similarly, CART? showed similar increases of relative error in all five data sets except for Corrosion, where the results are shown in Table 5. For this particular metric, Body Fat showed greatest reduction in relative error improvement of 29.7% compared to all other data sets used in this study. This emphasizes that CART? algorithm has a greater likelihood of improving data sets with evenly distributed samples. Thus, the performance of CART? seems to be highly dependent on the distribution of the data in the nodes with limited sample sizes.
Neural Comput & Applic (2012) 21:1477–1489 Table 3 Comparison between optimal and enhanced decision trees
1487
Nodes or data sets
CART MSE
CART? MSE
Delta MSE (%)
CART relative error
CART? relative error
Delta relative error (%)
A
118.80
101.94
-14.2
0.15
0.14
-9.3
B
41.37
44.73
?8.1
0.11
0.11
-2.5
C
12.46
10.18
-18.3
0.06
0.07
?10.9
D
24.55
23.52
-4.2
0.08
0.08
-1.1
E
11.58
11.16
-3.6
0.07
0.06
-11.5
F
8.48
4.80
-43.4
0.09
0.06
-26.6
G
11.27
7.49
-33.5
0.09
0.07
-17.1
H
11.50
5.35
-53.5
0.06
0.04
-25.2
I
15.69
10.70
-31.8
0.09
0.08
-15.3
Train
20.63
19.21
-6.9
0.08
0.07
-9.7
Test
25.56
24.89
-2.6
0.10
0.09
-5.0
Table 4 Delta mean square error for five data sets
Delta mean squared error
Body fat
Outages
Corrosion
Boston housing
Saginaw Bay
-7.4%
-6.9%
1.9%
-29.3%
-13.7%
Table 5 Delta relative error for five data sets
Delta relative error
Body fat
Outages
Corrosion
Boston housing
Saginaw Bay
-29.7
-9.7
3.3
-5.8
-1.1
5 Conclusion Decision trees and ANNs have been used in a variety of business and engineering fields. In particular, CART has been used by many practitioners because it incorporates both categorical and continuous dependent variables. Unlike CART, ANNs have the reputation of being blackbox models because their estimating properties are often difficult to interpret. Decision trees also have the limitation when inadequate sample sizes are used to create a partition in a growing tree. The primary objective of this research is to integrate the strengths of both methodologies to overcome a weakness in each. More specifically, the primary goal of this research is to enhance the CART limitation of using insufficient sample sizes for determining partitions in the data to increase its overall accuracy. A series of experiments were run on five regression problem domains to analyze the impact of CART?. Initially, decision trees were built using CART Pro software, which builds a tree with user-defined inputs for all of the parameters and data sets. The best neural network model for each database was generated by NeuroSolutions by varying the parameters such as weight and network architecture. The best models obtained from the ANN were used to generate random artificial samples. The artificial samples generated by the ANN Oracle were again added to the
nodes of the tree developed by CART, which had a minimum sample size less than 50. The newly added data were then used to grow the tree, thus implementing the proposed CART? algorithm. To create the CART? method, the CART algorithm was extended by incorporating ANNs to create synthetic data when there was a limited amount of samples available to determine a splitting criterion. This problem stems from a reduction in data available to the partitioning algorithm when the tree is grown. These decision splits are determined with limited sample sizes, which may cause reduced accuracy. The trees developed using CART and CART? algorithms were compared for accuracy. CART? algorithm outperformed CART in terms of accuracy for most of the databases. CART? provides the ability to produce more data by integrating the power of ANNs with decision tree algorithm CART. This facilitates the decision tree to grow further with improved predictive accuracy. 5.1 Future research One of the important accomplishments of this research was to deal with limitations of the existing CART method, where the amount of data diminishes as the tree grows from the root to terminal nodes. Although CART? has shown good initial results, there are enhancements that can be
123
1488
made in this area of research. For this research, ANNs were used as an Oracle to create further data, and the decision trees were grown further with the additional data. However, more advanced techniques to create the artificial samples more intelligently may lead to increased accuracy. A method that can appropriately analyze the correlation between the independent variables and provide a solution if the distribution of the data for the variables is discontinuous could be incorporated. Different distribution fitting techniques could be employed in order to improve accuracy. This advancement is hindered by the lack of computerized software implementation that does not currently exists. The method and therefore all of the results were manually derived using several software packages and manual implementation. If a computerized application was written, researchers could not only reduce the time needed to create a CART? model, but also experiment with various techniques and strategies that could further improve upon the existing methodology.
Neural Comput & Applic (2012) 21:1477–1489
15.
16.
17.
18. 19.
20. 21. 22.
23. 24.
References 1. Chih-Hung H, Mao-Jiun JW (2005) Using decision tree-based data mining to establish a sizing system for the manufacture of garments. Int J Adv Manuf Technol 26(5):669–674 2. Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques 3rd edn. Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, Burlington, MA 3. Kriegel H-P, Borgwardt KM, Kro¨ger P, Pryakhin A, Schubert M, Zimek A (2007) Future trends in data mining. Data Min Knowl Discov 15(1):87–97 4. Lavrac N (2004) Introduction: lesson learned from data mining applications and collaborative problem solving. Mach Learn 57:13–34 5. Kusiak A, Smith M (2007) Data mining in design of products and production systems. Ann Rev Control 31:147–156 6. Brand E, Gerritsen R (1998) Decision trees, DBMS, data mining solutions supple´ment, http://www.dbmsmag.com/9807m05.html 7. Apte C, Weiss S (1997) Data mining with decision trees and decision rules. Futur Gener Comput Syst 13:197–210 8. Li Y (2006) Predicting materials properties and behavior using classification and regression trees. Mater Sci Eng A 433:261–268 9. Classification and Regression Trees (C&RT) available at http:// www.statsoft.com/textbook/stcart.html 10. Wilkinson L (1992) Tree structured data analysis: AID, CHAID and CART, Department of Statistics, Northwestern University, Evanston, IL 60201 11. Shih Y-S (2005) QUEST classification tree available at http:// www.stat.wisc.edu/*loh/quest.html 12. Feldman D, Gross S (2005) Mortgage default: classification trees analysis. J Real Estate Fin Econ 30(4):369–396 13. Questier F, Put R, Coomans D, Walczak B, Vander Heydan Y (2005) The user of CART and multivariate regression trees for supervised and unsupervised feature selection. Chemom Intell Lab Syst 76:45–54 14. Caetanoa S, Aires-de-Sousab J, Daszykowskia M, Vander Heydena Y (2005) Prediction of enantioselectivity using chirality
123
25.
26.
27.
28.
29.
30. 31.
32.
33.
34.
35.
36.
codes and classification and regression trees. Anal Chim Acta 544:315–326 Andryashin A (2005) Financial applications of classification and regression trees, MS, Thesis, Center of Applied Statistics and Economics, Humboldt University, Berlin Waheed T, Bonnell RB, Prasher SO, Paulet E (2006) Measuring performance in precision agriculture: CART—a decision tree approach. Agric Water Manage 84:173–185 Provost F, Danyluk A (1999) Problem definition, data cleaning, and evaluation: a classifier learning case study. Informatica 23:123–136 Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Inc., Monterey, CA Hsu C-H, Wang M-JJ (2005) Using decision tree-based data mining to establish a sizing system for the manufacture of garments. Int J Adv Manuf Technol 26:669–674 Evans B, Fisher D (1994) Overcoming process delays with decision tree induction. IEEE Expert 9:60–66 Wu M-C, Lin S-Y, Lin C-H (2006) An effective application of decision tree to stock trading. Expert Syst Appl 31:270–274 Yang B-S, Lin DS, Tan ACC (2005) VIBEX: an expert system for vibration fault diagnosis of rotating machinery using decision tree and decision table. Expert Syst Appl 28:735–742 McSherry D (1999) Strategic induction of decision trees. Knowl Based Syst 12:269–275 Changnon D, Ritsche M, Elyea K, Shelton S, Schramm K (2000) Integrating climate forecasts and natural gas supply information Into a natural gas purchasing decision. Meterol Appl 7:211–216 Moon TH, Sohn SY (2005) Intelligent approach for effective management of governmental funds for small and medium enterprises. Expert Syst Appl 29(566):572 Hsia T-C, Shie A-J, Chen L-C (2008) Course planning of extension education to meet market demand by using data mining techniques—an example of Chinkuo technology university in Taiwan. Expert Syst Appl 34:596–602 Rangwala MH (2006) Empirical investigation of decision tree extraction from neural networks, MS, Thesis, Department of Manufacturing and Systems Engineering, Ohio University, Athens, OH Michie D (1989) Problems of computer-aided concept formation. In: Quinlan JR (ed) Applications of expert systems volume 2. Addison Wesley, Wokingham, UK, pp 310–333 Gardner MW, Dorling SR (1998) Artificial neural networks (the multilayer Perceptron)—a review of applications in the atmospheric sciences. Atmos Environ 32:2627–2636 Aakerlund L, Hemmingsen R (1998) Neural networks as models of psychopathology. Biol Psychiatr 43(7):471–482 Yang CC, Prasher SO, Landry JA, Ramaswamy HS, Tommasi D (2000) Application of artificial neural networks in image recognition and classification of crop and weeds. Can Agric 13 Eng 42:147–152 Laskaria EC, Meletiou GC, Tasoulis DK, Vrahatis MN (2006) Studying the performance of artificial neural networks on problems related to cryptography. Nonlinear Anal Real World Appl 7:937–942 Madani K, Braz J et al (eds) Industrial and real world applications of artificial neural networks, illusion or reality? Informatics in control. Autom Robot I:11–26 Allex CF, Shavlik JW, Blattner FR (1999) Neural network input representations that produce accurate consensus sequences from DNA fragment assemblies. Bioinformatics 15(9):723–728 An Introduction to Neural Networks, Prof Leslie Smith, Centre for Cognitive and Computational Neuroscience available at http://www.cs.stir.ac.uk/*lss/NNIntro/InvSlides.html Krawiec K, Slowinski R, Szczesniak I (1998) Pedagogical method for extraction of symbolic knowledge from neural
Neural Comput & Applic (2012) 21:1477–1489
37. 38.
39.
40.
41. 42.
43. 44.
45. 46.
47.
networks. In: Polkowski L, Skowron A (eds) Proceedings of the first international conference on rough sets and current trends in computing (RSCTC ’98). Springer, London, UK Bhagat PM (2005) Pattern recognition in industry. Elsevier. ISBN 0-08-044538-1 Craven MW (1996) Extracting comprehensible models from trained neural networks, PhD Thesis, Computer Science Department, University of Wisconsin, Madison, WI Dwyer K, Holte R Decision tree instability and active learning. Department of Computing Science, University of Alberta, Edmonton AB, Canada Neural Networks by Christos Stergiou and Dimitrios Siganos at http://www.doc.ic.ac.uk/*nd/surprise_96/journal/vol4/cs11/report. html Artificial Neural Networks available at www.makhfi.com/tutorial/ introduction.htm Nicholls JG, Martin AR, Wallace BG, Fuchs PA (2001) From neuron to brain, 4th ed. Sinauer Associates, Sunderland, MA. ISBN 0878934391 Artificial Neural Networks at https://www.dacs.dtic.mil/techs/ neural/neural2.php Chen F (2004) Learning accurate and understandable rules from SVM classifiers, MS, Thesis, School of Computer Science, Simon Fraser University Luin AD (1988) Queries and concept learning. Mach Learn 2:319–342 Murphy PM (1995) UCI repository of machine learning databases—a machinereadable data repository, maintained at the Department of Information and Computer Science, University of California, Irvine., anonymous FTP from ics.uci.edu in the directory pub/machinelearningdatabases Heinz G, Peterson LJ, Johnson RW, Kerk CJ (2003) Exploring relationships in body dimensions. J Stat Educ 11(2). American
1489
48.
49.
50.
51.
52. 53. 54.
55. 56.
Statistical Association Website at http://www.amstat.org/ publications/jse Millie DF, Weckman GR, Pigg RJ, Teste PA, Dyble J, Litaker RW, Carrick HJ, Fahnestiel GL (2006) Modeling phytoplankton abundance in Saginaw bay, Lake Huron: using artificial neural networks to discern functional influence of environmental variables and relevance to a great lakes observing system. J Phycol 42:336–349 Hernandez S, Nesic S, Weckman G, Ghai V (2005) Use of artificial neural networks for predicting crude oil effect on CO2 corrosion of carbon steels. Corrosion. Submitted and accepted for publication, April 2006 Rastogi P (2005) Assessing wireless network dependability using neural networks, MS, Thesis., School of Communication, Ohio University, Athens, OH Weckman G, Snow A, Rastogi P, Rangwala M (2008) Assessing wireless network dependability through knowledge extraction via decision trees. IARIA: ICONS 2008 CART Pro 6.0 available at http://www.salford-systems.com Classification and Regression trees available at http://www. statsoft.com/textbook/stathome.html Maindonald J, Braun J (2003) Data analysis and graphics using R: an example-based approach. Cambridge University Press, Cambridge, UK NeuroDimensions, NeuroSolutions v5.07, available at http://www. nd.com Rangwala M (2006) Empirical investigation of decision tree extraction from neural networks, MS, Thesis, Russ College of Engineering and Technology, Department of Industrial and Systems Enginnering, Ohio Univerity, Athens, OH
123