Knowl Inf Syst (2012) 31:405–432 DOI 10.1007/s10115-011-0409-1 REGULAR PAPER
A domain-specific decision support system for knowledge discovery using association and text mining Dnyanesh Rajpathak · Rahul Chougule · Pulak Bandyopadhyay
Received: 13 January 2010 / Revised: 17 February 2011 / Accepted: 16 April 2011 / Published online: 18 May 2011 © Springer-Verlag London Limited 2011
Abstract We propose a novel association and text mining system for knowledge discovery (ASTEK) from the warranty and service data in the automotive domain. The complex architecture of modern vehicles makes fault diagnosis and isolation a non-trivial task. The association mining isolates anomaly cases from the millions of service and claims records. ASTEK has shown 86% accuracy in correctly identifying the anomaly cases. The text mining subscribes to the diagnosis and prognosis (D&P) ontology, which provides the necessary domain-specific knowledge. The root causes associated with the anomaly cases are identified by discovering frequent symptoms associated with the part failures along with the repair actions used to fix the part failures. The best-practice knowledge is disseminated to the dealers involved in the anomaly cases. ASTEK has been implemented as a prototype in the service and quality department of GM and its performance has been validated in the real life set up. On an average, the analysis time is reduced from few weeks to few minutes, which in real life industry are significant improvements. Keywords Data mining · Association mining · Text mining · Knowledge synthesis · Decision support for fault diagnosis · Ontology engineering
D. Rajpathak (B) · R. Chougule · P. Bandyopadhyay Diagnosis and Prognosis Group, India Science Lab, General Motors Global Research and Development, GM Technical Centre India Pvt. Ltd., Creator Building, International Technology Park, Whitefiled, Bangalore 560066, Karnataka, India e-mail:
[email protected] R. Chougule e-mail:
[email protected] P. Bandyopadhyay e-mail:
[email protected]
123
406
D. Rajpathak et al.
1 Introduction In recent years, there has been a rapid increase in the complexity of automobile technology. Specifically, electronic and software content embedded in the vehicle architecture have driven significant changes in the area of automotive service and repair [4,18]. This growth has been driven by the Government regulations imposed by the countries like USA, UK—(a) to achieve pollution free clean air by limiting auto emission up to 0.41 g of hydrocarbons, 3.4 g of carbon monoxides, and 0.4 g of nitrogen oxides, per mile and (b) to provide improved fuel economy. Due to ever growing competitive market, there is a continuous demand to reduce the repair and warranty cost by achieving first time fix through the effective root cause diagnosis of faults. As a result, it becomes crucial to continuously monitor the functional health of a vehicle and take appropriate corrective actions in case of malfunctions [36]. In the automotive domain, the process of fault diagnosis is classified into—on-board diagnosis and off-board diagnosis [38]. The on-board diagnosis takes advantage of recent electronics and software systems that are embedded in modern automobiles [17] along with the data generated by a variety of sensors. These sensors are set at a specific threshold to monitor the performance of critical engineering parameters of various systems and sub-systems, for example, engine temperature, emission, crankshaft speed, engine speed, among others. The diagnostic trouble codes (DTCs) are generated if these systems violate the threshold limit to which they are set and such DTCs are stored in the on-board computer of a vehicle. When a customer visits a dealer to report a fault, the DTCs are extracted to get an insight into the affected systems for taking the appropriate repair actions. These repair actions are book kept in the form of labor codes (LCs). Also, a service record called technician verbatim is maintained consisting textual information of the diagnosis episode. Typically, a technician verbatim consists of a fault description, for example “coolant fan always on” that is observed in association with a specific part, for example “radiator” and the corrective actions, for example “replace cooling fan relay” that are performed to fix the faults. This type of diagnosis which is performed at the service bay for fault isolation is referred to as the off-board diagnosis. Given the overwhelming amount of data generated by the multiple sensors that are embedded in a vehicle, the task of root-cause investigation to isolate the faults becomes a non-trivial activity. This also restricts a service technician’s ability to diagnose the faults correctly in a short time. To alleviate these problems several fault diagnosis approaches, such as model based, history based and knowledge based have been proposed in the literature [34–36]. The model-based approach constructs a physical model to represent the structure of an objective system to assist technicians to perform the fault diagnosis. The history-based approach analyzes historical data captured in the faulty and the normal conditions to identify solution patterns that can be used for the case in hand. Finally, the knowledge-based approach embeds the domain-specific knowledge in a knowledgebased system in the form of ontologies [15] and the rules or problem-solving methods [13,29]. There are millions of repair claims that are submitted to General Motors (GM) from the dealer network. From this large volume of claims, the service and quality department firstly needs to distinguish between the successful repair cases and the anomalies, and then secondly identify the root causes that are associated with the anomaly cases. To achieve these objectives, in this paper, we propose a domain-specific Association and Text mining system for Knowledge discovery (ASTEK) that combines both history and knowledge-based approaches. There are three types of anomaly cases identified which are frequently appeared in the diagnosis data and they are described below:
123
A domain-specific decision support system for knowledge discovery
1. 2.
3.
407
High time to repair—in the anomalies associated with this type, the technicians takes more time to diagnose the faults than of a pre-assigned time window; Expensive repairs—in the anomalies associated with this type, the technicians fix the faults by performing expensive repair actions when in fact the cost optimal repair actions can also be used to fix the faults. For instance, an experienced technician may fix the fault say “HVAC not functioning properly” by “replacing cooling fan relay”, but an inexperienced technician may end up “replacing the cooling fan” causing an expensive repair; Unnecessary repairs—in the anomalies associated with this type, the improper repairs are used to fix the observed faults, which causes repeat visit.
In contrast, the successful repair cases are the ones in which the faults are fixed time and cost optimally. In theory, the service guidelines available in the form of service manuals, which can be used by technicians to choose the correct set of LCs to fix the observed DTCs. However, no such associations are mentioned explicitly in the services manuals that would enable technicians in correctly selecting LCs to fix the observed DTCs. To this end, our objective is to establish the {DTC-LC} patterns by mining the field data and then distinguish between the successful and the unnecessary repair cases. The text mining plays a crucial role in analyzing technician verbatim associated with the field repair cases to identify frequent part failures, the associated symptoms, and the repair actions used to fix the symptoms. However, discovery of correct knowledge from the technician verbatim is a challenging activity because it consists of several types of noises— abbreviated text entries: in many cases, the parts, actions, and symptoms are recorded by using abbreviations, e.g., “loose fr door chked repair performed” and it is important to establish the correct meaning of these abbreviations to perform the analysis; incomplete text entries: in some cases the technician verbatim consists of incomplete repair information, which makes it difficult to derive the precise knowledge. For example, in the technician verbatim, say “relearned TCM system”, only the action “relearned” that is used to repair the part “TCM system” has been recorded, but no information is provided about the symptoms; term disambiguation: the same term may have two or more meanings in a document, for example, the term “TPS” can be used to refer to “tire pressure sensor” or “tank pressure sensor” and it is necessary to establish the correct meaning of a term by taking into account the context. The main contributions of our work are described below: 1.
2.
3.
4.
We provide a principled approach to integrate the association mining and the text mining techniques such that they can be used to discover the best-practice knowledge from the diagnosis data; Instead of using a priori algorithm, in our approach the constraint-based association mining has been used (where constraints are specified in the form of rules) to successfully identify the patterns and infrequent anomalies from the data; Our text clustering algorithm proposes a novel way to cluster the data by using frequently co-occurring term phrases (two or more) as a cluster label. This provides a more meaningful description about the documents that are residing in a same cluster thereby facilitating discovery of the best-practice knowledge; Finally, ASTEK is a practical system that has been successfully implemented in the real-world set up to discover the best-practice knowledge. It has demonstrated reduction in terms of both time (from three to four weeks to few minutes) and cost in the existing business process.
In the next section, we present state of the art of related areas. In Sect. 3, we describe the architecture of ASTEK. The association mining and text mining components of ASTEK
123
408
D. Rajpathak et al.
are described in Sects. 4 and 5, respectively. In Sect. 6, the performance of our system when implemented in GM is discussed. Finally, in Sect. 7, we conclude the paper by reiterating our main contributions. 2 State of the art Collaboration and sharing of knowledge between different stakeholders has become a critical function for cost-effective business operations [23]. This is particularly important in experience-based domains like fault diagnosis where the shared knowledge can be used to establish the best-practices through aggregate level data analysis. In ASTEK, a novel associating mining [1] and text mining [17] algorithms are developed and fused together to facilitate knowledge discovery from the field data. Recently, the principles of knowledge management have been applied in the areas of fault diagnosis and best-practice knowledge identification. Wang [37] integrated cognitive task analysis (CTA), hierarchical clustering, and ontologies for PC troubleshooting knowledge management systems. The domain-specific ontology has been used in [20] enabling the discovery of the fault diagnosis knowledge. [31] developed a hybrid reasoning architecture for integrated fault diagnosis and health maintenance of fleet vehicles using dynamic case-based reasoning. From the literature, it appears that the systems reported in [20,28,37] makes use of the knowledge represented in a structured way (that is either in the form of ontologies or the case base), but they do not address how new knowledge can be discovered from the large historical data to improve fault diagnosis and best-practice knowledge discovery process. In contrast, our aim is to discover the best-practice knowledge from the historical data and then apply newly discovered knowledge to fix the anomalies (cf. Sect. 1) for improved fault diagnosis. The association mining is one of the most common techniques in data mining and knowledge discovery [8]. It involves extracting interesting correlations, frequent patterns, associations, or causal structures among sets of items present in the database [1,22]. As reported by [33], different association measures, such as support, confidence, conviction, lift, coverage, correlation, and odds ratio are used to measure the strength of newly identified associations between dispersed entities recorded in databases. Few applications of association mining techniques have been reported in the area of fault diagnosis, for example [6,25]. The notion of a priori algorithm has been used in these applications to mine frequent IF-THEN rules to perform fault diagnosis. The limitation of these approaches is that they do not discover infrequent associations (or anomalies) from the data. To alleviate this problem, the constraint-based association mining has been used in the proposed work. The incorporation of domain-specific knowledge (constraints) helps to reduce the noise facilitating identification of meaningful associations between data points. Such a type of incorporation of the domainspecific knowledge is missing in the existing work [6,25]. The text mining techniques are used to derive high quality knowledge from the structured or unstructured text data. Some of the common applications of text mining involve text classification, semantic extraction, sentiment analysis, text clustering, taxonomy generation, document summarization, and so on. Several text mining systems have been reported in the literature—OTTO [5], GATE [10], SAS [11], [14], [21], and [28] and Attensity,1 which perform semantic extraction and text clustering tasks. In some cases, the tools like Attensity and SAS extract semantics in the form of terms, events, and named entities to hierarchically cluster the documents. In contrast with this, our aim is to extract the domain-specific terms, 1 http://www.attensity.com.
123
A domain-specific decision support system for knowledge discovery
409
Fig. 1 The integrated architecture of the ASTEK system
such as parts, actions, symptoms and the combinations thereof along with the relationships that exist between them to perform more meaningful document clustering. The notion of clustering in Attensity is realized in terms of generating clusters of related terms while the aim of our text clustering algorithm is to use frequently co-occurring terms, for example {Parti Symptomj } or {Symptomj Actionk } for document clustering to facilitate knowledge discovery for fault diagnosis. GATE provides a principled approach for text engineering to annotate structured as well as unstructured documents, but limited support has been provided for document clustering. OTTO uses text mining to learn the target ontology from the text documents such that the newly learned ontology can be used to categorize text documents by using supervised and unsupervised learning. Again, limited support is provided by OTTO framework to perform the task of document clustering in such a way that the best-practice knowledge can be discovered. In [14], the text mining has been used to identify, represent, and process the emotional connotations of the text by using two ontologies—one designed to represent word dependency relations within a sentence and other to represent emotions. In [21], a novel approach has been proposed to exploit the semantic relationships existing between two terms to calculate the closeness between them for refined vector space representation of the text documents. Finally, a novel method has been proposed in [28] to measure a similarity between two short text snippets with the probabilities topics. The main aim of this system is to extract the relation of the non-common terms that are appearing in two text snippets by using several third-party topics.
3 ASTEK architecture As shown in Fig. 1, at a high level, the ASTEK architecture comprises of association mining and text mining components. The Symptoms database consists of DTCs along with the date
123
410
D. Rajpathak et al.
on which the DTCs are recorded. The Claims database consists of LCs used by the technicians to fix the DTCs and the date on which specific repairs are performed. It also consist the technician verbatim that are collected from different dealers. The data preprocessing first retrieves the DTCs and the LCs that are collected from different dealers and then it makes use of the rule-base that consist of a set of domain-specific rules as well as the temporal constraints used to identify appropriate temporal patterns between DTCs and LCs. The pattern identification module establishes frequent {DTC-LC} patterns that are associated with the same make and model of a vehicle and which are observed in a short time period. The infrequent anomaly cases are also discovered from the {DTC-LC} patterns. Newly discovered patterns are validated by the subject matter expert (SME) during association validation module to assess their correctness. Finally, the technician verbatim associated with the anomaly cases are retrieved for their investigation. As a part of text mining, the following steps are deployed to discover the knowledge from the technician verbatim extracted by the association mining: document annotation, semantic extractor, term-based clustering, and subject matter expertise-based recommendation. The document annotation subscribes to the D&P ontology and automatically annotates the parts, symptoms, and actions, which are recorded in each technician verbatim. The semantic extractor extracts the annotated information in different combinations as follows—{Parts}, {Actions}, {Symptoms}, and {Parti Symptomj }, {Symptomj Actionk }, and {Parti Symptomj Actionk }. It also identifies the cases of technician verbatim in which the repair actions performed in the field are not the correct ones. The term-based clustering deploys a novel algorithm to cluster the technician verbatim in such a way that it helps to highlight the most frequent repair actions that are used to fix the symptoms which are associated with frequent part failures. The best-practice repair actions identified by the algorithm are presented to SMEs for the verification.
4 Association mining As described earlier, the Symptoms and the Claims databases consist of millions of DTCs and LCs that are collected from heterogeneous dealers during different time intervals. Usually, some time is elapsed between when the repair actions are performed at the service bay till this information is recorded into the databases, and, therefore, there is a high degree of overlapping observed between the LCs that are used to fix the DTCs. To correctly identify the LCs that are used to fix the DTCs along with their relative frequency it is necessary to identify and extract temporal information associated with these data points. Also, the abnormal {DTC-LC} patterns must be highlighted as the anomaly cases. Through the following sections, we describe how association mining algorithm achieves these tasks. 4.1 Data preprocessing The data preprocessing selects a subset of the data from the millions of claim that are recorded in the Symptoms and the Claims database. The selected data points are cleaned to remove the noise to facilitate identification of crucial rules and patterns. The following steps are used to select the appropriate data.
Step 1. The subset of the data is selected by applying filters that takes into account the vehicle make, model, model year, build date, and the claims date.
123
A domain-specific decision support system for knowledge discovery
411
Step 2. Then, the appropriate DTC source is selected, such as either the dealer diagnostics or the OnStar. 2 The dealer diagnostics provides the data which is collected at the dealer’s end, whereas OnStar is a service network that uses wireless communication to provide diagnostic and other services to a driver at the push of a button to handle situations like crash, fender bender, stolen vehicle, locked out, and so on. Step 3. At this step, the algorithm uses temporal constraints to restrict the scope of a selected data such that only the data residing within a specific time window is considered to avoid unnecessary overlapping. In other words, multiple DTCs are uploaded from the various sources over a period of time, and therefore, the temporal patterns in the form of rules are developed to identify the most likely set of LCs that are associated with the DTCs. One such example of the rule is described below: “the events observed for the same vehicle in the Symptoms data are related to the events observed in the Claims data if the lag between “Repair Date” and “DTC Read Date” is between +/ − × days, e.g., = +/ − 5 days.” The data preprocessing then makes use of the rule-base (cf. Fig. 1), which consists of domain-specific rules and constraints that are used to define a precise data merging criteria as well as to eliminate the noise in the knowledge discovery process. The noise gets introduced in the data mainly because when a vehicle makes a visit to a dealer to report the faults, the technicians tend to perform some miscellaneous repair/service actions, such as courtesy transportation, tire pressure testing along with the targeted repair actions that are used to fix the actual DTCs set in a vehicle. Because these miscellaneous repair actions are not directly associated with the DTCs they add noise to the data. The existing version of ASTEK consists of about 90 rules and one such example of the constraint rule states that: “If the DTCs are related to “Powertrain”, then consider the repair actions that are member of the System{Engine, Transmission, Fuel Systems, Electrical}”. Having applied the temporal constraints and the data preprocessing rules, the associations between the DTCs and the LCs can now be established. Below, we show how these associations are established in our approach: Let, Vi = {V1 , V2 , . . . , Vn }
(1)
represent all the cases of vehicles with the warranty claims, that is the repair actions, where 1 ≤ i ≤ n; Si = {S1 , S2 , . . . , Sn },
(2)
represent the set of symptoms corresponding to the cases in (1); L i = {L 1 , L 2 , . . . , L n },
(3)
represent the set of repair actions corresponding to the set of symptoms in (2); Sn = {DT C1 , DT C2 , . . . , DT C p },
(4)
represent the diagnostic trouble codes in the set Sn ; L n = {LC1 , LC2 , . . . , LCq },
(5)
represent the labor codes in the set L n ; 2 www.onstar.com.
123
412
D. Rajpathak et al.
In other words, the set L n consists of multiple LCs in itself and the set Sn consists of multiple DTCs in itself and it is a non-trivial effort to discover the labor codes that can be used to fix a given set of symptoms to avoid high time to repair, expensive repairs, and unnecessary repairs. 4.2 Pattern identification At any given time, there are thousands of vehicles on the road and it is crucial to find whether any specific {DTC-LC} patterns are appearing more frequently than the baseline. The pattern identification correctly identifies the {DTC-LC} patterns that are hidden in the millions of claims submitted from the field data. At the same time, it also identifies the anomaly cases which are infrequent in the newly identified {DTC-LC} patterns and hence they are difficult to discover. In many cases, our algorithm generates large number of {DTC-LC} patterns, which makes it difficult for the users to comprehend. In literature [1,22], several techniques have been proposed to reduce the number of associations, like statistical constraints/threshold, such as support, confidence, among others. The notion of support sets a specific threshold limit and the associations that exceed a threshold limit are considered as the valid ones. However, in our domain it becomes difficult to use support as the statistical measure due to the infrequent nature of anomalies. And the use of statistical constraints in the form of support may result in missing the anomalies due to their low frequency count in the aggregate data. For this reason, the pattern identification uses the preprocessing constraints in the form of rules that are described earlier to discover the anomalies. Furthermore, it also makes use of the notion of confidence (described in formula 6) to establish the relevance between DTCs and LCs. The value of confidence is a probability of observing a particular LC for given DTCs. This probability is in the range of 0–1, where 1 states that a specific LC is used for all the occurrences of given DTCs. Confidence = (LC1 , DT C1 , DT C2 ) = Pr ob (LC1 |DT C1 , DT C2 ) = N (LC1 , DT C1 , DT C2 )/N (DT C1 , DT C2 )
(6)
where, N (LC1 , DT C1 , DT C2 )—total number of cases from Vi (1) involving the labor code LC1 and the diagnostic trouble codes DT C1 and DT C2 ; N (DT C1 , DT C2 )—total number of cases from Vi involving the diagnostic trouble codes, DT C1 and DT C2 . 4.3 Association validation The association validation step is used to evaluate whether the {DTC-LC} patterns that are established by the algorithm are in fact the valid ones. At the same time, it is also crucial to analyze the distribution of repair cost and repair time associated with all valid associations. The associations and patterns that are established by our algorithm along with their distribution are presented to the SMEs to validate their correctness. The anomaly cases are also presented to SMEs as the first-level of knowledge discovery. It is crucial to bring humanintervention in the loop for systems that are implemented in real life industry mainly because many times the data is incomplete in nature and before making any specific claims it is necessary to validate the results with SMEs to gain more confidence in the solution developed by the system. The incorrect results highlighted by the SMEs are removed from the aggregate result set. For example, as shown in Fig. 2, the DTC named “P0455” is fixed by “replacing powertrain control module” and this repair is highlighted as the improper one by a SME. Similarly, the successful repair cases i.e., the repair procedures with minimum time and cost are also identified.
123
A domain-specific decision support system for knowledge discovery
413
Fig. 2 The knowledge capturing mechanism of ASTEK to capture anomaly cases highlighted by domain experts
4.4 Technician verbatim extractor The technician verbatim extractor is used to extract the technician verbatim associated with the anomaly cases to identify the root causes of anomalies. At the same time, the technician verbatim associated with the successful repair cases (that is time and cost optimal repairs) are also extracted as the training cases such that the best-practice knowledge can be derived from them.
5 Ontology-based text mining The text mining is used to discover the knowledge from the technician verbatim, which are associated with both the anomaly and successful cases. The main aim is to identify the frequently used repair actions for fixing the symptoms from both anomaly as well as successful cases. 5.1 Document annotation The document annotation [2,9,26] attaches meta-information to the parts, symptoms, and actions that are recorded in the technician verbatim based on predefined scheme of classification, that is, the diagnosis & prognosis (D&P) ontology for their interpretation. Here, the main research challenge is to handle different types of noises (cf. Sect. 1) observed in technician verbatim to identify parts, symptoms, and actions. The document annotation has two main benefits: 1. It helps SMEs to focus their attention to the information, which is relevant
123
414
D. Rajpathak et al.
Fig. 3 Different phases involved in the (semi-) automatic term extraction for ontology construction
for their purpose and 2. The annotation provides a specific context to the data [2] such that it has a shared interpretation. 5.1.1 The diagnosis and prognosis ontology Several data sources are used to construct the D&P ontology, such as service manuals, manufacturing verbatim, standard handbooks, online records and catalog, and technician verbatim. The extracted data is formalized by using the resource description framework (Schema)3 (RDFS) [3] enabling data storage, exchange, and machine readability in different application domains. In our approach, the D&P ontology is constructed by using (semi-) automatic ontology construction tool [7], which extracts relevant terms from the domain of discourse. Figure 3 shows the overall process flow used in the (semi-) automatic concept extraction tool to acquire necessary domain-specific concepts. Due to the space limitation, here, we only provide an overview of the tool and more detailed description can be found in [7]. The pre-processing step uses the data coming from heterogeneous data sources and uses the standard document tags, such as keywords, section titles, and standard domain terms to annotate the textual data. The algorithm makes use of the term-based frequency function [30] to identify the most frequent domain-specific terms in the dataset, which are extracted as the bag-of-terms. Having generated the bag-of-terms, each term from the bag-of-terms is organized in a hierarchy such that a specific term is organized as the subclass of its corresponding abstract term. For this, the processing algorithm uses the seed words where the seed words provide most abstract representation of the domain-specific terms. For instance, the seed word representing 3 RDFS is a World Wide Web Consortium specification for Meta data model—available at http://www.w3.
org/TR/rdf-schema/.
123
A domain-specific decision support system for knowledge discovery
415
the term say “door” can be used to organize more specific terms, such as “front_door” and “back_door”. Similarly, the seed word at the second level say “front_door” can be used to organize corresponding terms say “right_front_door” or “left_front_door”, and so on. While organizing the domain-specific terms, our algorithm also disambiguate between the terms that are written by using inconsistent vocabulary, for example, shock absorber, front; shock frt absorber; and brake absorber. The most specific engineering nomenclature, say “front shock absorber” is retained as the appropriate name. In the post-processing step, the closely related terms, for example, machineResource, handlingResource, officeResource, are merged into a single term, say businessResource. Having merged the related terms, the term hierarchy is then converted into the RDF syntax to make it machine readable. At this stage, we also enrich the definition of each term by defining necessary attributes, such as the slot value type (that is string, number, enumerated, Boolean) and the slot cardinality (minimum or maximum). Finally, the relationships between different terms are defined to provide more domain-specific information about how different terms are related with each other. The knowledge extraction process is iterated regularly whenever the new data is available in order to augment the knowledge-base of the D&P ontology. At a coarse grained level, the D&P ontology is a structure of the form: D&Ponto := (C, Csubclass , RCi →Cj , I ). The class, C, represents the top level concepts, such as Part, Action, PartLocation, Symptom, and LaborCode in our domain. The instances, I , formalizes the real-world objects of the domain-specific concepts, for example, the term “frontDoor” is an instance of the concept Door. Furthermore, each concept in the D&P ontology is associated with the “base word” that formalizes the most appropriate domain-specific reference of a concept. The class PartLocation captures precise location of each part in the vehicle architecture such that when there are multiple parts of the same type it helps technician to precisely realize which part is affected. The class Action on the other hand represents different types of repairs that can be used to fix the symptoms. The class Symptom allows us to formalize subjective evidence of faults associated with a part. This class is further specialized into two subclasses DTCRelatedSymptom and non-DTCRelatedSymptom. The class DTCRelatedSymptom is used to formalize the symptoms that are represented in terms of the DTCs, for example P0455, whereas the class non-DTCRelatedSymptom is used to formalize the symptoms that are associated with a system or a subsystem whose faulty conditions need to be monitored by observing audio-visual signatures, for example, leakage, crack, among others. The class–subclass hierarchy Csubclass allows us to organize the classes and sub-classes from our domain systematically. Here, the top level classes represent the abstract domainspecific concepts whereas the sub-classes represent more specific representation of the top level concepts. For instance, while “engine control module” is represented as a sub-class, its corresponding top level system such as “control module” is represented as the top level class. It is important to remember that the top level classes can have one or more subclasses associated with it. Finally, the binary relationships of the form RCi →Cj formalize how two classes are associated with each other. For example, the relationship named “ActionPerformedOnPart(Action Part)” formalizes the fact that there must be a specific action that can be used to fix a specific Part. Some of the main relationships in the D&P ontology involve—PartHasALocation (Part PartLocation), ActionPerformedOnPart (Action Part), SymptomAssociatedWithPart (Symptom Part), ActionRectifiesSymptom (Action Symptom) and ActionHasLaborCode (Action LaborCode).
123
416
D. Rajpathak et al.
The existing version of the D&P ontology consists of about 1,400 parts, 400 actions, and 387 symptoms. 5.1.2 Annotation of technician verbatim The document annotation algorithm performs following steps to annotate critical information in each technician verbatim—tokenization, stop word deletion, word stemming, and lexical matching. The lexical matching is a novel algorithm used to handle different types of noises observed in the data (cf. Sect. 1). Initially, each technician verbatim is split into the sentences and then tokenized by using white space and other delimiters that appear in a technician verbatim: –, _, !, ., ?, :, ;, and ∼. Typically, the text documents consists of several non-descriptive terms, referred to as the stop words [24], which do not add value to the analysis, and, therefore, the stop words are deleted. This helps to reduce the noise in a document and in turn minimize the document dimensions. In many cases, the domain-specific symptom phrases (for example “operating as designed”) consists a stop word (for example “as”), and to avoid deleting the terms that are part of the symptom phrases first each symptom phase is checked and the stop words that are part of the symptoms phrases are ignored. Having deleted the stop words, the morphological variants of action (for example, replacing, checking) and symptom (for example, leaking, leaked) tokens are stemmed into their base form, (for example, replace, check, and leak). The stemmed symptom and action tokens along with the part tokens are matched with the corresponding concepts from the D&P ontology by using full string matching algorithm and the matched tokens from the data are annotated. In some cases, the term tokens are written by using inconsistent vocabulary, such as “PC Module”, “PCM”, or “Powertrain Control Module” and to correctly disambiguate between such linguistic variations more sophisticated version of the term matching algorithm is used. The algorithm takes each term to be disambiguated and extracts the base word associated with each term in the D&P ontology. The base word associated with each term is matched to see whether the same base word is used for different terms. The common base word is used as a disambiguated meaning of terms written using inconsistent vocabulary. The aggregate data level frequency of the selected base word is calculated to make sure that it represents an intended meaning in number of cases above a specific threshold. In many other cases, the same term token represent multiple meanings in a document, for instance, the token “TPS” may represent a “tire pressure sensor” or a “tank pressure sensor” and it is crucial to establish appropriate meaning of a term by taking into account the context in which a term is specified. To handle such cases, in each sentence the term, “Termi” with possible multiple meaning is identified and the word window of three terms is set on the either side of “Termi” to construct the pairs of the form {Termi Symptomj } and {Termi Actionk } by taking into account the symptoms and actions that appear in a word window. For each pair, the conditional probability P T er m i |Symptom j and P (T er m i |Action k ) is calculated to realize in how many cases T er m i appears with the common set of Symptom j and Action k in a corpus. The T er m i with highest conditional probability of co-occurrence with Symptom j and Action k is selected as the correct meaning in the given context. The Symptom database consists of over 6,000 DTCs and in the worst case scenario the algorithm need to perform 6,000 iterations to match a DTC that appear in a technician verbatim with the DTCs in the database. To reduce the number of iterations, our algorithm makes use of the heuristic that identifies the DTC-related symptoms, say “P0411”. One such example of the heuristic rule is described below:
123
A domain-specific decision support system for knowledge discovery
417
Fig. 4 Example showing annotated technician verbatim highlighting the parts, actions, and symptoms recorded in each technician verbatim
“if the first character of a token is a character between “a to z” and it is followed by the numeric between ‘0 to 9’, and the length of a token is “five characters”, then annotate the token as diagnostic trouble code”. Figure 4 shows a set of technician verbatim in which crucial terms appeared in the data, such as parts, symptoms, and actions are annotated automatically. 5.2 Semantic extractor The semantic extractor is used—a. to extract the annotated information in different combinations and b. to identify the cases of service repairs in the field, which do not match with the claimed labor codes. Initially, the annotated information from each technician verbatim is extracted in different combinations—{Parti }, {Symptomj }, {Actionk }, {Parti Symptomj }, {Symptomj Actionk }, or {Parti Symptomj Actionk }. These different combinations can be used as the labels to cluster the technician verbatim to make sure that no information is overlooked that can be discovered from different clusters. In many cases, the repair actions that are performed to fix specific symptoms do not match with the claimed labor codes. For example, the symptom like “dead battery” can be fixed by using the repair action say “battery recharging” but the labor code associated with the repair action “battery replacement” has been found to be claimed. Such expensive labor code claims cost additional money to our business. To identify such and similar mismatches, the pairs of the form ({Part1 Action1 }, . . ., {Parti Actionk })Tech−Verbatim are extracted from each technician verbatim and they are organized in a specific sequence such that this sequence reflects the order in which repair actions are performed by the technician to fix the faults. For example, “check battery voltage” • “recharge a battery” • “rechecks battery voltage”. Each labor code is associated with a labor code description that describes the repair action used along with the part fixed when a specific labor code is claimed, for example “Battery–Replacement”. The information associated with the labor code description is extracted as the tuple—{Parti Actionk }LaborCodeDescription . The tuple {Parti Actionk }LaborCodeDescription is then matched with the pairs ({Part1 Action1 }, . . ., {Parti Actionk })Tech−Verbatim and the cases of technician verbatim are highlighted in which {Parti Actionk }LaborCodeDescription ({Part1 Action1 }, . . ., {Parti Actionj })Tech−Verbatim .
123
418
D. Rajpathak et al.
5.3 Term-based clustering The term-based clustering algorithm constructs the clusters of technician verbatim by taking as the input annotated terms that are extracted by semantic extractor and the corpus of technician verbatim. For this, in our approach, the following three types of clustering algorithms are constructed: 1.
2.
3.
Part-based cluster—the part-based clusters are constructed by using the part terms that are extracted by semantic extractor, {Parti } ∈ {P1 , P2 , . . .Pi } as the cluster label such that the clusters of technician verbatim {d1 , d2 , . . ., dn } can be constructed, where each cluster, say dn consists a record of a specific part. Part-Symptom cluster—at this stage the part-based clusters are further divided into the sub-clusters (Sub-Clus1 , Sub-Clus2 , . . ., Sub-Clusm ) such that each sub-cluster consists a record of the technician verbatim corresponding to a specific part from {Parti } and the symptoms associated with it, for example Sub-Clusm{batterydead} . Different sub-clusters are constructed when the same part has multiple symptoms associated with it, for example ({Sub-Clus11{Battery Dead} , Sub-Clus12{Battery Leak} , . . ., Sub-Clus1m{Battery Inop} }). Part-Symptom-Action cluster—again the part-symptom clusters are further divided into the sub-clusters, (Sub-Clus11 , Sub-Clus21 , . . ., Sub-Clusk1 ) such that each cluster consists a set of technician verbatim where each technician verbatim consists a record of a specific part and a set of symptoms associated with it and the repair actions used to fix each symptom. If the same symptom is fixed by using multiple repair actions then different sub-clusters are constructed—({Sub-Sub-Clus11{Battery Dead Recharge} , Sub-Sub-Clus12{Battery Dead Replace} , . . ., Sub-Sub-Cluskn{Battery Dead Diagnosis} }).
At a high level of abstraction, the term-based clustering algorithm consists of following steps: For each technician verbatim, Step 1. Determine the sentence boundary to split each technician verbatim in different sentences. Step 2. Construct the part-based cluster. While there are more sentences and parts in {Parti } ∈ {P1 , P2 , . . .Pi }, match Pi to each technician verbatim and collect the indices of technician verbatim with a record of Pi to cluster them together. Step 3. Construct the part-symptom cluster. Select Pi from {Parti } and identify Pi in a technician verbatim. Fix the word window of four terms on the either side of Pi and check for the following two conditions: 3a. If a single symptom appears in a word window, then the pair of the form {Pi Sj } is constructed. The frequency of {Pi Sj } is calculated in the corpus to check whether its occurrence in a corpus is greater than the minimum frequency threshold and the indices of the technician verbatim, which consists a record of the valid pair are assigned to a cluster. 3b. If there are multiple symptoms appear in the word window set on the either side of Pi , then the distance of each symptom is calculated from Pi and the closest symptom is first selected to construct the pair of the form {Pi Sj }. The frequency of {Pi Sj } is calculated to see whether its occurrence in a corpus is greater than the minimum frequency threshold. A set of technician verbatim which consists a record of a valid pair are clustered together. If there are two symptoms at an equal distance from Pi , then both the symptoms are considered to construct the pairs.
123
A domain-specific decision support system for knowledge discovery
419
Step 4. Construct the Part-Symptom-Action cluster. First determine the focal term that is the symptom Sj , which is member of the pair {Pi Sj } in each technician verbatim and then set the word window of four terms on the either side of Sj . Consider the following two conditions to construct the triples: 4a. If there is a single action within the word window, then the triple of the form {Pi Sj Ak } is constructed. The frequency of newly constructed triple is calculated to make sure that its frequency is greater than the minimum frequency threshold. The indices of all technician verbatim, which consist a record of the valid triples are assigned to the cluster. 4b. If there are multiple actions appear in word window set on the either side of Sj , then after calculating the distance of each action from Sj , the action closest to Sj is selected to construct the triple {Pi Sj Ak }. The frequency of {Pi Sj Ak } is calculated to check whether its occurrence in a corpus is greater than the minimum frequency threshold and the indices of all technician verbatim, which consist a record of {Pi Sj Ak } are selected to construct a cluster. If there are two actions at a same distance from Sj , then both the actions are used to construct the triples. Below, we describe different steps involved in our clustering algorithm in further detail. 5.3.1 Determination of a sentence boundary Typically, there are multiple parts, symptoms, and actions recorded in each technician verbatim and it is necessary to determine which symptom is associated with which part and which repair action is used to fix which symptom. For this, the term-based clustering algorithm first splits each technician verbatim into separate sentences such that a part, a symptom, and an action appearing in the same sentence deemed to show a high degree of association as compared to ones that are written in different sentences. In natural language processing [32], a sentence boundary determination is a problem of deciding the beginning and the end of a sentence. Several algorithms are reported in the literature [27], which makes use of the lexicon with part-of-speech probabilities to determine the sentence boundaries. In our approach, a period is used as a sentence delimiter to determine the sentence boundary. However, determining a sentence boundary by using a period as a delimiter is a non-trivial task because it can be used to denote an abbreviation, a decimal point, or an ellipsis in text. To this end, we propose the rules that are described below which are used in our algorithm to decide the sentence boundary: Rule 1. If a term token is concatenated with a period that is followed by a white space and the first character of a succeeding term after a white space is a capitalized alphabet, for example “door. Fixed …”, then such a period is considered as a valid sentence boundary. Rule 2. If a term token is concatenated with a period, then it is checked to see whether it is an abbreviation of a domain-specific concept and if the valid abbreviation is followed by a white space and the first character of a succeeding term is a capitalized character, for example “brkn. Fixed…”, then such a period is treated as a valid sentence boundary. Rule 3. If a valid abbreviation is concatenated with a period and it is surrounded by the phrases on either side, e.g., “the door is brkn. so it is fixed” and the first character of the phrase preceding an abbreviation is not a capitalized letter, then such a period is not considered as a valid sentence boundary.
123
420
D. Rajpathak et al.
Rule 4. If a period is concatenated with integers on its either side without any white spaces in between, for example “0.5 olh is claimed”, then such a period is not treated as a valid sentence boundary. Rule 5. 5a. If an alphabet is concatenated with a period that is followed by another alphabet without any white space and the second alphabet is concatenated with a second period, for example “i.e.”, then in such cases a first period is not considered as a valid sentence boundary. 5b. If an alphabet is concatenated with a period, which is followed by another alphabet that is concatenated with a period without any white space and there are no characters after the second period, then the second period is considered as a valid sentence boundary, for example “we have to meet at 5 p.m.” (end of sentence). These rules are not exhaustive in nature, but they allow us to determine the sentence boundaries in about 94% of cases in our domain. These rules are modified to handle other punctuations, such as question mark, exclamations, and so on. 5.3.2 Part-based clusters Having applied the sentence boundary detection algorithm, in each split sentence of a technician verbatim, tvj , first the focal term, that is, the part Pi is identified, where Pi ∈ Part and tvj in TV, where TV = {tv1 , tv2 , . . ., tvj } and Part ∈ {P1 , P2 , . . ., Pi }. The algorithm makes use of the term frequency function [29] to set the minimum frequency threshold of two to check the validity of each Pi recorded in a technician verbatim. The main reason why the minimum frequency threshold is set to two because in our domain three types of technician verbatim are recorded—customer verbatim, correction verbatim, and causal verbatim, and if Pi is recorded only in a single verbatim which does not have any symptoms within the word window set on the either side of Pi then the algorithm does not have necessary confidence to consider Pi as the valid part failure. Having identified the validity of Pi the two most similar clusters, that is, C{Pi} ∪ C{Pj} = C{PiPj} are merged together [12]. The process is iterated till one large cluster is constructed which consist a set of technician verbatim with the record of Pi . The distance between two clusters, C{Pi} and C{Pj} is calculated by using the average linkage [30] before they are merged. The similar procedure is used to construct the clusters of technician verbatim with the record of remaining parts from {P1 , P2 , . . ., Pi }. 5.3.3 Part-Symptom clusters The part-symptom clustering algorithm takes as input the clusters constructed by the partbased algorithm and the pair {Part Symptom}sem-extract extracted by the semantic extractor as the cluster label and then the part-based clusters are further subdivided. In each sentence of a technician verbatim first the focal term that is Pi , is identified and then the word window of four terms is set on the either side of Pi . The pairs of the form {Pi Sj } are constructed by taking into account the symptoms, Sj that are appearing in the word window. Each newly constructed pair {Pi Sj } is then matched with the label pair {Part Symptom}sem-extract to make sure that {Pi Sj } is a valid pair. If the pair {Pi Sj } is not a member of {Part Symptom}sem-extract , then instead of directly using {Pi Sj } to construct clusters its frequency is calculated to see whether its occurrence is greater than the minimum frequency threshold in the corpus. The indices of the technician verbatim, which consists a record of the valid {Pi Sj } are assigned to the same cluster.
123
A domain-specific decision support system for knowledge discovery
421
The algorithm also deals with the case in which more than one symptom appears in the word window. First, all possible pairs of the form ({P1 S1 }, . . ., {Pi Sj }) are constructed and then the distance of each symptom (which is a member of the pairs) is calculated from the focal term, Pi . The symptom that is closest to Pi is used to construct the candidate pair {Pi Sk } and its frequency is calculated. The valid pair with its frequency greater than the minimum frequency threshold is used to construct the cluster. Finally, if there are two symptoms at an equal distance on the either side of Pi then the pairs ({Pi Sm }, {Pi Sn }) are constructed by considering both the symptoms. The same process is repeated to construct the clusters of technician verbatim by using the pairs ({Pi Sm }, {Pi Sn }).
5.3.4 Part-Symptom-Action clusters The Part-Symptom-Action clustering algorithm takes as input the triples {Part Symptom Action}sem-extract extracted by the semantic extractor and the clusters constructed by the part-symptom clustering. Similarly with the part-symptom clustering algorithm, here, first the focal term that is symptom Sj is identified in each sentence of a technician verbatim and the word window of four terms is set on the either side of Sj . Then, the algorithm identifies the actions, Ak , which appear within the word window and if there is a single action in the word window, then a single triple of the form {Pi Sj Ak } is constructed. Each newly constructed triple is then checked to see whether it matches with the triples {Part Symptom Action}sem−extract that are extracted by the semantic extractor and its frequency is calculated to see whether it is greater than the minimum frequency threshold. The indices of all technician verbatim, which consist a record of a valid triple are assigned to the same cluster. When multiple actions appear in the word window of Sj then all possible actions appearing in the word window are used to construct the triples ({P1 S1 A1 }, . . ., {Pi Sj Aj }) and the distance between each action from Sj is calculated. The triple consisting of an action which is closest to Sj is selected and it is matched with the triple {Part Symptom Action}sem−extract extracted by the semantic extractor to check its validity. The frequency of each valid triple is then calculated to see whether it is greater than the minimum frequency threshold and the valid triple is used to cluster the technician verbatim. Figure 5 shows the clusters of technician verbatim constructed by using the part-symptomaction algorithm. In this example, the part-based clusters of technician verbatim are constructed at the top level, for the parts, such as engine control module, Powertrain control module, and so on. Then, the part-symptom clusters are constructed in which for each part, say “powertrain control module” the symptoms associated with it, such as “light on”, “failure”, “p0601”, and so forth are considered for clustering the technician verbatim. Finally, the part-symptom-action clusters are constructed in which for each part, say “Powertrain control module” and the symptoms associated with it, say “light on”, “reprogram”, and so on, the clusters of the repair actions, such as “replace”, “reprogram”, and “setup”, are used to collect the corresponding technician verbatim. From each cluster of technician verbatim, our algorithm then automatically discovers the best-practice knowledge that is the parts that are frequently appearing in the field data along with the frequent symptoms that are associated with each part are identified. Finally, the frequently appearing repair actions used to fix each symptom are also identified. This newly discovered knowledge is represented in the form of Pareto graphs as shown in Fig. 6 to provide the necessary decision support to the SMEs.
123
422
D. Rajpathak et al.
Fig. 5 The example showing the clusters of technician verbatim constructed by using the part-symptom-action clustering algorithm
As shown in Fig. 7, further knowledge is discovered from each cluster, say {battery dead charge}. As shown in the top two Pareto graphs of Fig. 7, the tool first identifies the distribution of the vehicle makes and build dates that are associated with specific claim cases which are member of the cluster, say {battery dead charge}. Then, in the next two graphs the distribution of the cases that are associated with each cluster, say {battery dead charge} for the vehicle model and the vehicle build month are shown. These two distributions help SMEs to quickly realize whether specific type of failures are more frequent in the field for a set of vehicles that are built within a specific build month. Finally, the last two graphs show the distribution in terms of the manufacturing plants from which the vehicles are built along with the dealers that are involved in the claim cases for cases associated with a cluster. The manufacturing plant information is crucial for the SMEs in realizing whether specific types of failures that are associated with a part, say “brake” are more frequently originating from a specific manufacturing plant. This information then can be disseminated to the relevant processes that are involved in and are causing the failures such that the necessary corrective
123
A domain-specific decision support system for knowledge discovery
423
Fig. 6 The frequent parts, symptoms, and actions in the selected data set
Fig. 7 The example showing how new knowledge is discovered from the cluster of technician verbatim associated with {Pi Sj Ak } triple
actions can be taken right at the manufacturing plant level. Finally, the dealer distribution information allows SMEs to get a quick insight into the dealers that are involved in the specific cases along with their performance over a period of time.
123
424
D. Rajpathak et al.
5.4 Subject matter expertise based recommendation The newly constructed clusters are presented to SMEs for their final auditing. The SMEs validate whether the repair actions identified by the algorithm to fix specific symptoms are indeed the correct ones. The repair actions with minimum cost and time are automatically highlighted by the tool and they are saved as the best-practice knowledge. This knowledge can be used as the baseline to compare and contrast the repair actions that are used in the anomaly cases.
6 System performance Our system has been implemented as a prototype in the service and quality division of GM. The main aim of this exercise was to evaluate the tool performance to see whether A. B.
C.
The tool successfully identified the repair actions that did not match with the claimed labor codes from the millions of repair claim records; The document annotation (cf. Sect. 5.1) correctly disambiguated between the part phrases that were written by using inconsistent vocabulary after deploying the D&P ontology and the disambiguation mechanism and whether the relevant technician verbatim clustered correctly after disambiguation; The tool successfully handled growing amount of data points to establish {DTC-LC} associations and clustering of corresponding technician verbatim.
All the experiments were performed on OS Type: Microsoft Windows XP Professional, Memory size: 3.5 GB, Processor: Intel[R] Core[TM]2 Duo CPU T9300 @ 2.50 GHz 2.49 GHz. 6.1 Ability to discover the best-practice knowledge We selected the data points covering three year’s worth of claims data associated with different vehicle makes and models. Here, our main objective was to see how correctly ASTEK identified the dealers that were performing cost-effective repairs and the dealers that were claiming higher cost to fix the same fault. For this, we have chosen the repair cases associated with two year old batteries. From the 2.5 million data points, the association mining tool took only one minute 20 s to identify 2,338 cases in which different types of battery repair operations were performed for the symptom ‘dead battery’. The data preprocessing and pattern identification correctly pointed out that in 2,100 cases out of 2,338 cases, the dealers did not replace any part and simply repaired the dead battery by performing ‘battery recharge’ repair action. The tool also highlighted that in these cases no other labor hours were claimed, and, therefore, they were cost optimal repairs. More importantly, the tool isolated 238 cases in which for the same symptom the labor code associated with the “battery replacement” repair action was claimed. It was an inconsistent repair and, therefore, it was necessary to further investigate into these cases. For this, the semantic extractor (cf. Sect. 5.2) retrieved the {Part Action} pair from each technician verbatim that was associated with 238 cases. Further, text mining of these claims quickly revealed that, in 86% of the 238 cases, the dealers in deed performed repair actions, such as battery recharging, battery jump starting, and so on, which were validated by the SMEs to be inconsistent repairs in the given context.
123
A domain-specific decision support system for knowledge discovery
425
Cluster of technician verbatim before using D&P ontology 175
165
154 130
110 87
78
97
84
55
Total number of technician verbatim in a cluster
Fig. 8 The clusters of technician verbatim constructed before using the D&P ontology and the disambiguation mechanism
The system showed its value by correctly identifying the anomaly cases from the 2.5 million claim records only within minutes with a high level of accuracy. Before implementing our system this analysis was performed manually and the same analysis was taking about three to four weeks for the given number of data points. Also, in many cases, the critical associations were overlooked because it was difficult for the SMEs to identify the anomaly cases manually. The same analysis can now be performed within few minutes without missing a single instance of a claim record with unusual repair pattern. This type of decision-support provided by our system helped SMEs significantly to correctly identify the anomaly cases. The corrective actions involved in the anomaly cases were corrected in order to reduce their occurrence in the vehicle fleet population and thereby ensuring a high level of customer satisfaction. 6.2 Ability to handle linguistic variations of the terms The precise annotation of part phrases written by using inconsistent vocabulary in the technician verbatim (for example, “Powertrain Control Module”, “PCM”, or “PC Module”) was a non-trivial task. The main aim of this evaluation was to see how correctly document annotation disambiguated between inconsistent parts such that the disambiguated part phrase can be used to cluster relevant technician verbatim. For this experiment, the tool randomly selected 1135 technician verbatim from the claims data, which were associated with different vehicle makes and models. Initially, we clustered the technician verbatim without using the D&P ontology and the disambiguation mechanism. As it can be seen in Fig. 8, in the first run, our system did not handle different linguistic variations, and, as a result, the term-based clustering algorithm ended up constructing heterogeneous clusters of technician verbatim for each part variation. This was wrong because the technician verbatim with the record of same part name, but only written by using inconsistent vocabulary ended up in different clusters. Ten different clusters of technician verbatim were constructed with the records of following parts—Fuel, EBC Module, Electronic Brake Control Module, FTP Sensor, EBCM, PCM, BCM, Body Control Module, Powertrain Control Module, and Fuel Tank Pressure Sensor. To alleviate this problem, the D&P ontology along with the disambiguation mechanism was deployed and the experiments were performed by using the same dataset to see the
123
426
D. Rajpathak et al. Cluster of technician verbatim after using D&P ontology
Body Control Module
Powertrain Control Module
Fuel Tank Pressure Sensor
Electronic Brake Control Module
Total number of technician verbatim in a cluster
Fig. 9 The clusters of technician verbatim constructed after using the D&P ontology and the disambiguation mechanism
improvement that was achieved. The new version of the tool showed significant improvement in the way it disambiguated linguistic variations. For instance, the part phrases written by using inconsistent vocabulary were merged correctly into the same set as shown follows— {EBC Module; Electronic Brake Control Module; EBCM}, {BCM; Body Control Module}, {PCM; Powertrain Control Module}, and {FTP Sensor; Fuel Tank Pressure Sensor}. As shown in Fig. 9, the disambiguated part phrases were used to cluster the technician verbatim successfully. The inspection of newly constructed clusters revealed that the technician verbatim with the record of part phrases written by using inconsistent vocabulary successfully assigned to the same cluster. This type of precise clustering helped SMEs to discover the knowledge from the respective clusters more effectively. Also, note that the technician verbatim associated with the phrase “Fuel” was not considered for clustering because this phrase has been identified as the noise in the data. We performed second set of experiment to quantify the value that was added by the D&P ontology and the disambiguation mechanism (cf. Sect. 5.1.2) to our analysis. Three new data sets consisting of different part abbreviations were used as the surface query—set 1: (EBCM; BCM; PCM; FTPS), set 2: (BCM; FR DOOR; TPS4 ) set 3: (PCM; ECU; ICM5 ) to see in how many cases the technician verbatim were correctly retrieved in which the abbreviations were used only to represent the part. The precision and recall measures were used to evaluate the performance of our algorithm by using the formulae show in (7) and (8) respectively. NRR TVR NRR Recall = TVRR
Precision =
(7) (8)
where, NRR —relevant retrieved technician verbatim, TVR —retrieved technician verbatim, TVRR —relevant technician verbatim in a database. Table 1 summarizes the results of our experiment before and after deploying the D&P ontology and the disambiguation mechanism. As it can be seen in Table 1, after using the D&P ontology and the disambiguation mechanism there was an improvement both in the precision and recall. The recall was improved significantly, but still could not achieve 100% 4 TPS: tire pressure sensor. 5 ICM: injection control module.
123
A domain-specific decision support system for knowledge discovery Table 1 The performance of the document annotation results
427
TVR NRR TVRR Precision Recall Performance of ASTEK before deploying the D&P ontology and the disambiguation mechanism EBCM
70
45
125
0.64
0.36
BCM
67
38
87
0.56
0.43
PCM
72
31
95
0.43
0.32
FTPS
18
8
39
0.44
0.20
BCM
73
42
118
0.57
0.35
FR DOOR
87
51
132
0.58
0.38
TPS
63
38
76
0.60
0.50
PCM
124
78
136
0.62
0.57
ECU
110
82
127
0.74
0.64
ICM
82
63
97
0.76
0.64
Performance of ASTEK after deploying the D&P ontology and the disambiguation mechanism 129
93
125
0.72
0.74
BCM
EBCM
90
57
87
0.63
0.65
PCM
97
52
95
0.53
0.54
FTPS
39
27
39
0.69
0.69
BCM
98
66
118
0.67
0.55
FR DOOR 123
84
132
0.68
0.63
TPS
69
42
76
0.60
0.55
PCM
134
85
136
0.63
0.62
ECU
123
92
127
0.74
0.72
ICM
89
68
97
0.76
0.70
completeness. The main reason behind this was that the existing version of ASTEK did not take into account the cases when two term phrases were concatenated with each other, for instance “EBCMfaulty”. In theory, it was possible to handle such cases, but based on the SME recommendation we decided not to handle such cases as it added to the computation time. Moreover, the level of accuracy demonstrated by ASTEK without handling such cases was good enough for the purpose of analysis. 6.3 Scalability The scalability is one of the most important criteria that determine how gracefully a given system handles growing amount of data without failing dramatically when implemented in real life. We performed the set of experiments with data sets of different sizes to see the time taken by ASTEK to correctly identify frequently used LCs to fix the given DTCs and also the time taken to cluster the technician verbatim associated with newly discovered {DTC-LC} patterns. Table 2 summarizes the results of our experiments. 7 Comparison with the related works As discussed in Sect. 2, several association mining tools have been reported in the literature— see [6,21,25,31,37]. In comparison with our association mining approach, in some instances
123
428
D. Rajpathak et al.
Table 2 The summary of scalability experiment results Claim selection period
Number of records
Time to identify frequent LCs for given DTC (s)
Technician verbatim for {DTC-LC} patterns
Time taken to cluster the technician verbatim (s)
One week
54,635
5
123
One month
216,078
8
311
5
Three months
618,143
12
574
11
2,592,893
21
1482
17
One year
3
[21,31,37], the domain-specific knowledge has not been used to establish the associations between dispersed data points, which makes it difficult for the users to realize how the data points are associated with each other. In some other cases, [6] and [25], a priori algorithm has been used to discover the If-Then rules to perform the diagnosis. The algorithm proposed in [6] and [25] is capable of discovering only frequently appearing rules, but in our domain the anomaly cases are infrequent in number and it would be difficult to find such cases by using a priori algorithm because of the notion of minimum support threshold used in this algorithm. In comparison with this, our association mining approach (cf. Sect. 4.2) uses the constraint-based rules that not only allow us to establish the associations between DTCs and LCs, but it also helps to identify infrequent anomaly cases. Furthermore, it highlights several implicit details, such as the time and the cost distribution associated with newly identified cases, the parts changes in each case, and the dealers that are involved in such cases. Several text mining systems, such as OTTO [5], GATE [10], SAS [11], and [21], and Attensity6 have been proposed in the literature that perform the semantic extraction and text clustering tasks. In some cases, [10,11], and Attensity, the semantics in the form of terms, events, and named entities has been extracted from the document corpus, but only a limited support has been provided to cluster the documents for discovering the best-practice knowledge. In our approach (cf. Sect. 5.2), the semantics is extracted in multiple combinations— {Parti }, {Symptomj }, {Actionk }, or {Parti Symptomj }, or {Parti Symptomj Actionk }, which provides with a refined cluster labels that can be used for document clustering. In Attensity and SAS, a single term has been used as a cluster label to cluster the documents whereas in our approach frequently co-occurring terms are used to cluster the technician verbatim that enables more meaningful discovery of the knowledge from the documents residing in the same cluster. A limited support has been provided by GATE to cluster the documents, and therefore, GATE cannot be used as a tool to discover the best-practice knowledge. While in our approach, the terms, say the parts, symptoms and actions, which are written by using inconsistent vocabulary are successfully disambiguated a limited support has been shown by [5,10] and [21] to perform such a type of term disambiguation. In ASTEK, the newly discovered knowledge has been presented to SMEs in the form of Pareto graphs, which helps them to identify the best-practice knowledge quickly by highlighting the causes of anomalies. In comparison with our approach, no such provision has been provided in the existing tools. In SAS, the frequently appearing terms are used by theexpectation maximization and the hierarchical algorithms for document clustering [19]. These algorithms seem to exhibit two main limitations-1. The clusters are not always meaningful and 2. The users need to identify the theme that is associated with each cluster to discover the knowledge. OTTO uses the text mining technique to learn the target ontology from text documents such that this ontology 6 http://www.attensity.com.
123
A domain-specific decision support system for knowledge discovery
429
can be used to categorize the text documents by using supervised and unsupervised learning. In comparison with OTTO, our system not only categorizes the text documents, but it also derives the best-practice knowledge from the clusters. 8 Conclusion We proposed a novel system named ASTEK, which combines the association mining and text mining techniques to discover the best-practice knowledge from the service and warranty data in the automotive domain. The association mining technique incorporated domainspecific knowledge to correctly establish the associations between the symptoms and the repair actions from the millions of warranty and claims records. It also correctly identified infrequent anomaly cases that were present in the symptoms and repair action associations. As shown in the Sects. 6.1 and 6.3, before implementing our system on an average the SMEs were taking taking three to four weeks to analyze the field data, but the same analysis can now be performed within few minutes without missing a single anomaly case. From the business point of view, reduction in the time from few weeks to few minutes resulted in significant time as well as cost savings. The text mining was used to identify the root causes associated with the anomaly cases and also to learn the best-practice knowledge from the successful repair cases. This system successfully handled several types of noises (cf. Sect. 1) that were observed in the data while annotating parts, symptoms, and actions from each technician verbatim. The annotated information was extracted by the semantic extractor in multiple combinations. These combinations provided refined cluster labels to perform document clustering. Also, in our approach, the notion of frequently co-occurring terms was used to cluster the documents instead of using a single term as a cluster label, which allowed us to derive more meaningful best-practice knowledge from each document cluster. ASTEK has been evaluated in the real-world setup consisting millions of symptoms and claims records at any given instance. As discussed in Sect. 6.1, the semantic extractor achieved 86% accuracy in correctly identifying the anomaly cases, which resulted into significant costsaving opportunity. Also, as discussed in Sect. 6.2, after using the D&P ontology and the disambiguation mechanism, the performance of document annotation improved significantly in correctly disambiguating the terms that were written by using inconsistent vocabulary. We ran multiple experiments and in the first two data sets on an average the precision was improved by 11% whereas in the last dataset on an average the precision was improved by 4%. Similarly, on an average the recall was improved by 45%, 16%, and 10% respectively in the three datasets that were used for the experiments. Finally, the performance of ASTEK was also evaluated on datasets consisting of different number of data points ranging from few thousands to more than two million claims records. As discussed in Sect. 6.3, the association mining took only 21 s to establish {DTC-LC} associations, whereas the text mining took only 17 s to construct the clusters of technician verbatim that were associated with these patterns. In summary, our system can now be seen as the one that provides a necessary collaborative environment to discover the best-practice knowledge from the millions of data points. This newly discovered knowledge can be shared between different stake holders to improve their performance and as a result, help to maintain a high level of customer satisfaction. Acknowledgments The authors would sincerely like to thank Ravikumar Karumanchi and Halasya Siva Subramania (General Motors R&D, India) for their useful comments on the initial drafts of this paper. Marty Case (General Motors, USA) for providing useful domain-specific knowledge that helped us to formalize and improve the performance of the D&P ontology.
123
430
D. Rajpathak et al.
References 1. Agarwal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases, In: Proceedings of the 1993 ACM SIGMOD conference. Washington DC, USA, pp 207–216 2. Agosti M, Ferro N (2005) Annotations as context for searching documents, In: Crestani F, Ruthven I (eds) Proceedings of the 5th international conference on conceptions of library and information science— context: nature, impact and role, Lecture Notes in Computer Science, Springer, Heidelberg, Germany, pp 155–170 3. Beckett D (ed). RDF/XML Syntax Specification (Revised), W3C Recommendation, 2004. http://www. w3.org/TR/rdf-syntax-grammar/ 4. Benedittini O, Baines TS, Lightfoot HW, Greenough RM (2009) State-of-the-art in integrated vehicle health management. J Aer Eng 223(2):157–170 5. Bloehdorn S, Cimiano P, Hotho A, Staab S (2005) An ontology-based framework for text mining. LDV Forum 20(1):87–112 6. Buddhakulsomsiri J, Zakarian A (2009) Sequential pattern mining algorithm for automotive warranty data. Comput Ind Eng 57(1):137–147 7. Chougule R, Chakrabarty S (2009) Application of ontology guided search for improved equipment diagnosis in a vehicle assembly plant. In: Proceedings of fifth annual IEEE conference on automation science and engineering (IEEE CASE 2009). IEEE Press, Bangalore, India, pp 90–95 ´ 8. Cios KJ, Pedrycz W, Swiniarski RW (1998) Data mining methods for knowledge discovery. Kluwer, Norwell 9. Corcho O (2006) Ontology based document annotation: trends and open research problems. Int J Metadata Semant Ontol 1(1):47–57 10. Cunningham H (2002) GATE, a general architecture for text engineering. Comput Humanit 36:223–254 11. Davi A, Haughton D, Nasr N, Shah G, Skaletsky M, Spack R (2005) A review of two text-mining packages: SAS TextMining and WordStat. (Product/Service Evaluation). Am Stat 59(1):89–103 12. Dean PM (ed) (1995) Molecular similarity in drug design. Blackie Academic & Professional, London pp 111–137 13. Fensel D, Straatman R (1998) The essence of problem-solving methods: making assumptions to gain efficiency. Int J Human-Comput Stud 48:181–215 14. Francisco V, Gervas P, Peinado F (2010) Ontological reasoning for improving the treatment of emotions in text. Knowl Inf Syst 25:421–443 15. Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acq 5(2):199–220 16. Gusikhin O, Rychtyckyj N, Filev D (2007) Intelligent systems in the automotive industry: applications and trends. Knowl Inf Syst 12(2):147–168 17. Hearst T (1999) Untangling text data mining. University of Maryland, College Park, pp 3–10 18. Janasak KM, Beshears RR (2007) Diagnostics to prognostics—a product technology evolution, In: Proceedgins of the 2007 reliability and maintainability symposium—RAMS’07. Orlando, Florida, USA 19. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323 20. Jing Y, Choi Y, Xiong Y, Han K, Shin S, Lee Y (2007) A knowledge acquisition and management system for fault diagnosis and maintenance of equipments, In: Proceedings of the 6th WSEAS international conference on applied computer science. Hangzhou, China, pp 296–300 21. Jing L, Ng KM, Huang JZ (2009) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1):35–55 22. Kotsiantis S, Kanellopoulos D (2006) Association rules mining: a recent overview. Int Trans Comput Sci Eng 32(1):71–82 23. Kuehnast J, Hengeveld W (2009) Enterprise application integration (white paper). T-systems enterprise services. GmbH, Berlin 24. Luhn HP (1960) Keyword in context index for technical literature (KWIC Index). Am Docu 11:288–295 25. Li J-q, Niu C-l, Liu J-z, Zhang L-y (2006) Research and application of data mining in power plant process control and optimization. Lec Notes Comp Sci 3930:149–158 26. Ovsiannikov IA, Arbib MA, Mcneill TH (1999) Annotation technology. Int J Human-Comput Stud 50(4):329–362 27. Palmer DD, Hearst MA (1994) Adaptive sentence boundary disambiguation. Report No. UCB/CSD 94/797 28. Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25:473–491 29. Rajpathak D, Motta E, Zrahal Z, Roy R (2006) A generic library of problem solving methods for scheduling applications. IEEE Trans Knowl Data Eng 18(6):815–828 30. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York
123
A domain-specific decision support system for knowledge discovery
431
31. Saxena A, Wu B, Vachtsevanos G (2005) Integrated diagnosis and prognosis architecture for fleet vehicles using dynamic case based reasoning. In: Proceedings of the IEEE Autotestcon, pp 96–102 32. Stevenson M, Gaizauskas R (2000) Experiments on sentence boundary detection. In: Proceedings of the 6th conference on applied natural language processing. Seattle, USA, pp 84–89 33. Tan P, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of the SIGKDD’02 conference. Edmonton, Alberta, Canada, pp 32–41 34. Venkatasubramanian V, Rengaswamy R, Yin K, Kavuri S (2003) A review of process fault detection and diagnosis part I: quantitative model-based methods. Comput Chem Eng 27:293–311 35. Venkatasubramanian V, Rengaswamy R, Kavuri S (2003) A review of process fault detection and diagnosis part II: qualitative models and search strategies. Comput Chem Eng 27:313–326 36. Venkatasubramanian V, Rengaswamy R, Yin K, Kavuri S (2003) A review of process fault detection and diagnosis part III: process history based methods. Comput Chem Eng 27:327–346 37. Wang S, Hsu S (2004) A Web-based CBR knowledge management system for PC troubleshooting. Int J Adv Manuf Tech 23(7–8):532–540 38. Williams Z (2006) Benefits of IVHM: an analytical approach. In: Proceedings of the Aerospace Conference. Big Sky, Montana, USA
Author Biographies Dnyanesh Rajpathak is a senior researcher in the Diagnosis and Prognosis Group at the General Motor’s India Science Laboratory. He received Bachelor of Engineering in Production Engineering from the University of Pune in 1996; Master of Science degree in Advanced Manufacturing Management and Technology from the University of Surrey, UK in 1998; Ph.D. in Artificial Intelligence from the Open University, UK in 2004. Dnyanesh’s research interests are in the area of data mining, text mining, integrated vehicle health management, next generation automotive engineering and design, knowledge-based engineering, and ontology engineering. Dnyanesh has been awarded with 2010 Charles L. McCuen (GM R&D) and 2011 General Motors President Award for the work in the area of data mining and text mining.
Rahul Chougule is a senior researcher in the Diagnosis and Prognosis Group at the General Motor’s India Science Laboratory. He received Bachelor of Engineering and Masters of Engineering in Mechanical and Production Engineering in 1996 and 2000 respectively from Shivaji University. He received Ph.D. in Mechanical Engineering from the Indian Institute of Technology, Bombay in 2005. Rahul’s research interests are in the area of knowledge systems, casebased reasoning, multiple criteria decision making and their applications to manufacturing and operations management. Rahul has been awarded with 2010 Charles L. McCuen (GM R&D) and 2011 General Motors President Award for the work in the area of data mining.
123
432
D. Rajpathak et al. Pulak Bandyopadhyay is a Technical Fellow and a Senior Manager in the IVHM and Cyber Security, Global R&D division of General Motors in Warren, US. Pulak received his Ph.D. from the University of Wisconsin, Madison in the area of Production Engineering. He is a senior member of the Society of Manufacturing Engineers, and served in the board of directors of North American Manufacturing Research Institute. Pulak’s research interest includes math-based modeling of processes and systems, agile/reconfigurable system design, data mining technologies for diagnosis and prognosis, and real-time monitoring and control. Pulak has won several external and internal GM awards including SME Outstanding Young Manufacturing Engineer, R&D100 Top innovation award, and GM’s most prestigious “Boss Kettering” award three times.
123