Journal of Intelligent Information Systems, 22:1, 7–22, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Sequential Association Rule Mining with Time Lags SHERRI K. HARMS
[email protected] Department of Computer Science and Information Systems, University of Nebraska at Kearney, NE 68849, USA JITENDER S. DEOGUN
[email protected] Department of Computer Science and Computer Engineering, University of Nebraska at Lincoln, NE 68588, USA Received November 12, 2002; Revised February 11, 2003; Accepted April 10, 2003 Editor: Mohand-Sa¨ıd Hacid
Abstract. This paper presents MOWCATL, an efficient method for mining frequent association rules from multiple sequential data sets. Our goal is to find patterns in one or more sequences that precede the occurrence of patterns in other sequences. Recent work has highlighted the importance of using constraints to focus the mining process on the association rules relevant to the user. To refine the data mining process, this approach introduces the use of separate antecedent and consequent inclusion constraints, in addition to the traditional frequency and support constraints in sequential data mining. Moreover, separate antecedent and consequent maximum window widths are used to specify the antecedent and consequent patterns that are separated by either a maximal width time lag or a fixed width time lag. Multiple time series drought risk management data are used to show that our approach can be effectively employed in real-life problems. This approach is compared to existing methods to show how they complement each other to discover associations in the drought risk management domain. The experimental results validate the superior performance of our method for efficiently finding relationships between global climatic episodes and local drought conditions. Both the maximal and fixed width time lags are shown to be useful when finding interesting associations. Keywords: sequential rule discovery, time lag, knowledge discovery, drought risk management
1.
Introduction
The emergence of remote sensing, scientific simulation, telescope scanning, and other survey technologies has dramatically enhanced our capabilities to collect spatio-temporal data. However, the explosive growth in data makes the management, analysis, and use of data difficult and expensive. In decision support applications with spatio-temporal data, it is important to study the temporal relationships between events that influence the decision. Unlike transactional data, there are no inherently defined boundaries around the events that might be of interest to inference. Because multiple spatio-temporal data sets contain volumes of data, and often there is a delay between the occurrence of an event and its influence on the other events, finding interesting patterns can be difficult. Thus, there is a need to automate the temporal inference problem, using methods that are able to perform well with volumes of data. These methods not only must be able to compute inference, but also must find the interesting events involved, and define the boundaries around them.
8 1.1.
HARMS AND DEOGUN
Related work
Several different approaches have been investigated for episodal sequential data mining, for example, Bettini et al. (1998) and Mannila et al. (1997). These algorithms use all frequent episodes, within one (long) sequence. The entire set of association rules is produced and significance criterion such as the J-measure for rule ranking are used to determine the valuable rules. Recent work has highlighted the importance of using constraints to focus the mining process on the relevant items. Transactional association rules with item constraints were investigated in Srikant et al. (1997). Similarly, Feng et al. (1999) uses templates as constraints for inter-transaction associations. The GSP algorithm (Srikant and Agrawal, 1996) was the first to consider minimum and maximum gaps, as well as time windows within transactions with the same identifier. Ng et al. (1998) presented an algorithm for extracting all frequent associations matching a rich class of constraints for transactional data. This was improved on in Cong and Liu (2002). An approach that uses temporal constraints on transactional sequences was presented in Zaki (2000). This work considers a sequence to be frequent of it occurs within many transactions. One recent method that uses constraints for discovering episodal associations, is GenREAR (Harms et al., 2001b). This approach improves on previous methods by combining inclusion constraints and closure principles with a sliding window approach on multiple event sequences to find frequent closed episodes. It then generates representative episodal association rules from those episodes. This method is based on the premises that (1) only a subset of the collection of frequent episodes are interesting, and (2) a condensed representation of frequent episodes can be extracted more efficiently than the entire collection. However, the method should maintain the ability to quickly generate the entire collection of frequent patterns. This allowed for finding interesting patterns that exhibited relatively low frequency in the data sets. This approach was shown to be useful when extracting patterns relating to drought weather conditions from global data in Harms et al. (2001b). 1.2.
Contribution
This paper presents a temporal data mining approach that searches for periodic occurrences of events of interest that regularly occur prior to other events of interest. This approach was first introduced in Harms et al. (2002). Knowledge of this type of lagged relationship, can enable proactive decision making governing the inferred data. Many problem domains have a need for this type of analysis, including stock market analysis, risk management, fire management in national forests, understanding telecommunication system failures, forecasting, and more. For example, fire management decision makers can make better decisions for prescribed burns and other suppression techniques when looking at current climatic information, if they know how these climatic factors influence fire risk. This in turn will allow for the protection of adjoining private property, better compliance with air quality/smoke regulations, and reduced impacts on natural communities. The approach described in this paper, called Minimal Occurrences With Constraints and Time Lags, (MOWCATL), finds relationships within multiple data sets by combining
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
9
association rules with frequent episodes from Mannila et al. (1997) with event constraints from Harms et al. (2001b). This paper contributes two major enhancements over previous methods: (1) separate antecedent and consequent inclusion constraints, along with separate antecedent and consequent maximum window widths are used to specify the antecedent and consequent patterns that are of interest, and (2) time lags between the occurrence of the antecedent and the occurrence of the consequent are used to find natural delays embedded within episodal relationships. The proposed approach is well suited for data mining problems with the following structural foundations: (1) the problem has several temporal sequences each with interesting groupings of events that occur close together, even if the groupings occur relatively infrequently over the entire sequence, and (2) within the sequences, there are periodic occurrences when the signature of one or more sequence has a delayed present in other sequences, even when the multiple sequences are not globally correlated. 1.3.
Overview
The rest of this paper is organized as follows. Section 2 provides the background information for the MOWCATL approach, which is described in Section 3. For comparison, Section 4 describes the previously developed Representative Episodal Association Rules (REAR) methodology. Section 5 explains the drought risk management problem, which is used as a sample problem to demonstrate the application of these methodologies to a complex domain in Section 6. This research is summarized in Section 7. 2.
Events and episodes
Time series data in continuous domains is inherently inexact, due to the unavoidable imprecision of measuring devices and clocking strategies, and natural occurrences (Goldin, 1995). Additionally, many time series display seasonal variation. We are interested in patterns that represent deviations from the normal seasonal variation. Given this focus on deviations from the norm, it is necessary to remove the noise and the seasonal variation, so that interesting patterns can be detected. Typically, the time series is normalized, as in Goldin (1995) or Tan et al. (2001). The data is then discretized by using a partitioning algorithm and a suitable similarity measure. This segments the data into partitions that have similar characteristics or data that are within a given interval. Each partition identifier is considered as a single event type, and the set of partition labels as the set of events ξ . Different partitioning methods, similarity measures, and interval sizes produce different discretized versions of the same dataset. When multiple sequences are used, each data set is normalized and discretized independently. The time granularity is then converted to a single (finest) granularity before the discovery algorithms are applied to the combined sequences (Bettini et al., 1998). The version of the time series is called an event sequence. Formally, an event sequence is a triple (t B , t D , S) where t B is the beginning time, t D is the ending time, and S is a finite, time-ordered sequence of events (Mannila et al., 1995, 1997). That is, S = (et B , et B+1 p , et B+2 p , . . . (et B+d p = et D )), where p is the step size between events, d is the total
10
Figure 1.
HARMS AND DEOGUN
Example multiple event sequences with episodes.
number of steps in the time interval from [t B , t D ], and D = B + d p. Each eti is a member of a set of events ξ , and ti ≤ ti+1 for all i = B, . . . , D − 1 p. A sequence of events S includes events from a single set of events ξ . Example. Consider the event sequences of 1-month and 3-month Standardized Precipitation Index (SPI) values for Clay Center, Nebraska from January to December 1998 shown in figure 1(a). SPI values show rainfall deviation from normal for a given location at a given time (McGee et al., 1995). For this application, a interval of 1 month was used, and the data was partitioned into 7 partitions: A. Extremely Dry (SPIvalue ≤ −2.0), B. Severely Dry (−2.0 < SPIvalue ≤ −1.5), C. Moderately Dry (−1.50 < SPIvalue ≤ −0.5), D. Normal (−0.5 < SPIvalue < 0.5), E. Moderately Wet (0.5 ≤ SPIvalue < 1.5), F. Severely Wet (1.5 ≤ SPIvalue < 2.0), and G. Extremely Wet (SPIvalue ≥ 2.0). The starting time is t B = 1 (January), the ending time is t D = 12 (December), the interval size p = 1, and the number of intervals d = 12. The set of events for the top sequence (3-month SPI values) is ξ3 = {3A, 3B, 3C, 3D, 3E, 3F, 3G}, and the set of events for the bottom sequence (1-month SPI values) is ξ1 = {1A, 1B, 1C, 1D, 1E, 1F, 1G}, where the
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
11
letters indicate the 7 partitions and the number indicates which SPI sequence (1-month or 3-month). An episode in an event sequence is a partial order defined on a set of events. It occurs in a sequence if events are consistent with the given order, within a given time bound. Formally, an episode P is a pair (V, type), where V is a collection of events and type specifies the type of episode (Mannila et al., 1997). An episode is of type parallel if no order is specified and of type serial if the events of the episode have a fixed order. An episode is injective if no event type occurs more than once in the episode. To target the specific subset of episodes of interest to the user, only the episodes that contain events from a user specified inclusion constraint set are used. Episodes built from the inclusion constraint set, referred to as target episodes, are self-contained. That is, the support of any episode with k-events can be found by joining its two generating subsequences of length k − 1 from within the same class. The MOWCATL approach is based on identifying minimal occurrences of episodes along with their time intervals. Definition 1. Given an episode Q and an event sequence S, the window w = [ts , te ] is a minimal occurrence of Q in S, if: (1) Q occurs in the window w, (2) Q does not occur in any proper subwindow of w, and (3) the width of window w is less than the maximum window width parameter, win. In this definition, timestamp ts records the starting time of the occurrence of the episode, and te records its ending time, and ts ≤ te . The width of window w equals te − ts + 1. Notice that the minimal occurrence window widths are not constant for a given episode, but simply the minimal amount of elapsed time between the start of the episode occurrence and the end of the episode occurrence. The support of an episode Q is the number of minimal occurrences of Q. An episode Q is considered frequent if its support conforms to the given minimum support threshold min sup. Example. In figure 1(b), two episodes with their minimal occurrences are shown. The top (dark) episode is a parallel episode (3D, 3F) with minimal occurrences in windows [6,7] and [9,11], for a support of 2. The bottom (light) episode is a serial episode (1C, 1D) with minimal occurrences in windows [1,2], [5,6], and [9,10], for a support of 3. If the bottom episode had been a parallel episode, two additional minimal occurrences would have occurred at windows [4,5] and [8,9]. 3.
The minimal occurrences with constraints and time lags (MOWCATL) method
The MOWCATL method shown below, finds minimal occurrences of episodes and relationships between them as in MINEPI algorithm (Mannila et al., 1997). However, the MOWCATL approach has additional mechanisms for: (1) constraining the search space during the discovery process, (2) allowing a time lag between the antecedent and consequent of a discovered rule, and (3) working with episodes from across multiple sequences. As shown
12
HARMS AND DEOGUN
in the algorithm below, these new mechanisms are integral to the MOWCATL method. The rules generated from this method provide a unique way of analyzing sequential data that is not possible with any existing method, such as MINEPI or Gen-REAR. The method’s focus is on finding episodal rules where the antecedent episode occurs within a given maximum window width wina , the consequent episode occurs within a given maximum window width winc , and the start of the consequent follows the start of the antecedent within a given time lag. This allows it to easily find rules such as “if A and B occur within 3 months, then within 2 months they will be followed by C and D occurring together within 4 months.” 1. MOWCATL Algorithm: 2. Generate Antecedent Target Episodes of length 1 (ATE1,B ); 3. Generate Consequent Target Episodes of length 1 (CTE1,B ); 4. Input sequence S, record occurrences of ATE1,B and CTE1,B episodes; 5. Keep episodes from ATE1,B and CTE1,B that meet min sup; 6. k = 1; 7. while(ATEk,B = ∅) do 8. Generate Antecedent Target Episodes ATEk+1,B from ATEk,B ; 9. For each episode 10. record minimal occurrences of length ≤ wina timestamps; 11. Keep episodes from ATEk+1,B that meet min sup; 12. k++; 13. Repeat or execute in parallel, steps 6–12 for consequent episodes, 14. using CTEk+1,B and winc instead of ATEk+1,B and wina , respectively; 15. Generate combination episodes CEB from ATEB × CTEB ; 16. For each combination 17. record occurrences with lag between antecedent start and consequent start; 18. return supported episode rules in CEB that meet the min conf threshold; 19. End pseudo-code. The MOWCATL algorithm requires a single database pass. As in the MINEPI algorithm, the starting and ending times of the minimal occurrences of each frequent episode is stored. MOWCATL starts by finding the target episodes that contain single events from the inclusion constraint set. Episodes that do not meet the minimum support threshold are pruned. Larger target episodes are built from the smaller target episodes by joining episodes with overlapping minimal occurrences, which occur within the specified maximum window width. In the candidate generation phase, the minimal occurrences of a candidate episode Q are computed as a temporal join of the minimal occurrences of two subepisodes of Q. This procedure finishes when there are no more candidates to look through. The sequence S can be a combination of multiple sequences S1 , S2 , . . . , Sk . An episode can contain events from each of the k sequences. For serial episodes, it is important to know the order of occurrence. To overcome this problem, MOWCATL defines combination event types that are a combination of event types (each from a different sequence) that occur together during the same time stamp. When counting events that occur within a window, each combination event is counted as a single event, along with the other single events from
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
13
each sequence. This treats the combination of events from multiple sequences that occur at the same timestamp differently than when the events occur either before or after the other events in an episode. These are added to the ATE1,B and CTE1,B sets in steps 2 and 3. After the frequent episodes are found for the antecedent and the consequent independently, the frequent episodes are combined to form an episode rule. Definition 2. An episode rule r is defined as an expression χ [wina ] ⇒lag ψ[winc ], where χ and ψ are the constrained antecedent and consequent episodes, respectively. The maximum window width for antecedent episodes is wina , and the maximum window width for consequent episodes is winc . The lag is the measure of the delay in time between the occurrences of the antecedent and the respective occurrences of the consequent. In the MOWCATL algorithm, the time lag constraint can be either a fixed time lag constraint or a maximum time lag constraint. When using the maximal time lag constraint, the minimal occurrences of each frequent antecedent episode χ are joined with the minimal occurrences of each frequent consequent episode ψ, as long as the starting time of the minimal occurrence for ψ is after the starting time of the minimal occurrence of χ and no later than the ((starting time of χ ) + lag). The occurrence of χ must end before the occurrence of ψ ends. The number of events in episodes χ and ψ may differ. The informal interpretation of the rule is that if episode χ has a minimal occurrence in the interval [ts , te ], with te − ts + 1 ≤ wina , and ψ has a minimal occurrence in the interval [tr , td ], with td − tr + 1 ≤ winc , and tr is in the range [ts+1 , ts+lag ], and te < td , then the rule r has a minimal occurrence in the interval [ts , td ]. For serial episodes, the starting time of the consequent must be greater than or equal to the ending time of the antecedent, and must be less than or equal to the starting time of the antecedent plus the time lag. Also, the consequence ending time must be greater than the ending time of the antecedent. For parallel episodes, the starting time of the consequent must follow the starting time of the antecedent and can differ at most by the time lag. The order of the events in the parallel episodes is not important, but basically can be used to see if the events in one episode occur “close” to the events in the other episode. As an alternate to the maximal time lag constraint, the MOWCATL algorithm allows for a fixed time lag constraint. With a fixed time lag, the antecedent and the consequent episodes are separated with the exact number of time stamps specified by the lag parameter. It may be used to monitor parameters that occur at a fixed number of timestamps prior to the consequent. In general, with the same parameters, using a fixed time lag generates few rules as compared with using a maximal time lag for a given dataset. The confidence of an episode rule r = χ [wina ] ⇒lag ψ[winc ] in a sequence S with given windows wina , winc , and lag is the conditional probability that ψ occurs, given that χ occurs, under the time constraints specified by the rule. The support of the rule is the number of times the rule holds in the database. Interesting measures such as the J-measure can still be used to insure the validity of the generated rules. Example. Figure 1(b) shows an example parallel episode rule, for the event sequences given in figure 1(a). In this example, the antecedent is the 1-month SPI sequence, shown on the bottom (light colored), and the consequent is the 3-month sequence, shown on
14
HARMS AND DEOGUN
Table 1.
Sample MOWCATL episodes, minimal occurrences, and rules.
Episode/Rule
Minimal occurrences
Support
1C
1-1, 5-5, 9-9
1D
2-2, 4-4, 6-6, 8-8, 10-10
5
1E
3-3, 11-11
2
Confidence
3
3D
2-2, 3-3, 4-4, 5-5, 6-6, 11-11, 12-12
7
3F
7-7, 8-8, 9-9
3
1C, 1D
1-2, 4-5, 5-6, 8-9, 9-10
5
1C, 1E
1-3, 9-11
2
1D, 1E
2-3, 3-4, 10-11
3
3D, 3F
6-7, 9-11
2
1C, 1D, 1E
1-3, 3-5, 9-11
3
1C, 1D ⇒lag=1 3D, 3F
(5-6, 6-7), (8-9, 9-11)
2
1C, 1D ⇒lag=1 3D
(1-2, 2-2), (4-5, 5-5), (5-6, 6-6)
3
.60
1D, 1E ⇒lag=1 3D
(2-3, 3-3), (3-4, 4-4), (10-11, 11-11)
3
1.00
1C ⇒lag=1 3D
(1-1, 2-2), (5-5, 6-6)
2
.67 .60
.40
1D ⇒lag=1 3D
(2-2, 3-3), (4-4, 5-5), (10-10, 11-11)
3
1D ⇒lag=1 3F
(6-6, 7-7), (8-8, 9-9)
2
.40
1E ⇒lag=1 3D
(3-3, 4-4), (11-11, 12-12)
2
1.00
the top (dark colored). The antecedent episode of a sample parallel rule is 1C, 1D, with minimal occurrences shown on the bottom and the consequent episode is 3D, 3F, with minimal occurrences shown on the top. For this example, the parameters are wina = 3, min sup = 2, winc = 3, and lag = 1. The support of the antecedent is 5, the support of the consequent is 2, and the rule 1C, 1D ⇒lag=1 3D, 3F has a support of 2, and a confidence of 2/5. Table 1 shows the episodes and parallel rules generated from this event sequence, using wina = 3, winc = 3, lag = 1, min sup = 2, with the SPI1 sequence as the antecedent and the SPI3 sequence as the consequent.
4.
The Gen-REAR algorithm
Previously, the Gen-REAR method was presented in Harms et al. (2001b). The basic difference between this approach and the MOWCATL approach is that Gen-REAR uses a sliding window approach, and does not allow for a delay in time embedded within the relationship. Gen-REAR, defines a window on an event sequence S as an event subsequence W = {et j , . . . , etk }, where t B ≤ t j , and tk ≤ t D + 1 as in the WINEPI algorithm (Mannila et al., 1995; Mannila et al., 1997). The width of the window W is width(W ) = tk − t j . The set of all windows W on S, with width(W ) = win is denoted as W(S, win). The window width is pre-specified. The frequency of an episode is defined as the fraction of windows in which the episode occurs. Given an event sequence S, and a window width win, the frequency of
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
15
an episode P of a given type in S is: fr(P, S, win) =
|w ∈ W(S, win) : P occurs in w| |W(S, win)|
Given a frequency threshold min fr, P is frequent if fr(P, S, win) ≥ min fr. Closure of an episode set X , denoted by closure(X ), is the smallest closed episode set containing X and is equal to the intersection of all frequent episode sets containing X . Gen-REAR generates frequent closed target episodes with respect to a given set of Boolean target constraints B, an event sequence S, a window width win, an episode type, a minimum frequency min fr, and a window step size p. The set of frequent closed episodes, FCE, is used to generate the representative episodal association rules (REAR) that cover the entire set of association rules (Kryszkiewicz, 1998). The cover of a rule r : X ⇒ Y , denoted by C(r ), is the set of association rules that can be generated from r . That is, C(r : X ⇒ Y ) = {X ∪ U ⇒ V | U, V ⊆ Y, U ∩ V = ∅, and V = ∅}. An important property of the cover operator stated in (Kryszkiewicz, 1998) is that if an association rule r has support s and confidence c, then every rule r ∈ C(r ) has support at least s and confidence at least c. Using the cover operator, C(r : X ⇒ Y ), a set of representative episodal association rules with minimum support s and minimum confidence c, REAR(s, c), is defined as follows: REAR(s, c) = {r ∈ AR(s, c) | ∃r ∈ AR(s, c), r = r and r ∈ C(r )}. That is, a set of representative episodal association rules is a least set of episodal association rules that cover all episodal association rules and from which all association rules can be generated. Clearly, AR(s, c) = {C(r ) | r ∈ REAR(s, c)}. Gen-REAR given in Harms et al. (2001b), generates REAR for a given set of frequent closed episodes FCE with respect to a minimal confidence c. Using these techniques on multiple time series while constraining the episodes to a userspecified target set, finds relationships that occur across the sequences. Once the set of representative association rules is found, the user may formulate queries about the association rules that are covered (or represented) by a certain rule of interest for given support and confidence values. These techniques can be employed in many problem domains, including drought risk management. 5.
Drought risk management—An application
Managing risk is an important task in many domains, including Drought Risk Management. Even though droughts occur infrequently and are difficult to detect, they are a normal feature of climate and their occurrence is inevitable (Wilhite, 2002). Drought affects virtually all regions of the world and results in significant economic, social, and environmental impacts. The Federal Emergency Management Agency estimates the annual losses because of drought in the United States at $6–8 billion, which is more than any other natural hazard. For example, losses due to drought in the U.S. were $44 billion in 1980 and $56 billion in 1993, while the loses due to one of the worst flooding disasters of Hurricane Andrew in 1992 were $32.4 billion (Ross and Lott, 2000). The United States does not have a comprehensive drought monitoring system or a national drought policy that emphasizes risk management by promoting the development of drought plans at all levels of government.
16
HARMS AND DEOGUN
Congress recently enacted the Agricultural Risk Protection Act of 2000 to encourage the United Stated Department of Agriculture (USDA) Risk Management Association (RMA) and farmers to be more proactive in managing drought risks. Drought risk management involves both expanding the ability to provide better early warning systems and creating more awareness of water management practices that can reduce demand for water. The National Drought Mitigation Center (NDMC) has made tremendous progress in raising awareness of drought and the ability of government and individuals to manage the risk associated with drought through the use of the Drought Monitor (U.S. Drought Monitor, 2002) and extension outreach services (Goddard et al., 2003). The drought monitor is developed by the NDMC in cooperation with the USDA and National Oceanic and Atmospheric Administration (NOAA). The drought monitor for the week of June 2, 2002 is shown in figure 2. The 2002 drought in the western U.S. lead to several forest fires and severe agricultural losses. Given the complexity of drought, where the impacts from a drought can accumulate gradually over time and vary widely across many sectors, a well-designed decision support system is critical to the effective management of drought response efforts. This work is part of a Digital Government project that is developing and integrating new information technologies for improved government services in the USDA Risk Management Agency (RMA) and the Natural Resources Conservation Service. An advanced Geospatial Decision Support System (GDSS) is being developed to improve the quality and accessibility of drought related data for drought risk management (Goddard et al., 2003). Spatio-temporal
Figure 2.
United States Drought Monitor June 4, 2002.
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
17
knowledge discovery techniques are being integrated into the GDSS using a combination of data mining techniques applied to geospatial time-series data, using the MOWCATL and Gen-REAR approaches. This paper compares these approaches, validates their efficiency and investigates the influence of the parameters on the number of rules generated. 6.
Experimental results and analysis
Experiments were designed to find relationships between drought episodes at the automated weather station in Clay Center, NE, and other climatic episodes, from 1950–1999. There is a network of automated weather stations in Nebraska that can serve as long-term reference sites to search for key patterns and link to climatic events. Historical and current climatology datasets were used, including (1) Standardized Precipitation Index (SPI) data from the National Drought Mitigation Center (NDMC), (2) Palmer Drought Severity Index (PDSI) from the National Climatic Data Center (NCDC), (3) North Atlantic Oscillation Index (NAO) from the Climatic Research Unit at the University of East Anglia, UK, (4) Pacific Ocean Southern Oscillation Index (SOI) and Multivariate ENSO Index (MEI) available from NOAA’s Climate Prediction Center, and (5) Pacific/North American (PNA) Index and Pacific Decadal Oscillation (PDO) Index available from the Joint Institute for the Study of the Atmosphere and Ocean. The data for the climatic indices are grouped into seven categories, i.e. extremely dry, severely dry, moderately dry, near normal, moderately wet, severely wet, and extremely wet. In this study, the 3-month, 6-month, 9-month, 12-month SPI, and the PDSI values are grouped into the same seven categories to show the precipitation intensity relative to normal precipitation for a given location and a given month. The SOI, MEI, NAO, PDO, and PNA categories are based on the standard deviation from the normal and the negative values are considered to show the dry periods. After normalizing and discretizing each dataset using the seven categories above, experiments were performed to find whether the methods discover interesting rules from the sequences, and whether the methods are robust. Several window widths, minimal frequency values, minimal confidence values, and time lag values for both parallel and serial episodes were used. Droughts (the three dry categories in each data source) were specified as the target episodes. The global climatic indices (SOI, MEI, NAO, PDO, and PNA) were used as the antecedent data sets, and the local precipitation indices (SPI3, SPI6, SPI9, SPI12, and PDSI) as the consequent data sets. The experiments were ran on a DELL Optiplex GX240 2.0 GHz PC with 256 MB main memory, under the Windows 2000 operating system. Algorithms were coded in C++. Episodes with time lags from MOWCATL are useful to the drought risk management problem when trying to predict future local drought risk considering the current and past global weather conditions. Table 2 represent performance statistics for finding frequent drought episodes with various support thresholds using the MOWCATL algorithm. MOWCATL performs extremely well when finding the drought episodes. At a minimum support of .020 for parallel episodes, the algorithm needs less than one second to look through the 455 candidate drought episodes to find the 229 frequent drought episodes. As shown, the number of frequent episodes decreases rapidly as the support threshold increases.
18
HARMS AND DEOGUN
Table 2. Performance characteristics for parallel and serial drought episodes and rules with MOWCATL and Gen-REAR, Clay Center, NE drought monitoring database, wina = 4 months, winc = 3 months, and maximal lag = 2 months, and min conf = 25%. Parallel
Serial
Min. support
Total cand.
Freq. episodes
Distinct rules
Total time(s)
0.007
2064
1595
460
16
0.008
1510
1104
210
0.010
1214
820
98
0.015
669
392
24
0.020
455
229
7
0.040
188
56
0.02
5723
0.04 0.06
Total cand.
Frequent episodes
Distinct rules
Total time(s)
16574
1799
957
92
3
11790
1077
350
39
2
8887
700
121
34
1
5322
307
16
31
1
3509
162
7
31
0
1
2352
31
0
30
3056
952
17
1052
613
122
3
10757
324
3
683
781
221
39
1
3467
107
2
145
0.08
496
111
25
1
2206
48
0
108
0.10
314
58
5
0
1338
30
0
56
MOWCATL
Gen-REAR
Gen-REAR episodes are useful to the drought risk management problem when considering events that occur together. The lower part of Table 2 represents performance statistics for finding frequent closed drought episodes with various frequency thresholds using the GenREAR algorithm. As shown, the number of frequent closed episodes decreases rapidly as the frequency threshold increases as expected from the nature of drought. Also, at lower minimum frequency levels, the Gen-REAR algorithm needs more time than the MOWCATL algorithm to find the frequent episodes. Table 2 also shows the number of distinct rules generated for these algorithms. As shown, the number of rules between global climatic drought episodes and local drought at Clay Center, NE decreases rapidly as the frequency and support levels increase. In fact, there were only seven parallel drought rules at a 25% confidence level for a support threshold of 0.020 using the MOWCATL algorithm. As shown in Table 2, the MOWCATL algorithm efficiently finds the relationships in these datasets. For serial rules, MOWCATL needs approximately 30 seconds to generate the combination event types (which is a fairly expensive join operation) before the supported rules can be generated. To speed up this process, the frequent single events that are in the inclusion constraint set were found, and then the combination events were built from this smaller set of possible events. Although not shown, the number of frequent episodal rules decreases rapidly as the confidence threshold increases as expected from the nature of drought. However, when using a maximal time lag with serial episodes, even with 80% confidence, MOWCATL
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
Figure 3.
19
Comparison of MOWCATL and REAR serial rules.
still generates 9 rules. This once again demonstrates that MOWCATL is able to capture the natural lag in time between the oceanic dry conditions and the local drought conditions in the Clay Center, NE data sets. Examples of how the window widths influence the results are shown in figure 3. The MOWCATL algorithm finds a significant number of patterns and relationships for all window widths specified. In general, wider window widths wina , winc , produce more patterns and relationships, but with less significant meaning. With a 2 month lag in time, the MOWCATL algorithm discovers 98 parallel drought episodal rules and 121 serial drought episodal rules, using wina = 3 and winc = 3. MOWCATL discovers more relationships at similar window widths than the Gen-REAR approach. In fact, Gen-REAR only finds 3 significant serial rules when using a window width of 4 months. The least number of serial rules generated by the MOWCATL approach with a two month time lag, is 29 serial rules, when using one month antecedent and consequent window widths. These examples indicate that there is a natural delay in time after global climatic drought indicators are recognized, before local drought conditions occur. This is encouraging, since knowing this time difference will allow drought risk management experts time to plan for the expected local drought conditions. Finding the appropriate time lag is an iterative process. Examples of how the time lag influences the results are shown in figure 4. The MOWCATL algorithm finds a significant number of relationships for all time lags specified. In general, wider maximum time lags produce more relationships, but again with less significant meaning. With a 2 month maximum time lag, the MOWCATL algorithm discovers 98 parallel drought episodal rules and 121 serial drought episodal rules, using wina = 4 and winc = 3. With a fixed time lag,
20
HARMS AND DEOGUN
Figure 4.
Influence of time lag on rules generated.
36 parallel drought episodal rules are discovered and 66 serial drought episodal rules are discovered. With a 5 month maximum time lag, MOWCATL discovers 734 parallel drought episodal rules and 618 serial drought episodal rules, whereas with a fixed time lag, 41 parallel drought episodal rules and 101 serial drought episodal rules are discovered. Remember, the fixed time lag requires the consequent episodal occurrences to follow the occurrences of the antecedent episode by the exact time lag. Surprisingly, several significant rules (>20) for each fixed time lag were generated. The fact that MOWCATL found more serial rules than parallel rules when using a fixed time lag was also surprising. These results are encouraging, since knowing the exact time delay between the antecedent and consequent will enable better preparedness for an upcoming drought. No comparison is made to the Gen-REAR algorithm, since it does allow for time lags. Clearly, the results produced by these methods need to be coupled with human interpretation of the rules and an interactive approach to allow for iterative changes in the exploration process. Although not shown, the J-measure was used to verify the interestingness of the rules generated. Using these methods, the drought episodes and relationships are provided quickly and without the distractions of the other non-drought data. These are then provided to the drought risk management expert for human interpretation. Similarly, these methods can be employed in other applications. 7.
Conclusion
This paper presents the MOWCATL approach for generating episodal association rules in multiple data sets. This approach was compared with the Gen-REAR approach. This paper showed how MOWCATL and Gen-REAR complement each other to address complex reallife problems such as drought risk management. As demonstrated by the experiments, these
SEQUENTIAL ASSOCIATION RULE MINING WITH TIME LAGS
21
methods efficiently find relationships between climatic episodes and droughts by using constraints, time lags, closures and representative episodal association rules. Other problem domains could also benefit from this approach, especially when there are groupings of events that occur close together in time, but occur relatively infrequently over the entire dataset, with possible time delays between the occurrences. The analysis techniques developed in this work facilitate the evaluation of the temporal associations between episodes of events and the incorporation of this knowledge into decision support systems. For future work, these methods will be expanded to consider the spatial extent of the relationships. Additionally, these approaches are being incorporated into the advanced geospatial decision support system for drought risk management mentioned above.
Acknowledgments We would like to thank Tsegaye Tadesse and the National Drought Mitigation Center at the University of Nebraska at Lincoln for preparing the data sets for our empirical study. We also thank them for their discussions about the drought risk management problem. This research was supported in part by NSF Digital Government Grant No. EIA-0091530 and NSF EPSCOR, Grant No. EPS-0091900.
References Bettini, C., Wang, X.S., and Jajodia, S. (1998). Discovering Temporal Relationships with Multiple Granularities in Time Sequences. IEEE Transactions on Knowledge and Data Engineering, 10(2), 222–237. Cong, G. and Liu, B. (2002). Speed-up Iterative Frequent Itemset Mining with Constraint Changes. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi, Japan. Feng, L., Lu, H., Yu, J.X., and Han, J. (1999). Mining Inter-Transaction Associations with Templates. In Proceedings of the 1999 International Conference on Information and Knowledge Management [CIKM’99], Kansas City, Missouri, USA. Goddard, S., Harms, S.K., Reichenbach, S.E., Tadesse, T., and Waltman, W.J. (2003). Geospatial Decision Support for Drought Risk Management. Communications of the ACM, 46(1), 35–37. Goldin, D.Q. and Kanellakis, P.C. (1995). On Similarity Queries for Time-Series Data: Constraint Specification and Implementation. In Proceedings of the 1995 International Conference on the Principles and Practice of Constraint Programming (pp. 137–153). Marseilles, France. Harms, S.K., Deogun, J., Saquer, J., and Tadesse, T. (2001). Discovering Representative Episodal Association Rules from Event Sequences Using Frequent Closed Episode Sets and Event Constraints. In Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 603–606). San Jose, California, USA. Harms, S.K., Deogun, J., and Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. In Proceedings of the 2002 International Symposium on Methodologies for Intelligent Systems (pp. 432–441). Lyon, France. Kryszkiewicz, M. (1998). Fast Discovery of Representative Association Rules. In Lecture Notes in Artificial Intelligence, Vol. 1424 (pp. 214–221). Proceedings of RSCTC 98, Springer-Verlag. Mannila, H., Toivonen, H., and Verkamo, A.I. (1995). Discovering Frequent Episodes in Sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining [KDD 95] (pp. 210–215). Montreal, Canada. Mannila, H., Toivonen, H., and Verkamo, A.I. (1997). Discovery of Frequent Episodes in Event Sequences. Technical report, Department of Computer Science, University of Helsinki, Finland. Report C-1997-15.
22
HARMS AND DEOGUN
McGee, T.B., Doeskin, N.J., and Kliest, J. (1995). Drought Monitoring with Multiple Time Scales. In Proceedings of the 9th Conference on Applied Climatology (pp. 233–236). Boston, MA. Ng, R., Lakshmanan, L.S., Han, J., and Pang, A. (1998). Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA. Ross, T. and Lott, N. (2000). A Climatology of Recent Extreme Weather and Climate Events. Technical report 2000-02, National Climatic Data Center, US Dept. of Commerce. Srikant, R. and Agrawal, R. (1996). Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the Fifth International Conference on Extending Database Technology. Srikant, R., Vu, Q., and Agrawal, R. (1997). Mining Association Rules with Item Constraints. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining [KDD 97] (pp. 67–73). Tan, P., Potter, C., Steinbach, M., Klooster, S., Kumar, V., and Torregrosa, A. (2001). Finding Spatio-Temporal Patterns in Earth Science Data. In KDD-2001 Workshop on Temporal Data Mining, San Francisco, CA. U.S. Drought Monitor. (2002). Hosted and Maintained by the National Drought Mitigation Center. http://enso.unl.edu/monitor/. Wilhite, D.A. (2000). Drought as a Natural Hazard: Concepts and Definitions. In D.A. Wilhite (Ed.), Drought Volume II: A Global Assessment, Routledge Hazards and Disaster Series (pp. 3–8). New York: Routledge Publishers. Zaki, M. (2000). Sequence mining in Categorical Domains: Incorporating Constraints. In Proceedings of the Ninth International Conference on Information and Knowledge Management [CIKM2000] (pp. 422–429). Washington DC, USA.