J Supercomput DOI 10.1007/s11227-016-1714-y
Combining association rule mining and network analysis for pharmacosurveillance Eugene Belyi1 · Philippe J. Giabbanelli3 · Indravadan Patel1 · Naga Harish Balabhadrapathruni1 · Aymen Ben Abdallah1 · Wedyan Hameed1 · Vijay K. Mago2
© Springer Science+Business Media New York 2016
Abstract Retailers routinely use association mining to investigate trends in the use of their products. In the medical world, association mining is mostly used to identify associations between symptoms and diseases, or between drugs and adverse events. In comparison, there is a relative paucity of work that focuses on relationships between drugs exclusively. In this work, we use the Medical expenditure panel survey to examine relationships between drugs in the United States. In addition to examining the rules generated by association mining, we introduce the notion of a target drug network and demonstrate via different drugs that it can offer additional medical insight.
B
Vijay K. Mago
[email protected] Eugene Belyi
[email protected] Philippe J. Giabbanelli
[email protected] Indravadan Patel
[email protected] Naga Harish Balabhadrapathruni
[email protected] Aymen Ben Abdallah
[email protected] Wedyan Hameed
[email protected]
1
Department of Computer Science, Troy University, University Ave, Troy, AL 36082, USA
2
Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, ON P7B 5E1, Canada
3
Department of Computer Science, Northern Illinois University, 1425 W. Lincoln Highway, DeKalb, IL 60115, USA
123
E. Belyi et al.
For example, we were able to find drugs that are commonly taken together despite containing the same active compound. Future work can expand on the concept of target drug network, for example, by annotating the networks with the compounds and intended uses of each drug, to yield additional insight for pharmacosurveillance as well as pharmaceutical companies. Keywords
Association mining · Market basket analysis · Pharmacosurveillance
1 Introduction Several industries, such as banking and finance, have long been aiming at increasing sales revenue by marketing and advertising products relative to other products. This is known as cross-selling [32,36], and it is a customer relationship strategy that uses product purchase data to estimate how customers would respond to a campaign for other products [4]. One of the techniques to perform cross-selling is association rule mining, which extracts associations or co-occurrences from transactional databases [20]. In other words, the goal is to generate association rules given a database consisting of items purchased by customers in single visits (i.e., ‘transactions’ or ‘basket data’) [2]. In comparison to its widely spread used in retail, there is much less research on the use of association mining of purchase data for medical insight. In particular, most of the research using association mining for medical purposes has focused on clinical repositories where they seek to identify associations between symptoms, health conditions, and diseases (e.g., alcohol consumption relates to depression) [51], with less being done purely on data recording drug purchases. Post-marketing drug data (i.e., data on which drugs were purchased by whom) have typically been used in two ways. First, it may have been used to find (unexpected) adverse drug reactions (ADRs). An adverse reaction is defined by the World Health Organization as an unintended and noxious response to a drug used at doses normal in man [58]. This is more specific than a ‘side effect’, which is often taken to encompass all unintended responses, and it also has (at least suspected) causality, whereas ‘adverse events’ only report the co-occurrence of an undesirable experience and drug intake. A variety of techniques known as signal detection in pharmacovigilance have been developed and applied [37]. In addition, associate rule mining has recently been used to identify ADRs based on problems reported by patients in online health communities [60]. Second, it may have been used to analyze the kind of treatment that the population receives and detect whether there are subgroups that are on inappropriate treatment regimens [6]. Consequently, there is a relative paucity of work devoted to association mining purely on medication expenditure data. One such study was undertaken by Chen and colleagues, who used the National Health Insurance Research Database in Taiwan to identify which drugs were used together with antacids (a widely used drug against stomach acid) [14]. Our study seeks to contribute to the literature on data mining applications in healthcare by borrowing techniques from the retail industry to mine medical expenditure in the U.S. This is of particular interest to identify trends in U.S. healthcare. For example, finding that drugs used to treat entirely different conditions are often bought together is informative of com-
123
Combining association rule mining and network...
Fig. 1 Demographics in the MEPS 2011 dataset used for our study
mon co-morbidities in the population. Furthermore, finding that a set of drugs related to a condition is frequently purchased together could help partially automatize the ‘bundling’ of medication for common treatments, which is informative to a range of actors involved in the pharmacy distribution process [33]. While networks have been used abundantly in health [25,27,28] and the process of association rule mining is well established, less work has been devoted to analyzing relationships between drug purchases as a network. Thus, this paper introduces the concept of ‘target drug network’ to derive insight from medical expenditure data for healthcare. Specifically, we represent the rules produced by association mining as a network, thus using an innovative combination of data mining and network analysis in the context of pharmaceutical data. The rules used in our study are extracted from the 2011 medical expenditure panel survey (MEPS),1 which collects healthcare data from a nationally representative sample of the civilian non-institutionalized population in the United States. Demographic information on the MEPS 2011 data is provided in Fig. 1. The survey is specifically designed to “provide analysts with the data they need to support policy-relevant research on health care expenses, utilization, insurance coverage, and access in the United States” [17]. This survey has been very frequently used to assess the healthcare costs associated with specific conditions in the United States, 1 Data files and codebooks can be downloaded from http://meps.ahrq.gov/mepsweb/data_stats/ download_data_files.jsp. Statistical summaries of the data can be accessed via http://meps.ahrq.gov/ mepsweb/data_stats/quick_tables_search.jsp?component=1&subcomponent=0 and selecting Year 2011.
123
E. Belyi et al.
such as allergic rhinitis [41], attention-deficit/hyperactivity disorder (HDHD) [13], or multiple chronic conditions [45]. While the methods of our case study are tailored to the characteristics of the MEPS, readers in the biomedical community may need to replace some of our methods by alternative that are more suitable to other national datasets on medical expenditure. Consequently, Sect. 2 provides a brief outline of the methods used in this paper and reviews other options as well as their advantages and potential drawbacks depending on a dataset’s characteristics. Then, we detail our methods in Sect. 3 and we use them in Sect. 4 to exemplify what kind of medical information is offered by a data science approach to medical expenditure data. Finally, the limitations and implications of this study are discussed in Sect. 5.
2 Related work To derive medical insight from large amounts of medical expenditure data, we start by performing a market basket analysis (MBA). This commonly used approach in marketing research analyzes the products that customers tend to buy together, and outputs a series of association rules. The specific algorithm chosen to generate these rules is the Apriori; for a complete description of the algorithm and variants we refer the reader to [3], while a recent example of its application to medical data can be provided by [34]. After having generated the rules, we represent and analyze them as a network to find which drugs co-occurred, and we analyze the implications of these findings in light of the conditions that the drugs are normally prescribed/used for. As will be discussed, the Apriori algorithm was chosen because of the characteristics of the MEPS used in this paper. However, members of the biomedical community interested in our approach may use medical datasets with different characteristics. Consequently, this section starts by detailing what a Market Basket Analysis is, and then different algorithms to perform it are being reviewed, before concluding on other analyses that could be of interest on medical expenditure data. 2.1 Principles of market basket analysis A medical expenditure dataset consists of a set of transactions, where each transaction stands for a drug being purchased at a specific time by a patient. A market basket is a set of items purchased together; intuitively, it is what the patient has in the ‘basket’ when leaving the pharmacy. The N -market basket refers to the N items bought together most frequently. Companies such as insurance or banking seeking to implement crossselling can then examine if the customer’s profile matches a known market basket, and suggest new items for purchase accordingly. For additional information about market basket analysis, we refer the reader to [15]. A market basket can be represented with association rules. In general, association rules are defined as follows: given two non-overlapping sets of items X and Y, an association rule in the form of X ⇒ Y indicates a purchase pattern such that a customer who purchases X is likely to purchase Y [16]. The structure and metrics of association rules are detailed in Sect. 3 as part of our methodology. Many algorithms
123
Combining association rule mining and network...
can be used to extract association rules. Unlike several other branches of data mining or statistics, association mining is a non-hypothesis-driven data mining. In other words, none or few prior assumptions are made about which products will correlate: it is exploratory in nature. The advantage is that it can uncover previously unknown relationships between drugs by ‘letting the data speak for itself’, but the disadvantage is the risk of running into a combinatorial explosion by navigating a massive search space. Consequently, there are many possible algorithms to pick from when seeking to perform association mining, and understanding which one will behave best on a given medical expenditure dataset is essential.
2.2 Association mining on pharmaceutical data The Apriori algorithm introduced in [3] is a commonly used algorithm for association mining. It operates over transactional databases, where each transaction is viewed as an itemset. In the Apriori algorithm, a breadth-first search is performed to identify and count candidate itemsets. The algorithm has two parameters, which (i) set the minimum number of occurrence to consider an itemset as a candidate, and (ii) impose a maximum length on an itemset. As many algorithms generate rules from the data, the list of candidates needs to be pruned. The Apriori algorithm does pruning based on how frequently the itemsets occur [3]. The advantages of Apriori include providing an exact solution that is easy to understand, and identifying many patterns. However, these are also drawbacks: there may be too many patterns, which makes the analysis difficult, and the exact solution comes at a very high computational cost. Consequently, this solution is used in banking and insurance for datasets that have few items or are sparse [12]. Given that our dataset is sparse, Apriori is the solution used in this work. It should be noted that to facilitate replication of this work and since our main contribution consists of combining association rule mining and network analysis for pharmaceutical data, we use a simple version of Apriori commonly accessible through software such as R. To replicate our work on very large databases, we would recommend using recent versions of Apriori algorithm that come with a significant speed-up [54], as well as considering the use of graphics processing units [19,62]. Cavique reviewed a set of alternative algorithms and introduced Similis algorithm as follows [12]: “The input data is transformed into a graph-based structure and then the maximum-weighted clique problem is solved using a meta-heuristic approach in order to find the most frequent itemsets.” This can be a solution of interest when doing big data for healthcare, as the heuristic makes it more feasible to process larger amount of transactions when obtaining quasimost-frequent market baskets is acceptable. In addition, Similis requires less input parameters than Apriori and commonly used variants, which simplifies the process of setting up experiments. Parallel to algorithmic research on association mining, researchers have been inspired by progress in complex network analysis and started applying it to the market basket problem by representing the data as a network. In 2009, Raeder and Chawla cre-
123
E. Belyi et al.
ated a product network by representing purchase relationships between products. Then, they defined a new measure of utility (for a human analyst) for a community of products, and employed community detection algorithms to reveal key relationships [53]. Consequently, alternatives to association rule mining algorithms also include community detection algorithms [26], which are plentiful. One recent example of a network approach is provided by Kim and colleagues [38]. They constructed two different kind of networks: the typical Market-Based Network, in which association rules are mined and represented as a network, and a co-purchased product network (CPN). To construct a CPN, first a customer–product bipartite network is created; intuitively, customers are on one side, products on the other, and customers are connected to the products that they bought. Then, products purchased links are added between products u and v if customers have bought both u and v. The link’s value is set as the frequency of the co-purchase. One drawback of this approach is that the CPN to analyze can be particularly dense compared to the typical Market-Based Network. Indeed, one case study found that it was more than 20 times as dense [38]. Links thus have to be removed, and the criteria to do so (e.g., remove links whose value is less than x) introduce certain parameter(s) to the problem. Furthermore, generating a co-purchase product network requires transactional data that distinctly associates all given transactions with customers, while market basket networks only require transaction data that contain sets of items for each transaction. In the context of healthcare data, it may thus be easier to obtain data to build market basket networks, in addition to leading to an immediately sparser representation. Nonetheless, if there are supporting data and there is expert knowledge to trim the network, then a CPN presents an interesting option to utilize advances in complex networks. 2.3 Other approaches on mining pharmaceutical data Cluster analysis techniques (e.g., k-means clustering, time series clustering) could be used to identify groups of similar drugs. This technique is sometimes used with association mining on the same dataset, but it should be noted that they differ technically as well as in the type of questions that they seek to answer [23]. Association mining is used to find product co-purchasing trends, such as the likelihood of one item being purchased before or after another item. In contrast, time series clustering serves to illustrate the relationships between various groups of items. For a survey of time series clustering, we refer the reader to [43]. One example of cluster analysis found many interesting sets of complementary parts that are used to make different types of products, which was particularly informative for marketing as it allowed to elaborate new sales strategies [56].
3 Methods 3.1 Association rules The set of all drugs is denoted by I = {i 1 , i 2 , . . . , i n }. A transaction is denoted by T = {i 1 , i 2 , . . . , i m } wher e T ⊆ I . An association rule is represented as {i p , i q } ⇒
123
Combining association rule mining and network...
{i k }. This can be read as “if a user buys an item i p and i q then the user will likely buy the item i k ”. For example, {Amoxicillin, Augmentin} ⇒ {Diazepam} implies that if a patient is prescribed Amoxicillin and Augmentin, then they are also likely to be prescribed Diazepam. When stating an association rule {A} ⇒ {B}, there are two important metrics: the support, which is the relative frequency given by P({A, B}), and the confidence, which is the conditional probability P(B|A). The set of all rules that we mine from pharmaceutical transactional data using the Apriori algorithm thus comes with support and confident values. To illustrate this in the context of pharmaceutical products, consider a relationship in which Codeine and Amoxicillin are prescribed concurrently. This is given by the rule in Eq. 1, where the symbol ⇒ denotes an increased likelihood of the purchase of one product given the purchase of another. In the presented case, the relationship shows a likelihood of Codeine being prescribed if Amoxicillin is purchased. The support of 20 % shows that 20 % of all the transactions within analysis display that Codeine and Amoxicillin are prescribed together. The confidence of 80 % shows that 80 % of patients prescribed Codeine were also prescribed Amoxicillin. Typically, association rules are viewed as meaningful when they satisfy both minimum support threshold as well as a minimum confidence threshold. It should be noted that, while support and confidence are commonly used metrics, there exists a large number of other metrics that could be used to generate the rules; this is discussed in Sect. 5. Codeine ⇒ Amoxicillin Suppor t = 20 %
(1)
Con f idence = 80 % The number of possible association rules depends on the total number of items. For instance, if we have two items, then the number of possible association rules is 2; and if we have 3 items, the number of possible association rules is 12. In general, the total number of association rules R is exponential in the number of items n, as formalized in Eq. 2: (2) R = 3n − 2(n+1) + 1 This exponential number of possible rules results in pruning them: a minimum support value is used as threshold, and analysts typically choose the rules having highest support and confidence [12]. 3.2 Network analysis As introduced in Sect. 2.2, association rules can also be viewed as a network. In this paper, we introduce the target network of medication X defined as the network composed of three elements, visually organized from the center to the periphery: • The target medication X , • the support and confidence of each rule containing X , referred to in this study as a relationship node, where the node’s size stands for the support while the color stands for the confidence,
123
E. Belyi et al.
Fig. 2 Target network of Glucophage
• the other drugs involved in the rules containing X . In target networks, all target medications have incoming arrows from relationship nodes, and all relationship nodes have incoming arrows from the other drugs in rules with target medication. As an example, the target network of Glucophage is provided in Fig. 2, which will be analyzed in the next section. Producing target networks allows us to apply complex network analysis for medical expenditure data. Specifically, analyzing links in a target network can show which item is most present in transactions with the targeted item, regardless of what other items are in the transaction. Furthermore, the importance of each drug with respect to the target can be quantified. In the context of complex networks, nodes with many incoming links are referred to as authorities, and nodes with many outgoing links are referred to as hubs. We will use the HITS algorithm [39] to analyze target networks to interpret which drugs function are hubs
123
Combining association rule mining and network...
and which are authorities. Scores for hubs and authorities can be computed to rank nodes by their respective scores. In the context of this work, nodes which have a higher hub score in graphs are present in more association rules for a given pharmaceutical product than others. A high hub score indicates that a product is more likely to be prescribed with a specific product than other products present in the association rules. It should be noted that several works have previously built pharmacological networks. Many of them focused on adverse drug reactions, rather than co-purchases. For example, Cami and colleagues introduced predictive pharmacosafety networks (PPNs) [11] whose links are based on drug–adverse reactions relationships, while Xue and colleagues built a network of approved and withdrawn drugs where links are based on similarity between drugs at a chemical level [59]. The other typical use of a network methodology is devoted to drug repositioning, where the goal is to support the process of drug development using information on already FDA approved drugs. A recent review is offered in [63] while examples are offered by the work of Lee and colleagues, where the network consists of tripartite relationships between drugs, diseases, and proteins [42], or the work of Gregori-Puigjane and Mestres where the network is made of molecules sharing common drugs [30].
4 Experiments The pharmaceutical data used in this work are the 2011 MEPS [18], which contain records of the prescription of medications by doctors. Each record (or tuple) has a unique patient ID, a medication name and form (i.e., pill, capsule, lancet), and also various other patient or medication attributes. The data were pre-processed in Python to create transactional data containing baskets, where each basket is the set of medications prescribed to a patient. Then, the analysis was conducted in R using the Arules [31] and sna [10] libraries to generate association rules and perform network analysis, respectively. This section divides the analysis into two parts. First, we report general results about the association rules (e.g., sensitivity analysis, number of drugs involved in the rules, support and confidence). Second, we demonstrate the usefulness of modeling association rules as network for pharmaceutical data, focusing on three medications and analyzing their relationships with other drugs both from the association rules and from the target networks. 4.1 General characteristics of the association rules The number and descriptiveness of the rules generated by the Apriori algorithm are dependent on the input parameters of support and confidence. Since these rules are used to interpret relationships between itemsets, we first conducted a sensitivity analysis on the support and confidence threshold values used by the Apriori algorithm. Figure 3 shows how many rules are generated for various thresholds of support and confidence. Outlying points in the plot have been pruned to better illustrate the general effect of changing the parameters. As shown in Fig. 3, when the support and confidence are low, the number of rules generated is high. Conversely, when the support and confidence is high, the number of rules is low.
123
E. Belyi et al.
Fig. 3 Sensitivity analysis of the total number of rules that can be generated from the MEPS dataset, based on different values of support and confidence
To perform rule mining using the Apriori algorithm, a minimum support and minimum confidence must be specified. While the support and confidence of a rule are variables resulting from applying the Apriori algorithm, the minimum support and minimum confidence are user-defined constants that will specify how the algorithm should run. Minimum values need to be low enough so that the algorithm does not overprune rules that may be informative, while being high enough to avoid generating trivial rules and having to perform additional pruning. Choosing them is thus a matter of trade-off, which further takes into account the balance between the two minimums. Since the transactional data used in this work are sparse, values of minimum support and minimum confidence in the first quantile were found to generate a satisfactory number of rules. This can be seen by comparing the values in Fig. 3 with the parameter values. The input parameters of minimum support and minimum confidence are as follows: Minimum Suppor t = 0.0004 % Minimum Con f idence = 0.4 %
(3)
The overall procedure to generate rules was implemented with the arules library for association rules in the R programming language. A total of 534 rules were generated. Most (392) involve three drugs while a few involve two (87) or four (55) drugs. Table 1 provides a statistical summary of the rules generated. Sorting the rules allows to see the most relevant ones first. We sorted the rules by confidence, and reported on the top 5 rules in Table 2. Rules can also be pruned by either minimum or maximum number of drugs involved. In Table 3, we report the set of 3 drugs with the highest confidence. 4.2 Modeling associations rules as network for pharmaceutical data While the previous section reported results about the rules in general, this section will focus on demonstrating the usefulness of our approach to interpreting association
123
Combining association rule mining and network... Table 1 Statistics of the distributions for the number of drugs, support, and confidence
Statistics
Feature Number of drugs in rules
Support
Confidence
Min
2.00
0.000404
0.400
1st Qu
3.00
0.000404
0.462
Median
3.00
0.000472
0.545
Mean
2.94
0.000657
0.573
3rd Qu
3.00
0.000674
0.667
Max
4.00
0.008426
1.000
Table 2 Top five rules by confidence Left-hand side of the rule (lhs)
Right-hand side of the rule (rhs)
Support
Confidence
Lift
{LISINOPRIL, METFORMIN}
⇒ {SOFTCLIX}
0.00040442
1
24.645
{HUMULINR (vial), LIPITOR}
⇒ {SOFTCLIX}
0.00040442
1
24.645
{LANOXIN, LESCOL}
⇒ {FUROSEMIDE}
0.00040442
1
29.437
{LANOXIN, NOVOLININSULIN}
⇒ {SOFTCLIX}
0.00040442
1
24.645
{AVANDIA, COUMADIN}
⇒ {SOFTCLIX}
0.00047183
1
24.645
Table 3 Top five rules by confidence involving 3 drugs Left-hand side of the rule (lhs)
Right-hand side of the rule (rhs)
Support
Confidence
Lift
{ACCUPRIL, FUROSEMIDE, INSULINNOVOLIN}
⇒ {SOFTCLIX}
0.00040442
1.00000
24.645
{ACCUPRIL, INSULINNOVOLIN, LIPITOR}
⇒ {SOFTCLIX}
0.00040442
1.00000
24.645
{FUROSEMIDE, INSULINNOVOLIN, LIPITOR}
⇒ {SOFTCLIX}
0.00053923
0.88889
21.906
{ATROVENT, PblackNISONE, THEOPHYLLINE}
⇒ {ALBUTEROL}
0.00040442
0.85714
18.646
{ALBUTEROL, COMBIVENT, FLOVENT}
⇒ {PblackNISONE}
0.00040442
0.85714
25.433
123
E. Belyi et al.
Fig. 4 Top 20 most frequent items in dataset
rules as network for pharmaceutical data. Three specific medications will serve as guiding examples: Glucophage, Amoxicillin, and Albuterol. These three medications were chosen for the high prevalence and variety of the conditions they seek to treat (diabetes, asthma, infections) in addition to being among the top 20 most frequent items (Fig. 4). Glucophage is a medication which contains Metformin Hydrochloride and is used to control high blood sugar in individuals with type II diabetes [35]. Amoxicillin is a penicillin antibiotic used to treat bacterial infections [48]. Albuterol is an asthma medication used to open air pathways [22]. These conditions are all relatively common in the American population. Indeed, 22 million Americans were diagnosed with type II diabetes in 2014,2 22 million Americans had asthma in 2013,3 and bacterial infections are highly prevalent since they constitute one of the main vectors of infectious diseases (others being viruses, parasites and fungi). In the remainder of this section, we analyze the association rules for each of the three drugs in turn. 4.2.1 Analysis for Glucophage The typical way to analyze association rules is to select the top 5 by confidence having Glucophage on the right-hand side of the rule. This is summarized in Table 4. Many drugs (Avandia, Glyburide, Amaryl) and also a device (Softclix) are related to diabetes, which is expected given that it is the condition that Glucophage serves. 2 Up-to-date National Diabetes Statistics are maintained by the CDC at http://www.cdc.gov/diabetes/
statistics/prev/national/figpersons.htm. 3 Up-to-date National Asthma Prevalence is maintained by the CDC at http://www.cdc.gov/asthma/
most_recent_data.htm.
123
Combining association rule mining and network... Table 4 Most relevant rules when Glucophage is set as right-hand side Left-hand side of the rule (lhs)
Support
Confidence
Lift
{AVANDIA, GLYBURIDE}
0.0008762470
0.8666667
28.50968
{AVANDIA, GLYBURIDE, SOFTCLIX}
0.0004044217
0.7500000
24.67184
{PROZAC, SOFTCLIX}
0.0007414397
0.6875000
22.61585
{AMARYL, HYDROCHLOROTHIAZIDE}
0.0004044217
0.6666667
21.93052
{AVANDIA, ZESTRIL}
0.0004044217
0.6666667
21.93052
Other drugs are particularly interested in that they do not target diabetes. Indeed, Prozac is an anti-depressant [49] while Hydrochlorothiazide and Zestril are used to lower high blood pressure or for congestive heart failure. This asks the question of whether these medications are consequences/causes of diabetes, or share a root cause. Thus, investigating association rules from medication data can serve as a means to search for co-morbidities in the population. In the case of heart failure and diabetes, there is physiological evidence to relate them to insulin resistance. Indeed, insulin resistance (the root of type-2 diabetes) causes the liver to export lipoproteins [50], leading to a deposit of fatty acids in artery walls [5], thus contributing to coronary artery disease and hence heart failure [47]. Similarly, there is epidemiological evidence linking diabetes and depression [21,24,25]. Producing target networks allows for other interesting medical information to emerge. The target network for Glucophage (Fig. 2, used as an example in Sect. 3.2), shows that a product known as Softclix is present in many of the rules. Softclix is a brand of lancets which are used to collect a sample of blood to check sugar levels [1]. It is thus not surprising that a product to control blood sugar levels is often purchased with a device to measure sugar levels. But this points out that patients likely need both of these specific pharmaceutical products. This is of interest not only to retailers, who can ask their patients whether they already have their Softclix when getting their Glucophage, but also for population health by identifying patterns which may help patients simplify and safely use complex prescription regimens. Analyzing the hubs in the target network of Glucophage shows that Softclix is the highest scoring hub, followed by Glyburide (Fig. 5). Both Glyburide and Glucophage are oral medications for type 2 diabetes. Glucophage (being Metformin) is normally used as initial therapy for type 2 diabetes, and Glyburide (being a sulfonylurea) can be used as second-line agent for dual therapy [40]. The high support for {G LY BU R I D E, S O F T C L I X }− > {G LU C O P H AG E} implies that these products are highly associated in the treatment recommendation for diabetes mellitus Type II, but the low confidence shows that it is not necessary that these products are prescribed at the same time. This illustrates that a networkbased approach can serve as a tool to quickly identify common treatments used in conjunction with the target drug.
123
E. Belyi et al.
Fig. 5 Highest scoring hub of Glucophage
Table 5 Most relevant rules when Amoxicillin is set as right-hand side
Left-hand side of the rule (lhs)
Support
Confidence
Lift
{ROBITUSSIN CF}
0.00020221
1.0
13.661
{SILPHEN}
0.00020221
1.0
13.661
{GNPPEDIATRICF RUIT, TRIPLEANTIBIOTIC}
0.00020221
1.0
13.661
{ATROVENT, GUAIFENESIN, PblackNISONE}
0.00020221
1.0
13.661
{METHYLPblackNI SOLONE, PREMARIN}
0.00026961
0.8
10.929
4.2.2 Analysis for Amoxicillin Similar to the analysis for Glucophage, we started by generating the top five rules by confidence for Amoxicillin. Results in Table 5 include Robitussin CF and Silphen, both contain a cough suppressant known as Dextromethorphan [57]. Similarly, Guaifenesin also serves as cough suppressant. These observations suggest that Amoxicillin is commonly prescribed for bacterial infections with cough symptoms. Analyzing association rules can thus shed light on how broad-spectrum antibiotics are typically used by the population. This is useful to pharmaceutical retailers, who may suggest over-the-counter versions of cough suppressants to individuals picking up Amoxicillin prescriptions. It is also useful for pharmacosurveillance when a broad-spectrum antibiotic is expected to be commonly used for one array of conditions but ends up being used for another.
123
Combining association rule mining and network...
Fig. 6 Target network of Amoxcillin
The target network for Amoxicillin is shown in Fig. 6. As in all target networks, intermediate nodes model the support by size (larger means higher support) and confidence by color (darker means higher confidence). This visualization easily shows that there is an inverse relationship between support and confidence for the rules. In addition, it can be observed that several drugs are used by multiple rules. Performing a network analysis for hubs reveals that Augmentin is the main hub to Amoxicillin (Fig. 7). Augmentin is a medication which actually contains the antibiotic Amoxicillin along with a Clavulanic acid. The Clavulanic acid is used to maintain the effectiveness of the antibiotic in the presence of drug-resistant bacteria [29]. It may thus not be necessary for individuals to purchase both drugs to the extent seen in the data. 4.2.3 Analysis for Albuterol The top five rules by confidence for Albuterol are listed in Table 6. Each of the medications in this table is associated with asthma treatment and prevention. Since Albuterol is also an asthma medication, the main association rules offer little additional insight. In that light, the network analysis in Fig. 8 is more interesting. It shows that Prednisone is the highest scoring hub. Prednisone is a steroid-based anti-inflammatory medication that is used to treat allergic and breathing disorders. Since Albuterol is an asthma medication used to open air pathways, the prescription of Prednisone alongside Albuterol would mean that the asthma is allergy related [22]. Therefore, the network analysis provides contextualization to the target drug.
5 Discussion Association mining is a common practice to support cross-selling for industries such as banking and insurance. In a medical context, it has mostly been used on clinical
123
E. Belyi et al.
Fig. 7 Highest scoring hub of Amoxicillin Table 6 Most relevant rules when Albuterol is set as right-hand side
Left-hand side of the rule (lhs)
Support
Confidence
Lift
{ATROVENT, PblackNISONE, THEOPHYLLINE}
0.00040442
0.85714
18.646
{AZMACORTIN HALER}
0.00047183
0.77778
16.920
{AZMACORT, PblackNISONE}
0.00047183
0.77778
16.920
{FLOVENT, SINGULAIR, SINGULAIR (unit-of-use)}
0.00047183
0.77778
16.920
{AMOXICILLIN, PULMICORT}
0.00040442
0.75000
16.315
repositories to identify associations between symptoms, health conditions, and diseases. When medical expenditure was included as object of analysis, the goal was most commonly to do pharmacovigilance by detecting relationships between medicines and adverse events, or to find out whether subsets of the population were exposed to an inappropriate treatment regimen. Little has been done in comparison to association mining only on medication expenditure data in the United States. In this paper, we thus aimed to contribute to the literature on data science in healthcare by performing association mining of the 2011 MEPS. In addition, we introduced the representation of association rules as networks for target drugs, and demonstrated through three cases that it could complement the medical insight provided by looking at the most relevant rules as is common practice.
123
Combining association rule mining and network...
Fig. 8 Highest scoring hub of Albuterol
Finding which items are likely to be prescribed alongside each other to a patient provides valuable information for pharmaceutical companies as well as the pharmacosurveillance system and public health. For example, in public health terms, the joint prescription of medications for depression, heart failure and diabetes contributes to providing evidence about co-morbidities. In pharmacosurveillance, finding that two drugs having the same purpose and active compounds were prescribed together raises questions as to the usefulness of this common practice. Finally, for pharmaceutical companies, finding that those buying one specific diabetes drug were very likely to be interested by one device measuring blood sugar can be used for cross-selling of the two specific products. One of the limitations of this work comes from the specific dataset (MEPS) that we used. As noted in [17], the MEPS dataset excludes individuals in institutions (e.g., nursing homes) even if they do contribute to medical expenditure. In addition, research suggests that households have under-reported events, leading to lower expenditure estimates in MEPS than the national health expenditure accounts [17]. Consequently, there may have been other associations than those mentioned. Another limitation concerns the use of our approach for other datasets, since our algorithms were chosen specifically for the MEPS dataset. We do not claim that our approach can be identically used for other datasets on medical expenditure. Rather, we reviewed alternative algorithms that may be employed depending on the characteristics of one’s medical expenditure dataset. For example, larger datasets may require heuristics which are more computationally feasible than the exact solution used in this paper. Additionally, the availability of datasets with additional parameters could improve the accuracy of the generated target networks. If a transactional dataset included geographical locations for example, one could methodically prune the rules generated for the target network by relevant location. This pruning could conceivably be performed
123
E. Belyi et al.
over other parameters which may be provided with transactional data such as patients’ biometric [44], medical, or financial data. Using medical data as input to algorithms raises questions of privacy and legal issues, and in many cases one may have to design humans based on human knowledge much more than recorded data [7,46]. Performing data mining on medical data is no exception [61], even when mining drug purchases. However, there can be misconceptions about where the issues lie. When stating that “a few (US) state legislatures have passed laws to prohibit or limit the use of data mining for marketing purposes” [52], it is key to understand what ‘data mining’ precisely entails in that context. The key problem is to collect the quantity/name/dose of the drug of a prescription together with the prescribing physician’s name. This can indeed be used to find which physicians are quick to prescribe new drugs (such that they can be targeted for new molecules on the market), already prescribe a set of drugs (allowing to leverage an existing base), etc. In contrast, the use of prescription drug mining in this paper does not include physicians. Indeed, we are only interested in discovering how drugs are actually used by the American population, which could also inform pharmaceutical companies but not for selecting physicians. There are several possible extensions to this work. Our representation of association rules between drugs as a network was only used to search for hubs. Numerous other metrics from the field of complex networks could be used to further investigate associations. Similarly, the association rules {A} ⇒ {B} that we generated were only based on support (P({A, B})) and confidence (P(B|A)). Support and confidence are identical on 6 of the 8 mathematical properties for measures of interesting association rules given by Tan et al. [55]. There exists a plethora of alternative measures, including – Conviction, which was by Brinet al. [9] as an alternative to confidence. developed ¯ ¯ B) P(B)P( A) , It is defined by max P(A)P( ¯ ¯ P(A B) P(B A) – J-measure, which, according to Blanchard et al. is “the most commonly used information-theoretic measure within the context of association rules” [8]. It is defined by ¯ P(B|A ¯ log P( B|A , + P(A B) max P(A, B) log ¯ P(B) P( B) ¯ P(A|B) P( A|B ¯ P(A, B) log + P( AB log ¯ P(A) P( A) Network could thus be generated based on measures that have different mathematical properties, or low pairwise correlation. It would then be particularly interesting to relate the kind of pharmaceutical and epidemiological insight that the network provides to the metrics that are used to build it. In addition, data could be linked to each drug from medical databases such that they are annotated with the conditions that they are normally prescribed for as well as their active compounds. This could allow to further mine the network for drugs that may be redundant. Finally, one could be interested in having not only the time but also the geographical location of the prescription. This could allow to track how newer
123
Combining association rule mining and network...
drugs are adopted across the country and potentially change formerly established associations.
6 Conclusion In this article, we have contributed to the use of association mining in investigating relationships between drugs in the United States. Furthermore, we introduced the notion of target drug network, and demonstrated via three different drugs that it could supplement the information provided by looking solely at association rules. By pulling records from medical databases on the compounds and intended uses of the drugs, future research may seek to annotate the target drug network and use additional metrics from complex networks to further explore practices in drug prescriptions. Acknowledgments EB, IP, NHB, ABA and WH would like to thank the Department of Computer Science, Troy University for providing physical infrastructure. PJG is grateful to the Department of Computer Science, Northern Illinois University for research support. VM expresses his gratitude to the Department of Computer Science, Lakehead University for research support. Compliance with ethical standards Conflict of interest There are no competing interests. This study is based on public access data provided by medical expenditure panel survey. The interpretation and conclusions of the results are those of the researchers only.
References 1. Accu-Check (2015) Accu-check softclix lancing device. https://www.accu-chek.com/us/lancing -devices/softclix.html. Accessed 5 May 2015 2. Agrawal R, Imieli´nski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, SIGMOD ’93, pp 207–216 3. Agrawal R, Srikant R, et al. (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference very large data bases, VLDB, vol 1215, pp 487–499 4. Akçura MT, Srinivasan K (2005) Research note: customer intimacy and cross-selling strategy. Manag Sci 51(6):1007–1012 5. Bastien M, Poirier P, Lemieux I, Despres J (2014) Overview of epidemiology and contribution of obesity to cardiovascular disease. Prog Cardiovasc Dis 56(4):369–381 6. Bereznicki BJ, Peterson GM, Jackson SL, Walters EH, Fitzmaurice KD, Gee PR (2008) Data-mining of medication records to improve asthma management. MJA 189(1):21–25 7. Bhatia A, Mago V, Singh R (2014) Use of soft computing techniques in medical decision making: A survey. In: Proceedings of the 2014 international conference on advances in computing, communications and informatics (ICACCI), pp 1131–1137 8. Blanchard J, Guillet F, Gras R, Briand H (2005) Using information-theoretic measures to assess association rule interestingness. In: Proceedings of the fifth IEEE international conference on Data mining ICDM 2005. IEEE Computer Society Press, Los Alamitos, pp 66–73 9. Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Rec 26(2):255–264 10. Butts CT (2014) sna package. https://cran.r-project.org/web/packages/sna/index.html. Accessed 20 Apr 2015 11. Cami A, Arnold A, Manzi S, Reis B (2011) Predicting adverse drug events using pharmacological network models. Sci Trans Med 3(114):114–127 12. Cavique L (2004) Graph-based structures for the market baskets analysis. Inv Op 24(2):233–46
123
E. Belyi et al. 13. Chan E, Zhan C, Homer CJ (2002) Health care use and costs for children with attentiondeficit/hyperactivity disorder. Arch Pediatr Adolesc Med 156:504–511 14. Chen T, Chou L, Hwang S (2003) Application of a data-mining technique to analyze coprescription patterns for antacids in taiwan. Clin Ther 25(9):2453–2463 15. Cheng Y, Tang K, Shen R, Hu Y (2005) Market basket analysis in a multiple store environment. Decis Support Syst 40(2):339–354 16. Cios K, Swiniarski R, Pedrycz W, Kurgan L (2007) Unsupervised learning: association rules. In: Kecman V (ed) Data mining. Springer, US, pp 289–306 17. Cohen JW, Cohen SB, Banthin JS (2009) The medical expenditure panel survey: a national information resource to support healthcare cost research and inform policy and practice. Med Care 47(1):S44–S50 18. Data MHS (2011) Meps hc-059a: 2011 prescribed medicines file. http://meps.ahrq.gov/data_stats/ download_data_files_detail.jsp?cboPufNumber=HC-059A. Accessed 20 Apr 2015 19. Djenouri Y, Bendjoudi A, Mehdi M, Nouali-Taboudjemat N, Habbas Z (2015) Gpu-based bees swarm optimization for association rules mining. J Supercomput 71(4):1318–1344 20. Doddi S, Marathe A, Ravi S, Torney David C, S. (2001) Discovery of association rules in medical data. Inform Health Soc Care 26(1):25–33 21. Drasic L, Giabbanelli P (2015) Exploring the interactions between physical well-being, and obesity. Can J Diabetes 39:S12–S13 22. Food and drug administration (2008) Draft guidance on albuterol sulfate. http://www.accessdata.fda. gov/drugsatfda_docs/label/2008/050575s037550597s044050725s025050726s019lbl.pdf. Accessed 20 Apr 2015 23. Fu H (2008) Cluster analysis and association analysis for the same data. In: Proceedings of the 7th WSEAS international conference on artificial intelligence, knowledge engineering and data bases (AIKED’08), pp 576–581 24. Giabbanelli P, Crutzen R (2014) Creating groups with similar expected behavioural response in randomized controlled trials: a fuzzy cognitive map approach. BMC Med Res Methodol 14(1):130 25. Giabbanelli P, Jackson P, Finegood D (2014) Modelling the joint effect of social determinants and peers on obesity among canadian adults. Theor Simul Complex Soc Syst 52:145–160 26. Giabbanelli P, Peters J (2011) Complex networks and epidemics. Tech Sci Inform 30:181–212 27. Giabbanelli PJ (2013) A novel framework for complex networks and chronic diseases. Springer, UK, pp 207–215 28. Giabbanelli PJ, Crutzen R (2013) An agent-based social network model of binge drinking among dutch adults. J Artif Soc Soc Simul 16(2):10 29. GlaxoSmithKline: augmentin (amoxicillin/clavulanate potassium) prescribing information (2008). http://www.accessdata.fda.gov/drugsatfda_docs/label/2008/050575s037550597s044050725s025050 726s019lbl.pdf. Accessed 20 Apr 2015 30. Gregori-Puigjane E, Mestres J (2008) A ligand-based approach to mining the chemogenomic space of drugs. Comb Chem High Throughput Screen 11:669–676 31. Hahsler M, Buchta C, Gruen B, Hornik K, Borgelt C (2015) arules package. http://cran.r-project.org/ web/packages/arules/arules.pdf. Accessed 20 Apr 2015 32. Harrison T, Ansell J (2002) Customer retention in the insurance industry: using survival analysis to predict cross-selling opportunities. J Financ Serv Mark 6(3):229–239 33. Hauser DC, Young DA, Braitman LE (2010) Adapting the bundles approach to reduce medication errors in pharmacy practice. J Clin Outcomes Manag 17(3):125–131 34. Ilayaraja M, Meyyappan T (2013) Mining medical data to identify frequent diseases using apriori algorithm. In: Proceedings of the 2013 international conference on pattern recognition, informatics and mobile engineering 35. Inzucchi SE, Lipska KJ, Mayo H, Bailey CJ, McGuire DK (2014) Metformin in patients with type 2 diabetes and kidney disease: a systematic review. JAMA 312(24):2668–2675 36. Jarrar YF, Neely A (2002) Cross-selling in the financial sector: Customer profitability is key. J Target Meas Anal Mark 10(3):282–296 37. Jin H, Chen J, Kelman C, He H, McAullay D, O’Keefe CM (2006) Mining unexpeted associations for signalling potential adverse drug reactions from administrative health databases. In: Proceedings of the 2006 Pacific-Asia conference on knowledge discovery and data mining, pp 867–876 38. Kim HK, Kim JK, Chen QY (2012) A product network analysis for extending the market basket analysis. Expert Syst Appl 39(8):7403–7410
123
Combining association rule mining and network... 39. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604– 632 40. Lahiri SW (2012) Management of type 2 diabetes: what is the next step after metformin? Clin Diabetes 30(2):72–75 41. Law AW, Reed SD, Sundy JS, Schulman KA (2003) Direct costs of allergic rhinitis in the united states: estimates from the 1996 medical expenditure panel survey. J Allergy Clin Immunol 111:296–300 42. Lee HS, Bae T, Lee JH, Kim DG, Oh YS, Jang Y, Kim JT, Lee JJ, Innocenti A, Supuran CT, Chen L, Rho K, Kim S (2012) Rational drug repositioning guided by an integrated pharmacological network of protein, disease and drug. BMC Syst Biol 6(1):1–10 43. Liao TW (2005) Clustering of time series data–a survey. Pattern Recognit 38(11):1857–1874 44. Liu C, Mago VK (2012) Cross disciplinary biometric systems. Springer, Berlin 45. Machlin SR, Soni A (2013) Health care expenditures for adults with multiple treated chronic conditions: estimates from the medical expenditure panel survey, 2009. Prev Chronic Dis 10:120–172 46. Mago VK, Woolrych R, Sixsmith A (2012) Understanding fall events in long term care using fuzzy cognitive map. Gerontechnology 11(2):343 47. Mayer-Davis E, D’Agostino R, Karter A, Haffner S, Rewers M, Saad M, Bergman R (1998) Intensity and amount of physical activity in relation to insuli sensitivty: the insulin resistance atherosclerosis study. J Am Med Assoc 279(9):669–674 48. MedicinePlus (2010) Amoxicillin. http://www.nlm.nih.gov/medlineplus/druginfo/meds/a685001. html. Accessed 28 Apr 2015 49. Medscape reference (2015) WebMD.: prozac, sarafem (fluoxetine) dosing, indications, interactions, adverse effects, and more. http://reference.medscape.com/drug/prozac-sarafem-fluoxetine-342955. Accessed 28 Apr 2015 50. Mottillo S, Filion K, Genest J, Joseph L, Pilote L, Poirier P, Rinfret S, Schiffrin E, Eisenberg M (2010) The metabolic syndrome and cardiovascular risk: a systematic review and meta-analysis. J Am Coll Cardiol 56(14):1130–1132 51. Mullins IM, Siadaty MS, Lyman J, Scully K, Garrett CT, Miller WG, Muller R, Robson B, Apte C, Weiss S, Rigoutsos I, Platt D, Cohen S, Knaus WA (2006) Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377 52. Orentlicher D (2010) Prescription data mining and the protection of patients’ interests. J Law Med Ethics 38(1):74–84 53. Raeder T, Chawla N (2009) Modeling a store’s product space as a social network. In: Proceedings of the 2009 international conference on advances in social network analysis and mining, pp 164–169 54. Soysal O, Gupta E, Donepudi H (2015) A sparse memory allocation data structure for sequential and parallel association rule mining. J Supercomput 72(2):347–370 55. Tan PN, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02. ACM, USA, pp 32–41 56. Tan SC, San Lau JP (2014) Time series clustering: a superior alternative for market basket analysis. In: Proceedings of the first international conference on advanced data and information engineering (DaEng-2013). Springer, Berlin, pp 241–248 57. U.S. national library of medicine: Dextromethorphan (2011). http://www.nlm.nih.gov/medlineplus/ druginfo/meds/a682492.html. Accessed 20 Apr 2015 58. World Health Organization: international drug monitoring: the role of national centres. Tech Report Ser 498 (1972) 59. Xue M, Zhang S, Cai C, Yu X, Shan L, Liu X, Zhang W, Li H (2013) Predicting the drug safety for traditional chinese medicine through a comparitive analysis of withdrawn drugs using pharmacological network. Evid Based Complement Altern Med 2013:1–11 60. Yang H, Yang CC (2015) Using health-consumer-contributed data to detect adverse drug reactions by association mining with temporal analysis. ACM Trans Intell Syst Technol 6(4):1–55 (27) 61. Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang JF, Hua L (2012) Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst 36:2431–2448 62. Zhang F, Zhang Y, Bakos J (2013) Accelerating frequent itemset mining on graphics processing units. J Supercomput 66(1):94–117 63. Zhu C, Wu C, Jegga AG (2015) Network biology methods for drug repositioning. In: Sakharkar KR, Sakharkar MK, Chandra R (eds) Post-Genomic Approaches in Drug and Vaccine Development. River Publishers, Aalborg, pp 115–132
123