Methods for computational causal discovery in biomedicine

With the development of high throughput technology in the past twenty years, it has become easier and cheaper to simultaneously measure tens of thousa...

1 downloads 102 Views 1MB Size

Download PDF

Behaviormetrika DOI 10.1007/s41237-016-0013-5 INVITED PAPER

Methods for computational causal discovery in biomedicine Sisi Ma1 • Alexander Statnikov2

Received: 29 August 2016 / Accepted: 23 December 2016 Ó The Behaviormetric Society 2017

Abstract With the development of high throughput technology in the past twenty years, it has become easier and cheaper to simultaneously measure tens of thousands of molecules in biological systems. One of the major challenges is how to extract knowledge from these high dimensional datasets and infer the underlying mechanisms of the system. In this review, we discuss several topics related to causal discovery from biomedical data, including causal structural learning from observational and experimental data, estimation of causal effects, and using causal information for predictive modeling. Keywords Causal learning Causal inference Active learning Feature selection

1 Introduction Depicting the mechanism governing a system, or causality, has been the pursuit of all sciences. Theories and computational methods for identifying causality from data, both experimental and observational, have made substantial progress over the past fifty years. However, relatively few studies have applied these methods for causal discovery in biomedical systems. This gap is partially due to the fact that the researchers in the biomedical community are unfamiliar with the causal learning Communicated by Shohei Shimizu. & Sisi Ma [email protected] Alexander Statnikov [email protected] 1

University of Minnesota, Minneapolis, USA

2

New York University School of Medicine, New York, USA

123

Behaviormetrika

literature; and similarly, the researchers in the causal discovery community are unaware of some of the practical challenges presented in the biomedical data. This review aims to bridge this gap by introducing the basic theories and some important algorithms from causal discovery and discussing the practical considerations when applying these methods to data obtained from biomedical systems. In empirical sciences, the final arbiter of causal relations is randomized controlled experiments. And causal effects quantify the effects of a randomized controlled experiment. Therefore, causality is often operationalized as causal effects. However, conducting experiments are costly and sometimes infeasible or unethical, especially in biomedicine. One can rely on observational data to infer causality to some extent, given a set of causal assumptions (Maathuis and Nandy 2015). Early algorithms for causal learning have a set of assumptions that are generally violated by data collected from the biomedical domain. More recent algorithms relaxed some of the assumptions and thus are more suitable for causal structural learning from biomedical data. In Sect. 2, we introduce several algorithms that learns causal structure from observational data. We will describe some of these algorithms in detail and emphasize their assumptions. The causal structure of a system cannot be fully resolved from observational data in general. For example, it is impossible to determine the cause from the effect in a system consists of two correlated binary variables. The active learning framework is a class of algorithms that learns the causal structure from both the observational data and experimental data. The active learning framework is specifically useful in biomedical domain where experimental data are costly to obtain. One important application of the active learning framework is interactively recommending the proper experiment(s) to perform according to the current knowledge of the causal system in question. Active learning algorithms are discussed in Sect. 3. In Sect. 4, we introduce the estimation of causal effects for observational data with or without the knowledge of the causal structure of the underlying system. Causal effects are the effects of a hypothetical manipulation. The estimation of causal effects is critical for empirical problems in biomedicine, such as identifying the most promising drugs for a disease or pinpointing the most effective public health interventions for an epidemic. After describing the methods for causal structural learning and estimation of causal effects, we touch upon the contribution of causal knowledge to predictive modeling in Sect. 5. With the rapid advancement in data mining, pattern recognition, and machine learning, predictive modeling has been adopted for applications in biomedicine to build diagnostic and prognostic models for diseases and disorders. The variables or features used to construct the diagnostic and prognostic models are generally obtained by feature selection methods aim to optimize the predictive performance. We argue that feature selection based on causal knowledge is more beneficial especially under distribution shift. The author would like to note that this review is intended as a general overview of computational causal analytics methods and some of their applications in the field of biomedicine. Neither the lists of the methods nor the examples of applications are comprehensive. For different perspectives on the topic, we recommend the following review articles for the interested readers: Friedman et al. (1999), Schadt

123

Behaviormetrika

(2009), De Smet and Marchal (2010) and Lagani et al. (2016). Several important challenges regarding the application of computational causal discovery in biomedicine were not discussed in detail in this review, we provide some references for the readers below. First, it is sometimes difficult to determine the quality of causal models constructed from data due to the lack of proper gold standards. To address this issue, experimental validations of the reconstructed network should be conducted, example work includes Schadt et al. (2005), Olsen et al. (2014) and Hill et al. (2016). A related challenge is the evaluation and benchmarking of various computational causal discovery methods for biomedical data. More specific questions regarding this include but do not limit to: what is the best type of real biological data for evaluation and benchmarking? how to generated realistic simulated network for method evaluation and benchmarking? what metric to use for assessment? These questions are discussed in more detail in these papers Stolovitzky et al. (2007), Marbach et al. (2009) and Cantone et al. (2009). Another common problem is that certain types of biological data are not suitable for answering causal questions. For example, using the aggregate gene expression data from thousands to millions of cell to reconstruct intracellular molecular pathway and network is inappropriate, since the cells are unlikely to be synced and the aggregated gene expression do not capture the actual intracellular processes (Spirtes et al. 2000 and Sachs et al. 2013). As hinted above, multiple factors influence the performance of a particular computational causal discovery method, such as the nature of the data (e.g. the data quality, the data type, and the sample size) and the metric used for quality assessment. Therefore, in this study, we refrain from recommending one algorithm over another and note that the choice of the computational causal discovery method should be determined by the characteristics of the data and the research question. In the rest of this review, we use uppercase letters (e.g. X, Y) to represent both variables and their corresponding vertices in the causal graph underlying the data generation processes of these variables. We use bold uppercase variables (e.g. Z) to represent set of variables or vertices. We use lowercase letters (e.g. x, y) to represent the values of the variables. We will comment on the notation where exception to the above occurs.

2 Causal structural learning from observational data Observational data are measurements or values obtained on a set of variables without any interference or manipulation applied to the system in question. As opposed to experimental data, which are collected when manipulations are performed, observational data are abundant and cheaper to obtain. As shown in Fig. 1, the underlying process that generates the observational data can be represented in a graph (for the moment let us consider DAGs, direct acyclic graphs, where all edges are directed and no loops are present), where each vertex represents a variable and each edge represents a direct causal relationship. X ! Y denotes that X is a direct cause of Y, and Y a direct effect of X. The causal graph determines the joint distribution of all the variables, from which the observational data are drawn.

123

Behaviormetrika

Fig. 1 Relationships among the causal graph, joint distribution, and data. Solid gray arrow represents the data generation process, whereas the dashed gray arrow represents causal structural learning from data.

The goal of causal structural learning is to reverse-engineer the data generation process, i.e. to uncover or reconstruct the causal graph from the observational data (shown in dashed line in Fig. 1). The key link between the observation data and the causal graph is the joint distribution. The joint distribution of the data can be inferred from observational data under large sample limit. Let D denote the p dimensional (i.e. p variables are measured) observational dataset. The joint distribution function is denoted pðX1 ; X2 ; . . .; Xp Þ. The joint distribution gives the probability that a data instance ðx1 ; x2 ; . . .; xp Þ falls in a particular range of the space jointly defined by variables X1 ; X2 ; . . .; Xp . Naively, for multinomial data, the joint distribution pðX1 ; X2 ; . . .; Xp Þ can be represented explicitly for all possible instantiations of X1 ; . . .; Xp , which is 2p possible instantiations for p binary variables. This representation is very expensive and does not give explicit information regarding the causal relations among the variables. To construct a more causally relevant representation, one critical insight is that the variable Xi is only causally determined by a small subset of variables Pai 1 but not the all the variables observed. Therefore, the joint distribution can be represented as the product of several factors. Intuitively, one only needs the instantiation of the variables in Pai to instantiate Xi . The variables Pai that have the properties mentioned above are the Markovian parents of the variable Xi . Due to the property of the Markovian parents and Q the Chain rule, the joint distribution can be decomposed as pðX1 ; X2 ; . . .; Xm Þ ¼ m i pðXi jPai Þ. pðXi jPai Þ represents the probability of Xi conditioned on Pai , which represents the information in Xi that is not present in Pai . The Markovian parents of Xi obtained from examining the joint distribution bears resemblance to the definition of direct causes in the data generative process described by the DAG. The condition that makes the Markovian parents to direct causes is the Markov Condition. Definition 1 Markov Condition: The joint probability distribution pðVÞ over variables V satisfies the Markov condition for a DAG GhV; Ei if and only if 8 W 2 1

Here we assume all variables in the system are observed, see Definition 6), and we consider the smallest Pai possible.

123

Behaviormetrika

V, W is conditionally independent of all variables in V excluding descendants of W given parents of W. Definition 2 Conditional Independence: Let V be a set of variables and P be the joint distribution over V. 8 X V, Y V, Z V, X and Y are conditionally independent given Z, if pðXjY; ZÞ ¼ pðXjZÞ, where pðZÞ [ 0. In other words, variables in Y does not provide additional information regarding X once Z is available, in other words, variables in Z ‘‘screen off’’ variables in Y from variables in X. The Markov condition establishes the relationship between the graphical structure in the causal graph and the joint distribution, whereas the faithfulness condition ensures that all and only the independence relations in the joint distribution are entailed by the causal graph. Definition 3 Faithfulness: If all and only the conditional independence relations that are true in the joint distribution pðVÞ are entailed by the Markov condition applied to a DAG GhV; Ei, then pðVÞ and G are faithful to one another. If faithfulness and Markov condition are both satisfied, the causal minimality condition is satisfied. Causal minimality states that no subgraph of G over the set of variable V satisfies the Markov condition, i.e. graph G is the simplest causal explanation for the joint distribution pðVÞ. 2.1 Causal structural learning with strong assumptions First we will consider algorithms that have strong assumptions about the data generating process. The algorithms discussed in this section assumes causal sufficiency (Definition 6), Markov condition, faithfulness, and only consider certain distribution as the joint distribution. We also restrict the discussion to acyclic graphs. Search-and-score methods One natural way to infer the causal graph underlying the data is the Bayesian approach. Intuitively, consider all possible causal graphs over the set of observed variable V, the graph(s) Gi hV; Ei i that maximizes the posterior probability of Gi given the observed data D, PðGi jDÞ, most likely represents the true data generation process. For an illustrative example, see Cooper and Herskovits (1992). Notice that, two graphs Gi and Gj could be equally likely given an observational dataset; in this case, Gi and Gj belong to the same Markov equivalence class. In general, computing PðGi jDÞ for all Gi is computationally infeasible, since the number of possible DAGs is super-exponential to the number of vertices2. The search-and-score methods are therefore proposed to simplify the search for the optimal Gi by: (1) computing a heuristic score for Gi instead of directly computing PðGi jDÞ; and (2) using a heuristic search strategy to explore the space of Gi efficiently. It is worth noting that 2

Let ajVj denote the number of all possible DAGs over |V| vertices. ajVj could be obtained through the PjVj kðjVjkÞ ajVjk (Robinson 1978). following recursion: ajVj ¼ k¼1 ð1Þkþ1 jVj k 2

123

Behaviormetrika

methods in this class generally assumes multinomial distribution (i.e. the heuristic score is designed for multinomial data), with a few exceptions such as Geiger and Heckerman (1994), Heckerman and Geiger (2013) and John and Langley (1995) which assumes multivariate Gaussian distribution. Earlier search-and-score methods include Buntine (1991), Heckerman et al. (1995) and Cooper and Herskovits (1992). A more recent contribution, the GES (Greedy Equivalence Search) algorithm (Chickering 2002) recognized the Markov equivalence of certain causal graphs given the data and introduced greedy search algorithm to search over the space of all possible Markov equivalence classes (represented by complete partially directed graph, CPDAG) rather than the space of all DAGs. The algorithm has been proven to be sound in the limit of large samples, i.e. able to recover the data generating process up to the Markov equivalence class. The algorithm constitutes of a greedy edge addition phase and a greedy edge deletion phase maximizing the BDeu score (likelihood-equivalent Bayesian Dirichlet score with uniformed prior, see Heckerman et al. (1995) and Buntine (1991)). One advantage of the search-and-score method is that certain types of domain knowledge, such as the absence or presence of causal relation between two variables determined by prior experiments, can be easily incorporated by assigning different prior probabilities to different causal structures (Heckerman et al. 1995). This is especially important in biomedicine where the causal system is generally complex and evidence from multiple sources regarding different aspects of the system is available. For example, to infer gene regulatory network, one can use the gene expression data as the primary data for network reconstruction. Protein-DNA interaction, pathway database, and evolutionary information can be used as priors. This strategy has been proven to increase reconstruction accuracy in multiple studies, such as Tamada et al. (2003), Imoto et al. (2004), Feelders and Van der Gaag (2006), Werhli and Husmeier (2007) and Isci et al. (2014). Similarly, the search space for possible causal structures can be constrained with existing biological knowledge to make genome-wide network reconstruction feasible (Li et al. 2005; Ott et al. 2004). In addition, experimental data can be easily incorporated to refine the causal model constructed from the observational data. We will discuss this point in more detail in Sect. 4. Conditional independence constraint-based methods As mentioned earlier, despite a few exceptions, the existing implementations of search-and-score algorithms generally require discrete data such that the Bayesian score capturing the posterior probability of the causal model given data could be computed efficiently. However, most of the variables measured in the biomedical domain are continuous; and data discretization for Bayesian network reconstruction or data discretization in general remains an active area of research (Uusitalo 2007 and Ramı´rez-Gallego et al. 2016). Instead of modeling the joint distribution of the data or the distribution of individual variables explicitly like the Bayesian methods, the constraint-based methods model the relationship among the variables defined by conditional independence (see Definition 2). The constraint-based methods are more flexible in the sense that there are ample conditional independence test to choose from to accommodate different data distributions. The link between the

123

Behaviormetrika

conditional independence (calculated from the data) and the underlying causal graph, more specifically the graphical concept of d-separation, is established through the probabilistic implication of d-separation.3 Definition 4 d-separation: A path p is d-separated by a set of vertices W if and only if (1) p contains a chain X ! Y ! Z or a fork X Y ! Z, such that Y 2 W, or (2) p contains a collider Y (i.e. X ! Y Z) such that Y 62 W and descendants of Y 62 W. Definition 5 Probabilistic implication of d-separation: If X and Y are d-separated by Z in a DAG G, then X ? YjZ in every distribution compatible with G, if X and Y are not d-separated by Z, X 6? YjZ in at least one distribution compatible with G. The Symbol ? represents statistical independence. The SGS algorithm (Spirtes et al. 1990) converts the conditional independence in the observational data to d-separation and reconstruct the causal graph using the probabilistic implication of d-separation. Specifically, the first phase of the SGS starts with a complete unoriented graph on all variables measured in the observational data. The edge between pairs of variables X, Y is deleted, if there exists a set S V n fX; Yg that renders X and Y independent, i.e. X ? YjS. The first phase produces a unoriented graph, which is also referred to as the causal skeleton. The second phase of the algorithm is edge orientation. The three rules for edge orientation are: (1) for a set of three variables that have the structure of X–Y–Z (and no edge between X and Z), orient the edges as X ! Y Z (referred to as the V structure) if an only if no subset of fYg [ V n fX; Zg renders X, Z independent. (2) If X?Y–Z, and X and Z are not adjacent, and there is no arrowhead at Y, then orient Y—Z as Y ! Z. (3) If there is a directed path from X to Y, and an edge between X and Y, then orient X–Y as X ! Y. It is easy to see that the first phase of the SGS is computational intensive and infeasible in larger systems, since for each pair of variables the conditional independence test has to be conducted on all possible subsets of the remaining variables. The PC algorithm, a modification of the SGS, reduces the number of subsets needed to be considered (Spirtes et al. 2000). The PC algorithm first examines the univariate association between pairs of variables, and deletes all edges where pairs of variables are independent. Then, the PC considers conditioning sets that only contain one variable for all pairs of variables. One critical point to realize is that when examining the conditional dependency of pair of variables X and Y conditioned on a conditioning set, only the variables that are adjacent to both X and Y need to be considered due to the probabilistic implication of d-separation. Conditioning only on the neighbor(s) of both X and Y significantly reduced the number of conditional independence tests compare to the SGS. Similarly, conditioning set with increasing cardinality will be considered in a stepwise fashion, until the cardinality of the conditioning set exceeds the largest degree of the vertices in the graph. The PC algorithm is computationally efficient for sparse graphs. 3

Such link can also be established through the Markov condition, but the d-separation is more intuitive and useful, as pointed out in chapter 3 of Spirtes et al. (2000).

123

Behaviormetrika

As mentioned above, ample conditional independence test has been developed to suit different types of data; this is desirable for biomedical and clinical data, where causal relationships are reconstructed from a mixture of nominal, ordinal, and continuous data. For multivariate Gaussian variables, partial correlation is generally used as the conditional independence test (Fisher 1924; Baba et al. 2004). For binomial or multinomial data, G2 test (Sokal and Rohlf 1981) and v2 (Pearson 1900) is generally used. Various non-parametric conditional independence tests were also developed (Su and White 2008; Zhang et al. 2012; Angrist et al. 2011; Su and White 2007; Ramsey 2014; Zhang et al. 2011; Sun et al. 2007). In practice, the significance of the conditional independence test depends on the choice of the a value, which is the error rate for falsely determining conditional independence for individual conditional independence test. The a value can be treated as tuning parameter to adjust for the sensitivity specificity trade-off globally. A method for controlling global false discovery rate of the PC algorithm was also developed (Li and Wang 2009). Various modifications to the PC algorithm have been developed to better suit causal structure learning from biomedical data. The LPC (Wang et al. 2010), Loworder PC, was designed to increase the accuracy of causal structural learning where small number of samples were available relative to the number of variables (the large p small N problem), a common scenario in biomedical data. The LPC algorithm only tests the conditional independence between pairs of variables conditioning on variables sets that have low cardinality, since it is hard to estimate the high-order conditional independence test with small number of noisy samples. Modification of the PC algorithm to combine prior knowledge and multiple data sources by adjustment of a level has been proposed (Tan et al. 2011, 2008) and shown to improve network reconstruction in practice. However, more principled approaches remain to be discovered. Moreover, hybrid methods have been proposed where the constraint-based algorithms are used to learn the causal skeleton (due to the efficiency of these methods) and the score-based methods were applied to refine the network structure (Friedman et al. 1999; Brown et al. 2004; Tsamardinos et al. 2006). The hybrid methods are useful in biomedicine where networks often consist of thousands of variables. Different from the algorithms introduced above, the HITON-PC algorithm is a constraint-based local causal discovery algorithm (Aliferis et al. 2003). The HITON-PC algorithm takes a dataset D and a variable of interest/target variable T as input, and returns a variable set that contains the parents and children of T. The algorithm first ranks all variables V in dataset D according to their association with T in a descending order. And the variables are admitted into the CandidatePC set one by one. After each variable is admitted, for every variable X 2 CandidatePC check if there exist a variable set S CandidatePC n f X g that renders X independent of T (i.e. X ? TjS). If so, remove X from CandidatePC. The algorithm terminates until all variables are examined and return the final CandidatePC set as the parents and children set of T (for trace of the algorithm see, Aliferis et al. 2010). Note that the HITON-PC algorithm does not distinguish the parents from the children. To learn the global causal structure, one can apply the HITON-PC

123

Behaviormetrika

algorithm to every variable in dataset D to obtain the local causal structures, and stitch the local network together. This strategy is termed the local-to-global framework (Aliferis et al. 2010, 2010). Edge orientation can then be conducted on the global network similarly to the PC. Learning individual local causal structure is an independent process thus easily parallelized. The local to global learning easily scales to reconstruct large networks constitute of hundreds to thousands of variables. HITON-PC has shown promising results for network reconstruction in multiple biomedical applications (Duda et al. 2005; Statnikov et al. 2010, 2010; Ma et al. 2014). 2.2 Causal structural learning with relaxed assumptions Assumptions such as Markov conditions and faithfulness generally do not hold for observational data obtained from complex systems such as the molecular pathway underlying a disease or the financial and social factors determining the outcomes of a healthcare policy. In this subsection, a number of algorithms designed to relax different assumptions for causal structural learning from observational data will be discussed. The application of the following methods in biomedical data is rare, potentially because these methods are much more complex and are less familiar to researchers in biomedicine. Causal Structural Learning in the Presence of Latent Variables and Selection Bias Latent variables and selection bias is common in biomedical data. Latent variables are unmeasured variables that contribute to the data generating process. The presence of latent variables, or the failure to measure all variables relevant to the data generating process, violates causal sufficiency. Definition 6 Causal Sufficiency: A variable set V is causally sufficient if and only if every common cause of any two or more variables in V is in V, or has the same value for all samples. Selection bias refers to sampling bias induced by selection variables. The selection variables are unmeasured variables that determine whether or not a particular data point is included in the data sample. Selection bias can lead to the discovery of causal effect that is not present (Pearl 2009). For example, when studying the effect of two different treatments on the outcome of a disease, omitting the selection variable disease severity, which influences the choice of the treatment, may lead to discovering the causal relation between treatment and disease outcome (Simpson 1951). Given a faithful DAG on variable set V, if set U V, is not observed, and some variables in U are selection variables or latent variables, faithfulness no longer holds. Therefore, there might not exist any DAG that is faithful to the joint distribution over the observed variables. Due to the violation of Markov condition and faithfulness, applying methods discussed in the previous subsection to partially observed causal systems often results in error in the inferred causal structure. One of the first constraint-based algorithms that does not assume causal sufficiency is the FCI algorithm. The FCI algorithm is a modification of the PC

123

Behaviormetrika

algorithm originally designed for handling latent variables (Spirtes et al. 2000), and was later extended for causal structural learning in the presence of both latent variable and selection bias (Spirtes et al. 1995). The FCI algorithm is computationally infeasible for large graphs. Modifications of the FCI algorithm, such as the Anytime FCI (Spirtes 2001) and especially the RFCI (Colombo et al. 2012), have greatly increased the computational efficiency. The output of Anytime FCI and RFCI is correct but sometimes not complete. Unlike the FCI, most of the recent constraint-based algorithm assuming latent variables and selection variables describe the causal structure using the Maximum Ancestral Graph (MAG) rather than using a modified version of DAG, since the presence of hidden variables and selection bias requires a new representation of the causal relationship. In a MAG, the absence of an edge represents conditional independency between pairs of variables. A MAG is maximal in the sense that no edge can be added without changing the independence model. On the other hand, a directed edge X ! Y indicates that X is the ancestor of Y, or the presence of selection. And an undirected edge X—Y indicates that X is the ancestor of Y and Y is the ancestor of X. Similar to d-separation for DAGs, m-separation relates the graph structure in a MAG to conditional dependence relationship in the joint distribution (Richardson and Spirtes 2002). The linear latent cyclic (LLC) models proposed in Scheines et al. (2010) and Hyttinen et al. (2010, 2012, 2013), stemming from the structural equation models, are also specifically designed to accommodate latent variables. LCC models represent the causal relationship among variables through a set of linear structural equations, where each variable is modeled as a linear combination of its parents. Learning the causal structure from observational data in the LLC framework is equivalent to solving for the coefficients of a set of linear equations. More specifically, let X ¼ fX1 ; . . .; Xp g be the set of observed variables. The structural ! ! equations X ¼ B X þ ! represent the causal relationship among variables in X. Entry bji in Matrix B (the effects matrix) is the linear coefficient denoting the direct causal contribution of variable Xi on Xj . bji is zero when there is no edge from Xi to is a vector representing noise. Off-diagonal entries in the covariance matrix of Xj . ! ! indicate the presence of latent variables. And potential direct cycles in the graph put constraint on the eigenvalues of matrix B. One other benefit of the LLC models is that experimental data can be easily incorporated as additional constraints on matrix B; we will discuss this in more detail in the following section. Causal structural learning without assuming Gaussianity The original formulation of the structural equation models assumes Gaussian distributions and only utilizes the first and second-order moments information from the data to derive causal structures. This assumption would result in equivalent models and leave certain causal structure unidentified. For example, the conventional structural equation model would not be able to orient the edge between two correlated variables. It has been shown that under certain conditions, the casual direction between pairs of variables can be determined by assuming non-Gaussian noise and examining the higher-order moments (Dodge and Rousson 2001, Shimizu and Kano 2008). Based on this idea, the LiNGAM algorithm (Shimizu et al. 2006)

123

Behaviormetrika

extends the conventional structural equation model by assuming non-linear error ! ! term. The structural equation system can be written as X ¼ B X þ ! , and ! rearranged as X ¼ A! where A ¼ ðI BÞ1 . A is estimated through independent component analysis and then permuted and normalized such that the resulting B matrix is lower triangular, which ensures the corresponding graph would be a DAG. A recent update of the LiNGAM, the DirectLiNGAM estimates the causal relations through iteratively identifying exogenous variable, thus avoids the ICA. This modification results in better scalability (Shimizu et al. 2011). An extension of LiNGAM that takes care of hidden variables were recently introduced (Shimizu and Bollen 2014). Causal structural learning in the presence of multiplicity Another typical violation of faithfulness in biological systems is the presence of multiplicity. When looking at high-throughput expression data, it is quite common to find that one gene is correlated with hundreds to thousands of other genes, even though its direct upstream modulators and direct downstream effects is relatively few. Multiplicity or target information equivalency is one key contributor to this problem. Target information equivalency refers to the phenomenon when two variables contain the exact same information regarding a target variable of interest (Lemeire et al. 2006). Definition 7 Target information equivalency (Lemeire 2007): Two subsets of variables X and Y are target information equivalent to T, if and only if X 6? T, Y 6? T, X ? TjY,Y ? TjX Variables that contain the same information regarding T belong to the same information equivalence class. The variables in Fig. 2A that have the same color belong to the same information equivalence class. Consider variables X1 and X2 , in Fig. 2, they contain the same information regarding T. Therefore, one cannot distinguish the parent of T, X1 , and the grandparent, X2 from observational data. With the presence of information equivalency, applying an algorithm that assumes

A

B

C

Fig. 2 Local causal pathway discovery in the presence of target information equivalence

123

Behaviormetrika

faithfulness, such as the HITON-PC can result in both false positives and false negatives. The output of HITON-PC depends on the sequence that the variables in the same information equivalence class are considered. As show in Fig. 2b, HITONPC could output any set of five variables that comes from the five information equivalence class. The TIE* and iTIE* were specifically designed to handle information equivalence (Statnikov and Aliferis 2010; Statnikov et al. 2013). These algorithms would output all variables that are information equivalent with the local causal pathway of T and organize them in information equivalence classes (Fig. 2c).The TIE* and iTIE* are suitable for the identification of biomarkers of complex diseases and disorders where multiplicity is expected (Statnikov and Aliferis 2010; Alekseyenko et al. 2011; Karstoft et al. 2015; Lytkin et al. 2011).

3 Active learning and causal-modeling-guided experimental design As discussed in the previous section, causal structure learning using observational data can only discover causal relationships up to a statistical equivalence class. To resolve the statistical equivalence uncovered from the observational data, one can conduct manipulation or experimentation on certain variables and observe the changes in the joint distribution and refine the causal structure. We define a manipulation or experimentation on variable X as do(X). The effect of the manipulation on the system is assessed by comparing the joint distribution before the manipulation, i.e. pðVÞ, with that after the manipulation, i.e. pðVjdoðXÞÞ (Pearl 2009). The effects of do(X) on the causal relations in the original system depend on the nature of the manipulation on X. One common type of manipulation is randomized experimentation (Fisher 1936, 1950), for instance, randomly assigning patients into two different pain management programs. This type of manipulation will break the causal relations between the variable being manipulated and its direct causes. In the pain management example, the random assignment into the programs eliminates the effects of factors that may influence the choice of pain management, such as educational background, financial considerations and pain management history. Similarly, another common manipulation in biomedicine is to set the value of a given variable to a constant, e.g. conducting a gene knock-out experiment would set the gene expression to the basal level. This type of manipulation will effectively disconnect the variable being manipulated from its parents; knock-out experiment on gene A will eliminate the regulatory effect of transcription factor B on gene A. A third type of manipulation involves introducing a distribution shift to a given variable, such as over-expressing a particular protein. This type of manipulation leaves the causal relationship between the manipulated variable and its parents intact. It is possible to resolve causal relationship with experimentation alone. However, this strategy is very costly and time consuming. Especially in the domain of health sciences, certain experiments or manipulations are unethical or infeasible. On the other hand, since the development of high throughput technologies, it has become

123

Behaviormetrika

increasingly cheaper to obtain observational data in biomedicine. And as a result, a large amount of publicly available observational data has cumulated. Algorithms that combine the information in the observational data and experimental data can best take advantage of this situation, resulting in cost effective causal structure learning. These algorithms are referred to as ‘‘active learning algorithms’’. The active learning algorithms aim to utilize information from the observational data and experimental data to fully resolve the causal structure of a system while minimizing the number of experiments to be conducted. As shown in Fig. 3, the active learning algorithms generally have two main components. The first component is a structural learning method that learns causal structure based on observational data. Any of the algorithms introduced in Sect. 2 can be used as a causal structure learning method. The second component is a experimentation recommendation method. The experimentation recommendation method recommends variables for experimentation according to some heuristics, e.g. the currently discovered causal structure, the feasibility and cost of certain experiments, and the importance of variables according to some knowledge base. After obtaining the experimental data as a result of manipulating the variable recommended by the second component, changes in the joint distribution or changes in the conditional independence relationships are accessed to update the causal graph. This process continues until all the causal structure is fully resolved or there is no experiment to conduct.

Fig. 3 Simplified framework for causal-model-guided experiment design. (1) Causal discovery algorithm is applied for structural learning from observational data to obtain a draft of the causal graph. (2) Variable is selected for experimentation according to the current draft of the causal graph. (3) Experiment is conducted and experimental data are collected. (4) Experimental data are used to refine the draft of the causal graph. (5) The draft of the causal graph is updated. Steps (2)–(5) are repeated until the causal structure is completely resolved or there is no available experiment to be conducted

123

Behaviormetrika

3.1 Global algorithms Most of the active learning methods for causal discovery are designed for discovering the global causal structure, i.e. the structure of the entire causal graph. Two of the earliest studies on this topic focuses on the experimentation recommendation methods (Tong and Koller 2001; Murphy 2001). In both studies, the causal model is represented as a Bayesian network. The basic idea underlying both studies is to first compute a posterior probability distribution over the causal structures given the observational data. Then, for individual possible manipulations, compute an expected loss function/utility given the current structure of the causal graph. This strategy is computational infeasible, since the number of possible DAGs grows super-exponentially with the number of variables, and computing the expected loss function/utility over all possible outcomes is Oð2kVk Þ assuming binary outcome. To mitigate this, Tong et al. Tong and Koller (2001) defined the loss function to specifically characterize edge orientation, and Murphy (2001) proposed to utilize MCMC and importance sampling. The Biolearn method (Peer et al. 2001; Sachs et al. 2005) is a more recent contribution that uses the search-and-score idea and specifically considers the biological networks. This method is more scalable. Similar to the GES (Chickering 2002), Biolearn uses the fact that the posterior probability of the global causal structure given the data can be decomposed into the local contribution of individual variables. Due to the decomposability, when determining which variable to select for manipulation, one only needs to consider the local contribution of the variables in question. Moreover, heuristics such as Greedy Hill-Climbing were used to limit the search space for possible graphs, further improving the computational efficiency. Method due to Meganck et al. (2006) and He and Geng (2008) use the constraint based method, the PC algorithm, as the structural learning method. For selecting variable to experiment on, Meganck et al. (2006) uses utility functions that takes into account the potential gain (number of edges oriented directly by the experiment and potential number of edges oriented by propagation) and the cost (of both conducting the experiment and measuring the outcome) of a given experiment. He and Geng (2008) utilizes the fact that edge orientation can be done locally within each chain component of the graph to improve computational efficiency. He et al. chooses variable that minimize the maximum number of DAGs in the chain component after an experiment or maximize the entropy of the chain component. The scalability of this method depends on the size of the chain component. The models discussed so far assume causal sufficiency and no feedback loops. On the other hand, the linear latent cyclic (LLC) models (Scheines et al. 2010), (Hyttinen et al. 2010, 2012) mentioned in Sect. 2.2 relaxed these constraints by representing the causal relationships among variables through a set of linear structural equations. The LLC algorithm discovers causal relationships by estimating the coefficients in the effects matrix. Each experiment introduces additional constraint so that the coefficients of the effects matrix can be determined through experimental data. The current effects matrix, which represents the most up-to-date causal structure, can be used to determine the optimal variable for manipulation.

123

Behaviormetrika

3.2 Local algorithms Some of the algorithms mentioned in Sect. 3.1 do not scale well when applied to discover causal systems consist of large number of variables. Also, most algorithms mentioned above assume faithfulness. As discussed in Sect. 2.2, multiplicity or target information equivalence is prevalent in most complex biological systems, which results in violation of the faithfulness condition. The local algorithm ODLP takes care of both problems mentioned above. The ODLP (Statnikov et al. 2015) first discovers the local causal pathway consistent with the data by applying the TIE* or iTIE* algorithm (Fig. 2c). Then, experiments would be conducted to resolve the local causal pathway. Notice that unlike all the algorithms discussed in Sect. 3.1, structural information among the variables is not available after structural learning from observational data due to multiplicity. The best choice of the first manipulation, if possible, would be manipulating the target, thus distinguishing the effects of the targets from the other variables. Then, effect variable can be manipulated to identify direct effects from indirect effects. Any effect variable that changes due to the manipulation of another effect variable is an indirect effect. The variables that are not affected by the manipulation of the target could be causes or passengers (the variables that correlate with the target T but are neither causes nor effects). If manipulating one of the noneffect variables does not result in any change in the target variable T, then the variable is determined as a passenger variable. Any variable that changes due to the manipulation of a passenger variable is also a passenger variable. On the other hand, if manipulating one of the non-effect variable results in changes in the target variable, the variable is determined as a cause. If manipulating a cause variable does not result in changes of any other cause variables, this cause variable is a direct cause. In the worst case, the number of experiments needed to fully resolve the local causal network is the number variables in the local causal pathway consistent with the data plus one (i.e. manipulating the target). The authors of this report had conducted comprehensive benchmark studies comparing the performance of ODLP algorithm and some of the other algorithms discussed in Sect. 3.1 for local causal network discovery. The ODLP method outperforms other algorithms in terms of structural discovery accuracy and scalability with good experimental efficiency in both simulation data and real biological data (Statnikov et al. 2015; Ma et al. 2016).

4 Estimating the effects of a manipulation In the previous sections, we presented tools for causal structure learning. The causal structure of a system, generally represented by the causal graph, captures qualitative information regarding the causal relationships among variables. However, this information is insufficient to answer quantitative questions that are of interest in empirical sciences, such as, identifying the most effective treatments of a particular disease and pinpointing the most beneficial public health interventions for preventing an epidemic. This type of problems involves estimating the effects of a manipulation.

123

Behaviormetrika

The effects of an intervention cannot be estimated in the absence of the causal model. As Pearl puts it‘‘[Predictions about how systems would respond to hypothetical interventions] rest on—and, in fact,—define causal relationships’’. In this section, we will address the problem of estimating the specific type of manipulation where the manipulated variable X is forced to take a particular value x, denoted as doðX ¼ xÞ. The effect of manipulating X on Y can be defined as EðYjdoðX ¼ x1 ÞÞ EðYjdoðX ¼ x2 ÞÞ where x1 and x2 are two distinct values assigned to X by the manipulation. 4.1 Representing causal structure with structural equation model As discussed above, Structural Equation Model (SEM) is one way to represent the causal relationships among variables. SEM is particularly convenient for estimating effects of a manipulation, since the quantitative relationships among the variables are represented explicitly in the structural equations. Here we use the classical definition of the structural equation model, assuming no hidden variables or loops and Gaussian errors. Briefly, each variable Xi in the causal system can be determined by the function Xi ¼ fi ðPaðXi Þ; i Þ. PaðXi Þ is a set that contains all the variables that are the parents of Xi . i are jointly independent error terms. We emphasize that the function fi captures the asymmetrical causal relationship between Xi and its parents. A manipulation on Xi (i.e. doðXi ¼ xi Þ) replaces the generating function fi with the constant xi , effectively disconnects Xi from all its parents. Figure 4 shows a toy example of the causal relationships among genetics (X1 ), exercise (X2 ), diet (X3 ), and BMI (X4 ) before (Fig. 4a) and after manipulation of diet (Fig. 4b). 4.2 Estimating effects of manipulation with SEM We define the problem of estimating the effects of manipulation as the following: given a causal system with known causal structure, estimate the effects of a hypothetical manipulation on variable Xfrom data observed from the unmanipulated

A

B

Fig. 4 Causal relationships among genetics, exercise, diet and BMI before (a) and after (b) manipulation on diet.

123

Behaviormetrika

causal system. We use the example in Fig. 4 to illustrate this process. Assuming the relationships among the variables are linear and the causal structure known, we describe the quantitative causal relationships with the equations in Fig. 4A, where cij is the coefficient characterizing the contribution of variable Xi to Xj . Noting that manipulating the diet variable X3 would disconnect X3 from its parent X2 but leave all other relationships intact, as illustrated in Fig. 4B. Therefore, the effect of doðX3 ¼ x3 Þ is represented in c34 , the amount of change in BMI (X4 ) as a result of one unit change in diet (X3 ). Now all is left to do is estimating c34 from data obtained from the causal structure before the manipulation (4a). This can be achieved by a linear regression of BMI (X4 ) with diet (X3 ) and exercise (X2 ) as the independent variables, i.e. X4 ¼ c34 X3 þ c24 X2 . It is critical to include exercise (X2 ) in the linear regression, since exercise (X2 ) is correlated diet (X3 ) before the manipulation. The coefficient obtained by regressing diet (X3 ) and BMI (X4 ) without exercise (X2 ) is cc2423 þ c34 . This estimation includes the contribution of exercise (X2 ) incorrectly. On the other hand, whether or not genetics (X1 ) is included in the linear regression would not affect the estimation assuming a large enough sample, since it is independent of diet (X3 ). The back-door criterion (Pearl 2009) describes the set of variables must be considered (conditioned on) when estimating the effect of manipulating X on Y. Definition 8 Back-door criterion: A set of variables Z satisfies the back-door criterion relative to an ordered pair of variables hX; Yi in a DAG G if : (1) no node in Z is a descendant of X; and (2) Z blocks every path between X and Y that contains an arrow into X. A naive algorithm for finding the back-door variables from the causal graph require enumerating all paths between X and Y which becomes intractable in dense graphs. Instead of conditioning on the back-door variables, conditioning on the parents of X also yields the correct estimation (Nandy et al. 2014; Maathuis and Nandy 2015). The above method for estimating causal effects is suitable when the complete causal structure underlying a system is available, which is rarely the case in practice. Therefore, one often has to estimate causal effects based on causal structures learned from data. And as mentioned above, if only observational data are available, causal structure can only be resolved up to statistically equivalent class. The IDA algorithm (Maathuis et al. 2009, 2010) is specifically designed to estimate the effect of a manipulation in this situation. Briefly, a causal discovery algorithm (such as PC) is applied to the observational data to discover the causal relationship up to the Markov equivalent class. Then, all possible DAGs in the Markov equivalent class can be enumerated. The effect of a manipulation doðX ¼ xÞ on Y can be estimated on each possible DAG using the method discussed in the previous paragraph. The set containing the estimated effect from each DAG is used as the estimated manipulation effect. For detailed discussion regarding computational efficiency of the method, see Maathuis et al. (2009), Maathuis and Nandy (2015). A more recent method by Hyttien et al. (2015) took a different approach to causal effect estimation. The method represent the graph constraints (d-separation

123

Behaviormetrika

and d-connection) inferred from the data using propositional logic. The Markov equivalence class derived from the data is implicitly represented in the propositional logic. This representation bridges causal structural learning and causal inference. And both the structural learning and the inference problem can be solved with constraint solving techniques such as Boolean satisfiability solvers. In health sciences, it is common practice to estimate the effects of a potential manipulation on a patient population using the causal model constructed from the patient data in another population. For example, the drug effects on the general population are often predicted from the causal model built on clinical trial data. The potential differences in the two populations may result in inaccurate estimation. This problem is a special type of model generalizability problem, termed as transportability. Some recent studies have outlined the conditions under which causal effects learned in experimental studies, like clinical trials, can be transferred to a new population, where only observational data are available (Pearl and Bareinboim 2014; Bareinboim et al. 2013). The transportability problem is closely related to the rationale underlying causal feature selection discussed in Sect. 5.

5 Causal structure learning and feature selection In the recent twenty years, breakthroughs in high-throughput technologies, such as gene expression microarrays, mass spectrometry, SNP arrays, tissue arrays, and single cell arrays, have allowed multi-modular molecular data to be measured on the genome scale. Similarly, the modernization of the health care system, especially the EHR has resulted in the accumulation of a vast amount of structured and unstructured (e.g. clinical notes) clinical data. The biomedical community has since adopted the predictive modeling approach to identify clinical markers and biomarkers that are indicative of particular clinical outcomes. The core problem underlying the development of a diagnostic assay is feature selection, i.e. selecting a small subset of biomarkers from thousands of markers that results in optimal predictability for the disease in question. In general, there is no trivial solution to the feature selection problem. Let us define the feature selection problem as selecting a subset of features S from the available features F of size N according to some criterion (e.g. maximizing predictive performance). The number of possible feature sets is 2N . Therefore, an exhaustive exploration of all possible subsets is infeasible. This is especially true for data from the healthcare domain, where the number of features is sometimes couple of order of magnitudes higher compared to the number of samples. An encouraging success story is PAM50, an assay that is extensively used in the clinical setting for determining the molecular subtype of breast cancer. The PAM50 assay consists of Fifty genes that were selected out of thousands of genes (Sørlie et al. 2001; Van’t Veer 2002). The reduction in number of genes needed to be profiled results in great improvement in cost efficiency for molecular subtyping. Despite the major success of PAM50, deriving stable, robust, and clinically relevant feature sets for diagnosis and prognosis remains a challenging task. In this section,

123

Behaviormetrika

we will introduce feature selection methods and discuss the benefits of causal feature selection methods. 5.1 Purely predictive feature selection Myriads of feature selection methods have been developed. The aim of purely predictive feature selection strategies is to select a subset of features that optimize some loss function characterizing some measure of predictivity (e.g. AUC, sensitivity, or specificity). The univariate feature selection is the most simple and straightforward. Features are ranked according to their univariate association with the target. The top ranked features (above a chosen threshold, e.g. p 0:05) are selected. Univariate association based methods are computationally efficient and generally yield adequate predictive performance; however, selected features produced by univariate methods are often redundant, as indicated by high correlations among the selected features. This is especially true for biomedical data, such as gene expression data. Moreover, univariate association fails to capture interaction among features. This is likely a disadvantage for modeling biomedical data, since in general the interactions of molecules, pathway, physiological processes, pathological processes, and environmental factors co-determine the disease outcome. Various multivariate feature selection strategies have been developed to reduce feature redundancy and capture interaction among features. For comprehensive reviews of purely predictive feature selection methods, see Guyon and Elisseeff (2003, 2006), Liu and Motoda (2007). 5.2 Benefits of using causal feature selection To overcome some of the drawbacks of the non-causal feature selection methods, several researchers have suggested incorporating causal structural information into the feature selection process. Intuitively, the Markov Boundary (MB) of the variable of interest T is a good candidate feature set for predicting T. The MB of T consists of the direct causes, direct effects and direct causes of direct effects of T; all other features are rendered independent of T given the features in the MB of T (Guyon et al. 2007; Aliferis et al. 2003). In practice, the number of features in MB(T) is generally small, much smaller compare to the set of features that are correlated with the target T univariately. Therefore, reducing model complexity and overfitting is one of the main benefits of using the MB(T) as the selected feature set. Moreover, considering causal relations among features when selecting features for predictive modeling makes it possible to predict the outcome of interest under distribution change. Usually, predictive modeling assumes the training data and testing data are sampled from the same distribution (i.e. the i.i.d condition). This condition is often violated in biomedical data. Figure 5 shows an toy example (the examples in this section is inspired by Guyon et al. (2007) and casted in the context of biomedicine). The causal structure underlying the data is depicted in Figure 5a: cancer causes changes in biomarker 1 and biomarker 2. A predictive model were trained on patient data obtained in hospital 1 for cancer diagnosis (Figure 5b). When applying the model to classify the disease state in hospital 2, where more patients

123

Behaviormetrika

A

B

C

Fig. 5 Effects of distribution change on the performance of predictive model. a Causal model of the data. b Training data, c testing data. Solid line represents the decision boundary learned on the training data. This decision boundary becomes suboptimal under distribution change as shown in C. Dashed line represents the decision boundary taking into account the distribution change

have the specific type of cancer, the original decision boundary becomes suboptimal (Fig. 5c). The decision boundary estimated from the training data (solid black line) corresponds to where Pðcancer ¼ 1jbiomarkerÞ ¼ Pðcancer ¼ 0jbiomarkerÞ. From Bayes rule, we have PðcancerjbiomarkerÞ / PðcancerÞPðbiomarkerjcancerÞ. P(cancer) in the training data (hospital 1) is different from P(cancer) in the testing data (hospital 2). This results in different PðcancerjbiomarkerÞ in the testing data as oppose to the training data; therefore, the decision boundary obtained from hospital 1 data is suboptimal for hospital 2 patients. The remedy to this situation is causal information. The key realization is that the biomarkers are effects of cancer, so PðbiomarkerjcancerÞ remains the same even when P(cancer) changes across different patient population. Instead of estimating PðcancerjbiomarkerÞ directly from the training set, one can estimate PðbiomarkerjcanerÞ and compute PðcancerjbiomarkerÞ using PðcancerÞPðbiomarkerjcancerÞ, where P(cancer) can be estimated using a iterative process in the testing data. Furthermore, manipulation to certain variables, which could result in distribution change, could affect the generalizability of the predictive model as well. Consider building a predictive model for lung cancer using stress and smoking as the features in individuals who has not attempted to quit smoking. Assuming, the underlying causal structure among these variables is: stress ! smoking ! lung cancer. Some purely predictive feature selection methods (especially the univariate methods) would select both stress and smoking, since both of these variables are correlated with lung cancer. If we apply this predictive model to a group of people that are randomly assigned to participate in a smoking cessation program, the original predictive model would perform badly. Because the random assignment is a manipulation on smoking which renders stress and lung cancer independent. In this particular example, selecting the direct cause smoking to build the predictive model results in a model that are robust to manipulation. The ‘‘causation and prediction challenge’’ at World Congress on Computational Intelligence addresses some of the problems mentioned above in more detail. The

123

Behaviormetrika

proceedings of the callenged is linked here Jmlr workshop and conference proceedings (0000) for interested readers. 5.3 Causal feature selection methods and their applications If the main concern for the predictive model is overfitting due to redundancy of the selected features, and assuming the absence of multiplicity. One can use the MB of the target variable as the feature set. Various causal structure learning algorithm can be used to identify the MB. Technically, one can even use a global causal structure learning method such as PC or GES to identify MB(T); however, it is unnecessary, computationally inefficient, and requires a lot of samples to identify the causal structure correctly. On the contrary, local causal discovery methods such as HITON-PC are ideal for feature selection. In theory, MB(T) is a better feature set compare to PC(T) in terms of prediction, since MB(T) contains all information regarding T that is available in the entire feature set. In practice, the predictive performance of using MB(T) and PC(T) is generally not significant. And identifying PC(T) is cheaper computationally. An extensive benchmark study has shown that compared to purely predictive feature selection methods, the causal feature selection methods achieve similar or better predictive performance with feature sets that are couple of order of magnitudes smaller (Aliferis et al. 2010, 2010). A smaller yet informative feature set is attractive for translating the computational model of disease into diagnostic and prognostic assays. On the other hand, in domains where multiplicity is prevalent, it is commonly observed that multiple feature sets could lead to very similar predictive performance. We argue that this phenomenon (termed the Rashomon effect by Breiman 2001) is due to information equivalence mentioned the in the previous section. As noted in Lagani et al. (2016), one benefit of discovery all equally predictive subsets is that the cost of obtaining some of the variables might be lower compared to other variables. Using the equally predictive variables set that incurs the lowest cost yet provide the same predictive performance is preferable in practice. Algorithm such as The TIE* and iTIE* (Statnikov and Aliferis 2010; Statnikov et al. 2013) can be used for discovering all the subsets that are information equivalent to the target as discussed before. A more recent contribution, the SES method, is a constraint-based heuristic method that identities equivalence among predictor sets (Lagani et al. 2016; Tsamardinos et al. 2012). On the contrary to what is discuss in the two paragraphs above, if one intends to build predictive model that is robust to distribution shift and manipulation, direct causes and direct effects of the target variable need to be distinguished.4 To distinguish direct causes from direct effects, one can start off by identifying the MB(T) and try to orient the edges in the local causal neighborhood of T. In general, local causal neighborhood cannot be correctly oriented by only examining the dependencies among T and MB(T) from observational data. Consider a graph where A and B are two direct causes of T; C is the common cause of A and B. 4

Here we are talking about the general case. It is easy to recognize that certain manipulations, i.e. the manipulations that do not affect P(T|predictors), would not change the predictive performance of the model. More detailed discussion on this can be found in Tillman and Spirtes (2008).

123

Behaviormetrika

MBðTÞ ¼ fA; Bg. If one only examines the dependency relationship among A, B and T, it is not possible to correctly identify the tuple hA; T; Bi as a V structure. Because the correlation between A and B could not be attributed to their common cause C. Several algorithms have been proposed to use the dependencies in the extended local neighborhood (i.e. the local neighborhood of the local neighborhood) to orient the edges in the local neighborhood of T, including MBFS (Ramsey 2006), LocalGraph, and PCD-by-PCD (Yin et al. 2008). All these algorithms are sound and computationally efficient. Especially, the PCD-by-PCD algorithm can orient the local neighborhood completely5 by progressively expanding the local neighborhood with examining the dependencies among a small set of variables.

6 Summary and future directions In summary, this review provided an overview of some typical causal discovery related problems encountered in biomedicine. We discussed the theory underlying selected computational tools for solving these problems. With the cumulation of large amount of publicly available, high-throughput data from the biomedical domain, computational causal discovery and analytics methods could assist researchers in providing a efficient, integrative and comprehensive understanding of biological processes and enable systematic knowledge discovery. Here, we point out some future directions for applied computational causal discovery in biomedicine. First, since it is common in the biomedical domain to encounter the ‘‘big P small N’’ problem, we think it is important to harness the local learning approach to make more robust inference. Another area of interest is network integration. More specifically, how to integrate networks learned from different data sources that potentially have slightly different sets of variables measured. This is important for identifying the consensus and conflict in the knowledge about a particular research topic. Some recent work on this topic include Danks (2002); Tillman and Eberhardt (2014); Triantafillou and Tsamardinos (2015). Lastly, we believe one potential way to enhance the quality and interpretability of causal discovery from clinical data is to introduce multilevel or nested models to capture the hierarchical structure of the data. Therefore, systematic and comprehensive benchmark studies of multileveled models are critical to evaluate the efficacy of these models under different conditions. We want to close by emphasizing that causal structures identified from observational data should always be interpreted with caution and ideally followed up by validation experiments. Moreover, the transportability of the causal model should be examined closely when the causal model learned from one population is applied to another population, both for the estimation of causal effects and for predictive modeling. Acknowledgements The authors are grateful to Roshan Tourani for helpful comments on the manuscript. 5

There still could be unoriented edges in the local causal neighborhood, since certain causal structure cannot be discovered from observational data. By ‘‘completely’’ we mean the orientation of local neighborhood by PCD-by-PCD is the same as that obtained from a global causal discovery algorithm.

123

Behaviormetrika

References Alekseyenko AV, Lytkin NI, Ai J, Ding B, Padyukov L, Aliferis CF, Statnikov A (2011) Causal graphbased analysis of genome-wide association data in rheumatoid arthritis. Biol Direct 6(1):1 Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010) Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. J Mach Learn Res 11:171–234 Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010) Local causal and markov blanket induction for causal discovery and feature selection for classification part ii: Analysis and extensions. J Mach Learn Res 11:235–284 Aliferis CF, Tsamardinos I, Statnikov A v Hiton: a novel markov blanket algorithm for optimal variable selection. In: AMIA Annual Symposium Proceedings. Am Med Inform Assoc 2003:21 Angrist JD, Kuersteiner GM (2011) Causal effects of monetary shocks: Semiparametric conditional independence tests with a multinomial propensity score. Rev Econ Stat 93(3):725–747 Baba K, Shibata R, Sibuya M (2004) Partial correlation and conditional correlation as measures of conditional independence. Australian New Zealand J Stat 46(4):657–664 Bareinboim E, Pearl J (2013) A general algorithm for deciding transportability of experimental results. J Causal Infer 1(1):107–134 Breiman L et al (2001) Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat Sci 16(3):199–231 Brown LE, Tsamardinos I, Aliferis CF (2004) A novel algorithm for scalable and accurate bayesian network learning. Medinfo 11(Pt 1):711–715 Buntine W (1991) Theory refinement on bayesian networks. In: Proceedings of the Seventh conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., pp 52–60 Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, Santini S, Di Bernardo M, Di Bernardo D, Cosma MP (2009) A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell 137(1):172–181 Chickering DM (2002) Learning equivalence classes of bayesian-network structures. J Mach Learn Res 2:445–498 Chickering DM (2002) Optimal structure identification with greedy search. J Mach Learn Res 3:507–554 Colombo D, Maathuis MH, Kalisch M, Richardson TS (2012) Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann Stat 294–321 Cooper GF, Herskovits E (1992) A bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347 Danks D (2002) Learning the causal structure of overlapping variable sets. In: International Conference on Discovery Science. Springer, pp 178–191 De Smet R, Marchal K (2010) Advantages and limitations of current network inference methods. Nature Rev Microbiol 8(10):717–729 Dodge Y, Rousson V (2001) On asymmetric properties of the correlation coeffcient in the regression setting. Am Stat 55(1):51–54 Duda S, Aliferis C, Miller R, Statnikov A, Johnson K (2005) Extracting drug-drug interaction articles from medline to improve the content of drug databases. In: AMIA annual symposium proceedings. Am Med Inform Assoc 2005:216 Feelders A, Van der Gaag LC (2006) Learning bayesian network parameters under order constraints. Int J Approx Reason 42(1):37–53 Fisher RA et al (1924) The distribution of the partial correlation coefficient. Metron 3:329–332 Fisher RA (1936) Design of experiments. Br Med J 1(3923):554 Fisher RA, et al (1950) Statistical methods for research workers. Biological monographs and manuals. No. V. Oliver and Boyd, Edinburgh and London Friedman N, Nachman I, Pee´r D (1999) Learning bayesian network structure from massive datasets: the sparse candidate. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. pp 206–215 Geiger D, Heckerman D (1994) Learning Gaussian networks. In: Proceedings of the Tenth international conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. pp 235–243 Guyon I, Aliferis C, Elisseeff A (2007) Causal feature selection. Computational methods of feature selection pp 63–86

123

Behaviormetrika Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 Guyon I, Elisseeff A (2006) An introduction to feature extraction. In: Feature extraction. Springer pp 1–25 He YB, Geng Z (2008) Active learning of causal networks with intervention experiments and optimal designs. J Mach Learn Res 9:2523–2547 Heckerman D, Geiger D, Chickering DM (1995) Learning bayesian networks: The combination of knowledge and statistical data. Mach Learn 20(3):197–243 Heckerman D, Geiger D (2013) Learning bayesian networks: a unification for discrete and gaussian domains. arXiv preprint arXiv:1302.4957 Hill SM, Heiser LM, Cokelaer T, Unger M, Nesser NK, Carlin DE, Zhang Y, Sokolov A, Paull EO, Wong CK et al (2016) Inferring causal molecular networks: empirical assessment through a communitybased effort. Nature Methods 13(4):310–318 Hyttinen A, Eberhardt F, Hoyer PO (2010) Causal discovery for linear cyclic models with latent variables. on Probabilistic Graphical Models, p 153 Hyttinen A, Eberhardt F, Hoyer PO (2012) Learning linear cyclic causal models with latent variables. J Mach Learn Res 13:3387–3439 Hyttinen A, Eberhardt F, Ja¨rvisalo M (2015) Do-calculus when the true graph is unknown. In: Proceedings of the 31th Conference on Uncertainty in Artificial Intelligence Hyttinen A, Hoyer PO, Eberhardt F, Jarvisalo M (2013) Discovering cyclic causal models with latent variables: A general sat-based procedure. arXiv preprint arXiv:1309.6836 Imoto S, Higuchi T, Goto T, Tashiro K, Kuhara S, Miyano S (2004) Combining microarrays and biological knowledge for estimating gene networks via bayesian networks. J Bioinform Comput Biol 2(01):77–98 Isci S, Dogan H, Ozturk C, Otu HH (2014) Bayesian network prior: network analysis of biological data using external knowledge. Bioinformatics 30(6):860–867 Jmlr workshop and conference proceedings: Volume 3. http://www.jmlr.org/proceedings/papers/v3/. Accessed: 2016-11-23 John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. pp 338–345 Karstoft KI, Galatzer-Levy IR, Statnikov A, Li Z, Shalev AY (2015) Bridging a translational gap: using machine learning to improve the prediction of ptsd. BMC Psychiatry 15(1):1 Lagani V, Athineou G, Farcomeni A, Tsagris M, Tsamardinos I (2016) Feature selection with the r package mxm: Discovering statistically-equivalent feature subsets. arXiv preprint arXiv:1611.03227 Lagani V, Triantafillou S, Ball G, Tegne´r J, Tsamardinos I (2016) Probabilistic computational causal discovery for systems biology. In: Uncertainty in Biology. Springer pp 33–73 Lemeire J (2007) Learning causal models of multivariate systems and the value of it for the performance modeling of computer programs. ASP/VUBPRESS/UPA Lemeire J, Maes S, Meganck S, Dirkx E (2006) The representation and learning of equivalent information in causal models. Tech. rep., Technical Report IRIS-TR-0099, Vrije Universiteit Brussel Li H, Lu L, Manly KF, Chesler EJ, Bao L, Wang J, Zhou M, Williams RW, Cui Y (2005) Inferring gene transcriptional modulatory relations: a genetical genomics approach. Human Mol Genet 14(9):1119–1125 Li J, Wang ZJ (2009) Controlling the false discovery rate of the association/causality structure learned with the pc algorithm. J Mach Learn Res 10:475–514 Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A (2011) Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections. PloS One 6(6):e20,662 Ma S, Kemmeren P, Aliferis CF, Statnikov A (2016) An evaluation of active learning causal discovery methods for reverse-engineering local causal pathways of gene regulation. Sci Rep 6 Ma S, Kemmeren P, Gresham D, Statnikov A (2014) De-novo learning of genome-scale regulatory networks in s. cerevisiae. PLOS One 9(9):e106,479 Maathuis MH, Colombo D, Kalisch M, Bu¨hlmann P (2010) Predicting causal effects in large-scale systems from observational data. Nature Methods 7(4):247–248 Maathuis MH, Kalisch M, Bu¨hlmann P et al (2009) Estimating high-dimensional intervention effects from observational data. Ann Stat 37(6A):3133–3164

123

Behaviormetrika Maathuis MH, Nandy P (2015) A review of some recent advances in causal inference. arXiv preprint arXiv:1506.07669 Marbach D, Schaffter T, Mattiussi C, Floreano D (2009) Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 16(2):229–239 Meganck S, Leray P, Manderick B (2006) Learning causal bayesian networks from observations and experiments: a decision theoretic approach. In: International Conference on Modeling Decisions for Artificial Intelligence. Springer, pp 58–69 Murphy KP (2001) Active learning of causal bayes net structure. Tech. rep, UC Berkeley Nandy P, Maathuis MH, Richardson TS (2014) Estimating the effect of joint interventions from observational data in sparse high-dimensional settings. arXiv preprint arXiv:1407.2451 Olsen C, Fleming K, Prendergast N, Rubio R, Emmert-Streib F, Bontempi G, Haibe-Kains B, Quackenbush J (2014) Inference and validation of predictive gene networks from biomedical literature and gene expression data. Genomics 103(5):329–336 Ott S, Imoto S, Miyano S (2004) Finding optimal models for small gene networks. In: Pacific symposium on biocomputing, Citeseer 9:557–567 Pearl J, Bareinboim E et al (2014) External validity: From do-calculus to transportability across populations. Stat Sci 29(4):579–595 Pearl J (2009) Causality. Cambridge University Press Pearson K (1900) X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of. Science 50(302):157–175 Peer D, Regev A, Elidan G, Friedman N (2001) Inferring subnetworks from perturbed expression profiles. Bioinformatics 17(suppl 1):S215–S224 Ramsey J (2006) A pc-style markov blanket search for high dimensional datasets. Tech. rep., Technical Report No. CMU-PHIL-177 Ramsey JD (2014) A scalable conditional independence test for nonlinear, non-gaussian data. arXiv preprint arXiv:1401.5031 Ramı´rez-Gallego S, Garcı´a S, Mourin˜o-Talı´n H, Martı´nez-Rego D, Bolo´n-Canedo V, Alonso-Betanzos A, Benı´tez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6(1):5–21 Richardson T, Spirtes P (2002) Ancestral graph markov models. Ann Stat 962–1030 Robinson R (1978) Counting labeled acyclic digraphs. In: Harary F (ed) New directions in the theory of graphs. Academic Press, New York and London, pp 239–273 Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP (2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721):523–529 Sachs K, Itani S, Fitzgerald J, Schoeberl B, Nolan G, Tomlin C (2013) Single timepoint models of dynamic systems. Interface focus 3(4):20130,019 Schadt EE (2009) Molecular networks as sensors and drivers of common human diseases. Nature 461(7261):218–223 Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman M, Zhang C et al (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet 37(7):710–717 Scheines R, Eberhardt F, Hoyer PO (2010) Combining experiments to discover linear cyclic models with latent variables. Tech. rep, CMU, Pittsburg, US Shimizu S, Bollen K (2014) Bayesian estimation of causal direction in acyclic structural equation models with individual-specific confounder variables and non-gaussian distributions. J Mach Learn Res 15(1):2629–2652 Shimizu S, Kano Y (2008) Use of non-normality in structural equation modeling: Application to direction of causation. J Stat Plann Infer 138(11):3483–3491 Shimizu S, Hoyer PO, Hyva¨rinen A, Kerminen A (2006) A linear non-gaussian acyclic model for causal discovery. J Mach Learn Res 7:2003–2030 Shimizu S, Inazumi T, Sogawa Y, Hyva¨rinen A, Kawahara Y, Washio T, Hoyer PO, Bollen K (2011) Directlingam: A direct method for learning a linear non-gaussian structural equation model. J Mach Learn Res 12:1225–1248 Simpson EH (1951) The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) pp 238–241

123

Behaviormetrika Sokal RR, Rohlf FJ (1981) Biometry: the principles and practice of statistics in biological research. Freedman, New York Spirtes P, Glymour C, Scheines R, Kauffman S, Aimale V, Wimberly F (2000) Constructing bayesian network models of gene expression networks from microarray data. Tech. rep, CMU Spirtes P (2001) An anytime algorithm for causal inference. In: AISTATS. Citeseer Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search. MIT Press Spirtes P, Glymour CN, Scheines R, Spirtes P, Glymour C, Scheines R (1990) Causality from probability. In: Conference Proceedings: Advanced Computing for the Social Sciences, Williamsburgh Spirtes P, Meek C, Richardson T (1995) Causal inference in the presence of latent variables and selection bias. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. pp 499–506 Statnikov A, Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF (2010) Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections. BMC Res Notes 3(1):264 Statnikov A, McVoy L, Lytkin N, Aliferis CF (2010) Improving development of the molecular signature for diagnosis of acute respiratory viral infections. Cell Host Microbe 7(2):100 Statnikov A, Aliferis CF (2010) Analysis and computational dissection of molecular signature multiplicity. PLoS Comput Biol 6(5):e1000,790 Statnikov A, Lytkin NI, Lemeire J, Aliferis CF (2013) Algorithms for discovery of multiple markov boundaries. J Mach Learn Res 14:499–566 Statnikov A, Ma S, Henaff M, Lytkin N, Efstathiadis E, Peskin ER, Aliferis CF (2015) Ultra-scalable and efficient methods for hybrid observational and experimental local causal pathway discovery. J Mach Learn Res Stolovitzky G, Monroe D, Califano A (2007) Dialogue on reverse-engineering assessment and methods: the dream of high-throughput pathway inference. Ann New York Acad Sci 1115:1 Su L, White H (2007) A consistent characteristic function-based test for conditional independence. J Econ 141(2):807–834 Su L, White H (2008) A nonparametric hellinger metric test for conditional independence. Econ Theory 24(04):829–864 Sun X, Janzing D, Scho¨lkopf B, Fukumizu K (2007) A kernel-based causal learning algorithm. In: Proceedings of the 24th international conference on Machine learning. ACM pp 855–862 Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS et al (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceed Nat Acad Sci 98(19):10869–10874 Tamada Y, Kim S, Bannai H, Imoto S, Tashiro K, Kuhara S, Miyano S (2003) Estimating gene networks from gene expression data by combining bayesian network model with promoter element detection. Bioinformatics 19(suppl 2):ii227–ii236 Tan M, Alshalalfa M, Alhajj R, Polat F (2011) Influence of prior knowledge in constraint-based learning of gene regulatory networks. IEEE/ACM Trans Comput Biol Bioinform 8(1):130–142. doi:10.1109/ TCBB.2009.58 Tan M, AlShalalfa M, Alhajj R, Polat F (2008) Combining multiple types of biological data in constraintbased learning of gene regulatory networks. In: Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB’08. IEEE Symposium on IEEE pp 90–97 Tillman ER, Eberhardt F (2014) Learning causal structure from multiple datasets with similar variable sets. Behaviormetrika 41(1):41–64 Tillman RE, Spirtes P (2008) When causality matters for prediction: Investigating the practical tradeoffs. In: Proceedings of the 2008th International Conference on Causality: Objectives and Assessment Volume 6, COA’08, pp. 137–146. JMLR.org . http://dl.acm.org/citation.cfm?id=2996801.2996811 Tong S, Koller D (2001) Active learning for structure in bayesian networks. In: International joint conference on artificial intelligence, vol. 17, pp. 863–869. LAWRENCE ERLBAUM ASSOCIATES LTD Triantafillou S, Tsamardinos I (2015) Constraint-based causal discovery from multiple interventions over overlapping variable sets. J Machine Learn Res 16:2147–2205 Tsamardinos I, Brown LE, Aliferis CF (2006) The max-min hill-climbing bayesian network structure learning algorithm. Mach Learn 65(1):31–78 Tsamardinos I, Lagani V, Pappas D (2012) Discovering multiple, equivalent biomarker signatures. In: 7th Conference of the Hellenic Society for Computational Biology and Bioinformatics (HSCBB12). Heraklion

123

Behaviormetrika Uusitalo L (2007) Advantages and challenges of bayesian networks in environmental modelling. Ecol Model 203(3):312–318 Veer Van’t LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536 Wang M, Benedito VA, Zhao PX, Udvardi M (2010) Inferring large-scale gene regulatory networks using a low-order constraint-based algorithm. Mol BioSyst 6(6):988–998 Werhli AV, Husmeier D (2007) Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol 6(1) Yin J, Zhou Y, Wang C, He P, Zheng C, Geng Z (2008) Partial orientation and local structural learning of causal networks for prediction. In: WCCI Causation and Prediction Challenge, pp 93–105 Zhang K, Peters J, Janzing D, Schoelkopf B (2011) Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the Twenty-Seventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-11), pp. 804–813. AUAI Press, Corvallis, Oregon Zhang K, Peters J, Janzing D, Scho¨lkopf B (2012) Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775

123

Methods for computational causal discovery in biomedicine

Recommend Documents