Health Serv Outcomes Res Method DOI 10.1007/s10742-014-0136-7
Extending coarsened exact matching to multiple cohorts: an application to longitudinal well-being program evaluation within an employer population J. A. Sidney • C. Coberley • J. E. Pope • A. Wells
Received: 17 January 2014 / Revised: 19 December 2014 / Accepted: 24 December 2014 Ó Springer Science+Business Media New York 2015
Abstract Research to date within the field of well-being program evaluation has considered the study population to have either been given a treatment or not, and that matching will yield an unbiased, efficient estimate of the treatment causal effect. As well-being intervention programs become more sophisticated and diverse in their offerings, so too must the methods for assessing program effect. The objective of this research was to extend the traditional binary cohort assignment in quasi-experimental program evaluation in order to quantify the differential effects of a multi-tiered well-being improvement program administered over a 3 year period in a large employer. Data collected over this 3 year period included well-being assessments and medical claims from 17,669 employees and spouses. These individuals were assigned different cohorts based on intervention program intensity and matched utilizing coarsened exact matching. The matching process was able to remove 85 %, on average, of detectable bias across all comparison cohorts. A weighted generalized linear model, using the coarsened exact matching derived weights, was estimated to quantify the net (difference-indifference) causal effect of the well-being intervention program. The results showed an increase of overall well-being on average in the High Intensity cohort of 1.48 and 1.32 points in the Mild Intensity cohort. The non-intervened cohort only evidenced a 0.57 point increase in overall well-being. The methodology reported here provides an expanded and robust approach to matching on different cohorts for the purpose of program evaluation. Keywords Coarsened exact matching Treatment effect estimation Well-being improvement Quasi-experimental program evaluation
1 Introduction The objective of this research was to extend the traditional binary cohort assignment in quasi-experimental program evaluation in order to determine whether varying levels of
J. A. Sidney (&) C. Coberley J. E. Pope A. Wells Center for Health Research, Healthways, Inc., 701 Cool Springs Boulevard, Franklin, TN 37067, USA e-mail:
[email protected]
123
Health Serv Outcomes Res Method
program engagement yielded differing levels of well-being improvement over a 3 year period in a larger employer. Well-being improvement programs are designed to provide solutions for individuals across the spectrum of health care needs for the purpose of minimizing current and projected health care costs while leading to higher individual productivity, engagement and overall well-being. Improving well-being leads to lower healthcare costs and utilization, better performance on the job, enhanced productivity, lower turnover, and fewer days of unplanned absence. One cross sectional study has shown that a one point difference in well-being was associated with a 2 % difference in healthcare utilization and a 1 % difference in healthcare costs for individuals (Harrison et al. 2012). Subsequent research shows that the strength of this effect may differ depending on the well-being and demographic make-up of the population. Research to date within the field of well-being program evaluation has considered the study population to have either been given a treatment or not, and that matching will yield an unbiased, efficient estimate of the treatment causal effect (Rubin 2010; Wells et al. 2012a). This binary cohort assignment is not in alignment, though, with the intent of well-being improvement programs which seek to identify, intervene and measure a diverse set of individuals based on a multitude of factors such as readiness to change behavior, current demand of health care resources and projected likelihood of well-being decline. The evolution of programs designed to help individuals manage their health and wellbeing has accelerated over the last decade, yet outcomes measurement methodologies have been slow to advance (Goetzel et al. 2008; Mattke et al. 2013). In other disciplines of health care related research, in particular, pharmaceutical research, methodologies for measuring outcomes from quasi-experimental designs have evolved to account for the varied nature in which a set of treatments may affect different individuals within a population. (Ho et al. 2006; Yu et al. 2013; Rubin 2007; Stuart and Rubin 2007). The current study builds upon this research, leveraging a relatively new matching methodology, with the intent of showing researchers, administrators, employers and individual participants of well-being improvement programs how to more comprehensively assess the value of such programs. Finding a non-enrolled population that is comparable to the enrolled population is difficult due to the inherent selection bias in quasi-experimental studies of opt-in lifestyle coaching programs (Cortes et al. 2008; Larzelere et al. 2004). As a further complication, bias is present within the enrolled population due to varying levels of intervention intensity. There is a need to account for multiple levels of intensity and other studies have sought to expand matching methodologies for such purposes (Rassen et al. 2013; Wang et al. 2013). To address these biases, this study utilized coarsened exact matching (CEM) to simultaneously match across multiple cohorts of individuals who were engaged at varying levels in a well-being improvement program. The purpose of this study is to expand the current CEM methodology into a multiple cohort framework with the outcome of well-being change as the means by which to illustrate the advancement. We also present results from application of binary CEM on the same outcome to highlight the different results obtained from the two approaches.
2 Data In this 3-year, retrospective analysis (January 2010 through December 2012), responses to a well-being assessment (WBA) survey as well as administrative health care claims were collected from employees, spouses and dependents (collectively, members) of a large US
123
Health Serv Outcomes Res Method
employer. The WBA was offered to all members of the employer. Based on the survey responses and administrative claims, the well-being program administrator was able to identify members for individualized well-being improvement coaching sessions. These sessions were designed to review current chronic condition status, emotional health, healthy behaviors, physical state, life evaluation and biometric risks; inform and educate members of long-term health consequences associated with their risks; and offer effective guidance in the form of motivation as well as action plans to move the members to a higher state of well-being. In the case of members with chronic conditions, the sessions were augmented with additional content specific to their chronic condition state, including specific steps to follow for proper disease management, adhering to prescribed medication and closing gaps in care. For members who chose to participate in coaching sessions, an enrollment was created; other members who completed the WBA and were identified to have multiple well-being risks, and thus qualified for enrollment into the program, but did not participate, were deemed non-enrolled. The presence or absence of enrollment in any one of the three analysis years distinguished treatment (participating) and comparison (non-participating) group members. This ‘once treated always treated’ philosophy was employed to reduce selection bias and thus contaminated error terms from mixing members across the evaluated groups. Among the total population present over the 3 year study period, 53 % completed at least one assessment during the 3 year window of this study. Of this subset of members, 66 % completed at least two WBAs (n = 17,669) with 65 % (n = 11,446) having enrolled in a well-being improvement coaching program. The cohort construction section below contains the specifications for member inclusion in the analysis. Factor analysis was applied to the WBA responses in order to identify latent constructs not directly assessed in the survey while consolidating the number of questions that could be utilized in the matching and regression analyses. For some members, however, the latent factors could not be constructed due to incomplete WBAs. The occurrence of missing responses for one or more WBA questions was observed in the data because members were permitted to skip questions, whether intentionally or not. The presence of missing responses has been noted before (Prochaska et al. 2012) and similar to prior research, the approach taken here utilized Markov chain Monte Carlo (MCMC) to impute the latent factors as opposed to individual responses for respondents evidenced to have missing values (Yuan and Yang 2009; Schafer 1997). Results of the distribution of response values with and without MCMC, in terms of the stability and validity of the MCMC method, are discussed in the Results section.
3 Methodology 3.1 Coarsened exact matching Coarsened exact matching is an innovative, bias-reducing matching method that can be employed within a quasi-experimental design for the purpose of comparing an outcome between two cohorts over time (Iacus et al. 2009, 2011). Compared with the commonly applied propensity score matching methodology (Brandt et al. 2010; Allen-Ramey et al. 2003; Perkins et al. 2000; Shepardson et al. 1999). CEM has been found to yield estimates of the causal effect with the lowest variance and bias for any sample size (Wells et al. 2012a, b; King et al. 2011b). The increased efficiency and lower bias properties of CEM are attributed to exact matching across common strata defined by variables that explain
123
Health Serv Outcomes Res Method
variance in the decision to participate or not in the program. Effectively, CEM enables more comparable evaluation of treatment and comparison groups by creating proportionality among the factors contributing to the outcome of interest. The CEM methodology allows for theoretical extension into multiple cohorts (Iacus et al. 2011), yet to the authors’ knowledge there has been no publication of CEM utilized in such fashion or for well-being program evaluation. To extend the application of CEM beyond two study cohorts, the data must first be organized into one treatment cohort and multiple comparison cohorts defined by mutually exclusive criteria. The cohorts are exactly matched across strata that have been defined by a stratification logic consisting of discrete variables. The matching process, similar to the two study cohort design, maintains strata common to all cohorts and removes the unique strata (Plesca et al. 2008; Hade and Erinn 2012; Rassen et al. 2011). With the matched strata identified, weights are calculated for each stratum based on the relative proportion of members in the evaluated cohorts. The weights then allow for estimation of an unbiased estimate of the casual effect associated with the treatment, whether within the nonparametric structure of CEM or in a weighted least squares multivariate regression model (King and Zeng 2007). Prior to matching, a stratification logic must be established. The variables used to stratify vary from population to population, though in all cases the variables should be selected based on a priori assumptions of the source of bias present among the selected cohorts as well as factors explaining variability in the baseline value of the outcome. The stratification logic requires discrete representations, referred to as bins, of the evaluated variables, such as age represented in two categories based on a threshold of 40, in order to create a combination of the variables into which a member is uniquely assigned. These unique combinations of categorized variables are collectively referred to as strata. Transforming continuous variables into discrete bins can be achieved using a number of analytical techniques including quartile breakpoints, Scott’s Binning Algorithm, the Pretty Breakpoints function found within the programming language R, and a novel method utilizing Kernel density estimates. The Kernel density estimate is constructed by summation of normal kernels at each observation of a given variable. The peaks among the plotted estimate imply a higher density around that particular value; inflection points on this plotted estimate are equivalent to naturally occurring breakpoints within the data. These inflection points were utilized to define the bin thresholds for each continuous matching variable. The Kernel density method allows for unequal bin size within the variables, which is in contrast to the previously mentioned methods that seek to find equivalent proportions of individuals across the categorized levels. The imbalance or relative distribution of individuals across strata and cohorts is measured pre and post matching. There are three primary objectives in any given application of CEM. One is to identify imbalance between treatment and comparison cohorts within a population. This is achieved by selecting the stratification logic that yields a sufficiently large pre matching level of imbalance. A second objective is reduction of identified imbalance to a negligible level of imbalance following matching. Lastly, an objective is to maximize retention of treatment cohort members. It is often not possible to maximize all of these objectives. For example, adding more variables or bins within the matching process will identify more pre-matching imbalance but will result in observations being pruned from the analysis. Balancing these objectives creates a spectrum ranging from perfect statistical similarity of multiple cohorts (at the expense of cohort sample size) to generalizability of results in a population (at the expense of undetectable imbalance between cohorts).
123
Health Serv Outcomes Res Method
The L1 statistic was developed to quantitatively assess the extent to which a given stratification logic accomplished these objectives (Iacus et al. 2009). This statistic ranges from 0 to 1, where 0 is recorded for a stratification logic that achieves perfect balance (equal proportion of treatment and comparison cohort members in each stratum) and 1 reflective of a logic that results in a perfectly imbalanced set of members (treatment and comparison members are mutually exclusive in each stratum). The L1 is computed according to (1) below: 1 Xk jT Cs j ð1Þ L1 ðT; CÞ ¼ s¼1 s 2 8s 2 ð1; . . .; kÞ where T represents the relative frequency of treatment observations (members) for a given stratum, C denotes the relative frequency of the comparison observations for a given stratum, s denotes each unique stratum, and k is the total number of unique strata created. A separate statistic entitled Least Common Support (LCS) is also evaluated in CEM applications to assess the efficiency of created strata. The LCS indicates the percentage of strata created with at least one observation from each cohort present therein (Iacus et al. 2009). A simple example of the L1 calculation is as follows: assume a stratification logic that creates three strata across 15 treatment and 17 comparison observations. Strata 1 is populated with 3 treatment and 8 comparison observations (with relative frequencies of 0.20 and 0.47). Strata 2 is populated with 5 treatment and 5 comparison observations (relative frequencies of 0.33 and 0.29). Last, Strata 3 is populated with 7 treatment and 4 comparison observations (relative frequencies of 0.47 and 0.24). Per Eq. (1), the absolute differences are calculated from these relative frequencies which results in 0.271, 0.039, and 0.231 for Strata 1, 2, and 3. The sum of these absolute differences is 0.541 and its product of 0.5 yields a calculated L1 of 0.271. Once the common strata, S, are established across the treatment cohort, T, and all comparison cohorts, C, the weights are calculated using Eq. (2a). To simplify the equations, the weights (w) are calculated for each comparison cohort as: ws ¼
mC msT mT msC
8 s ðST \ SC1 . . . \ SCn Þ Xa
wi a
ð2bÞ
wi =a 1
ð2cÞ
i¼1
Xa i¼1
ð2aÞ
where s, again, denotes each unique stratum, mC is the total number of post-matching comparison observations, mT is the total number of post-matching treatment observations, msT is the total number of post-matching treatment observations within a given stratum, msC is the total number of post-matching comparison observations within a given stratum, i indexes member by cohort, and a is the total number of observations within a unique cohort with w = 0. The mathematical proof of how these weights reduce bias can be found in Appendix A of Iacus et al. (2008). The weights ws are applied to the comparison cohorts; the treatment cohort receives a weight of 1 regardless of observation distribution across common strata. Furthermore, all observations not present in a common stratum receive a weight of 0. Equation 2b illustrates that the sum of the weights, wi, for a unique cohort must equal the total number of observations of that cohort; by extension, the average of the
123
Health Serv Outcomes Res Method
weights, by cohort, must equal 1 as shown in Eq. 2c. Equations 2b and 2c are necessary to demonstrate the efficiency of w and to ensure variability across cohorts is not incorrectly calculated. To confirm the CEM generated weights have removed detected bias in the population, the L1 is recalculated based on the resulting distribution of matched members and is presented as L1(T*, C*). Bias removal is indicated by a result where L1 ðT ; C Þ L1 ðT; C Þ (Iacus et al. 2008). If a sufficient degree of bias has been removed, the weights can be used in descriptive statistics and regression models to determine the causal effect of the treatment effect. (Iacus et al. 2009, 2011). 3.2 Outcome of interest and cohort construction In this study, the evaluated outcome was whether an intervention program designed to improve individual well-being over a 3-year period was effective. The outcome measure was the individual well-being score (IWBS), which is calculated based on responses to the WBA and is designed to measure all aspects or domains of a person’s life including emotional health, physical health,1 work environment, healthy behaviors, basic access, and life evaluation (Evers et al. 2012). This score has been found to be highly correlated with high healthcare costs, short-term disability days, and workplace productivity (Sears et al. 2013; Shi et al. 2012, 2013; Gandy et al. 2013, 2014; Prochaska et al. 2011, 2012; Merrill et al. 2012, 2013). Research has shown that individuals who increased their IWBS by one point, on average were 1.7 % less likely to incur an ER visit and 1.0 % less likely to incur any health care costs (Harrison et al. 2012). The IWBS ranges from 0 to 100, where 100 represents perfect well-being and 0 indicates the complete absence of well-being. Missing or incomplete responses used to construct the IWBS were not imputed. A requisite percentage of the IWBS component questions are required to be completed in order to calculate the IWBS; the IWBS for a given individual is null if the threshold is not satisfied (Prochaska et al. 2012, 2008). The unit of analysis for evaluating change in IWBS was members who completed a WBA in 2010 (pre-intervention base year) and in 2012 (study year). The specific formulation of the outcome was the difference in IWBS between the evaluated cohorts between study and base years, or simply, the difference-in-difference of IWBS. To assess the significance, directionality and magnitude of IWBS change, members were allocated to three cohorts based upon their exposure to the well-being improvement program: high program intensity, low program intensity, and those never enrolled in the program. The high and low intensity cohorts were defined as either above or below the average annual number of attempted telephonic outreaches (equal to 5). Attempted outreaches were used as a proxy metric for a more general intent-to-treat specification of the cohorts. Table 1 shows that in 2010, members with more attempted outreaches had higher medical utilization and lower well-being, indicating that such members required more outreaches to improve well-being. This finding further supports the hypothesis being tested here that bias is present across and within enrollment types. While the use of multi-cohort CEM is intended to minimize this bias in addition to the bias stemming from non-enrolled members, it is possible that residual selection bias remains. The theory of CEM specifically, and more generally matching, would hold that such residual bias is normally distributed across members, strata and cohorts such that a difference-in-difference calculation 1
Information pertaining to diagnosed diseases and conditions was based on administrative claims as opposed to stated survey responses.
123
Health Serv Outcomes Res Method Table 1 Means of factors, matching variables, and regression covariates Covariate
Pre-matching Prime (n = 6,267)
Agea
Post-matching MildIntensity (n = 5,179)
Comparison (n = 6,223)
Prime (n = 5,864)
MildIntensity (n = 4,218)
Comparison (n = 4,919)
47.48
45.86
44.70
47.66
47.55
47.48
Gender (percent female)a
0.42
0.44
0.44
0.42
0.42
0.42
Health inactivity factor 2010
1.05
0.81
0.59
1.04
0.95
0.74
Biometric factor 2010a, b
0.88
0.79
0.75
0.87
0.86
0.86
Physical problems factor 2010a, c
0.29
0.24
0.17
0.29
0.29
0.27
Self-perception ladder factor 2010
7.71
7.93
8.00
7.72
7.82
7.88
Positive emotion factor 2010
0.84
0.85
0.87
0.84
0.83
0.86
Afford basic needs factor 2010
0.93
0.94
0.94
0.93
0.93
0.93
Negative emotion factor 2010
1.04
1.07
1.08
1.05
1.06
1.07
Healthy habits factor 2010
2.07
2.26
2.29
2.07
2.22
2.29
Work satisfaction factor 2010
0.92
0.92
0.93
0.92
0.92
0.94
Safe residence factor 2010
0.90
0.91
0.92
0.91
0.91
0.92
Access healthy food and drinks factor 2010
0.99
0.99
0.99
0.99
0.99
0.99
Work environment factor 2010
0.76
0.77
0.77
0.76
0.77
0.76
Doctor visits factor 2010
0.90
0.90
0.88
0.90
0.90
0.89
Diabetes
0.16
0.06
0.03
0.16
0.12
0.12
COPD
0.02
0.01
0.00
0.02
0.02
0.02
Asthma
0.11
0.08
0.04
0.10
0.11
0.09
CAD
0.07
0.02
0.01
0.07
0.05
0.05
Chronic back pain
0.15
0.11
0.06
0.15
0.16
0.17
20.24
14.78
6.49
19.73
19.78
18.24
Count of diseasesa
0.84
0.53
0.26
0.82
0.79
0.78
ER visits (2010)
0.38
0.28
0.26
0.36
0.32
0.40
Months after disease trigger date (MAT)
123
Health Serv Outcomes Res Method Table 1 continued Covariate
Pre-matching
Post-matching
Prime (n = 6,267)
MildIntensity (n = 5,179)
Comparison (n = 6,223)
Prime (n = 5,864)
MildIntensity (n = 4,218)
Comparison (n = 4,919)
Unique IP stays (2010)
0.10
0.06
0.06
0.10
0.09
0.10
Out patient visits (2010)
4.02
3.23
2.42
3.98
3.83
3.40
Unique prescriptions filled (2010)
3.32
2.51
2.01
3.27
3.10
2.97
Attempted Calls (2010)
7.61
3.77
4.19
7.71
3.74
5.24
Attempted calls (2011)
7.08
2.57
2.43
7.15
2.90
3.80
Attempted calls (2012)
8.55
2.60
1.93
8.43
3.37
5.58
IWBS 2010
76.35
79.12
80.38
76.51
77.76
78.59
IWBS 2012
77.90
80.48
81.73
77.99
79.46
79.40
Delta IWBS (2012–2010)
1.55
1.36
1.34
1.47
1.70
0.81
Health Care Costs PMPM 2010
$425.17
$295.53
$200.25
$423.05
$407.56
$343.65
Health care costs PMPM 2012
$526.21
$348.91
$271.11
$516.43
$446.55
$399.26
a
Indicates that the covariate was utilized in the stratification logic of CEM
b
As noted in earlier the biometric factor is an index of questions indicating whether a member had diabetes, high blood pressure (over 139/89) and/or high cholesterol ([239 mg/dL), in addition to the natural logarithm of their BMI (for scaling reasons) c
The physical problems factor is an index of questions concerning whether a member had any of the following issues: knee or leg pain, neck or back pain, physical pain, recurring pain, count of other conditions not listed, and if health problems prevented the member from performing normal activities
of mean values by strata is a robust means by which to derive the overall causal effect (Iacus et al. 2009, 2011). The three cohorts were comprised of members between the ages 18 through 64 and all of whom were eligible to enroll in the well-being improvement program. Enrollment in the program was defined as participating for at least 3 months. Members with only 1 or 2 months of identified participation or who opted out of the program prior to its completion were removed from the analysis as they represented a contaminated treatment group (refer to Fig. 1). Enrolled members with five or more attempted outreaches on average were placed in the High Intensity cohort while enrolled members with four or fewer were assigned to the low intensity cohort. The non-enrolled cohort was comprised of members who never enrolled in the well-being improvement program but had at least one attempted outreaches. The singular attempted outreach criterion was an important consideration as it indicated the member was deemed eligible to benefit from the program but either opted not to participate or was unable to be reached. By requiring the three cohorts to have at least one outreach attempt, the non-enrolled cohort was made more comparable to the enrolled
123
Health Serv Outcomes Res Method
cohorts prior to matching. Figure 1 illustrates the population breakdown into the final cohort construction. A standardized comparison of the three cohorts can be found in Table 2 with the High Intensity cohort treated as the Prime cohort, low intensity cohort treated as the first comparison cohort, and the non-enrolled cohort as the second comparison cohort. To satisfy the second purpose of this study, the results of a traditional binary cohort CEM analysis are also reported. The composition of the non-enrolled comparison cohort remained unchanged from the multiple cohort method. The Prime treated cohort, within the binary methodology, is the aggregation of the two enrolled and treated cohorts, regardless of program intensity. Unless otherwise stated, the binary cohort CEM analysis utilized an identical stratification and sampling logic found in the multiple cohort analysis. This decision allowed for measurable differences in the reported results to be attributed to the different methodologies, not variable inclusion. 3.3 Stratification logic To assess and reduce bias across the cohorts, eight base year variables were utilized to create the stratification logic: age, gender, count of diseases (from administrative claims), physical problems factor, biometric factor, and three dummy variables indicating whether a member received an outreach in 2010, 2011, and/or 2012. The physical problems and biometric factors were statistically constructed using factor analysis with an orthogonal rotation on responses to WBA questions. Both of these factors were calculated as the weighted sum2 of specific questions. The physical problems factor was calculated from questions concerning whether a member had any of the following issues: knee or leg pain, neck or back pain, physical pain, recurring pain, count of other conditions not listed, and if health problems prevented the member from performing normal activities. The biometric factor was calculated from questions indicating whether a member had diabetes, high blood pressure (over 139/89) and/or high cholesterol ([239 mg/dL), in addition to the natural logarithm of their BMI (for scaling reasons). Given construction of the weighted factors, a linear relationship existed between factor value and well-being risk of a member. The continuous variables were transformed into discrete bins utilizing the kernel density estimate. Five bins were constructed for age: 18–26.9, 27–34.9, 35–50.9, 51–59.9, and 60–64.9. The variable count of diseases was specified with three bins: 0 claims-identified diseases, 1 disease, 2 or more diseases. The bins for both factor variables were also constructed at three levels: 0, 0.01–0.4, and [0.4 for the physical problems factor and 0–0.8, 0.81–1.1, and [1.1 for the biometric factor. Combination of the bins of these variables formed the stratification logic for which Eqs. (1) and (2a) were applied. 3.4 Generalized linear model specification This study employed a weighted generalized linear model (GLM) to estimate the intervention effect. GLM allows for flexibility in estimating the effect of covariates with a limited dependent distribution, assuming a non-linear difference in each coefficient of the categorical level (McCullagh and Nelder 1989). This is an important advantage since GLM allows the incremental nature of the intensity of the program to be estimated, holding all 2
Each variable was weighted by its relative factor loading compared to the loadings of the other variables comprised within the latent factor (threshold for variable inclusion in a factor was an absolute loading C0.30).
123
Health Serv Outcomes Res Method
Fig. 1 Study population waterfall analysis. WBA denotes a well-being assessment. IWBS stands for individual well-being score
123
Health Serv Outcomes Res Method
else constant. Furthermore, the data for several of the covariates in the model were responses to WBA questions, the range of which was restricted to the response-options provided; a traditional linear model is not restricted in its prediction of the responses. Table 2 Standardized differencea of factors, matching variables, and regression covariates Covariate
Pre-matching
Post-matching
MildIntensity
Comparison
MildIntensity
Comparison
Ageb
0.1533
0.2727
0.0112
0.0172
Gender (percent female)b
0.0321
0.0404
0.0000
0.0000
Health inactivity factor 2010
0.0992
0.2073
0.0344
0.1325
Biometric factor 2010b
0.3388
0.5284
0.0341
0.0376
Physical problems factor 2010b
0.1620
0.4123
0.0023
0.0679
Self-perception ladder factor 2010
0.1727
0.2360
0.0828
0.1283
Positive emotion factor 2010
0.0339
0.0923
0.0242
0.0677
Afford basic needs factor 2010
0.0351
0.0324
0.0216
0.0169
Negative emotion factor 2010
0.0946
0.1241
0.0584
0.0963
Healthy habits factor 2010
0.1861
0.2138
0.1404
0.2030
Work satisfaction factor 2010
0.0042
0.0647
0.0011
0.0870
Safe residence factor 2010
0.0477
0.0834
0.0154
0.0758
Access healthy food and drinks factor 2010
0.0477
0.0360
0.0424
0.0071
Work environment factor 2010
0.0326
0.0294
0.0188
0.0077
Doctor visits factor 2010
0.0134
0.0759
0.0074
0.0470
Diabetes
0.3479
0.4645
0.1024
0.1238
COPD
0.0771
0.1258
0.0179
0.0436
Asthma
0.0891
0.2804
0.0043
0.0364
CAD
0.2272
0.3057
0.0897
0.0836
Chronic back pain
0.1187
0.3004
0.0299
0.0564
Months after disease trigger date (MAT)
0.2447
0.7009
0.0021
0.0658
Count of diseasesb
0.3134
0.6453
0.0291
0.0426
ER visits (2010)
0.0786
0.1001
0.0356
0.0256
Unique IP stays (2010)
0.1040
0.1095
0.0257
0.0024
Out patient visits (2010)
0.1390
0.3154
0.0244
0.1027
Unique prescriptions filled (2010)
0.3069
0.5322
0.0595
0.1083
Attempted calls (2010)
0.9587
0.8092
1.0228
0.5877
Attempted calls (2011)
1.1966
1.2439
1.1535
0.8302
Attempted calls (2012)
1.2202
1.2456
1.0515
0.4642
IWBS 2010
0.2162
0.3217
0.0958
0.1624
IWBS 2012
0.1966
0.2952
0.1115
0.1046
Delta IWBS (2012–2010)
0.0156
0.0175
0.0184
0.0557
Health care costs PMPM 2010
0.1403
0.2771
0.0144
0.0840
Health care costs PMPM 2012 0.1391 0.2091 0.0540 0.0941 qffiffiffiffiffiffiffiffiffi s21 þs22 a The standardized difference ðx1 x2 Þ is calculated with respect to the Prime cohort 2 b
Indicates that the covariate was utilized in the stratification logic of CEM
123
Health Serv Outcomes Res Method
Finally, GLM relaxes the assumption of constant variance across all members present in the analysis (McCullagh and Nelder 1989). A GLM can be expressed in the following set of equations: gðlÞ ¼ x0 b varð yÞ ¼
wVðuÞ w
l ¼ x0 b
ð3aÞ ð3bÞ ð3cÞ
where y is the dependent variable, x a vector of covariates and b denotes a vector of parameters. Equation 3b illustrates that variability in the outcome measure y is dependent on u through a variance function and the scalar w=w, where w is the dispersion parameter and w the vector of weights. Last, Eq. 3c is the identity link function utilized by the model (Jackman). Relying on axioms 2b and 2c, the weights can be applied to the distribution of x without biasing the outcome measure, y. The estimated distribution with the applied weighted scalar x is a function of the vector of maximum likelihood estimates, b, thus allowing an unbiased estimate of yˆ. This study estimated the equations in (3a), (3b) and (3c) with y ¼ DIWBS and x = f (Program Utilization, WBA Risk Factors, Demographic Information, Healthcare Utilization, Disease Indicators) while assuming a normal distribution of the exponential variance. To control for residual variability in the outcome not directly accounted for by the match we included additional variables in the regression model. Lastly, random sampling was employed to further mitigate any undetectable bias that may be a function of subset of members, whether individually or in a particular combination (King et al. 2011a). The cohorts were sampled with replacement at a 1:1:1 ratio 100 times; the causal effects were based on the average over these iterations. 4 Results 4.1 Factor analysis and monte carlo simulation Based on the final 17,669 study participants, factor analysis with orthogonal varimax rotation allowed for consolidation of responses to WBA questions into 13 latent factors. The identification of latent factors allowed for the measurement and utilization of unobserved variables present within the survey. Utilizing the factors in lieu of the individual questions removed more of the potential measurement error inherent in survey data (Bound et al. 2000). Furthermore, the latent factors allowed for more efficient model specification. Utilizing all individual responses in model specification could lead to redundancies as the WBA questions were designed to uncover latent factors (Sears et al. 2014). On the other hand, evaluating only a subset of responses could lead to an incomplete view of the latent factors since each question contributes to a factor. For these reasons, the unobserved variables proved necessary to both the stratification logic and regression specification. Questions were uniquely loaded into a latent factor if their absolute factor loading was [0.30. Four of the 13 factors, health inactivity, biometric, physical problems, and self perception, were utilized in the regression model due to their relative high correlation with the outcome of interest. The health inactivity factor was comprised of questions related to poor health and its effect on daily activities. The biometric factor included weight, blood pressure, and glucose. The physical problems factor indicated how much physical pain a
123
Health Serv Outcomes Res Method
member suffered from as well as specific locations of the pain. Last, the self perception factor assessed the quality of life and health of a member based on a discrete scale ranging from 0 to 10. Some of the factors could not be constructed for all members due to members having an insufficient number of requisite questions completed. On average, the percentage of incomplete non-workplace related factors across the population was 6.5 %. Of the two workplace related factors, 34.2 % were incomplete. Ten iterations of MCMC simulation were performed on the 13 factors, which yielded a relative efficiency of approximately 1 for all factors. Full results of the simulation are available upon request. The mean of the imputed factor values across the ten iterations were calculated and utilized in the matching and regression models. 4.2 Matching The matching results are listed in Table 3 along with the matched strata results found in Table 4. The Mild Intensity cohort evidenced an 82 % reduction in detectable bias while the comparison cohort had an 88 % reduction in detectable bias. However, the postmatching L1 indicated residual variability and bias following application of the matching weights and as a result, a regression model was employed to model remaining variability in the matched cohorts. Given the stratification logic, it is important to note that the Prime cohort remained relatively intact with just 6.2 % of observations on average removed from further analysis. Furthermore, approximately 80 % of the non-Prime cohort members remained in the study post matching. The binary cohort method resulted in a 99 % reduction of detectable bias. This reduction was achieved with minimal removal of observations across both cohorts. However, the multiple cohort method detected more bias (0.957) than the binary cohort method (0.499). Examination of the stratification variables illustrates how CEM matching created cohorts and weight vectors that resulted in distributions more similar to the Prime cohort compared to the unweighted distributions. The matched distributions for variables included in the stratification logic reported the most similar first moment values, with the nonstratification variable distributions also evidencing a greater degree of comparability post matching. For example, the biometric factor, a stratification variable, observed a prematching standardized difference of 0.34 in the Mild-Intensity cohort and 0.53 in the comparison cohort. After matching, the standardized difference fell to 0.03 and 0.04 for the cohorts. Another example of the reduced difference of a stratification variable’s Table 3 Coarsened exact matching imbalance results Cohort
Pre-matching L1
a
Post-matching b
a
LCS (%)
L1
LCS (%)
Average members dropped from matching b
Prime
Prime (%)
Cohort
Cohort (%)
Mild Intensity
0.370
17. 7
0.066
100
386.3
6.20
1,066.8
20.6
Comparison
0.587
17.7
0.072
100
386.3
6.20
1,134.2
18.2
0.499
24.8
0.005
100
395.4
3.45
440.2
Binary cohort Comparison
7.07
a
Lower values of L1 indicate better balance between groups, with L1 = 0 indicating perfect balance whereas L1 = 1 indicating perfect imbalance and mutual exclusivity b
Least Common Support (LCS) assesses the efficiency of created strata. The LCS indicates the percentage of strata created with at least one observation from each cohort present therein
123
Health Serv Outcomes Res Method Table 4 Matched strata summary results Cohort
Mean N
Std dev
Median N
Minimum N
Maximum N
Sum N
Mean weight
Mean ratioa
Prime
16.3
26.8
6
1
219
5,864
1
N/A
Mild Intensity
11.7
17.7
6
1
138
4,218
1.23
1.70
Comparison
13.7
32.1
4
1
346
4,919
2.25
2.69
After dropping unmatched strata, 360 common strata were present a
Mean ratio = the mean ratio of Prime cohort members to the respective cohort members across all matched strata
Table 5 Algebraic calculation of the ATT Cohort
ATT (std dev)a
Mild Intensity
0.202 (0.086)b
Comparison
-0.68 (0.178)b
Average treatment effect on treated a
Pre-regression CEM ATT calculation (i.e., calculated at the strata-level, post matching in which the difference of difference values were multiplied by the respective number of Prime cohort members and then divided by the total number of Prime members) (standard deviations are calculated from the range of ATTs estimated across 100 random samples)
b The resulting ATTs were multiplied by -1 to maintain a consistent interpretation of the directionality of the varying calculations of the ATT
distribution was the count of diseases. The Mild Intensity and comparison cohorts recorded a standardized difference of 0.31 and 0.65 prior to the match with post-match differences of 0.03 and 0.04. The reduced standardized differences after matching were also found in non-stratification variables such as the diabetes prevalence rate, months after disease identification date, and health care costs. Recall that the Prime cohort received a weight of one, thus any difference in the weighted and unweighted Prime cohort mean was solely a function of the observations removed during the matching process. Table 5 shows the mean and standard deviation of the average treatment effect on the treated (ATT) calculated prior to regression analysis across 100 random samples. The ATTs are similar to the difference of the Mild Intensity and comparison cohort mean IWBS change relative to the Prime cohort shown in Table 2. Note that prior to matching, both the Mild Intensity and comparison cohorts showed a similar change in IWBS score, 1.36 and 1.34 respectively. With respect to IWBS change and following application of the matching weights, as illustrated in Table 2, the Mild Intensity cohort outperformed while the comparison cohort largely underperformed the Prime cohort. However, it was noted previously that not all variability was removed during the matching process. The implication is that the algebraic ATT cannot be definitively declared the most efficient and unbiased estimate of the program’s effect. 4.3 Regression In this study, we employed a second stage to the CEM process in which a weighted GLM was estimated on the matched cohorts. We pursued this two-stage type of CEM methodology to explain residual variability in the outcome measure post matching that may be due
123
Health Serv Outcomes Res Method
to the stratification logic not resulting in exact matches. Results of the estimated regression model are listed in Table 6. The estimated coefficients for the Mild Intensity and comparison cohorts were similar to the pre-regression or algebraic ATT values such that the Mild Intensity cohort showed a statistically significant increase in IWBS of 0.38 compared to the Prime cohort while the comparison cohort decreased in IWBS by -0.30 relative to the Prime cohort. The binary cohort application of CEM showed the comparison cohort to have decreased in IWBS by -0.20 relative to the all-in treated cohort (results not shown); this result was statistically insignificant. These findings, however, are restricted to the marginal effect of program intensity, holding all other variables constant. Evaluating the effect of program intensity with all other variables at their mean values, heretofore referred to as the average predicted value (or more formally, total differential of the estimated regression equation), demonstrated that the Prime cohort had a relatively higher improvement in IWBS than the Mild Intensity cohort and an even larger improvement than the comparison cohort (see Table 7). While the binary cohort method utilized two cohorts to perform the match, the average predicted values are reported by the intensity of the intervention to facilitate an easier comparison of results. The pre-regression CEM ATT values were of similar sign and magnitude as well as within the confidence intervals of the regression-derived predicted values (Table 7). These results were important as they demonstrated that even after controlling for many other factors within a parametric model, the ATT values were not materially different from the pre-regression values. Last, the confidence intervals around the predicted values with respect to the Prime and Mild Intensity cohorts intersected, indicating no statistical Table 6 Generalized linear model regression results
Variable Mild Intensity cohorta
a
Reference group is Prime
b
Reference group is Opt In Members
c
The reported coefficients are the mean coefficient across 100 random samples. The reported standard deviation is the standard deviation of the 100 coefficients
Std dev
p value
0.38
0.0867
0.0011
-0.30
0.1749
0.1181
ER visits 2010
0.02
0.0557
0.6696
Biometric factor
3.08
0.3455
0.0001
Health inactivity factor
0.46
0.0295
0.0001
Physical problems factor
2.30
0.2434
0.0001
Self-perception ladder factor
-1.87
0.0514
0.0001
First program trigger: 2010b
-1.60
0.1621
0.0001
First program trigger: 2011b
-1.99
0.1756
0.0001
First program trigger: 2012b
-3.25
0.3330
0.0001
Inpatient stays 2010
-0.47
0.2972
0.1424
Intercept
15.01
0.5507
0.0001
0.00
0.0041
0.4167
Unique prescriptions 2010
-0.33
0.0294
0.0001
Outpatient visits 2010
-0.04
0.0142
0.0245
Scale
11.59
0.0503
0.0001
Age
0.01
0.0058
0.227
Asthma
-0.80
0.2612
0.0111
CAD
-0.88
0.3084
0.0154
Chronic back pain
-1.71
0.2395
0.0001
COPD
-0.81
0.6247
0.22
Diabetes
-0.32
0.2552
0.2427
Comparison cohorta
Wald statistic (w.r.t. cohort variable) = 1.0633; k = 20; p value & 1; AIC = 105,107; Log likelihood = -52,531
Coefficientc
MAT
123
Health Serv Outcomes Res Method Table 7 Predicted change in IWBS from the generalized linear model Cohort
Predicteda multiple cohort IWBS (95 % CI)
Predicteda binary cohort IWBS (95 % CI)
Prime
1.48 (0.84–2.12)
1.61 (1.60–1.62)
Mild Intensity
1.32 (1.28–1.36)
1.18 (1.17–1.19)
Comparison
0.57 (0.52–0.62)
0.91 (0.90–0.92)
a
The predicted values are calculated as the average of weight adjusted covariate values for each individual multiplied by the respective estimated coefficient values
difference between the two results (a = 0.05). The comparison group’s confidence interval, however, was outside the interval of the two intervention groups. Beyond the effect of program enrollment and intensity, the regression results showed a significant relationship between the time a member first received an active outreach for participation in the program and IWBS change. The program trigger variable, which denoted the year in which a member first identified as being program eligible due to a high risk profile, demonstrated greater IWBS improvement in members who opted in without direct outreach than those who had direct outreach prior to enrollment. Furthermore, members who first had a direct outreach in 2012 reported a significantly greater decrease in IWBS over time than members who had their first outreach in 2011 or 2010.
5 Discussion The objective of this research was to extend the traditional binary cohort assignment in quasi-experimental program evaluation in order to quantify the differential effects of a multi-tiered well-being improvement program. Specifically, CEM and a robust GLM were applied to WBA responses of more than 17,000 members collected over a consecutive 3 year period. Compared to current binary matching methodologies, the multiple cohort matching methodology accounted for and removed bias from three different cohorts. Moreover, having utilized the GLM further reduced any residual bias and inefficiency from the match, yielding an unbiased and efficient estimate of the causal effect of the evaluated well-being improvement program. 5.1 Matching The purpose of this study was to extend CEM from traditional binary cohort application to multiple cohort application. Overall, the matching results indicated sufficient bias and variability among the groups prior to matching, thus supporting delineation of the intervened cohort into two sub-cohorts. Moreover, the post-matching L1 showed demonstrable reduction in observable bias and variability across the set of multiple member-level attributes specified in the stratification logic. There are few studies where the L1 has been reported. In three such studies, the post-match L1 was reported as approximately 0 (King et al. 2011a; Pawa et al. 2013; Wells et al. 2012b). As with any CEM analysis, though, the groups could have been matched more exactly and the resulting L1 reduced to near zero if the stratification logic had been more refined in terms of variable binning and additional variable inclusion. The tradeoff to consider with further refinement of the stratification logic is the extent to which the remaining subset of matched members are representative of
123
Health Serv Outcomes Res Method
the original population such that the observed causal effect can be applied to all treated members to derive a total estimate of program effectiveness. The questions of how many post-match members and their resultant multivariate distribution constitutes representativeness of the original population is beyond the scope of this study. Of the members successfully matched, the CEM weights were applied to the Mild Intensity and comparison cohorts. These weights adjusted the distribution of members in the two cohorts across multiple attributes in order to achieve balance with the Prime cohort, which on average reported higher levels of disease prevalence, hospital utilization, and well-being risks. The adjusted distributions of the Mild Intensity and comparison cohorts resulted in more comparable disease prevalence rates, medical utilization, and well-being risks relative to the Prime cohort. Despite the adjusted distribution to account for the imbalance, the Mild Intensity cohort demonstrated a greater increase in IWBS score compared to the Prime cohort after the weights were applied. The comparison group demonstrated a lower rate of IWBS improvement compared to the intervened cohorts, thus reflecting the differential effectiveness of the well-being improvement program. 5.2 Regression Typically in an outcome study that utilized CEM, the effect of the program, as represented by the cohort coefficient, would be sufficient to assess program efficacy. In this study, the estimated coefficients of the evaluated cohorts were directionally comparable to the preregression CEM ATT values. In both methods, the Mild cohort displayed more improvement in the IWBS over time while the comparison cohort demonstrated less improvement relative to the Prime cohort. Interestingly, despite the higher level of clinical and well-being risk among the Prime cohort, which could be indicative of lower well-being in the future, the Prime and comparison cohorts were found to have statistically similar well-being change on a marginal or coefficient basis. However, when evaluating the change in well-being with all modeled covariates at their mean values, the results showed a lower level of well-being change in the comparison cohort. The conclusion, then, is that the cohort coefficients are necessary but not sufficient and the calculated predicted values, which account for the full array of explanatory variables, are the most robust estimates of program causal effect. Having utilized the predicted values to assess program effect did not diminish insights the cohort coefficient contributed to this study. The Mild Intensity cohort coefficient indicated that members within this cohort reported greater well-being improvement over time relative to their matched Prime cohort peers. This result was statistically significant and observed in both the pre-regression CEM ATT and descriptive tables. Given the less severe clinical and well-being risk profile, namely concerning the type of such risks individually and in the aggregate, among Mild Intensity members, it was expected that a well-being improvement program would have greater incremental effect at reducing risk compared to the Prime members. However, this incremental effect, as reflected by the cohort coefficient, represented only a marginal contribution of the program. Other explanatory variables, such as the first triggered variable, provided additional insights into program efficacy and showed the value of accounting for multiple attributes simultaneously when evaluating the program. As stated in the results section, the first triggered variable captured the year when the first direct outreach for program participation occurred; the reference level was comprised of members who opted into the program prior to receiving a direct outreach. We anticipated the self-motivated, reference group members would have a statistically significant
123
Health Serv Outcomes Res Method
greater increase in IWBS change compared to others. The results confirmed this hypothesis, but an additional and fascinating insight emerged from the result: longer exposure to the program yielded greater improved well-being. Members who had their first outreach in 2012 showed significantly lower IWBS improvement relative to the opt-in members. Moreover, members who had an additional year of exposure outperformed the 2012 members by approximately 63 % while the 2010 members outperformed the 2012 subset by 102 %. Collectively, these results yielded an important distinction of program evaluation such that program intensity was necessary to account for when assessing the outcome but of equal importance was temporal program exposure. In addition to program exposure, baseline level of member well-being risk was necessary to account for in the regression equation. Examining the coefficients attributed to base year (2010) well-being factors, one could conclude the directionality is incorrect despite statistical significance. Greater values of the biometric, health inactivity, and physical problems factors are interpreted as risker or unhealthier behaviors. Members who were risker in the base year had more incentive to improve their condition, thus the significant positive coefficient. Furthermore, members who were already less risky and healthier at the outset could not mathematically improve any more, given the scale of the WBA questions. Thus, the coefficients captured the riskier members who improved during the study while the healthier members maintained their high levels of well-being. The predicted values allowed for a more complete view of program efficacy compared to the sole contribution of program intensity. As described in the results, the predicted values of the estimated regression equation with respect to Prime and Mild Intensity cohorts overlapped within a 95 % confidence interval, while the comparison cohort did not. With this observation, we conclude no statistical difference in intensity of the program following matching and weighted generalized linear estimation. It is important to note that prior to matching, relative bias was detected between the Prime and Mild Intensity cohort (L1 equal to 0.37). With the matching weights applied to the Mild Intensity cohort, the intensity effect approached zero. Despite this result, active participation in the program, regardless of intensity, produced significantly greater improvement in the IWBS compared to no program participation. A similar finding was found in a study examining the efficacy of a disease management program administered in Germany (Hamar et al. 2010). A wellstructured program should be tailored to the risks and characteristics of the members in varying intensity levels such that the treatment effect of the program is positive and statistically similar for all intervened members. This conclusion was supported by the preregression CEM ATT. While the pre-regression CEM ATT indicated the Mild Intensity cohort improved in IWBS by 0.2 points more than the Prime cohort, the difference fell within the regression-based confidence intervals of the respective cohorts. 5.3 Comparisons of multiple cohort and binary cohort CEM A secondary purpose of this study was to compare applications of CEM based on the assignment of members to two versus three cohorts. To ensure a robust comparison, the sampling rules, stratification logic, and regression models were consistent between both applications. Any differences, therefore, can be attributed to the expansion of the CEM methodology into multiple cohorts. Examination of the matching results would indicate that the binary cohort method on average removed fewer members than the multiple cohort method (835.6 and 2,587.3 respectively). However, the multiple cohort method removed more members in an attempt
123
Health Serv Outcomes Res Method
to mitigate a higher level of detectable bias than was found in the binary cohort method. By combining the high and Mild Intensity intervention cohorts into one Prime cohort, the binary cohort method ignored the intra–intervention bias. With respect to bias between the intervention and comparison cohorts, the binary cohort method proved to be more efficient at mitigating the detectable bias. The inclusion of more than 1,700 additional members, post-match, from the binary cohort method altered the predicted value of well-being improvement generated within the weighted GLM. Both the Prime and Mild Intensity cohorts showed a nearly identical change in wellbeing improvement, though, the change was toward higher well-being in the Prime and lower well-being in the Mild cohort. The Comparison cohort evidenced a nearly a 63 % increase in well-being improvement relative to the change computed within the multiple cohort method. Accordingly, on a difference-in-difference basis, the increased well-being improvement of the Comparison cohort results in a lower estimate of program effectiveness. If value is to be defined as the difference in well-being improvement of an enrolled cohort compared to a nonenrolled cohort, then the redistribution of Mild Intensity and High Intensity members into a singular cohort results in a dilution of program effect within the binary cohort method. Standard to peer reviewed research, we caveat these findings to the particular population, time period and program under investigation such that other applications under differing conditions may yield different findings. Moreover, it is important to note that by invoking quasiexperimental methodologies such as CEM one has already assumed a position of willingness to accept some level of unmeasured and unknown sources of bias underlying individual decisions to be in the treated or comparison cohorts.
6 Conclusion This research sought to extend the traditional binary cohort assignment in quasi-experimental program evaluation in order to quantify differential effects of a multi-tiered wellbeing improvement program administered over a 3 year period in a larger employer. The methodology reported here provides an expanded and robust approach to matching on different cohorts for the purpose of program evaluation that could not be achieved in a binary matching methodology. Furthermore, the methodology incorporates an advanced econometric model to further mitigate bias and unexplained variability that could not be accounted for in the matching process. Current and future applications of the methodology presented here will hopefully empower providers and purchasers of well-being improvement solutions to more accurately ascertain the quantitative and monetary relationship between program participation, intensity of participation, and duration of participation with improved well-being and future expectations of health care costs. Acknowledgments We thank Gary King (Harvard University) and Patrick Lam (Harvard University) for valuable comments and guidance on the coarsened exact matching methodology.
References Allen-Ramey, F.C., Doung, P.T., Goodman, D.C., Saijan, S.G., Nelsen, L.M., Santanello, N.C., Markson, L.E.: Treatment effectiveness of inhaled corticosteroids and leukotriene modifiers for patients with asthma: an analysis from managed care data. Allergy Asthma Proc. 24(1), 43–51 (2003) Bound, J., Charles, B., Nancy, M.: Measurement error in survey data. Population Studies Center; University of Michigan. http://www.psc.isr.umich.edu/pubs/pdf/rr00-450.pdf (2000)
123
Health Serv Outcomes Res Method Brandt, S., Gale, S., Tager, I.B.: Estimated effect of asthma case management using propensity score methods. Am. J. Manag. Care 16(4), 257–264 (2010) Cortes, C., Mehryar, M., Michael, R., Afshin, R.: Sample selection bias correction theory. In: Yoav, F., La´szlo´, G., Gyo¨rgy, T., Thomas, Z. (eds.) Algorithmic learning theory, 5254:38–53. Berlin: Springer. http://www.springerlink.com/index/10.1007/978-3-540-87987-9_8 (2008) Evers, K.E., Prochaska, J.O., Castle, P.H., Johnson, J.L., Prochaska, J.M., Harrison, P.L., Rula, E.Y., Coberley, C., Pope, J.E.: Development of an individual well-being scores assessment. Psychol. WellBeing: Theory Res. Pract. 2(2), 1–9 (2012). doi:10.1186/2211-1522-2-2 Gandy, W.M., Coberley, C., Pope, J.E., Rula, E.Y.: Well-being and employee health—how employees’ well-being scores interact with demographic factors to influence risk of hospitalization or an emergency room visit. Popul. Health Manag. (2013). doi:10.1089/pop.2012.0120 Gandy, W.M., Coberley, C., Pope, J.E., Wells, A., Rula, E.Y.: Comparing the contributions of well-being and disease status to employee productivity. J. Occup. Environ. Med. 56(3), 252–257 (2014). doi:10. 1097/JOM.0000000000000109 Goetzel, R.Z., Roemer, E.C., Liss-Levinson, R.C., Samoly, D.K.: Workplace Health Promotion: Policy Recommendations that Encourage Employers to Support Health Improvement Programs for their Workers. Partnership for Prevention, Washington, DC (2008) Hade, E.M.: Propensity score adjustment in multiple group observational studies: comparing matching and alternative methods. Ohio State University. https://etd.ohiolink.edu/ (2012) Hamar, B., Wells, A., Gandy, W., Haaf, A., Coberley, C., Pope, J.E., Rula, E.Y.: The impact of a proactive chronic care management program on hospital admission rates in a German health insurance society. Popul. Health Manag. 13(6), 339–345 (2010). doi:10.1089/pop.2010.0032 Harrison, P.L., Pope, J.E., Coberley, C.R., Rula, E.Y.: Evaluation of the relationship between individual well-being and future health care utilization and cost. Popul. Health Manag. (2012). doi:10.1089/pop. 2011.0089 Ho, D.E., Imai, K., King, G., Stuart, E.A.: Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Anal. 15(3), 199–236 (2006). doi:10.1093/pan/ mpl013 Iacus, S. M., King, G., Porro, G.: Matching for causal inference without balance checking. http://ssrn.com/ abstract=1152391 (2008) Iacus, S.M., King, G., Porro, G.: Cem: software for coarsened exact matching. J. Stat. Softw. 30(9). http:// www.jstatsoft.org/v30/i09 (2009) Iacus, S.M., King, G., Porro, G.: Causal inference without balance checking: coarsened exact matching. http://j.mp/iUUwyH (2011) Jackman, S.: Generalized linear models. Stanford University.http://jackman.stanford.edu/papers/glm.pdf. Accessed 21 Nov 2013 King, G., Nielsen, R., Coberley, C., Pope, J.E., Wells A.: Comparative effectiveness of matching methods for causal inference. http://j.mp/jCpWmk (2011a) King, G., Nielsen, R., Coberley, C., Pope J.E., Wells A.: Avoiding randomization failure in program evaluation, with Application to the medicare health support program. Popul. Health Manag. 14 (S1), S-11–S-22 (2011b). doi:10.1089/pop.2010.0074 King, G., Zeng, L.: When can history be our guide? The pitfalls of counterfactual inference. Int. Stud. Q. 51(1), 183–210 (2007). doi:10.1111/j.1468-2478.2007.00445.x Larzelere, R.E., Kuhn, B.R., Johnson, B.: The intervention selection bias: an underrecognized confound in intervention research. Psychol. Bull. 130(2), 289–303 (2004). doi:10.1037/0033-2909.130.2.289 Mattke, S., Liu, H., Caloyeras, J.P., Huang, C.Y., Van Busum, K.R., Khodyakov D., Shier V.: Workplace wellness programs study. Congressional report. Health & Human Services. http://aspe.hhs.gov/hsp/13/ WorkplaceWellness/rpt_wellness.cfm (2013) McCullagh, P., Nelder, J.A.: Generalized linear models. Chapman and Hall, London (1989) Merrill, R.M., Aldana, S.G., Pope, J.E., Anderson, D.R., Coberley, C.R., Grossmeier, J.J., Whitmer, R.W., HERO Research Study Subcommittee: Self-rated job performance and absenteeism according to employee engagement, health behaviors, and physical health. J. Occup. Environ. Med. 55(1), 10–18 (2013). doi:10.1097/JOM.0b013e31827b73af Merrill, R.M., Aldana, S.G., Pope, J.E., Anderson, D.R., Coberley, C.R., Whitmer, R.W., HERO Research Study Subcommittee: Presenteeism according to healthy behaviors, physical health, and work environment. Popul. Health Manag. 15(5), 293–301 (2012). doi:10.1089/pop.2012.0003 Pawa, D., Firestone, R., Ratchasi, S., Dowling, O., Jittakoat, Y., Duke, A., Mundy, G.: Reducing HIV risk among transgender women in Thailand: a quasi-experimental evaluation of the sisters program. PLoS One 8(10), e77113 (2013). doi:10.1371/journal.pone.0077113
123
Health Serv Outcomes Res Method Perkins, S.M., Tu, W., Underhill, M.G., Zhou, X.H., Murray, M.D.: The use of propensity scores in pharmacoepidemiologic research. Pharmacoepidemiol. Drug Saf. 9(2), 93–1001 (2000) Plesca, M., Smith, J.: Evaluating multi-treatment programs: theory and evidence from the U.S. job training partnership act experiment. In: Dustmann, C., Fitzenberger, B., Machin, S. (eds) The economics of education and training, pp. 293–330. Heidelberg: Physica-Verlag HD. http://www.springerlink.com/ index/10.1007/978-3-7908-2022-5_13 (2008) Prochaska, J.O., Evers, K.E., Johnson, J.L., Castle, P.H., Prochaska, J.M., Sears, L.E., Rula, E.Y., Pope, J.E.: The well-being assessment for productivity: a well-being approach to presenteeism. J. Occup. Environ. Med. 53(7), 735 (2011) Prochaska, J.O., Evers, K.E., Castle, P.H., Johnson, J.L., Prochaska, J.M., Rula, E.Y., Coberley, C., Pope, J.E.: Enhancing multiple domains of well-being by decreasing multiple health risk behaviors: a randomized clinical trial. Popul. Health Manag. (2012). doi:10.1089/pop.2011.0060 Prochaska, J.J., Velicer, W.F., Nigg, C.R., Prochaska, J.O.: Methods of quantifying change in multiple risk factor interventions. Prev. Med. 46(3), 260–265 (2008). doi:10.1016/j.ypmed.2007.07.035 Rassen, J.A., Shelat, A.A., Franklin, J.M., Glynn, R.J., Solomon, D.H., Schneeweiss, S.: Matching by propensity score in cohort studies with three treatment groups. Epidemiology (Cambridge, Mass.) 24(3), 401–409 (2013). doi:10.1097/EDE.0b013e318289dedf Rassen, J.A., Solomon, D.H., Glynn, R.J., Schneeweiss, S.: Simultaneously assessing intended and unintended treatment effects of multiple treatment options: a pragmatic ‘matrix design’: ‘matrix design’ for comparative effectiveness research. Pharmacoepidemiol. Drug Saf. 20(7), 675–683 (2011). doi:10. 1002/pds.2121 Rubin, D.B.: The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat. Med. 26(1), 20–36 (2007). doi:10.1002/sim.2739 Rubin, D.B.: On the limitations of comparative effectiveness research. Stat. Med. 29(19), 1991–1995 (2010). doi:10.1002/sim.3960 Schafer, J.L.: Analysis of incomplete multivariate data. Chapman & Hall, London (1997) Sears, L.E., Shi, Y., Coberley, R.C., Pope, J.E.: Overall well-being as a predictor of health care, productivity, and retention outcomes in a large employer. Popul. Health Manag. (2013). doi:10.1089/pop. 2012.0114 Sears, L.E., Agrawal, S., Sidney, J.A., Castle, P.H., Rula, E.Y., Coberley, C.R., Witters, D., Pope, J.E., Harter, J.K.: The well-being 5: development and validation of a diagnostic instrument to improve population well-being. Popul. Health Manag. 17, 357–365 (2014) Shepardson, L.B., Youngner, S.J., Speroff, T., Rosenthal, G.E.: Increased risk of death in patients with donot-resuscitate orders. Med. Care 37(8), 727–737 (1999) Shi, Y., Lindsay, E., Coberley, C.R., Pope, J.E.: The association between modifiable well-being risks and productivity: a longitudinal study in pooled employer sample. J. Occup. Environ. Med. 55(4), 353–364 (2013). doi:10.1097/JOM.0b013e3182851923 Shi, Y., Sears, L.E., Coberley, C.R., Pope, J.E.: Classification of individual well-being scores for the determination of adverse health and productivity outcomes in employee populations. Popul. Health Manag. (2012). doi:10.1089/pop.2012.0039 Stuart, E.A., Rubin, D.B.: Matching with multiple control groups with adjustment for group differences. J Educ. Behav. Stat. 33(3), 279–306 (2007). doi:10.3102/1076998607306078 Wang, Y., Cai, H., Li, C., Jiang, Z., Wang, L., Song, J., Xia, J.: Optimal caliper width for propensity score matching of three treatment groups: a monte carlo study. PLoS One 8(12), e81045 (2013). doi:10.1371/ journal.pone.0081045 Wells, A.R., Hamar, B., Bradley, C., Gandy, W.M., Harrison, P.L., Sidney, J.A., Coberley, C.R., Rula, E.Y., Pope, J.E.: Exploring robust methods for evaluating treatment and comparison groups in chronic care management programs. Popul. Health Manag. (2012a). doi:10.1089/pop.2011.0104 Wells, A.R., Hamar, B., Bradley, C., Gandy, W.M., Harrison, P.L., Sidney, J.A., Coberley, C.R., Rula, E.Y., Pope, J.E.: Exploring robust methods for evaluating treatment and comparison groups in chronic care management programs. Health Manag, Popul (2012b). doi:10.1089/pop.2011.0104 Yu, C., Legg, J., Liu, B.: Estimating multiple treatment effects using two-phase semiparametric regression estimators. Electron. J. Stat. 7, 2737–2761 (2013). doi:10.1214/13-EJS856 Yuan, Y.C.: Multiple imputation for missing values: concepts and new development—revised 2009. SAS Institute Inc. http://support.sas.com/rnd/app/papers/multipleimputation.pdf (2009)
123