Purpose. Data from international educational assessments conducted in many countries are mostly analyzed using item response theory. The assumption t...

22 downloads
3162 Views
585KB Size

No documents

Fitting an item response theory model with random item effects across groups by a variational approximation method Frank Rijmen · Minjeong Jeon

Published online: 19 July 2012 © Springer Science+Business Media, LLC 2012

Abstract Purpose. Data from international educational assessments conducted in many countries are mostly analyzed using item response theory. The assumption that all items behave the same in all countries is often not tenable. The variability of item parameters across countries can be taken into account by assuming that the item parameters are random effects (De Jong et al. in J. Consum. Res. 34:260–278, 2007; De Jong and Steenkamp in Psychometrika 75:3–32, 2010). However, the complex latent structure of such a model, with latent variables both at the item and the person level, renders maximum likelihood estimation computationally challenging. We describe a variational estimation technique that consists of approximating the likelihood function by a computationally tractable lower bound. Methods. A mean field approximation to the posterior distribution of the latent variables was used. The update equations were derived for the specific case of discrete random effects and implemented in a Maximization Maximization algorithm (Neal and Hinton in M.I. Jordan (Ed.) Learning in Graphical Models, Kluwer Academic, Dordrecht, pp. 355–368, 1998). Parameter recovery was investigated in a simulation study. The method was also applied to the Progress in International Reading Study of 2006. Results. The model parameters were recovered well under all conditions of the simulation study. In the application, the estimated variances of the random item effects showed a high positive correlation with traditional measures for the lack of item invariance across groups. Conclusions. The mean field approximation and variational methods in general offer a computationally tractable alternative to exact maximum likelihood estimation. Keywords Item response theory · Random item effects · Variational techniques · Mean field approximation

F. Rijmen () Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA e-mail: [email protected] M. Jeon Graduate School of Education, UC Berkeley, Berkeley, CA, USA

648

Ann Oper Res (2013) 206:647–662

1 Introduction International large-scale educational assessment surveys such as the Programme for International Student Assessment (PISA), the Trends in International Mathematics and Science Study (TIMSS), and the Progress in International Reading Study (PIRLS) are important sources of information for comparing educational outcomes across countries. These assessments can also be used to investigate whether background variables such as socioeconomic status and gender are associated with achievement levels, and to which degree these relations are similar across countries. Implementing the same assessment in many countries that exhibit a wide variation in languages, school curricula, urbanization, labor market, and many other important aspects of societal organization, poses serious challenges to the development of a common standard. In practice, the comparability of the assessment across the participating countries is safeguarded by a careful development and review of the test material, field testing, and an evaluation of the psychometric properties of the items both within and across countries. However, even after a careful development of the test material and a rigorous evaluation of its psychometric properties, the test may not operate in exactly the same way in all the participating countries. A moderate variation of item parameters can be expected, reflecting subtle differences in nuances due to item translation and adaptation, differences in school curricula, and many other aspects in which countries differ. Item response theory models with random item effects have been proposed to account for such a moderate fluctuation in item parameters across countries (De Jong et al. 2007, 2008; De Jong and Steenkamp 2010). In these models, the item parameters of a specific country are assumed to be realizations of random variables that are defined over the population of countries. The associated variance is an indication of how much variation in item parameters exists across countries. Before presenting the model, we emphasize upfront that the incorporation of random item effects in an IRT model can never replace a careful test development stage nor a careful investigation of psychometric properties. Insofar as an item functions completely differently in one or several countries, it is unrealistic to assume a common distribution of item parameters for all countries.

2 The random item effect IRT model 2.1 Model formulation A random item effect IRT model can be formulated as follows (De Jong et al. 2007, 2008; De Jong and Steenkamp 2010). Let yi(k)j denote the binary scored response of test taker i from country k on the j th item of the assessment, where i = 1, . . . , nk , k = 1, . . . , K, and j = 1, . . . , J . The link function g(·) relates the conditional probability of a correct answer given a person’s ability θi(k) to a function of country-specific item parameters αkj and βkj , g(πi(k)j ) = αkj θi(k) + βkj ,

(1)

where πi(k)j = Pr(yi(k)j = 1|θi(k) , αkj , βkj ). Typical choices for g(·) are the logit and probit link functions (Fahrmeir and Tutz 2001). The item response model of Eq. (1) is a so-called two parameter model. In a one parameter model, only an item location parameter βkj is included for each item, whereas a three parameter model incorporates an additional parameter for each item to account for guessing (Lord and Novick 1968). The ability θi(k) is assumed to be a random variable, θi(k) ∼ p(θi(k) ; γ k ), where γ k is the parameter vector that characterizes the distribution of θi(k) in country k. In addition, the random item effect IRT model also

Ann Oper Res (2013) 206:647–662

649

treats the country-specific item parameters as random variables, with a distribution over the population of countries. Let δ kj denote the vector of item parameters for item j in country k. It is assumed that each δ kj is a random sample from a multivariate distribution p(δ kj ; ξ j ) that is characterized by an item-specific parameter vector ξ j . Country-specific item parameters pertaining to different items are assumed to be statistically independent. IRT models with random item effects take an intermediate position between models in which parameters are assumed to be constant across countries and models in which item parameters are estimated separately for each country as fixed effects. Specifically, the random item effect IRT model allows for country-specific item effects, but rather then incorporating separate item parameters for each country, these effects are treated as random variables with a common distribution defined over the population of countries. The variance of an item parameter indicates how much variation there is across countries with respect to that item. If the variance is zero, the item effect is constant across countries. If the variance is zero for all items, a traditional IRT model with no random item effects is obtained. 2.2 Computational challenge The joint probability of all observed responses and random variables for country k is Pr(yi(k)j |θi(k) , δ kj ) p(δ kj ; ξ j ) p(θi(k) ; γ k ) , (2) Pr(yk , θ k , δ k ; ψ) = i(k) j

j

i(k)

where yk is the concatenated vector consisting of all observed responses for country k; θ k and δ k are the vectors containing all random person and random item parameters for country k, respectively; and all (fixed) model parameters are collected into the vector ψ = (ξ 1 , . . . , ξ J , γ 1 , . . . , γ K ) . The likelihood function of the random item effect IRT model is obtained by integrating Eq. (2) over the random variables for each country, and taking the product over countries, ··· ··· Pr(yk , θ k , δ k ; ψ) dθnk (k) . . . dθ1(k) dδ kJ . . . dδ k1 , L(y; ψ) = k

δ k1

δ kJ

θ1(k)

θnk (k)

(3) where y is the concatenated vector consisting of all observed responses. In contrast to linear models, the multiple integral in Eq. (3) has no closed form solution, and numerical integration is called for. A high-dimensional latent structure as such does not necessarily imply that numerical integration over the latent variables is computationally prohibitive. Often, the conditional independence relations implied by the model assumptions can be exploited such that numerical integration over a high-dimensional latent space can be carried out as a sequence of integrations in lower-dimensional subspaces of latent variables. Graphical model theory turns out to be quite useful in this respect because there is a correspondence between the conditional independence relations of a statistical model and properties of the graph associated with the model. For the use of graphical models in the context of psychometric models, the reader is referred to Humphreys and Titterington (2003), Rijmen et al. (2008), and Rijmen (2009, 2010). For the random item effect IRT model, each subspace consists of all random item effects plus one random person effect. As a consequence, maximum likelihood estimation for a random item effect IRT model is computationally very intensive. As an alternative to maximum likelihood estimation, De Jong et al. (2007) and De Jong and Steenkamp (2010) proposed a Bayesian framework and relied on Monte Carlo techniques. However, the computation time remained high (about 24 hours for a single dataset

650

Ann Oper Res (2013) 206:647–662

in De Jong and Steenkamp 2010). Therefore, it is worthwhile to investigate alternative estimation methods. These methods can either serve as alternatives to the existing Bayesian methods, or provide initial values to shorten computation time for the Bayesian methods.

3 Variational expectation maximization algorithm 3.1 Traditional EM algorithm The EM algorithm (Dempster et al. 1977) is a powerful algorithm to obtain parameter estimates for models that contain latent variables. The algorithm alternates between two steps until convergence. In the Expectation-step (E-step) of iteration m, the expectation of the complete data log likelihood is computed. The complete data consist of both the observed data y and the latent variables or otherwise missing data z. The expectation is taken with respect to the posterior distribution of the missing data z, given the observed data y, and parameter estimates ψ (m−1) obtained from the previous iteration,

Q ψ; ψ (m−1) = Eψ (m−1) log Lc (y, z; ψ) y , (4) where Lc (y, z; ψ) is the complete data likelihood. In the Maximization-step (M-step), Q(ψ; ψ (m−1) ) is maximized with respect to the parameters ψ, leading to new parameter estimates ψ (m) . 3.2 An alternative look at the EM algorithm Neal and Hinton (1998) provided an alternative interpretation of the EM algorithm. They showed that the E-step can be conceptualized as a maximization step as well. Relying on Jensen’s inequality, a lower bound l(y; ψ) to the log likelihood l(y; ψ) can be obtained as l(y; ψ) ≡ log p(y, z; ψ) dz z p(y, z; ψ) dz = log f (z; μ) f (z; μ) z p(y, z; ψ) = log Ef f (z; μ) p(y, z; ψ) ≥ Ef log f (z; μ) p(y, z; ψ) (5) dz ≡ l(y; ψ, μ), = f (z; μ) log f (z; μ) z where Ef denotes that the expectation is taken over the missing data z with distribution f (z; μ) characterized by parameter vector μ. It is important to realize that f (z; μ) is fundamentally different from p(z; ψ) and p(z|y; ψ), which are the model-based prior and posterior distributions of the latent variables z. In contrast, f (z; μ) is called a variational distribution. It is not a part of the statistical model, and characterized by a vector of parameters μ that are independent of the model parameters ψ. If the unobserved variables z are discrete variables, the integral in Eq. (5) becomes a summation. Since continuous latent variables are approximated by discrete latent variables later, we assume that the latent variables z are discrete from now. Thus, f (z; μ) now denotes

Ann Oper Res (2013) 206:647–662

651

a discrete probability distribution, and the discrete counterparts of p(z; ψ) and p(z|y; ψ) are denoted by Pr(z; ψ) and Pr(z|y; ψ). The difference between l(y; ψ) and l(y; ψ, μ) amounts to

f (z; μ) KL f (z; μ) Pr(z|y; ψ) = , (6) f (z; μ) log Pr(z|y; ψ) z which is the Kullback-Leibler divergence between the probability distributions f (z; μ) and Pr(z|y; ψ) (Kullback and Leibler 1951). The Kullback-Leibler divergence is always nonnegative. It equals zero when f (z; μ) = Pr(z|y; ψ). Hence,

(7) l(y; ψ) = l(y; ψ, μ) + KL f (z; μ) Pr(z|y; ψ) . The EM algorithm can now be generalized by framing it as a Maximization Maximization algorithm (MM algorithm). The first maximization step replaces the E-step of the traditional EM algorithm. In iteration m, instead of computing Pr(z|y; ψ (m−1) ), the posterior probabilities of the missing data given the observed data and provisional parameter estimates ψ (m−1) , l(y; ψ, μ) is maximized with respect to the parameters μ characterizing f (z; μ), with the model parameters kept fixed at their current values ψ (m−1) . In the second M-step, l(y; ψ, μ) is maximized with respect to the model parameters ψ , and the parameters characterizing f (z; μ) are kept at their updated values μ(m) . The EM algorithm is a special case of the MM algorithm (Neal and Hinton 1998). If the distribution of f (z; μ) is left completely unspecified (except for z f (z; μ) = 1, f (z; μ) being a probability distribution), the lower bound is optimized in the first M-step by letting f (z; μ) be Pr(z|y; ψ (m−1) ). The step therefore reduces to the E-step of the traditional EM algorithm. The second M-step is equivalent to the M-step of the EM algorithm as well when f (z; μ) is left completely unspecified, since the lower bound l(y; ψ, μ) with f (z; μ) equal to Pr(z|y; ψ (m−1) ) becomes Pr(y, z; ψ) . (8) Pr z y; ψ (m−1) log l(y; ψ, μ) ≡ Pr(z|y; ψ (m−1) ) z The denominator in Eq. (8) is not a function of ψ and hence the maximum of the lower bound corresponds to the maximum of the expectation of the complete data log likelihood Q(ψ; ψ (m−1) ) as defined in Eq. (4). The quality of the MM-algorithm will evidently depend on the choice of f (z; μ). Ideally, f (z; μ) is chosen such that it resembles the true posterior distribution of the missing data Pr(z|y; ψ) while keeping the first M-step computationally tractable. How to choose f (z; μ) is still an open area of research. In this regard, Humphreys and Titterington (2003) present a good overview of the literature and suggest some alternatives. In this paper, whose primary objective is to introduce the variational method for the random item effect IRT model, a very simple approximation of the true posterior is chosen. 3.3 Mean-field approximation for the random item effect IRT model In the standard mean-field approximation (Peterson and Anderson 1987), the posterior of the latent variables is approximated by the independence model, where the latent variables are assumed to be independent given the observed data, fv (zv ; μ). (9) f (z; μ) = v

In the context of the random item effect IRT model, the complete data consist of the observed responses y, the vector of unobserved (missing) random person effects

652

Ann Oper Res (2013) 206:647–662

θ = (θ 1 , . . . , θ k , . . . , θ K ) , and the vector of unobserved random item parameters δ = (δ 1 , . . . , δ k , . . . , δ K ) . Hence, z = (θ , δ ) . Assuming conditional independence of the random item parameters across countries, the complete data likelihood of the model is then Pr(yi(k)j |θi(k) , δ kj ) Pr(δ kj ; ξ j ) Pr(θi(k) ; γ k ) , (10) Lc (y, z; ψ) = k

i(k) j

j

i(k)

where θi(k) and δ kj are now explicitly assumed to be discrete, as opposed to Eq. (3). In the E-step of iteration m, the expectation of the complete data log likelihood is computed, given the observed data y,

Q ψ; ψ (m−1) = Eψ (m−1) log Lc (y, z; ψ) y

= Eψ (m−1) log Pr(θi(k) ; γ k ) yk k

+ +

i(k)

k

i(k)

k

j

Eψ (m−1) log Pr(yi(k)j |θi(k) , δ kj ) yk

j

Eψ (m−1) log Pr(δ kj ; ξ j ) yk .

(11)

Due to the country-specific random item effects, the data from test takers in the same country are not independent. These within-country dependencies lead to an insurmountable computational burden when evaluating the expectations in Eq. (11). In the MM algorithm, the exact posterior distribution Pr(z|y; ψ) is approximated by a variational distribution f (z; μ). Since the latent variables pertaining to different countries are independent, Pr(z|y; ψ) = k Pr(zk |yk ; ψ), the mean-field approximation of Eq. (9) can be applied separately for each country as Pr(zk |yk ; ψ) = Pr(δ k , θ k |yk ; ψ) ≈ f (δ kj ; μδkj ) f (θi(k) ; μθ

i(k)

j

) ,

(12)

i(k)

where μδkj and μθ are the parameters that characterize the variational probability distrii(k) butions for the latent item and latent person variables δ kj and θi(k) , respectively. 3.4 The MM algorithm for the random item effect IRT model Because the major difference between the MM and the EM algorithm resides in the first step, only the first M-step is discussed in detail in this section. Let d = 1, . . . , D denote the categories for the latent item variables δ kj , and let t = 1, . . . , T denote the categories of the latent person variables θi(k) . A multinomial distribution is assumed for each θi(k) and for each δ kj , with mean vectors μθi(k) = (μ1θi(k) , . . . , μTθi(k) ) and μδkj = (μ1δkj , . . . , μD δ kj ) , respectively. The usual constraints for the multinomial distribution T d that t=1 μtθi(k) = 1 and D d=1 μδ kj = 1 apply, and can be satisfied, for example, by letting D−1 d T −1 t T D μθi(k) = 1 − t=1 μθi(k) and μδkj = 1 − d=1 μδkj . The mean vectors μδkj and μθi(k) are the vectors of variational parameters towards which the variational lower-bound l(y; ψ, μ) to the log likelihood function l(y; ψ) is maximized in the first M-step of the MM algorithm. This comes down to solving, for all i and j within each country k, Pr(yk , zk ; ψ) ∂ Ef log = 0, (13a) ∂μθi(k) f (zk ; μ)

Ann Oper Res (2013) 206:647–662

and

653

∂ Pr(yk , zk ; ψ) Ef log = 0. ∂μδkj f (zk ; μ)

(13b)

Now, let Ak = Ef [log Pr(yk , zk ; ψ)] and Bk = Ef [log f ((zk ; μ))], which are defined for the random item effect IRT model as log Pr(θi(k) ; γ k ) + log Pr(yi(k)j |θi(k) , δ kj ) + log Pr(δ kj ; ξ j ) Ak = E f i(k)

=

+

j

= t; γ k )

t

i(k)

+

j

i(k)

μtθi (k) log Pr(θi(k)

i(k)

j

j

d

t

μtθi (k) μdδkj log Pr(yi(k)j |θi(k) = t, δ kj = d)

d

μdδkj log Pr(δ kj = d; ξ j ),

and Bk = Ef

(14a)

log f (θi(k) ; μθi(k) ) +

=

i(k)

log f (δ kj ; μδkj )

j

i(k)

μtθi(k)

log μtθi(k)

+

t

j

μdδkj log μdδkj .

(14b)

d

The partial derivatives of Ak and Bk with respect to μtθi(k) , t = 1, . . . , T −1, are, respectively, ∂Ak = log Pr(θi(k) = t; γ k ) + μdδkj log Pr(yi(k)j |θi(k) = t, δ kj = d) t ∂μθi(k) j d d − log Pr(θi(k) = T ; γ k ) + μδkj log Pr(yi(k)j |θi(k) = T , δ kj = d) , j

d

(15a) and ∂Bk = log μtθi(k) − log μTθi(k) . ∂μtθi(k)

(15b)

T −1 t μθi(k) . The second terms stem from the restriction μTθi(k) = 1 − t=1 Setting the partial derivatives of l(y; ψ, μ) with respect to μtθi(k) , t = 1, . . . , T −1, i(k) = 1, . . . , nk , and k = 1, . . . , K, to zero, the following sets of T − 1 equations are obtained, log

μtθi(k) μTθi(k)

where ηθt i(k) = log Pr(θi(k) = t; γ k ) +

= ηθt i(k) − ηθTi(k) ,

j

d

μdδkj log Pr(yi(k)j |θi(k) = t, δ kj = d).

(16)

654

Ann Oper Res (2013) 206:647–662

The left-hand side of Eq. (16) corresponds to the baseline-category logits function with category T as baseline category. The solutions for μtθi(k) are found by applying its inverse, exp(ηθt i(k) ) μtθi(k) = T . t t=1 exp(ηθi(k) )

(17)

Analogously to Eqs. (14a), (14b) to (17), the solutions for μdδkj , d = 1, . . . , D−1; k = 1, . . . , K; j = 1, . . . , J , can be obtained, log where ηδdkj = log Pr(δ kj = d; ξ j ) +

μdδkj μD δ kj

= ηδdkj − ηδDkj ,

i(k)

(18)

μtθi(k) log Pr(yi(k)j |θi(k) = t, δ kj = d),

t

and thus exp(ηδdkj ) μdδkj = D . d d=1 exp(ηδ kj )

(19)

Because the variational parameters μθi(k) of the latent person variables θi(k) appear in the solutions for the variational parameters μδkj for the latent item parameters δ kj , and vice versa, a closed form solution is not readily obtained. A conditional optimization scheme can be implemented instead for Eqs. (17) and (19). In the second M-step of the MM algorithm, the lower-bound l(y; ψ, μ) is maximized with respect to the model parameters ψ . The model parameters do not appear in Bk , as defined in Eq. (14b), so that this second M-step reduces to solving ∂ Ak = 0, (20) ∂ψ k with Ak defined in Eq. (14a). This step is analogous to the M-step of the traditional EM algorithm, and therefore is not discussed in further detail. It can be carried out using a Newton-Raphson algorithm for those parameters for which no closed form solution exists. Particularly useful in this regard is the volume by Fahrmeir and Tutz (2001), who present a very clear and complete overview of generalized linear models with random effects and extensions thereof, including an extensive discussion of parameter estimation. As most other IRT models, random item effect IRT models can be framed as generalized linear models or extensions thereof (Rijmen et al. 2003).

4 Simulation study The variational method with a mean-field approximation outlined in the previous section was evaluated in a simulation study. The one parameter model for binary outcomes was used as the IRT model, so there was only one random effect for each item, δ kj = βkj . Data were generated from a model with discrete random effects. Several factors were manipulated orthogonally. For each resulting condition, 100 datasets were generated. Each dataset contained the simulated item responses of 500 persons to 20 items for each of a number of countries. In order to evaluate the bias of the variational method, all datasets were analyzed under the true model.

Ann Oper Res (2013) 206:647–662

655

4.1 Design The first factor that was manipulated was the number of countries: 10 versus 20. The number of locations (or classes) for each of the random item effects was the second factor in the simulation design. For the condition with two locations, the proportion of countries in each class was 0.5 for every item, regardless of the number of countries. For the condition with three classes, the proportions amounted to 0.4, 0.3, and 0.3 for every item in the conditions with 10 countries, and to 0.35, 0.35, and 0.3 for every item in the condition with 20 countries. The third factor in the design was the difference between the item locations of the random effect associated with each item: 0.5, 1, or 1.5. Specifically, first 20 baseline item locations were generated from −2 to 2, with intervals of 0.2, and excluding an item location of zero. These baseline item locations were the item locations for the first class of each of the 20 random item effects. Then, the item locations of the second class of each of the random item effects were defined by adding 0.5 (or respectively 1 or 1.5, depending on the condition) to the baseline item locations of every odd item, and subtracting 0.5 (or 1 and 1.5, respectively) to the baseline item locations of every even item. For the condition with three classes for the random item effects, the random item effects of the third classes were defined by subtracting 0.5 (or respectively 1 or 1.5, depending on the condition) to the baseline item locations of every odd item, and adding 0.5 (or 1 and 1.5, respectively) to the baseline item locations of every even item. Under all conditions, datasets were generated with four levels for the person random effects, with locations and weights of (−2, −1, 1, 2) and (0.3, 0.2, 0.2, 0.3) for every country, respectively. In the models that were fitted to the data, an item random effect was included for each item. The locations of the random person effects were assumed to be constant across countries, but their weights were estimated separately for each country. In order to identify the model, the location of the first class of the random person effects was set to zero. Thus, for example, for the condition with 10 countries and three item locations for each random item effect, the number of estimated parameters amounted to 133: three locations for the person random effect, 30 weight parameters for the random person effects (three weight parameters for each of 10 countries), 40 weight parameters for the random item effects (two weight parameters for each of 20 items), and 60 item locations (three item locations for each of 20 items). The convergence criterion that was used was a maximum of absolute differences between parameter estimate smaller than 0.0001. Depending on the condition, the estimation of the parameters for a single dataset took from about 15 minutes (10 countries, two latent classes per item) to about 40 minutes (20 countries, three latent classes per item). 4.2 Results For models with discrete random effects, equivalent solutions can be obtained by a permutation of the classes. As a consequence, for some datasets the first class of the random person effect consisted of the best performing subgroup of persons, whereas for other datasets the first person class corresponded to a lower performing subgroup. A similar phenomenon was observed for the random item effects. This is illustrated in Fig. 1. In Fig. 1a, the estimated item locations for the first item are presented for the condition with 10 countries, a difference of 1 between the item locations of the random item effects, and three item locations for each random effect. Estimates pertaining to the same dataset are connected. Many different patterns of lines can be distinguished, due to a permutation of the classes of both the person and item random effect over datasets. In order to cancel the effect of the permutation

656

Ann Oper Res (2013) 206:647–662

Fig. 1 Location estimates for the random effect of the first item (10 countries, 3 classes for each random item effect, difference of 1 between item classes). Estimates obtained from the same dataset are connected. (a) original solutions, (b) permutation of the classes of the person random effect undone, (c) permutation of the classes of both person and item random effect undone

of the classes of the person random effect, the mean of the locations of the person random effect was subtracted from all item locations. The transformed item locations are shown in Fig. 1b. Figure 1b shows six bundles of lines, corresponding to the six possible permutations of the three classes of the random item effect. The fact that six different bundles can be distinguished means that the different classes of the item random effect are well separated, and hence that classes can be unambiguously relabeled. The item locations for the relabeled classes are shown in Fig. 1c. The classes for the first item random effect now are labeled in the same way across all 100 simulated datasets as is apparent from the appearance of a single bundle of item location estimates. A separation of the different classes of the item locations was observed as well for the other conditions of the simulation study, and was also observed for the person random effects. For all datasets, the classes of both the person and item random effects were relabeled such that the classes were ordered in the same way as the classes of the data generating models. Figures 2 to 5 present the means of the estimates for the different sets of model parameters. The results for the locations of the person random effects are presented in Fig. 2. In each panel, the average of the estimated location parameters over all 100 datasets is plotted against the true parameter values. For all datasets, the mean value of the locations of the person random effect was subtracted from all estimates to align them with the generating values. All averages are on the dashed 45° line, indicating that there was no bias in the estimation of the location parameters of the person random effects. The results for the weights of the person random effects are presented in Fig. 3. In each panel, the average of the estimated weights over all 100 datasets is plotted first for the first class for all K countries, then for the second class for all countries, and so on. All means are very close to the dashed lines representing the true values, indicating that there was no bias in the estimation of the class weights of the person random effects. Figure 4 presents the results for the weights of the item random effects. In each panel, the means are ordered by item: first all estimated weights for the first item, then all weights for the second item, and so on. All means are very close to the dashed lines representing the

Ann Oper Res (2013) 206:647–662

657

Fig. 2 Means of the estimated locations for the person random effects versus the true locations. dif is the difference between the item locations of the latent classes for each item. D is the number of latent classes for each of the item random effects. K is the number of countries

true values of the weights, indicating that there was no bias in the estimation of the class weights of the item random effects. Finally, Fig. 5 presents the results for the item locations of the item random effects. In each panel, the means of the estimated item locations are ordered by item: first all estimated locations for the first item, then all locations for the second item, and so on. All means are very close to the dashed lines representing the true values of the item locations, indicating that there was no bias in the estimation of the item locations of the item random effects. To conclude, all parameters were recovered well under a variety of conditions.

5 Application: progress in international reading study PIRLS is an internationally comparative reading assessment that is carried out every five years since its inception in 2001. The study measures trends in children’s reading literacy achievement (http://timss.bc.edu/pirls2006/about.html). In 2006, there were 40 participating countries, totaling more than 200,000 participants. For Belgium and Canada, separate samples were included for different regions or provinces, resulting in a total of 45 groups (countries and regions or provinces). The 2006 assessment consisted of ten reading materials, each of which was accompanied by a set of 11 to 14 items, with a total of 125 items.

658

Ann Oper Res (2013) 206:647–662

Fig. 3 Means of the estimated weights for the person random effects. dif is the difference between the item locations of the latent classes for each item. D is the number of latent classes for each of the item random effects. K is the number of countries

Each participant received a single booklet consisting of two reading materials. For the application, we used the data for the first booklet, consisting of 26 items. The number of participants in each of the 45 groups ranged from 221 to 982, with a total of 14,392 participants. Responses were dichotomized (0 vs. 1 or higher), and omitted and not reached items were scored as incorrect (for a discussion of different treatments of omitted and not reached items in large-scale educational surveys, see for example Rose et al. 2010). A model with four discrete categories for the person latent variable and two categories for each item latent variable was fitted to the data. To reduce the number of parameters somewhat, the weights of the person latent variable were assumed to be the same across groups. Instead, a main effect was included for each group to capture overall differences in reading proficiency across countries. The main effect for the first group was set to zero to identify the model. The model contained 128 parameters: three location and three weight parameters for the person latent variable, 44 group main effects, 26 weight parameters for the item latent variables, and 52 location parameters for the item latent variables. For each item, the model-based standard deviation of the random effect was calculated from the estimated item parameters. The standard deviations ranged from 0.26 to 0.59. The highest standard deviation was observed for the fifth item of the second test material. Without having access to the test questions, it is hard to assess which factors contributed to the lack of item invariance for this item. Inspection of the posterior probabilities did not reveal a correspondence between class membership of countries for this item and overall performance on the PIRLS assessment.

Ann Oper Res (2013) 206:647–662

659

Fig. 4 Means of the estimated weights for the item random effects. dif is the difference between the item locations of the latent classes for each item. D is the number of latent classes for each of the item random effects. K is the number of countries

In addition, the lack of item invariance across countries was assessed with the logistic regression approach (Swaminathan and Rogers 1990). Under this approach, two models were fitted: a logistic regression model in which the scores on an item are predicted by the total score, and a model in which the group effects and interaction effects between groups and total score are included as well. Under the null hypotheses that there is item invariance, all the group effects and interaction terms are zero. Hence, the likelihood-ratio statistic comparing the two models can be used to assess the lack of item invariance. The likelihood-ratio statistic was computed for each item in turn. The correlation between the likelihood-ratio statistic and the standard deviation of the random item effect computed across items amounted to 0.86 (p < 0.001). Hence, the model-based standard deviations can be considered to give a good indication of the degree of (or lack of) item invariance across groups. Note that the logistic regression procedure uses the total score as an observed covariate. Therefore, unlike the random item effects approach, it cannot be used for incomplete data collection designs which are most commonly used in large-scale educational assessments.

6 Discussion Large-scale international assessments face the challenge of ensuring that the scale has the same meaning across countries. A threat to the comparability across different countries is the lack of invariance of item parameters.

660

Ann Oper Res (2013) 206:647–662

Fig. 5 Means of the estimated locations for the item random effects. dif is the difference between the item locations of the latent classes for each item. D is the number of latent classes for each of the item random effects. K is the number of countries

Even with the high standards currently in place for item development and notwithstanding rigorous psychometric analyses, it may be more realistic to allow for some degree of variation in item parameters across countries. The random item effect IRT model (De Jong et al. 2007; De Jong and Steenkamp 2010) is such a model. It allows for variation in item parameters by treating item parameters as random variables with a distribution defined over the population of countries. The variance of the random item parameters across countries gives an indication of the degree to which item parameters are lacking item invariance across countries. Such a model may be especially useful in the present context, since careful item development and translation of test material add to the plausibility that assuming a common distribution across countries for item parameters is a reasonable approach. Due to the nonlinearity of the random item effect IRT model and the crossed structure of its random effects, with random effects both at the item and at the person side, maximum likelihood estimation involves numerical integration over high dimensional spaces, resulting in a computational burden when more than a few items are used. Sampling based techniques such as Monte Carlo EM or Markov Chain Monte Carlo methods also can become computationally very intensive with increasing dimensionality in this case. A variational method was presented as an alternative estimation method. In the variational method, the log likelihood is approximated by a computationally tractable lower bound. The lower bound is a function of both model parameters and auxiliary (variational) parameters. The variational parameters characterize a distribution that is an approximation to the posterior distribution of the latent variables. The difference between the log likeli-

Ann Oper Res (2013) 206:647–662

661

hood function and its lower bound equals the Kullback-Leibler divergence between the posterior distribution and its variational approximation. Parameters are estimated using a Maximization Maximization (MM) algorithm. In the first Maximization step, the lower bound is maximized with respect to the variational parameters while keeping the model parameters constant. In the second step, the lower bound is maximized with respect to the model parameters, with the variational parameters kept constant. The MM algorithm contains the EM algorithm as a special case when the variational distribution equals the exact posterior distribution of the latent variables. The variational method is actually a class of methods in that many different choices can be made with respect to the variational distribution. Ideally, the variational distribution is chosen such that it closely resembles the true posterior, while keeping the evaluation of the resulting lower bound to the log likelihood function computationally tractable. In this paper, we presented the variational method using the mean-field approximation in which the joint true posterior is approximated by the independence model where the latent variables are assumed to be independent given the observed data. The variational method with a meanfield approximation to the posterior distributions of the random effects was evaluated in a simulation study. For discrete random effects, no bias was found for any of the model parameters under a variety of conditions. The variational approximation can be developed along the same lines for Gaussian random effects by using (adaptive) Gaussian quadrature to numerically integrate over the random effects (Bock and Aitkin 1981; Pinheiro and Bates 1995). For nonadaptive quadrature, the update equations for the variational parameters will be similar to Eqs. (17) and (19), as the nodes and weights of the nonadaptive quadrature are treated as the known locations and weights of discrete latent variables. For adaptive quadrature, further modifications are needed as the quadrature points become subject-specific and the model can no longer be treated as a discrete latent variable model with known locations and weights. Further research will address the development of variational approximation techniques for Gaussian random effects. In this paper, we have focused on an IRT model with random effects for item parameters to account for differences across countries. However, the variational estimation method is not limited to this specific model, but can be applied with minor modifications to other psychometric models with random item effects such as, for example, the hierarchical IRT model of Janssen et al. (2000); see also Glas and van der Linden (2001).

References Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: an application of the EM algorithm. Psychometrika, 46, 443–459. De Jong, M. G., & Steenkamp, J.-B. E. M. (2010). Finite mixture multilevel multidimensional ordinal IRT models for large scale cross-cultural research. Psychometrika, 75, 3–32. De Jong, M. G., Steenkamp, J.-B. E. M., & Fox, J.-P. (2007). Relaxing measurement invariance in crossnational consumer research using a hierarchical IRT model. Journal of Consumer Research, 34, 260– 278. De Jong, M. G., Steenkamp, J.-B. E. M., Fox, J.-P., & Baumgartner, H. (2008). Using item response theory to measure extreme response style in marketing research: a global investigation. Journal of Marketing Research, 45, 104–115. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39, 1–38. Fahrmeir, L., & Tutz, G. (2001). Multivariate statistical modelling based on generalized linear models (2nd ed.). New York: Springer.

662

Ann Oper Res (2013) 206:647–662

Glas, C. A. W., & van der Linden, W. J. (2001). Modelling variability in item parameters in item response models (Tech. Rep.). Enschede, University of Twente. Humphreys, K., & Titterington, D. M. (2003). Variational approximations for categorical causal models with latent variables. Psychometrika, 68, 391–412. Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for criterionreferenced measurement. Journal of Educational and Behavioral Statistics, 25, 285–306. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 86, 22–79. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Dordrecht: Kluwer Academic. Peterson, C., & Anderson, J. R. (1987). A mean-field theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Pinheiro, P. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixedeffects model. Journal of Computational and Graphical Statistics, 4, 12–35. Rijmen, F. (2009). An efficient EM algorithm for multidimensional IRT models: full information maximum likelihood estimation in limited time (ETS Research Report, RR-09-03). Rijmen, F. (2010). Formal relations and an empirical comparison between the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361–372. Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185–205. Rijmen, F., Vansteelandt, K., & De Boeck, P. (2008). Latent class models for diary method data: parameter estimation by local computations. Psychometrika, 73, 167–182. Rose, N., von Davier, M., & Xu, X. (2010). Modeling nonignorable missing data with item response theory (IRT) (ETS Research Report, RR-10-11). Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.