PSYCHOMETRIKA--VOL. 65, NO. 4, 437-456 DECEMBER 2000
A TEST-THEORETIC APPROACH TO OBSERVED-SCORE EQUATING WIM J. VAN DER LINDEN UNIVERSITY
OF TWENTE
Abstract Observed-score equating using the marginal distributions of two tests is not necessarily the universally best approach it has been claimed to be. On the other hand, equating using the conditional distributions given the ability level of the examinee is theoretically ideal. Possible ways of dealing with the requirement of known ability are discussed, including such methods as conditional observed-score equating at point estimates or posterior expected conditional equating. The methods are generalized to the problem of observed-score equating with a multivariate ability structure underlying the scores. Key words: observed-score equating, equipercentile method, equating criteria, multidimensionality.
1. Introduction Though important elsewhere, the problem of test-score equating has virtually been nonexistent in the history of intelligence testing. The reason is the wide reliance on the practice of normalizing observed-score distributions, after David Wechsler introduced this idea for the Wechsler-Bellevue Intelligence Scale in 1939 (DuBois, 1970, p. 126). Nevertheless, the practice of transforming observed scores on different versions of a test to a common normal distribution involves an implicit relation between these scores on which the standard method of equating in all other areas of testing is based. Let X be the observed score of a random examinee on an old version of a test and Y the score of a random examinee on a new version of the same test. Both examinees are drawn from a common population. Throughout this paper, the same notation will be used to denote the tests themselves. In addition, let • (.) be the distribution function of the common normal distribution to which X and Y are transformed. Normalization of X and Y amounts to the following score transformations: e(x) = ~-l(fx(x))
(1)
e(y) = ~-l(Fy(y)).
(2)
For the original scores Y and X, these transformations imply x = e(y) = Ff;l(Fy(y)).
(3)
The same relation between Y and X could have been established not by normalizing their distributions for a common population, but by calculating their original distributions functions and applying the transformation in (3) directly to y. This practice is known as the method of equipercentile equating for an equivalent-groups design. It is important to note that different versions of an intelligence test typically meet the same content specifications and are designed to measure the same psychological construct. Different This article is based on the author's Presidential Address given on July 7, 2000 at the 65th Annual Meeting of the Psychometric Society held at the University of British Columbia, Vancouver, Canada. The author is most indebted to W i m M.M. Tielen for his computational assistance and Cees A.W. Glas for his comments on a draft of this paper. Requests for reprints should be sent to W.J. van der Linden, Department of Educational Measurement and Data Analysis, University of Twente, P.O. Box 217, 7500 A E Enschede, THE NETHERLANDS. E-Mail: w'j'vanderlinden @ edte'utwente'nl 0033 -3123/2000-4/1999-0703 -N $00.75/0 (~) 2000 The Psychometric Society
437
438
PSYCHOMETRIKA
conditions exist in other areas of testing. The first example is large-scale educational testing, where test forms are generally released after testing. In this area, it is therefore necessary to assemble multiple versions of tests for administration at different sessions. Though these versions are assembled to be parallel, their score distributions are not always perfectly so, and some form of post hoc equating is used to prevent differences in scoring. Note that test versions for different sessions are usually assembled to identical content specifications but, unlike psychological tests, not necessarily designed to measure a single construct. The second example is the problem of "vertical equating". The first to address this problem might have been George Rasch in his study in the 1950s of the development of reading ability in a cohort of students that was regularly tested for over 15 years. The necessity to have equivalent scores from different versions of the tests for different age groups led Rasch to the proposal of his logistic item response model (Rasch, 1960). Observe that in this application these versions were written to different content specifications but that each version was intended to measure the same construct. The third type of equating exists when tests are written to different content specifications and also known to measure different constructs. A well-known example is college entrance testing in the USA. Though the two major tests used in entrance selection, the ACT Assessment Program and the Scholastic Assessment Test (SAT), are different instruments, several colleges accept scores from either instrument, and score conversion tables are established regularly to support them in making their admission decisions. A recent set of tables is given in Dorans (1999). Current interest in accountability in education has given rise to the exploration of even more challenging types of equating. For example, in countries with both local and national educational assessments, several local administrations are involved in attempts to link their results to scales established in national assessments. A study of this type is reported in Williams, Billaud, Davis, Thissen, and Sanford (1995). IAkewise, at the international level, several attempts have been made to compare results from national assessments to those reported in international assessments. One of the first studies to address this problem was Pashley and Philips (1993). These newer types of linking problems are particularly challenging because the different assessments are not only based on sets of test items written to different specifications and intended to measure different domains of knowledge and skills, but also have to deal with crucial differences in curricula between countries and school districts. Moreover, these assessments have sampling designs typically chosen to yield accurate estimates of population distributions rather than of individual scores. Recently, t h r ~ different committees studied the technical problems involved in these linking problems and published reports that discouraged this type of equating because of serious technical problems (Cizek, Kenney, Kolen, Peters, & van der Linden, 1999; Feuer, Holland, Green, Bertenthal, & ttemphill, 1999; Koretz, Bertenthal, & Green, 1999). The next two sections in this paper explore the assumptions and criteria of observeA-score equating. We then introduce a new class of methods of observed-score equating that meet the criteria; unlike the current methods, which are all based on the marginal observed-score distributions. Generalizations of these methods to the case of observed scores with an underlying multidimensional ability structure are presented. Our main intention is a more rigorous satisfaction of equating assumptions--not to relax these assumptions to accommodate the more challenging linking problems above, as in Linn (1993) and Mislevy (1992). Nevertheless, the results may lead to improvements in those types of linking as well. 2. Equipercentile Method For two arbitrary score distributions, the method of equipercentile equating is demonstrated graphically in Figure 1. Test X and Y have the same number of items. The observed score on test X equivalent to a score of 29 on test Y is found going from y = 29 to the cumulative density associated with this score and then to the quantile associated with this proportion for test X (the discreteness of test scores is ignored). "Ihese steps constitute the transformation in (3).
W I M J. V A N D E R L I N D E N
restX /jy
1.0
439
0.5
L~ 0.0
J
J
r
10
20
1
30
40
Test Score FIGURE 1. Graphical illustration of equipercentile transformation.
The standard motivation for the transformation in (3) in introductory texts is the thought experiment of a population of examinees taking both test X and Y. Because the population is common, both tests share the order of the examinees in the population, and scores corresponding with the same percentile in the distributions of X and Y are thus equivalent. Hence, the name equipercentile equating. A plot of the transformation in (3) is known as a quantile or Q-Q plot in statistics. These plots are generally used to assess graphically the differences between two empirical distributions or the fit of an empirical to an expected distribution (Wilk & Gnanadesikan, 1968). Differences between distributions are reflected by deviations of the plot from the identity line. The term quantile equating for the method considered here would thus have been more in line with the mainstream statistical literature. It would also have acknowledged the fact that the method does not necessarily involves the use of percentiles. The method of equipercentile equating is not based on any distributional assumptions. It seems therefore universally applicable. However, applying the method does involve several practical problems, such as the need of large samples or strong smoothing methods to get satisfactory estimates of the distribution functions, possible indeterminacy of the transformation due to guessing on test items, discreteness of number-correct scores or differences in test length, difficulties to realize a proper equating design, etc. Detailed descriptions of strategies to resolve such problems are given in Holland and Rubin (1982) and Kolen and Brennan (1995).
2.1. IRT Obsen, ed-Score Equating The theoretic results in this paper assume observed-score equating under the fit of a (dichotomous) item response theory (IRT) model. Both the case of unidimensional and multidimensional ability is considered. For all results only a few nonparametric assumptions on the response functions are needed. In fact, we only assume that monotonicity of the response function in the examinee's ability, 0, and conditional independence b e t w ~ n responses hold. Under these assumptions the observed-score distributions are ka~own to be generalized binomial. For a test of n items, the probability function of the generalized binomial is defined by
440
PSYCHOMETRIKA
the generating function n
t/
JXlo(X)t x = I-I[Qi(O) -t- t Pi(O)], x=0
(4)
t=l
where Pi (0) is the response function l~)r item i and Qi (0) =- 1 - Pi (0). The probability function can easily be calculated using a recursive relation in Lord and Wingersky (1984). As
= f,; ,o(x)jo(o)do,
(5)
a possible observed-score equating method is to calculate the distributions of X and Y for a population with distribution f e (0) and use their distribution functions to calculate the transformation in (3). This method was presented in Lord (1980, sec. 13.7), whereas applications are discussed in Glas (1992), Lord and Wingersky (1984), and Zeng and Kolen (1995). Some of the results are illustrated with data from empirical examples. The responses in these examples fitted either a unidimensional three-parameter logistic (3-PL) model Pi(O) = Pr{Ui = 1} ~ ci + (1 - ci)
exp[ai(O - b i ) ] 1 + exp[ai(O - bi)]'
(6)
where Ui is the response variable for item i, with Ui = 1 for a correct and Ui = 0 for an incorrect response, 0 c R is the ability of the examinee, and ai E (0, ec), bi c R, and c i c [0, 1) are the discrimination, difficulty, and guessing parameter for item i, respectively, or a two-dimensional generalization of the two-parameter (2-PL) logistic model Pi(O) = Pr{Ui = 1}
exp(ali01 + a2i02 + di) 1 + exp(aliO1 + a2i02 + di)'
(7)
where di c R is the easiness parameter and all E (0, oc) and a2i c (0, oc) are the "loadings" of item i on the two ability dimensions. 3. Equating Criteria The fact that the transformation in (3) does not involve any distributional assumptions does not make it automatically appropriate for test-score equating. Additional criteria for successful equating are derived from the principles of measurement and the need to be fair to all examinees. Lord (1980, sec. 13.5) formulates four fundamental criteria of equating. We will review these criteria only briefly; more complete discussions of these criteria, as well as of some weaker versions thereof, are given in Harris and Crouse (1993) and Yen (1983). In addition, we introduce two new requirements for observed-score equating that are derived from basic assumptions in test theory. 3.1. Measurement of Same Variable The criteria of X and Y measuring the same empirical variable is obvious. Equating scores on different variables would miss the rationale underlying the method of equipercentile equating: For two tests measuring different variables, the assumption that they rank the same population of examinees identically, and hence that equal quantiles in the two score distributions represent equivalent scores, is obviously wrong. The method of equipercentile equating has no built-in protection against application to tests measuring different variables. In fact, it does not even guarantee that each test measures a single variable. However, IRT observed-score equating does provide statistical tests of unidimension-
W I M J. VAN D ER L I N D E N
441
ality. Complications due to test scores with an underlying multidimensional ability structure are discussed later in this paper. 3.2. Equity
For an equating transformation to be fair, to each examinee taking test Y with additional equating of his/her score to test X must be an event identical to taking test X directly. This criterion was coined the equity criterion by Lord, and formalized as the requirement that
Fxlo (x) = Fe(r)lo (Y)
(8)
hold for each value of 0. The criterion of equity is the same as Morris' (1982) notion of strong equating. An important result derived by Lord is that two tests can not meet the criterion in (8) unless they are item-by-item parallel and no equating is needed at all (Lord, 1980, Theorem 13.3.1).
3.3. Invariance Across Populations The criterion of invariance of equating transformations across populations is also motivated by concerns about fairness. If an examinee would get a different equated score for the same performance in another population, this score would suffer from unjust dependence on the performances of other examinees. The question if equipercentile equating does meet the criterion of invariance across populations seems an empirical issue. More importantly, however, the concept of a population itself is unclear. If both tests are administered at different times, the groups of examinees may differ systematically and the claim of sampling from a common population may be difficult to maintain. To resolve this issue Braun and Holland (1982) introduced the notion of a synthetic population. Let 7~x and 7~y be the populations of examinees involved in the administration of test X and Y. The populations have size n x and ny, respectively. A synthetic population is defined as 7) = (nxT)x + n r T ) r ) / ( n x +
nr),
(9)
and the equipercentile transformation for 7) is the identical combination of the transformations in (3) for 7~x and 7~y. However, it is hard to find a satisfactory interpretation for the notion of a synthetic population. It seems to be motivated more by the necessity to compromise between 7~x and 7~y than to model a sampling design in an actual equating study.
3.4. Symmetry The requirement of the equating transformation being symmetric in X and Y is obvious. Equipercentile equating satisfies this criterion. If X and Y are interchanged, the transformation in (3) becomes identical to its inverse
e(x) = F ~ l ( F x ( x ) ) .
(10)
3.5. Two Other Fundamental Problems 3.5.1. Multilevel Structure of Test Scores The transformation in (3) matches the marginal distributions of X and Y. However, a basic assumption in test theory is that observed scores have a multilevel structure, with a distribution of X for each examinee p c 7~ and a second-level distribution across 7~ of the true score or latent parameter that describes these individual distributions. It seems even safe to claim that the classical test theory model introduced by Spearman (1904) was the first multilevel model ever.
442
PSYCHOMETRIKA
If test scores have a multilevel structure, why then match marginal distributions? How could a method based on this practice ever satisfy the criterion of equity formulated entirely at the individual level?
3.5.2. CriterionJbr Ordering Examinees The usual motivation of the method of equipercentile equating has a weak s p o t - - t h e lack of a formal criterion to rank the examinees in 79. Should we consider them ordereA by their classical true score (or, equivalently, by their values for the IRT ability parameter, 0)? Or by the individual distributions X I P ? Itowever, for either choice there are no "quantiles" liar the examinees in the marginal distribution of X - - a key assumption on which the method of equipercentile equating is based. Equipercentile equating even assumes 79 to be identically ordered by X and Y. How could this identity be established?
3.6. Conclusion The method of equipercentile equating does not satisfy most of the above criteria. In addition, it struggles with the need to meet the distributional structure of the test scores assumed throughout test theory and is not based on an explicit criterion of ordering examinees. The question arises if another transformation could do a better job of meeting these criteria? The answer is in the affirmative, provided we are prepared to drop an important assumption implicit in all current equating. ~I]ae following thought experiment explains why. Suppose we have two different examinees with the same score g = y. Current methods of equating give these examinees the same equated score e(y). This rule seems sound if all we know about them is their observed score y. However, suppose we are now informed of the examinees' ability levels and one examinee appears to be more able than the other. Would we still give the same equated score e(y) to both examinees? Probably not. For example, we may find that score y is quite low for the more able examinee and that, in order to allow for this rare event and achieve more equitable equating, ihis/her equated score should be higher than the one for the less able examinee. That is, we Should equate conditional on the ability level of the examinees. Of course, we do not know the ability levels of the examinees. But we may have information about them, for example, in their response vectors if these fit a model R)r which the observed score is not sufficient for the ability parameter. Or we may find another approach that enables us to do better than the compromise between the conditional transformations offered by the marginal transformation in (3). ~I]ae remainder of this paper is based on these ideas. 4. Conditional Equating (Unidimensional Case) The proposed transformation is the equipercentile transformation in (3) defined on the conditional distributions of X and Y given the same value of 0:
e(y; O) = F~loFrlo(y -1 ).
(11)
Because we have one transformation at each possible value of 0, equating is now based on a set of transformations, F --= e(y; O) -- F~;io Fyio (y) : 0 c R , instead of a single transformation.
Proposition 1. For a population of examinees 79 and tests X and g with concurrent fit to a dichotomous unidimensional response model with monotonic response functions and conditional independence between the responses, 1~ has the following properties: i. identical distributions of e(Y) and X for each p c 79 (equity); ii. symmetry in X and Y for each p c 79 ;
W I M J. VAN D ER L I N D E N
iii. population invariance within 7); iv. conditional distributions {Xp p c 7)} and
443
{Yp [ p c 7)} that are identically ordered.
In addition, v. X and Y measure the same variable.
Proof (i) For each p c 7) there is a unique value of 0, and for each 0 the transformation in (11) matches the conditional distributions of e(Y) and X. (ii) The inverse of fxl~fylo(y) is F~l~Fxlo(y), which is (11) for the equating of X to Y. (iii) The conditional formulation of (11) implies independence from the distribution of 6) over 7). As a consequence, F holds for any subpopulation of 7). (iv) The fact that the conditional distributions {Xp [ p c 7)} and {Yp [ p c 7)} are identically ordered within 7) is proved separately in Proposition 2 below. (v) Finally, because test X and Y have concurrent fit to a unidimensional response model, they measure the same variable. [] We are now able to give a definition of the population for which the equating holds:
Definition 1. 7) is the population of examinees for which X and Y show concurrent fit to the response model of choice. This definition involves a clear empirical criterion. It is also flexible. The population now includes any past or future examinees whose responses to the items in X and Y can be shown to fit the model. Most importantly, however, the definition does not involve any need to sample examinees randomly from a population. Because equating is conditional on ability, the only randomness is in the examinees' responses to the items in X and Y. Also, unlike equipercentile equating based on marginal score distributions, there is no need to realize an equivalent-groups design. The definition allows us to drop the distinction between an examinee p c 7) and a value of 0. For convenience, this will be done in the remainder of the paper.
4.1. Identically Ordered Test Scores The proposed criterion for ordering examinees in 7) is stochastic order in the families of distributions of {X [ 0} and {Y [ 0}. A family of distributions {X [ 0} is stochastically ordered if their distributions functions have no point in common, that is, if {Fxlo (x)} decreases in 0 for each value of x (see Lehmann, 1986, p. 84; for convenience, only the case of strict stochastic order is considered here). Figure 2 gives examples of two sets of conditional distributions of X for a "population" of three examinees that is and is not stochastically ordered. Observe that this criterion is a fundamental choice because it implies the other possible criteria discussed earlier: If 7) is ordered by {X [ 0}, it is also ordered by 0 and by E(X [ 0) (Lehmann, 1986, p. 116, Exercise 5). The last quantity is the true score from classical test theory. For these and other properties of stochastic order in test theory, see Junker and Sijtsma (2000) and van der Linden (1998a).
Proposition 2.
Under the conditions in Proposition 1, {X [ 0} and {Y [ 0} are ordered
identically in 0.
Proof The proposition follows immediately from Theorem 2 in Grayson (1988), which states that under the given conditions X and Y have monotone likelihood ratio with respect to 0. The property of monotone likelihood ratio implies that both {X [ 0} and {Y [ 0} are stochastically ordered in 0. Because 0 is a common parameter, the property of identical order follows. []
444
PSYCHOMETRIKA
(a)
(b)
I 09
08 0.7
p=l
0.7
"~ 0.6 ~0.5 OA 0.3 0.2. 0,1,
p=l
= /
"=
~
///p=3
0.,5 O3 0.2 0.
X
X
FIGURE 2. Graphical illustration of observed-score distributions that are (a) and are not (b) stochastically ordered.
A formal definition of a variable in measurement theory is the one of an ordered set of objects introduced by Campbell (1928). This definition differs from the criterion of a common ability parameter used in the proof of part (v) of Proposition 1. However, the definition is implied by (iv). In fact, under the assumptions in this proposition, the two definitions are equivalent. Historically, Campbell's definition has led to the notion of a quantitative variables as an ordered set with a metric, and from there to :representational measurement theory (Supper & Zinnes, 1963). This connection can be used to treat the problem of test score equating as one of finding a transformation that maps one metric for a variable onto another. This route will not be further pursued here
4.2. Empirical Example An empirical example for two 25-item subtests from the Law School Admission Test (LSAT) is given in Figure 3. Both subtest were taken from the same pool of items calibrated under the unidimensional model in (6) and were of the same difficulty. For these two tests, conditional observed score distributions were generated using (4). These distributions were then used to calculate the transformations in (11) for 0 = -3.00(.25)3.00. The figure shows that large differences in equated scores are possible, particularly for large differences between 0 values. Also, equated scores tend to increase in 0. The same strong trend is present in all examples below. Equated scores for the same value of y are not ordered in 0 though. This finding, demonstrated by the tact that some of the equating curves in the plots in this paper cross, was somewhat unexpected in view of all the order assumption made before. It is therefore documented in the following proposition:
Proposition 3. Under the assumptions in Proposition 1, the transformations in F are not ordered in 0 for each y, that is, 01 < 02 does not imply any relation between Xl = e(y; 01) and x2 = e(y; 02). -1 The proposition addresses the behavior of e(y; O) = F~lo(Fyio(y)) as a function of 0.
Let fxlo(Y) = P. Both fyio(y) and t;xl~(p) decrease in 0, but a decrease in 0 also entails a increase in p and hence in Fxl f (p) (because it is a quantile function). As we have no closed-
445
W I M J. VAN D ER L I N D E N
25 20 t9=3.0
15 0=-3.0
X
10
i
i
i
0
5
10
i
i
I
15
20
25
Y FIGURE 3. Conditional equating transformations e(y;0) for two subtests from the LSAT.
from expressions for Frlo (Y) and/;'~11o_(p), further analysis of possible conditions under which F is ordered in 0 is seriously impeded. 5. A ternatwe Observed-Score Equating Methods In a practical application of conditional equating, a choice has to be made from the set of conditional transformations F while 0 is unknown. ]'he following suggestions are made:
5.1. Estimated Conditional Equating A straightforward approach is to use a point estimate for 0 and choose the transformation from F at this estimate. Empirical examples for a set of response vectors on test Y are given in Figure 4. One response vector was generated for each 0 value in Figure 3 and the posterior distribution of 0 was calculated using a noninformative prior. The conditional equipercentile transformations at the means of these posteriors (EAP estimates) were plotted. The general shape of this plot is the same as in Figure 3, with a slightly smaller variance between the curves due to the use of a Bayesian estimator. O f course, variation due to error in the E A P estimates is not visible in this figure. This variation decreases with the length of the test, the degree to which its items are optimal at the 0 value of the examinee, as well as the information in the prior distribution.
5.2. Posterior Expected Conditional Equating Instead of choosing a single transformation from F for a point estimate of 0, we could average F over the posterior distribution of 0 ~br the examinee. The equating transformation for an examinee with response vector ( u l , . • •, u,~) on test Y then becomes
e(y) =- J t~1~Frlo(y)j~lul ......v,(O)dO,
(12)
where f~l.1 ..... .v, (0) is the posterior density of 0. Figure 5 shows the transformations in (20) for the same set of response vectors as Figure 4.
446
PSYCHOMETRIKA
25 20 15 X
0=-3.0
10
0
I
I
I
i
i
5
10
15
20
25
Y F I G U R E 4. C o n d i t i o n a l e q u a t i n g t r a n s f o r m a t i o n s at a b i l i t y e s t i m a t e s f o r t w o s u b t e s t s f r o m t h e L S A T .
25-
2oi 15. X
10.
0
I
I
5
10
i
I
I
15
20
25
Y F I G U R E 5. P o s t e r i o r e x p e c t e d c o n d i t i o n a l e q u a t i n g t r a n s f o r m a t i o n s f o r t w o subtests f r o m t h e L S A T .
WIM J. VAN DER LINDEN
447
5.3. WeightedAverage Conditional Equating If, for practical reasons, only one transformation from F can be chosen for all examinees in a group, the choice of a weighted combination
f t;)-t~FYio(Y)w(O)dO/ f
e(y) =--
w(O)dO
with w(O) as weight for 0, seems obvious. A possible choice for of the examinees at 0:
w(O) is
(13) the relative frequency
e(y) =- f Fx-i1Fyio(y)fe(O)dO.
(14)
In Figure 6, the transformation in (19) is plotted along with the marginal equipercentile transformation in (3). Both plots are for a standard normal density for f~(O). It is interesting to compare (14) with the combination of (3) and (5). In (14), we first equate for fixed values of 0 and then average over the distribution of 0, whereas in marginal equating we first average over the distribution of 0 and then equate using this average. Note that both transformations in Figure 6 are population dependent. The purpose of this figure is only to show that even for the case of a single transformation for a population of examinees, reasonable alternatives to (3) are available.
5.4. Assembling Tests to Match Conditional Score Distributions Another approach is to assemble test Y to have conditional observed-score distributions identical to those on X. This approach would be possible for testing programs in which new versions of tests are assembled from a pool of items calibrated under a response model but scores are still reported on an observed-score scale. This practice exists in most programs with scoring scales established before they began using I1{I" to analyze their items and assemble tests.
25 20 15 X
10
0 0
I
I
5
10
I
I
I
15
20
25
Y FIGURE 6. Marginal equating transformation (a) and weighted average of conditional equating transformations (b) for two subtests from the LSAT.
448
PSYCHOMETRIKA
The following proposition, derived in van der Linden and Luecht (1998, Proposition 1), shows a set of conditions on response functions under which test X and Y,have identical conditional observed-score distributions:
Proposition 4. If Pi (0) and Pj (0) are the response functions of item i = 1. . . . . n in test X and item j = 1, . . . , n in test Y, respectively, the conditional distributions X and Y given 0 are identical if ~pr
(0) = ~
i=1
P~ (0)
for r = 1. . . . . n.
(15)
j=l
Though the proposition formulates a set of n conditions on the two test, the authors also show that the impact of the sums of higher-order powers quickly vanish and in practice only the sums for r = 1 , . . . , R with a low values for R are needed to equate two tests (R = 3 or 4, say). Note that the conditions in (21) are linear in the items. Thus, test Y can be assembled using a 0-1 linear programming (LP) model for test assembly that maximizes the fit of the sums n ~,j=l Pf(O) for test Y to the sums ~ i =n 1 P[(O) for a given test X at a series of 0 values (van der Linden, 1998b). The approach is illustrated for the same subtest X from the LSAT as in the previous examples. Test Y was now assembled from a pool of 728 items to have response functions maximally approaching the conditions in (15) at 0 = - 1 . 5 , - 1 . 0 , .0, 1.0, 1.5. Four tests were assembled, one for R = 1, 2, 3, and 4. For these test the conditional equating transformations in (11) were calculated. Figure 7 displays the transformations for the same values of 0 as in previous plots. The criterion for test Y to have conditional observed-score distributions identical to those on test X,is coincidence of all conditional equating transformations with the identity line. This criterion was not met satisfactorily for R = 1 and 2 but quite well for R = 3 and 4.
R=I
25 20 15 x 10-
_
0
i
0
5
i
10
i
Y
15
i
20
i
25
FIGURE 7A. Conditional equating transformations for tests assembled f r o m an LSAT item pool to match conditions on response functions for a given test (R = 1).
W I M J. V A N D E R L I N D E N
449
R=2 252015X
10_
0
i
0
i
5
i
10
,
15
i
20
25
Y FIGURE 7B. Conditional equating transformations for tests assembled f r o m an LSAT item pool to match conditions on response functions for a given test (R = 2).
5.5. Discussion
The first three suggestions need further evaluation to show their practicality. It seems appropriate to evaluate these forms of equating against the set of conditional transformations F in (11). The choice of F as the "true" equating transformations is motivated by the fact that it meets all reasonable criteria of equating (Proposition 1). Up to now, the statistical evaluation of equiper-
R=3
252015X
10-
0
I
I
5
10
Y
I
I
I
15
20
25
FIGURE 7C. Conditional equating transformations for tests assembled f r o m an LSAT item pool to match conditions on response functions for a given test (R = 3).
450
PSYCHOMETRIKA
R=4
2520~ 15 X 10
5 0 0
5
10
15
20
25
Y FIGURE 7D. Conditional equating transformations for tests assembled fi'om an L S A T i t e m pool to m a t c h conditions on response functions for a given test (R = 4).
centile observed-score equating has only been restricted to calculating its (asymptotic) standard error (e.g., Liou & Cheng, 1995; Lord, 11982). However, the differences for the same two tests between equating based on the marginal distributions in Figure 5 and those on the conditional distributions in Figure 3 reveals that equating bias should also be a serious concern. The last suggestion is of a different nature. A natural area of application is computerized adaptive testing (CAT). In this application, a calibrated item pool is always available, and the only thing needed to have CAT administrations with mutually equated observed scores is to impose a few of the conditions in (15) on the item-selection process at the ability estimate. An empirical example of this application is given in van der Linden (in press).
6. Conditional Equating (Multidimensional Case) So far, the treatment of observed-score equating has relied critically on the assumption of a single variable underlying both tests, ttowever, multidimensionality is one of the most frequent reasons for lack of fit of a response model, in particular in the more challenging type of test score equating discussed in the introductory section. The question arises how to proceed when 0 is multidimensional. The response model used in the empirical examples below is the two-dimensional model in (7). Note that this model has a monotonic response surface: The probability of a correct response increases in 01 for a fixed value of 02 and in 02 for a fixed value of 01. The following result holds:
Proposition 5. Under the multidimensional response model in (7), the family of conditional observed-scores distributions {X I 0l, 02} is stochastically ordered in 01 for a fixed value of 02, and reversely. Proof. For a fixed value of 01 or 02 the model in (7) is a 2-PL unidimensional model with a rescaled difficulty parameter. Thus, Proposition 2 applies. []
WIM J. VAN DER LINDEN
451
This property, which generalizes immediately to higher-dimensional models, plays a role in the three different cases of multidimensionality discussed below. These cases arise because the abilities found to be measured by the items may have to be considered as intentional or as nuisance abilities (see van der Linden, 1996).
6.1. Case 1: All Abilities Intentional Scoring a test by the number of items correct is inconsistent with the interest in multiple abilities. Also, observed-score equating seems to be impossible. From Proposition 5 it follows that the observed-score distributions are ordered by each separate ability but always conditional on fixed values for all other abilities. It is impossible to define a ranking of 7) based on all intentional abilities simultaneously, let alone an identical ranking for two tests. A special case arises if test X and Y have a simple structure with respect to the underlying abilities, that is, consists of subtests with items loading highly on one ability but hardly on the others. In this case, it is possible to equate observed scores at subtest level because, as shown in the next section, conditional distributions identically ordered in an intentional ability parameter are available. However, if equated subtest scores are summed, or aggregate otherwise at test level, this property is lost.
6.2. Case 2: Single Intentional Ability This case arises, for instance, if a test is designed to measure a single ability but actually happens to also measure one or more abilities in which no interest exists. Using number-correct scoring in this case is also not encouraged but sometimes unavoidable, particularly when there is one accidental nuisance ability to which only a few items are sensitive. Let 01 be the parameter for the intentional ability and 02 for the nuisance parameter. It is suggested to base observed-score equating on the conditional distributions of X and Y given 01. These distributions are defined by
fxlol (x) =
f FXIOl,02(X)fc~)l,(')2(01, 02)d02 Ji'~l (01)
(16)
F now has elements
e(y : 01) - bxIol (taglol (Y)).
(17)
Marginalizing over nuisance parameters in distributions functions is unusual. Justification from a sampling perspective is impossible, whereas in a Bayesian approach marginalization is allowed only for a posterior distribution. However, the only thing needed for observed-score equating is the availability of families of distributions on both tests that are identically ordered in the ability parameter of interest. The following proposition, introduced in van der Linden and Vos (1996), establishes this feature for (16).
Proposition 6. For a population of examinees 'P and tests X and Y with concurrent fit to the multidimensional model in (7), the families of conditional distributions {X I 01} and {g I 01} are identically ordered in 01 if {®2 I 01} is stochastically ordered in 01. Proof. The proposition follows Proposition 5 above and Lemma 10 in van der Linden (1998a).
[]
If {®2 I 01} is stochastically ordered in 01, high values of 02 tend to go with high values of 01. This property is necessary to maintain stochastic order. If high values of 01 would go with low values of 02, the two abilities would have a compensatory impact on X and Y, and the order in the
452
PSYCHOMETRIKA
25 2O t5 x
01 = - 3 . 0 10 5 0
0
5
10
Y
15
20
25
FIGURE8. Conditional equating tlansformations [k)rtwo tests from the AAP given 01 (intentional ability).
distributions of X and Y given 01, would be lost. Observe that the condition in this proposition is a condition on 7~, not on the response model. The wish to equate observed scores in the presence of nuisance abilities thus involves a price. An empirical example of the conditional equating transformations in (17) is given in Figure 8. The transformations were calculated for two 25-item tests taken from a mathematics item pool from the A C T Assessment Program (AAP). The pool had 176 items all fitting the twodimensional model in (7). Unlike the previous unidimensional examples, the tests were now selected to have minimum (test X) and maximum difficulty (test Y) among the items in the pool. This difference may explain the much larger differences in equated scores between more and less able examinees. The estimated, weighted, and expected posterior conditional equating transformations for the same two tests are shown in Figures 9 through 11. The estimated and posterior expected transformations show much less variation across ability levels than the conditional transformations in Figure 8. This shrinkage is due to the combination of Bayesian methods and a test length that is generally short for estimation in a multidimensional response model.
6.3. Case 3: Composite Ability We now assume that the use of observed scores in the presence of multiple abilities is motivated by an interest in a composite. The two-dimensional case is discussed. The appropriate approach seems to model the composite explicitly as a linear combination of 01 and 02. This linear combination can then be considered as our intentional ability, and the previous case applies. The following change of variables is needed:
~1 ~ '~'01 q- (1 -- ;V)02, ~2 ~ 02.
0 < )V < 1,
(18) (19)
W I M J. V A N D E R L I N D E N
oj
25 20
X
453
01=-3.0
15 10
0
i
i
i
i
i
5
10
15
20
25
Y FIGURE 9. Conditional e q u a t i n g t r a n s f o r m a t i o n s at e s t i m a t e s of 01 (intentional ability) for t w o tests f r o m the A A P .
25. 20. 15. S
01=-3.0
X
10-
0
I
~
I
I
I
5
10
15
20
25
Y FIGURE 10. Posterior e x p e c t e d conditional e q u a t i n g t r a n s f o r m a t i o n s g i v e n 01 for t w o tests f r o m the A A P .
454
PSYCttOMETRIKA
25 20 15 X
10
0
I
I
5
10
I
I
I
15
20
25
Y FIGURE 1 1. Marginal equating transformation (a) and weighted average of conditional equating transformations (b) given 01for two tests from the AAP.
Equating is based on the conditional distributions of X and g given ~1. The distribution of X given ~1 is defined by
fxl~l,~2(x) -~ fxlol,o2 (y;
(~1 - (1 -X)~2)X -1, ~2).
(20)
F has elements
e(y : ~1)
-=
F-1 Xl~l
(FYI~I (y))"
(21)
We now have to make sure that the distributions in (16) display the required stochastic order. This results is established by the following proposition:
Proposition 7. For a population of examinees "P and tests X and Y with concurrent fit to the multidimensional model in (7), the family of distributions {X I ~1} and {Y I ~1} are identically ordered in ~1 if {632 I 01} is stochastically ordered in 01. Ptvof For 0 < ;v < 1, ~1 is a monotonic function of 01. Also, ~2 = 02. Thus, {E2 I ~1} is stochastically ordered in ~1. As Fxlol,02(x) is monotonically increasing in 02 for fixed 01, it follows from (20) that FxI~I & (x) has this property in ~1 for fixed ~2. Now
FXI~I (y) = j Fxl~,~2 ( x ) f z 2 t ~ l
(~2)0~2 -
(22)
Because {E2 I ~1} is stochastically ordered and FXI~I,~2(X ) monotonically increasing in ~1, fxl~ 1(X) is also increasing in ~1 (Lehmann, 1986, p. 116, Exercise 5). Likewise, fyl~l (y) increases in ~1. Since ~1 is a common parameter, the required result follows. [] Figure 12 shows the conditional observed-score distributions on test g from the previous tests from the AAP pool given ~1 = -2.0(.25)2.0 for )v = .5. The distributions were calcu-
WIM J. VAN DER L I N D E N
455
1 -2.0
o.5
0
I
0
5
10
Y
15
20
25
FIGURE 12. Graphical illustration of stochastic order among conditional observed-score distributions on a test from the AAP given ~1 (composite ability) for L = .5.
lated using a standard normal bivariate distribution of (01, 02), which has the required property of stochastic order of {®2 I 01} in 01. The distributions possess the feature claimed in Proposition 7. 7. Concluding Comment This paper addresses the current practice of equipercentile equating using marginal observed score distributions. Though other forms of observed-score equating are practised, most of them are special cases of equipercentile equating obtained when special conditions on the marginal score distributions hold. For example, another popular form of equating is linear equating (Kolen & Brennan, 1995, chap. 2). Linear equating follows if the marginal distributions of X and Y are members of a location-scale family. The recommendation to replace marginal by conditional equating thus also applies to these special cases. However, assumptions that hold easily for marginal distributions of test score may be difficult to satisfy for conditional distributions. For example, linear equating on conditional distributions is always incorrect because these distributions are generalized binomial and not determined by a location and scale parameter. References Braun, H.I., & Holland, EW. (1982). Observed score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.). Test equating (pp. 9-49). New York: Academic Press. Campbell, N. R. (1928). An account of the principles of measurement and calculation. London: Longmans, Green & Co. Cizek, G.J., Kenney, RA., Kolen, M.J., Peters, C.W., & van der Linden, W.J. (1999). The feasibility of linking scores on the proposed Voluntary National Test and the National Assessment of Educational Progress [Final report]. Washington, DC: National Assessment Governing Board. Dorans, N.J. (1999). Correspondences between ACTand SATIscores (College Board Rep. No. 99-1). New York: College Entrance Board. Dubois, RH. (1970). A history of psychological testing. Boston: Allyn & Bacon. Feuer, M.J., Holland, EW., Green, B.F., Bertenthal, M. W., & Hemphill, F. C. (Eds.). (1999). Uncommon measures: Equivalence and linkage among educational tests. Washington, DC: National Academy Press. Glas, C.A.W. (1992). A Rasch model with a multivariate distribution of ability. In M. Wilson (Ed.), Objective measurement: Theory into practice (Vol. 1, pp. 236-260). Norwood, NJ: Ablex. Grayson, D.A. (1988). Two-group classification in latent trait theory: Scores with monotone likelihood ratio. Psychometrika, 53,383-392.
456
PSYCHOMETRIKA
Harris, D.B., & Crouse, J.D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240. Holland, EW., & Rubin, D.B. (Eds.). (1982). Test equating. New York: Academic Press. Junker, B.W., & Sijtsma, K. (2000). Latent and manifest monotonicity in item response models. Applied Psychological Measurement, 24, 65-81. Kolen, M.J., & Brennan, R.L. (1995). Test equating: Methods and practices. New York, NY: Springer-Verlag. Koretz, D.M., Bertenthal, M.W., & Green, B.E (Eds.). (1999). Embedded questions: Thepursuit of a common measure in uncommon tests. Washington, DC: National Academy Press. Lehmann, E.L. (1986). Testing statistical hypothesis (2nd ed.). New York, NY: Wiley & Sons. Linn, R.L. (1993). Linking results of distincts assessments. Applied Measurement in Education, 6, 83-102. Liou, M., & Cheng, EE. (1995). Asymptotic standard error of equipercentile equating. Journal of Educational and Behavioral Statistics, 20, 119-136. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N J: Erlbaum. Lord, F.M. (1982). The standard error of equipercenfile equating. Journal of Educational Statistics, 7, 165-174. Lord, EM., & Wingersky, M.S. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings". Applied Psychological Measurement, 8, 452-461. Mislevy, R.J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, N J: Educational Testing Service. Morris, C.N. (1982). On the foundations of test equating. In EW. Holland & D.B. Rubin (Eds.), Test equating (pp. 169-191). New York, NY: Academic Press. Pashley, EJ., & Philips, G.W. (1993). Towards worM-class standards: A research study linking international and national assessments. Princeton, NJ: Educational Testing Service, Center for Educational Progress. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Spearman, C. (1904). The woof and measuremenl of association between two things. American Jourt~tt of Psychology, 15, 72-101. Suppes, E, & Zinnes, J.L. (1963). Basic mee~surement theory. In R.D. Luce, R.R. Bush, & E. Galanter ~ds.), Handbook ofmathematicalpzychology(Vol. 1, pp. 1-76). New York, NY: Wiley & Sons. van tier Linden, W. J. (1996). Assembling tests for the measurement of multiple abilities. Applied Psychological Measurement, 20, 373-388. van der Linden, W.J. (1998a). Stochastic order in dichotomous iem response models for fixed, adaptive, and multidimensional tests. Psychometrika, 63, 211-226. van der Linden, W.J. (1998b). Optimal assembly of psychological and educational tests. Applied PsychologicztlMeasuremerit, 22, 195-211. van der Linden, W.J. (in press). Adaptive testing with equated number-correct scoring. Applied Psychological Measuremerit, 25. van der Linden, W.J., & Luecht, R.M. (1998). Observed-equating as a test assembly problem. Psychometrika, 63, 401418. van der Linden, W.J. & Vos, J.H. (1996). A compensatory approach to optimal selection with mastery scores. Psychometrika, 61,155-172. Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55, 1-17. Williams, V., Billaud, L., Davis, D., Thissen, D., & Santord, E. (1995). Projecting the NAEP scale: Results from the North Carolina end--of-grade testing program (Technical Rep. No. 34). Chapel Hill, NC: University of North Carolina, Chapel Hill, National Institute of Statistical Sciences. Yen, W. (1983). Tau-equivalence and equipercentile equating. Psychometrika, 48, 353-369. Zeng, L., & Kolen, M.J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231-240.