PSYCHOMETRIKA--VOL. 68, NO. 1, 1 2 3 - 1 4 9 MARCH 2003
C L A S S I C A L TEST THEORY AS A FIRST-ORDER ITEM R E S P O N S E THEORY: APPLICATION TO TRUE-SCORE PREDICTION F R O M A POSSIBLY N O N P A R A L L E L TEST PAUL W. HOLLAND EDUCATIONAL
TESTING
MACHTELD
SERVICE
HOSKENS
CTB-MCGRAW
HILL
We give an account of Classical Test Theory (CTT) in terms of the more f u n d a m e n t a l ideas of Item Response Theory (IRT). This a p p r o a c h views classical test theory as a very general version of IRT, a n d the c o m m o n l y u s e d IRT models as detailed elaborations of C T T for special purposes. We then use this a p p r o a c h to C T T to derive s o m e general results regarding the prediction of the true-score of a test f r o m an observed score on that test as well f r o m an observed score on a different test. This leads us to a new view of linking tests that were not developed to be linked to each other. In addition we propose true-score prediction analogues of the Dorans a n d Holland measures of the population sensitivity of test linking functions. We illustrate the a c c u r a c y of the first-order theory using simulated data f r o m the R a s c h model, and illustrate the effect of population differences using a set of real data. Key words: test theory, true scores, best linear predictors, test linking, nonparallel tests, simulation, R a s c h Model.
1. Introduction, Basic Notation and the General IRT Model This paper has two tasks. The first is to give an account of classical test theory (CTT) that shows how it can be viewed as a mean and variance (i.e., first-order) approximation to a very general version of item response theory (IRT). This connects CTT more closely to IRT and provides simplified ways of making calculations relevant to IRT models using the easier mean and variance calculations of CTT. It is often hard to see the structure of a full IRT model through the forest of item response functions, their numerous parameters, prior ability distributions, complex estimation techniques, the computation of plausible values drawn from posterior distributions of ability, etcetera. This is not a call for the return to the simpler ideas of CTT, but rather the suggestion to use them when they can give insight into the more complex IRT calculations. Our second task is to show that this approach can bear fruit. We show how it gives insight into the problem of predicting (a) the true score of a given test, that is, "direct true-score prediction", and (b) the true scores of tests that are not necessarily parallel to the given test, that is, "indirect true-score prediction". The rest of this paper is organized as follows. In the remainder of this section we develop our notation and discuss the basic assumptions that underlie the general IRT model we assume throughout the rest of this paper. Section 2 is devoted to showing how classical test theory can be derived from our general IRT model, the main result being Theorem 4. In section 3 we give definitions of what we call direct and indirect true-score prediction in terms of our general IRT model, and introduce the idea of replacing the posterior distribution of a true-score by the best This research is collaborative in every respect and the order of authorship is alphabetical. It was b e g u n w h e n both authors were on the faculty of the Graduate School of E d u c a t i o n at the University of California, Berkeley. We w o u l d like to t h a n k both Nell Dorans, Skip Livingston and t w o a n o n y m o u s referees for m a n y suggestions that have greatly improved this paper. Requests for reprints should be sent to Paul W. Holland, Educational Testing Service, Rosedale R o a d 12-T, Princeton NJ 08541. E-Mail:
[email protected] 0 0 3 3 - 3 1 2 3 / 2 0 0 3 - 1 / 2 0 0 1 - 0 9 1 9 - A $00.75/0 (~) 2003 The P s y c h o m e t r i c Society
123
124
PSYCHOMETRIKA
linear predictor (BLP) of the true-score from the manifest data. We also discuss the relationship between the posterior variance of the true-score and the average prediction error of the BLP of the true-score. Section 4 applies the results of sections 2 and 3 to two related problems. In section 5 we examine some real and simulated data to show how these ideas work out in the case of the Rasch model. Finally, section 6 contains discussion and suggestions for future research. Basic Notation. Let X and Y denote the raw test information collected using two testing instruments which we also call X and Y. For us, X and Y denote two random vectors, each realization of which is associated with a single examinee. Underlying it all is a population P of examinees, which will not play a major role in our analysis. Instead, G will denote a subgroup or subpopulation defining variable so that G = g indicates membership in a particular subpopulation of P denoted by g. The subpopulations defined by G will be ubiquitous throughout our analysis, while P will lie quietly in the background. To fix ideas, G could denote gender, so that the possible values of G would be, G = male or G = female. In almost all of our analyses, we will be conditioning on the event that G = g, and will denote this by "lG". This paper is closely related to the work discussed in Dorans and Holland (2000) in that it is partly concerned with the effects of subgroup membership on various aspects of linking the scores on different tests. This is why we have kept the subgroup membership function, G, in the notation. It was our original intent to cover only the simple case of "fixed-length, nonadaptive tests." At this point we are not sure if our development is sufficiently rich to include adaptive tests or other cases where there are missing item responses of certain types. This is a problem for future consideration. Once we have test data, we need to score it, so we assume that there are two real-valued "scoring functions", sx(.) for X and sy(.) for Y, and the resulting scores denoted by capital letters, that is, Sx = sx (X) and, Sy = sy (Y). In practice, sx might be "number-right", "formula scores" or some weighted combination of item-scores. The same holds for sy. We regard the definition of the scoring functions as external to our analyses. The only property of them that we assume is that for each vector, X or Y, of test performance data, there is a unique score assigned to it by the scoring functions. In our notation, Sx and Sy denote random variables over P that give the scores obtained from the tests X and Y, respectively for an examinee randomly sampled from P. The IRTAssumptions. We have specified a notation for two tests, their test data and their scores. Now we bring unobservables or latent variables into the picture. As usual, we let Ox and 0y denote two latent or inherently unobservable variables that "govern" or "lie behind" the tests, X and Y. The thetas are sometimes viewed as what the two tests "measure", and they need not be the same thing in our analysis. Over the population P, we presume that sampling an examinee at random from P induces a joint distribution of the variables: X, Y, Sx, Sy, OK, Oy, and G. We use this joint distribution to define the distributions of these variables as well as their means, variances, covariances. Thus, this joint distribution on P gives meaning to the formulas such as (1) through (4), below. We use P{} to denote the probability function for these random variables. The IRT model is defined in terms of P{}. In order to proceed, we make four very general assumptions about the IRT model. We list and name them first, and then discuss each one in turn. 1-DIM: Ox and 0y are real numbers (not vectors of real numbers). NO DIF: For any G, P{X, YlOx, Or, G} = P{X, YlOx, Or}. C O N D IND: P{X, YIOx,
Oy}
:
P{Xl0x, Oy}P{YlOx,
Oy}.
(1) (2)
P A U L W. H O L L A N D
AND MACHTELD
HOSKENS
SIMP: P{XIOx, Oy} : P{XIOx}, and P{YIOx, Oy} : P{YIOy}.
125
(3)
Assumption 1. 1-DIM. Initially, the latent variables, Ox and @ , are abstract quantities with no assumed numerical properties. We clarify this by assuming the thetas are real numbers rather than vectors or abstract categories. The assumption, 1-DIM, is restricting and eliminates all multidimensional IRT models for each of the tests, but it is widely assumed in practice, so we use it in this analysis. We have not explored the extent to which 1-DIM can be relaxed in the results we report below, but we recognize that this is a task for further research. The other three assumptions are often implicitly assumed to operationalize what it means for X and Y to "measure" Ox and 0y. We believe it is useful to make them explicit.
Assumption 2. NO DIE This assumption is intended to apply for "any" G, which we will always interpret as any function on P involving (a) only observable data and (b) that is not determined by the observed test data in either X or Y, but might involve some other test data as well as examinee characteristics. The observable and "nonreflexive" nature of the "legitimate" G ' s are important restrictions that need to be kept in mind when applying our results. We will not mention it again, but it is tacitly assumed that whenever we use the phase "for any G", we actually mean "for any legitimate G". The assumption in (1) is that Ox and 0y are the only things that affect the performance recorded in X and Y. The NO D I F assumption means "no differential item functioning" in the very general sense that for given Ox and 0y values, group membership in the groups indicated by G has no additional influence on the performance of an examinee on these tests. Because X and Y will usually contain item-level responses, this use of the term "NO D I F " is compatible with other uses of "DIF" in the literature. The NO DIF assumption is an unstated part of many IRT analyses. Within certain IRT models, it can be tested in various ways. We do not consider testing it here, but assume NO D I F and use it in many of our calculations. We have not examined what changes in our analysis would take place if we modified this assumption to allow D I E We point out that the roles of X and Y cannot be reversed with those of Ox and 0g in the NO D I F assumption. Later on we will consider the reversed or "posterior" probability, P{Ox, Oy IX, Y, G}, and here the effect of G can rarely be ignored in the way that it is in (1).
Assumption 3. COND IND. Mathematically this assumption states that given Ox and @ , X and Y are "conditionally independent" of each other. It means that information from test X is useless for predicting performance on test Y given the two theta values for an examinee. Usually this assumption is stated in terms of "local independence" of test items within a given test once the theta values are given, but we use this version of the assumption because we never look within X or Y beyond the scores, Sx and Sy. Assumption 4. SIMP. This assumption is related to COND IND, because it involves conditional independence as well. The first part assumes that X is independent of 0g given Ox, that is, that Ox is "specific" to X. The second part assumes that Or is "specific" to Y in the same sense. Relative to Ox and @ , the SIMP assumption asserts that the test data, X and Y, exhibit "simple structure" in the sense often used in factor analysis. For some, it helps to think of the thetas as what the observed test data "measure" and that the three assumptions, NO DIF, COND IND and SIMR merely follow from what it ought to mean for a test to measure something. For us, these assumptions together define what it means for the thetas to "govern" or "lie behind" the observed test data. Because Sx and Sy are functions of X and Y, respectively, they may be substituted for X and Y into (1) through (4) and the resulting formulas hold as well.
126
PSYCHOMETRIKA
The combined effect of NO DIF, C O N D IND and SIMP is the following basic formula that we state as a theorem to identify it. We note that this result does not depend on the dimensionality assumption, 1-DIM.
Theorem 1. Under assumptions NO DIF, COND IND and SIMR the conditional distribution of X and Y given Ox, Oy and G simplifies as follows: P{X, YIOx, Oy, G} = P{XIOx}P{YIOy}.
(4)
Equation (4) is often implicit in the particular forms of likelihood functions and other important elements of IRT models applied to testing problems. These four assumptions are made time and again in the application of IRT to testing problems. Throughout this analysis, we will avoid making any additional "functional form" assumptions (i.e., Rasch model, 3PL, Partial Credit, Graded Response, etc.) that are the usual fare of IRT applications. The one exception that we make a bit further along, is a very mild restriction on the functional form of the IRT model that is satisfied by every IRT model in common use. We will show later that Classical Test Theory can be viewed as a "mean and variance" approximation to this very general class of IRT models. In Appendices A and B we summarize two other mathematical results that we also need for the derivations in this p a p e r - - o n e on using conditioning to calculate first and second moments and the other on best linear predictors. The results in these two appendices are well known and along with the IRT assumptions that we have just stated are the only tools we use here.
Reparameterizing the Thetas in Terms of True Scores. The abstract nature of Ox and 0y makes them somewhat difficult to discuss. We wish to avoid this and now introduce the "truescore" reparameterization of the thetas. This reparameterization makes it easier to think about what the latent variables are and will lead us to connect the general IRT model described above to Classical Test Theory. We define the "true scores", r x and ry, in the usual way by: rx = rx(0x)
= E ( S x 10x),
(5)
rr = rr(0r)
= E(Srl0r).
(6)
and
We note that due to the NO DIF assumption we also have
rx = E(SxIOx, G),
and
ry = E(SyIOy, G),
(7)
for any choice of G. The functions, r x = rx(Ox) and ry = ry(0y), reparameterize the abstract latent quantities, Ox and 0r, into new latent quantities that are in the range of the values assigned by the scoring functions Sx = s x ( X ) and, Sy = sy(Y). Thus, the r ' s are equivalent 1-dimensional reparameterizations of the 0's and have units (i.e., X- or Y-score points) that are, in some ways, more understandable than the "logits" or "probits" of the theta scales. In special cases, the functions, r x = rx(Ox) and vie = r y ( 0 y ) are called the "test characteristic functions" of X and Y, respectively. In order for this reparameterization from the 0's to the r ' s to be useful we need to make one further assumption about the IRT model.
Assumption 5. CSI. The functions, rx(Ox) and ry(0y), in (5) and (6) are continuous and strictly increasing functions of Ox and 0y.
P A U L W. H O L L A N D
AND MACHTELD
HOSKENS
127
The CSI assumption allows r x and ry to be reparameterizations of Ox and 0y with no loss of information between the 0's and r ' s . The CSI condition always holds for the scoring functions and the IRT models that are widely used in practice. CSI is the "mild restriction" on the functional form of the IRT models that was mentioned earlier. In our development of CTT, we will reduce the joint distribution of X, Y, Sx, Sy, Ox, Oy, and G, to the joint distribution of Sx, Sy, r x , ry, and G. For example, the equations, (1) through (4), may be replaced, without change, by the same formulas where Ox and 0y are replaced by r x and ry, and X and Y replaced by Sx and Sy. In what follows we will assume that we have reparameterized the latent quantities into the corresponding "true scores", that is, the r ' s , and will ignore the 0's in the rest of this paper. 2. Classical Test Theory As A First A n d Second Moment Approximation to Item Response Theory In this section we show how to relate the general IRT model discussed in section 1 to classical test theory (CTT). As announced, we do this by showing in some detail that CTT gives an approximation to the more detailed results of IRT modeling that is accurate up to the first and second moments of the scores distributions. CTT is a "first-order" theory because it is primarily concerned with means and variances. As such it applies widely to any IRT model satisfying our basic assumptions--that is, to all of the models in routine use.
The Basics of CTZ In CTT, the data are reduced to the scores, Sx, Sy, and G, and the IRT model is reduced to the true-scores, ry and r x and their distribution over the relevant subpopulations of P. In the course of our development, we repeatedly use the assumptions NO DIF, C O N D IND and SIMR Definition 1. The "error term". The most basic equation of CTT is the formula: Sx = rx + ex.
(8)
Equation (8) will automatically hold in our development because we define the "error term", ex, by
ex = Sx - rx.
(9)
We begin our analysis with an examination of the conditional mean and variance of Sx given r x and G. Because of the NO D I F assumption we can drop the conditioning on G, so we examine the moments of Sx given its true score, rx. By definition we have E(Nxlrx) = rx,
(10)
and we define the conditional variance of Sx to be
gar ( SxIrx) = ~ x (rx).
(11)
Once we have these two conditional moments of Sx, we can study the corresponding moments of ex both conditionally given the true score, r x , and marginally, where r x is averaged out. The basic results are summarized in Theorem 2.
Theorem 2. (a)
E ( e x l r x ) = O,
(12)
128
PSYCHOMETRIKA
so that
E ( e x l G ) = 0, for anyG.
(13)
In addition,
V a r ( e x l r x ) = CNx(rX) = Var ( S x l r x )
(b)
(14)
(note that (14) shows that ~rsx (rx) is the conditional standard error of measurement of Sx) and 2 %xlG = Var (exlG) = E[~r2x ( r x ) l G ] = E[Var ( S x I r x ) l a ] .
(c)
(15)
Proof We outline the proof of Theorem 2 to show how the definitions and assumptions we have made work together. Part (a) follows from: E ( e x l r x ) = E(Sx - r x l r x ) = E ( X x l r x ) - E ( r x l r x ) = rx - rx = 0. Finally (c) follows from Var ( e x l r x ) = Var (Sx -
rxIrx) =
Var ( S x l r x ) = ~r2x (rx)
and the fact that Var (exlG) = E[Var (exlG, r x ) l G ] + Var [E(exlG, r x ) l G ]
= E[~r2x(rx)lG] + Var [OIG] = E[~r2x(rx)lG].
[]
In this derivation, we used Theorem A, parts (b) and (c), from Appendix A as well as the NO DIF assumption. Theorem 2 shows that in this first-order IRT, the conditional mean of the error given the true score is constant and = 0 for any value of rx, but that, in general, the conditional variance of the error term given the true-score is not constant. Equation (14) shows that the conditional variance (given rx) of the error term, ex, and the of observed score, Sx, are the same. In addition, it is a truism that Var ( r x l r x ) = 0, so that we have (16)
Var (SxlTx) = Var (TxlTx) + Var (exlTx)
However, the important formula relating the "observed score variance" to the sum of the "truescore variance" and the "error variance" is actually a statement about the marginal (given G) variances of Sx, rx, and ex. Theorem 3 summarizes the basic results for the marginal variances and covariances of Sx, rx, and ex.
Theorem 3. The error term and true score are uncorrelated, i.e., (a)
Coy (ex, TxIG) = O,
from which it follows that (b)
Coy (ex, S x l G ) = Coy (ex, e x l G ) = Var (exlG)
(c)
COY (Sx, TxlG) = Coy (rx, TxlG) = Var (TxlG) = ~rxlG'2
=
~2exlG,
129
PAUL W. HOLLAND AND MACHTELD HOSKENS
and (d)
o-2 SxIG = o-2xl G q_ o-2 rxIG"
(17)
P r o o f Part (a) is the key result and the rest follows from it. Part (a) follows from Cov (ex, v x I G ) = E ( e x v x I G )
- E(exIG)E(rxIG)
= E[rxE(exlrx,
= E[E(exrxIrx,
G)IG] - 0 []
G ) I G ] = E[rxOIG] = 0.
In CTT, the "unconditional" standard error of measurement is o-exlG, and, from (14), the "conditional" standard error of measurement is o-ex ( r x ) = o-& ( r x ) . These are not the same t h i n g s - - t h e former is a "summary average" of the latter, as we see in (15). Using (17) we may define the usual CTT form of the "reliability" as the ratio of (marginal) true-score variance to the (marginal) total variance, except that because we condition on G, it is a "conditional reliability" that depends, in general, on the subpopulation defined by G. Reliability is, as usual, 2
2
O-rxlG _
O_2
SxIG -- O-2xlG _ 1 -- --.O-2xlG
(18)
In the following development we will see that this formula for the reliability of S x , p2xl G, plays its usual role in our version of CTT.
Afirst-order item response theory. The first-order IRT that we will discuss involves only the joint distribution ( S x , Sy , r x , ry) conditional on G up to its first and second moments. All of the other details of this distribution are suppressed in this first-order theory. Theorem 4 gives all of the first and second moments that are relevant to any IRT model that involves two tests, X and Y, satisfying the five IRT assumptions defined in section 1. Theorem 4. I f X and Y are two tests satisfying the five IRT assumptions (1-DIM, NO DIF, C O N D IND, SIMP and CSI) then (a) the mean vector of ( S x , Sy, r x , ry) given G is: E [ ( S x , S y , "cx, vy)IG] = (#SxIG, # & I G , #SxIG, # & I G ) ,
(19)
and (b) the variances, correlations and covariances of ( S x , Sy, r x , r y ) given G are given in the following table where the covariances are above the diagonal and the correlations below it:
Variances, covariances and correlations of ( S x , S y , v x , vy) given G SX
Sy
vX 2 °'rx IG
SX
°'2x IG
°'r X ry IG
Sy
PSx IG I°Sy IG Prx ry IG
a2y IG
"CX
PSx I G
PSy I G PrX ry IG
°'r X ry IG 2 °'rx IG
vy
PSx IG Pr x ry IG
I°Sy IG
Prx ry IG
The argument for the means is easy, that is, # S x l a = E ( S x l G ) = E [ E ( S x l G , r x ) l G ] = E [ r x I G ] = #~xla,
vy °'r X ry IG 2 °'rY IG °'r x ry IG ~2y IG
130
PSYCHOMETRIKA
and /~SrlG = E(SYIG) = E[E(SYIG, r y ) I G ] = E [ r y I G ) = /~rrlG. ThUS the conditional mean vector of (Sx, S t , r x , r r ) , E[(Sx, S t , r x , r r ) l G ] is (/~rxIG, /~ryIG, /~rxIG, /~ryIG) = (/~SxIG, /~SyIG, /~SxIG, /~SyIG)" We also show how two of the covariance-matrix expressions are derived to illustrate our analysis. The covariance between S x and S r is a good case in point. Cov (Sx, S r I G ) = E[Cov (Sx, S r I G , rx, r r ) l G ] + C o v [E(SxIG, r x , r r ) , E ( S r I G , r x , r r ) l G ] = E[0IG] + C o v [rx, r y l G ] = ~rxryIG" In this derivation we used Theorem A, part (d), in Appendix A and all three of the IRT assumptions, NO DIF, C O N D IND and SIMR The corresponding correlation computation is Correl(Sx, S r I G ) -
=
C~xrrlG -- C~rxrrlO ~rrxlOc~rrlO ~TSxIG~TSyIG ~TSxIG~TSyIG~TrxIG ~TryIG ~rxrylG ~rxlG ~rylG - -- PSxIGPSyIGPrxryIG, ~YrxIG~YryIG ~YSxIG ~YSyIG
using the definition of the reliabilities given earlier. Another interesting case is the covariance between S r and r x . Cov (rx, SyIG) = E[Cov (rx, SyIG, r x , r y ) l G ] + C o v [ E ( r x I G , r x , ry), E ( S y I G , r x , r y ) I G ] = E[0IG] + C o v [rx, r y l G ] = ~rxryIG. In this derivation we also made use of the fact that a random variable given itself (i.e., r x given r x ) is a constant and has zero covariance with any other variable. We hope this is enough detail to clarify how to calculate the entries in the covariance matrix of Theorem 4. [] Theorem 4 summarizes all the means, correlations and covariances we need to compute all of the quantities of interest to us in our first-order IRT. We first want to illustrate how Theorem 4 can be used to give all of the usual results of CTT. We do this by considering three e x a m p l e s - - t h e disattenuation formula, the interpretation of reliability as the correlation between parallel tests, and the Spearman-Brown formula for predicting the reliability of a whole test from half tests. Disattenuation. This is the relationship between the correlation of the two observed scores, S x and S t , and the correlation of the two true-scores, r x and r r . F r o m the covariance matrix in Theorem 4 we can read off PSxSrlG = PSxlGPSrlGPrxrrlG
(20)
and the usual disattenuation formula is easy to derive from (20), that is, PrxryIG --
PSx&IG PSxIGPSyIG
(21)
Reliability and Parallel Tests. Suppose S x and S r are such that their true scores are perfectly correlated, that is, congeneric, so that, PrxrrlG = 1,
(22)
PAUL W. H O L L A N D AND M A C H T E L D HOSKENS
131
and furthermore, suppose that Sx and Sy are equally reliable in the sense that 2
2
PSxlG = PSyIG"
(23)
Then (21) is easily rearranged to show that 2
Psx IG = PSxSy IG"
(24)
Thus, the reliability of Sx is the correlation between Sx and a "parallel" (i.e., congeneric and equally reliable) measure, Sg.
The Spearman-Brown Formula. In this case the score for the "whole test" is the sum, Sx + Sy where X and Y are congeneric and equal reliability, that is, (22) and (23) above. In addition we assume that X and Y are given equal weight in the sense that their standard deviations are equal, that is, c~sx = c~sy.
(25)
When (23) holds, (25) is equivalent to the assumption of equal errors of measurement, ~rex = ~rey. When (22), (23) and (25) are satisfied then the familiar Spearman-Brown formula holds: 2
2 2PSxlG P(Sx+Sy)lG -- 1 + p2xl G
(26)
Estimating error variances. One of the great triumphs of early psychometric theory was the discovery of ways to estimate the "parallel forms" reliability of a test from a single administration of the test rather than the administration of two tests. An early approach used the "split-halves" method to correlate the scores on "parallel half-forms" of the test and then used the SpearmanBrown formula to "step-up" the correlation between the half-forms to that of parallel full-forms. This can be interpreted as an attempt to implement (24) directly. However, from the IRT perspective taken here, the natural approach to estimating the reliability of a test is through model-based estimates of the error variance, Var (exlG), using the relationship c~e2 x la = E[c~x ( r x ) l G ] . Once an estimate o f (721G is in hand, it can be combined with the sample estimate of the total score variance and equation (18) to estimate PSx 2 IG without any direct reference to parallel forms of the test. There are various routes one might take to estimate the error variance. A simple one which we use in our example section is to use a specific IRT model to estimate the function ~r~x (Ox) = Var (Sx IOx). This variance function will depend on the form of the IRT model assumed and the form of the scoring function. In our case we used the sum of the individual item conditional variances. In addition, the distribution of Ox for the subgroup defined by G will need to be estimated. Existing computer programs for IRT analyses can provide both of these estimates. The "error-term variance", c~e2 x la' is then computed by averaging the estimate of ~r~x (Ox) over the estimated distribution of Ox. 3. Direct and Indirect Prediction of True Scores from Observed Scores Many constituencies wish to link the scores on tests that were never developed to be linked to other tests. For example, to link the scores of a State's standardized assessment to the N A E P scale in order to give wider interpretability to the State's testing results. Or, to link a state's highschool exit exam to the College Board's SAT I's M and V scores to avoid taking the real SAT I. One of us, (PWH), chaired a National Research Council (NRC) committee that made recommendations about the feasibility of linking tests in this more general setting (Feuer, Holland, Green
132
PSYCHOMETRIKA
Bertenthal, & Hemphill, 1999). It is partly the result of this experience that led to the research reported here. The NRC committee's findings are pessimistic, but the committee also concluded that quantitative knowledge was lacking about the tradeoffs that linking different types of tests would entail. We hope that this paper and related research, such as Dorans and Holland (2000), will help to clarify some of the technical considerations that must be faced in providing such information.
Direct True-Seore Prediction. If we start with an examinee's score on X, Sx, and his or her group membership as indexed by the value of G, and we then want to make an inference about the unobserved true-score, r x , for that examinee, the standard way to proceed is to report the posterior probability distribution of r x , given by P{Nxlrx}
P{rxlSx, G} -- - P{rxlG}. P{SxlG}
(27)
In (27) we have invoked the NO D I F assumption which simplifies the numerator of the ratio from P{Sxlrx, G} to P{Sxlrx}. The posterior distribution in (27) summarizes what is known about the latent true-score given the observed test performance (summarized by Sx) and whatever else we known about the examinee (summarized here by the examinee's value of G). We note that the score, Sx, can be replaced by the entire set of test data, X, in (27), but for our purposes here we will reduce all the test data to the scores. We will call this summarizing of what is known about the latent true-score of a test given the observed score on that same test, and perhaps other information such as G, "direct true-score prediction", to reflect the "direct" connection between a latent true-score and its corresponding observed-scores. Later we will consider "indirect true-score prediction". The posterior distribution, P{OxlSx, G} in (27), can be very complex and hard to calculate. In addition, in the form specified by (27) it gives little insight into its detailed structure. Therefore, it is sometimes useful to summarize the full posterior distribution by its first and second moments, i.e., by
E(rxlSx, G)
and
Var(rxlSx, G).
(28)
The posterior mean in (28) is a prediction of Ox based on the information that is available from the examinee. The posterior variance (or better yet, its square root, the posterior standard deviation) in (28) is a measure of the error of this prediction. In terms of true-score prediction, an "inference" about an examinee is a prediction of his or her value of r x along with a measure of the prediction error. Some authors call these considerations "true-score estimation", but we follow Holland (1990) and call it "true-score prediction". The posterior mean and variance in (28) can also be difficult to calculate even though they are simplified summaries of the full posterior distribution in (27). They can be approximated by using the best linear predictor (BLP) of r x from Sx and G and the average prediction error of the BLR that is more carefully described in Appendix B. We denote the BLP of r x from Sx and G by:
L(rxlSx, G) = C~G+ fGSx,
(29)
where C~G and fig may be calculated using Theorem B in Appendix B. We use the notation L(rxISx, G) in order to make the best linear predictor appear formally like the conditional mean that it approximates. Using the results of Appendix B we see that
L(rxlSx, G) =
I~rxl G + P r x S x l G
°~rx IG ~e ~,~X -- I~SxlG), ~rSx IG
(3o)
133
PAUL W. HOLLAND AND MACHTELD HOSKENS
and then, using the formulas in Theorem 4, Equation (30) becomes L(rxISx,
2
G ) = # S x l G + P S x I G ( S X -- # S x I G ) .
(31)
Thus, the BLP of r x from Sx, and G is just the Kelley (1923) true-score estimation formula. Lord and Novick (1968, pp. 64-65) give some of the same analysis in their discussion of "the linear minimum mean squared error regression functions" and their relation to the Kelley formula for estimating true scores as a function of observed scores. Prediction also includes measures of the prediction error. When we use the posterior mean to "predict" r x then the posterior variance or its square root can be used for this purpose. When the posterior variance is too complicated to compute it can sometimes be roughly summarized by its average value over the conditional distribution of Sx given G, i.e., by E[Var (rxlSx, G ) l G ] .
(32)
Even the calculation in (32) may be difficult to make but we may always use the results of Theorem B of Appendix B, part (e2), to show that the average prediction error of the BLP provides an upper bound on (32) that is easy to calculate and that is close to (32) when the posterior mean, E(rxISx, G), is close to being linear in Sx. Specifically we have:
E[Var (rxISx, G)IG] = rCrx2ISx,G - Var [E(rxISx, G) - L(rxISx, G)IG] = ~r2xlG( 1 _ PrxSxlG)-2 ~r2 v2 rxlG"DG
(33)
where
k2Da = Var [ (E(rxlSx, G) - L(rxlSx, G) )
G] .
(34)
~rxlG
We can use Theorem 4 to simplify (33) to E[Var (rxlSx, G ) I G ] = °-rxlG(1 2 2 2 2 , -- PSxlG) -- ¢rrxlGkDG
(35)
or
E[Var(rxlSx, G ) I G ] =
(72
2 2 /~2 SxlGPSxlG (1 -- PSxlG) -- (72rxlG"DG"
(36)
2 ISx, ~' is the proper measure If we use the BLP to predict r x , then the average prediction error, rex of the average uncertainty of this prediction. The first terms of the right-hand sides of (35) or (36) give us easily calculated measures of this uncertainty in the BLPs prediction of r x . We will use these ideas later in section 4.
Conditioning on nontest information. Before we leave direct true-score prediction, we comment on a mathematical detail that has some important consequences. F r o m Theorem A of Appendix A, part (b) it follows that E{E(rxISx, G)IG} = E ( r x I G ) ,
(37)
E{E(rxISx)IG} = E ( r x I G ) ,
(38)
and furthermore that
does not hold, in general. The relevance of this last statement is that conditioning on both the test data, Sx, and the nontest data, G, is necessary if it is desired that the system of predictions of r x , E(rxISx, G), produce the average value of r x over the group defined by G when the predictions are themselves averaged over the distribution of observed scores in the group defined
134
PSYCHOMETRIKA
by G - - t h a t is the content of (37). If G is left out of the prediction, as it is in E(rxISx), then its average over the distribution of observed scores in the group defined by G will not, in general, equal the average true-score, E ( r x IG). The less reliable the test and the more strongly associated G is with test performance the larger will be the discrepancy between the average of E(rxISx) over the group determined by G and E ( r x I G ) . This is discussed extensively in Mislevy et al. (1992), and is one of the reasons for the use of "conditioning" in the National Assessment of Educational Progress. The "expected a posteriori" estimate of theta, the E A R (Bock & Mislevy, 1982), is an example of such a prediction of a latent variable that does not include nontest data, that is, G, in the conditioning. Wainer et al. (2001) consider the prediction of what we would call r x from both Sx and Sy, using, in our notation, L(rxISx, Sy), the best linear predictor of r x from Sx and Sy. This is also an example of not including nontest information in the prediction of r x . Wainer and his colleagues call this "scoring" rather than predicting and perhaps that is a good distinction to make. When scoring a test rather than predicting rx from available information it is usually thought proper to exclude anything about the test taker from the conditioning other than his or her test data. The result will be that averaging the "scores" over subgroups of examinees will produce discrepancies between this average and the average value of r x over the subgroups of examinees of interest. On the other hand, averaging predictions that do include the relevant nontest information will not exhibit such discrepancies. F r o m this we deduce that what we are calling "predicting" is not the same thing as "scoring" a test, no matter how similar they might seem to be.
Indirect True-Score Prediction. Suppose now that we are interested in the true-score, r x , but what is available to us is the observed-score from a different test, Sy, as well as the information from G. This is the setting we are in when talk turns to "linking" tests that are not assumed to be parallel. Thus, the test information is only indirectly related to the true score that is the target of our prediction, and for this reason we call this the problem of "indirect true-score prediction". Following the previous discussion, we are naturally led to consider the posterior distribution, P { r x l S y , G}. There is a little more complexity to this conditional distribution than what we see in (27). We have P{rxlXy, G} - P { S y l r x , G} P{rxIG}.
(39)
P{SyIG} The numerator of the ratio can be reexpressed as P{Sglrx, G} = E[P{Sglrg}lrx, G].
(40)
In (40), we use both NO D I F and SIMP to rid the inner conditional probability of its dependence on both G and r x . The inner conditional probability in (40) is the usual likelihood function for Sy, while the outer expectation is over the conditional distribution of ry given both r x and G. Comparing (27) and (39) and (40) we see that indirect true-score prediction has a new place for dependence on G to emerge. It is in the conditional distribution of rg given r x and G, which is used in (40) to average over the likelihood function of Sy, which does not depend on G. In our real data example we will show that this does occur. Again, the posterior distribution in (39) can be complicated and in some cases it may be useful to reduce it to its posterior mean and variance:
E(rxlSg, G)
and
Var(rxlSg, G).
(41)
In turn, we can approximate the elements of (41) by the corresponding BLR L ( r x l S y , G), and its average prediction error, zcrx 2 ISy,G"
135
PAUL W. H O L L A N D A N D M A C H T E L D H O S K E N S
Using Theorem B of Appendix B, the formula for the BLR L(rxISy, G), is L ( r x I S y G) = #~xlG + Prx&IG ¢rrxlG (Sy - #&IG), ~rSYIG
(42)
and applying Theorem 4 to this it reduces to L ( r x l S y G) = #SxlG + PSyIGPrxryIG ~r~xla ~rSxla (Sy - # & l a ) ,
(43)
_ PrxryIGP&IGPSxlG ~rSxla L ( r x I S y , G) = #Sxla + Cr&lG (Sy
(44)
~TSxIG ~TSyIG
or
#Syla),
or
(45)
L ( r x I S y G) = # & I G + PSx&IG ¢rSxlG (Sy - #&IG) ~rSYIG
Equation (45) shows that the BLP L ( r x ISy, G) is, in fact, the population linear regression function of S x on Sy, given G, that is, L ( S x I S y , G). At first, we were surprised by this result, but on reflection it seems intuitively plausible, or perhaps even obvious. Applying the average prediction error of the BLP to approximate the average posterior variance, using the results of Appendix B again, we get: E[Var ( r x I S y , G)IG] = rcrx 2 ISy, G
Var [E(rxISy, G)
L ( r x I S y , G)IG]
2 2 2 = O-r21G(1 -- PrxSyIG) -- C%xlGklG
(46)
where (47) Again, applying the results of Theorem 4 to (46) we get: E[Var (rx ISr G)IG] = O r2x l G ( 1 -- PSyIGPrxryIG) 2 2 -- ~ r2x i G k l2G ,
(48)
E[Var(rxlSy, G)lG]=cr 2 rxlG
(49)
or
1-
2 PSx&IG _or2 v2 2 2 rxI G'~IG' PSglGpsxlGPSglG J
or
E [ V a r ( r x l S y , G)IG] = ~SxIG(PSxIG 2 2 2 2 2 -- PSxSyIG) -- ~rxlGklG"
(50)
Note that the leading term of (50) is smaller than the average residual variance in linear regression because the prediction error for true scores is smaller than for their observed scores Again, the leading terms of (49) and (50) give the average prediction errors of this "indirect" BLP of r x We will use them again in section 4 Using these results, we see that the practice of using linear regression to predict one observed test score from another, as in Pashley and Phillips, 1993, has a clear meaning in terms of indirect t r u e s c o r e prediction as defined here--namely that it is the same as the BLR L ( r x l S y , G ) The coefficients of the BLP all depend, in general, on G, however, and, in the use of the BLP to "project" the scores of Sy onto the scale of S x , care must be given to including as "predictors" of S x the main effect of and interactions of G with the test score S y The precision measure of the BLR given by (50) does not come out of the usual regression analysis programs and involves the reliability of S x
136
PSYCHOMETRIKA
4. Further Uses of the First-Order IRT In this section, we will examine two further applications of the material developed in sections 2 and 3. First, we consider the increase in prediction error that arises as we move from direct to indirect prediction. Second, we develop analogues to the measure of the population dependence of equating functions introduced by Dorans and Holland (2000). The price of indirection. We can use our results to get a measure of the increase in average prediction error that occurs when we predict the true-score of Sx from scores on a test that need not be parallel or closely related to X. We propose to use the ratio of the square roots of the average prediction errors (either the average posterior variances given in (35) and (49) or their leading terms, the prediction errors of the BLPs.) This gives us the "Prediction error inflation factor" given by
H is a measure of the amount by which the average prediction error of the prediction of tx is inflated when we use Sy rather than Sx to make the prediction. Indirect prediction is always worse and H is a measure of how much worse on average. Using the square root puts the inflation factor into units that are similar to "percent increase in the standard deviation". In section 5 we examine how much the kg-factors matter in a real application.
Analogues of measures of the population dependence of linking functions. Dorans and Holland (2000) propose measures of the influence of subgroup membership on observed-score linking functions. There are natural analogues of that work to true-score prediction. Since our aim here is to use the BLPs as approximations to the posterior means we will concentrate on the BLPs in this discussion. In the Dorans and Holland approach, the linking functions computed on each subpopulation are compared to the linking function that is computed on the whole population. In our situation, this corresponds to comparing Direct Prediction:
L(tx I Sx, G
= g)
and L(tx I Sx),
(52)
Indirect Prediction:
L(tx I Sy , G
= g)
and L(tx I Sy ).
(53)
The analogues to the Dorans and Holland (RMSD) measure are the predicted difference functions given by: ( ~ ~ [ ~ ( =t S.x G~=xg) Direct Prediction:
PDD(s)
-
L(rxSx
= s)12wg)
L(rxlSy
= s)12wg)
112
=
, (54)
Dzx
( E g [ U ~ x l ~=yS, G Indirect Prediction:
PDI (s)
= g)
=
-
112 , (55)
Dzx
where wg is the proportion of the whole population that is in the subpopulation denoted by G = g. These measures show the average amount, in true-score standard deviation units, that the subpopulations indicated by G affect the BLP for each value of Sx.
137
PAUL W. H O L L A N D A N D M A C H T E L D H O S K E N S
Dorans and Holland also propose single number summaries of the R M S D functions. The analogues here are the Expected Predicted Difference values given by: Direct Prediction: s , g [ L ( r x l S x = s, G = g) - L ( r x l S x
= s)]2p(Sx = slG = g)wg
EPDD =
,
(56)
O-rx or
Indirect Prediction: (~.s,g[L(rxlSr
= s, G = g) - L ( r x l S r = s)]2P(Sr = s l G = g ) w g )
EPDI = \
/
(57)
O-rx
However, the numerator of (56) can be expressed as the square root of the following quantity, E { [ ( L ( r x l S x , G) - L ( r x l S x ) ] 2} = Var [ ( L ( r x l S x , G) - L ( r x l S x ) ] .
The equality of the second moment and the variance follows from the equality of the unconditional expectations of the BLPs of r x . Similar expressions hold for indirect prediction. Hence we obtain the following alternative representation of EPD values in (56) and (57), Direct Prediction:
EPDD =
S D [ ( L ( r x l S x , G) - L ( r x l S x ) ]
(58)
O'rx
and Indirect Prediction:
EPDI =
S D [ ( L ( r x l S r , G) - L ( r x l S r ) ]
(59)
O'rx
The measures for indirect prediction are more analogous to those of Dorans and Holland than are those for direct prediction because they involve linking Y to the true score scale of X. We have not investigated the utility of finding analogues to the "parallel linear" linking functions used in Dorans and Holland to simplify their measures. In the next section we evaluate these measures for a real data example. 5. Examples Using Real and Simulated Data In this and the next section we report some preliminary results using the ideas developed in this paper. These results make use of simulated data using a simple IRT model as well as an example using real data. The simulation study. In order to see how well the first order analysis approximates the posterior mean and variance in a real IRT model we carried out a small simulation study using the software, ConQuest (Wu, Adams, & Wilson, 1997). In the study, the two tests, called X and Y, had forty items e a c h - - e x c e p t in Conditions 5 and 6 explained below in Table 1. In all of our analyses, the Y-scores are being linked to the rx-scale. Because our interest was in the accuracy of the first order theory we did not investigate the effects of multiple groups of examinees in the simulation, so there is no " G " in this part of our study. All of the item responses were simulated using a one-dimensional Rasch model with b parameters that we varied to mimic several interesting differences between X and Y. There were four basic sets of b parameters called "spread", "spread low", "spread high" and "peaked", respectively. In all four conditions the b's were on the logit scale.
138
PSYCHOMETRIKA TABLE 1. The sets of item parameters describing each pair of test conditions used in the simulation study Case 1 2 3 4 5 6
Test X
Test Y
Spread Peaked Peaked Spread low Spread, 20 items Spread, 40 items
Spread Peaked Spread Spread high Spread, 20 items Spread, 20 items
In the "spread" condition the b's were randomly sampled from the uniform distribution on [ - 1 . 7 5 , 1, 75]. For the "spread low" they were also sampled from this uniform distribution and then had 0.25 logits subtracted from all of them. For the "spread high" condition the b's were again randomly drawn from the uniform distribution on [ - 1 . 7 5 , 1.75] and then had 0.75 logits added to all of them. For the "peaked" condition the b's were randomly drawn from the Normal distribution with mean 0 and standard deviation 0.1 logits. With these definitions, we created six "cases" or pairs of tests, X and Y, with the item parameters described according to Table 1. These six cases were chosen to mimic the following conditions that might arise in linking two tests. Case 1: The two tests have difficulty parameters spread out over similar large ranges of values. Case 2: The two tests have difficulty parameters concentrated in about the same small range of values. Case 3: The two tests have similar average difficulty, but Y has a wide range of difficulty parameters and X has a narrow range of difficulty parameters. Case 4: The two tests both have wide ranges of difficulty parameters but they have very different average difficulty with X being the easier test. Case 5: The same as Condition 1, but with both tests half as long, and therefore both X and Y are less reliable than in Condition 1. Case 6: The same as Condition 1 except that X is twice a long as Y, so that a less reliable test is being linked to the scale of a more reliable test. The thetas for the two tests were assumed to be distributed as bivariate normal with means 0, standard deviations 1 and with one of two correlations between Ox and Or, either p = .8 or p = .5. In each simulation we used N = 2000 simulated examinees, that is, "simulees". The structure of the simulation consisted of 12 = 2 times 6 combinations of a choice of one of the two correlations for the bivariate ability distribution of (Ox, Or) and choice of one of the six sets of pairs of item parameters as specified in Table 1. In each simulation a sample of 2000 simulees with (Ox, Or)-valueswere generated from the selected bivariate ability distribution, and values of their dichotomous item responses from X and Y were then simulated using the selected value of (Ox, Or) and the pair of Rasch models with item parameters indicated by the appropriate case in Table 1. The raw scores on X and Y were taken to be the number-right scores on each test, i.e.,
Sx=sx(X)= ~ X j , J
and
Sr=sr(Y)= ~ Y j . J
(60)
When necessary, we transformed all the theta values to the corresponding true-score scale using the transformations in (5) and (6) that now take the form
PAUL W. HOLLAND AND MACHTELD HOSKENS
139
rx = ~Pjx(Ox), J
(61)
and
rr = ~Pjr(Or), J
where the item response functions in (61) are the Rasch type, i.e., they are given by logit [Pjx(Ox)] = OK - bjx,
logit [Pjr(Or)] = Or - bjr.
and
(62)
In the transformations defined by (61) and (62), the true b's were used rather than estimates of them based on the sample data. Thus, in this study the theta-to-true-score transformation was the correct "population" transformation rather than an approximation estimated from sample data. An important part of our simulation was to obtain estimates of the posterior means and variances, that is, E(rxISx) and Var (rxlSx), for direct, and E(rxlSr) and Var (rxlSr), for indirect, prediction. The program ConQuest can produce plausible values (i.e., sample draws) from the posterior distributions of OxlSx so we exploited this facility in our simulation. In calculating the posterior distributions we again used the population values for the b's and the appropriate normal distribution for the priors. When we were concerned with direct prediction we drew from the posterior distribution, P{OxlSx}, and when we were concerned with indirect prediction we first drew from the posterior distribution, P{Or IS r}, and then used the conditional distribution P{OxlOr, St} = P{OxlOr} to make the final draws from the posterior distribution, P{OxlSr}, (Gelman, Carlin, Stern, & Rubin, 1995). For each simulee we generated 100 plausible values from the conditional distribution of Ox given either Sx (for direct prediction) or Sr (for indirect prediction). We then used the true-score transformation for X in (61) to transform the "theta" plausible values to "true-score" or tau-plausible-values. Simulees were grouped on the basis of Sx or of Sr and then means and variances were computed for all of the true-score plausible values represented by each of these groups of simulees with identical values of Sx or St. For each condition of the simulation design we generated 10 replicate data sets and averaged the results. The means and sd's across the 10 replications are given in Tables 2, 3 and 4. These means and variances of the plausible values of r x then formed our estimates of the posterior means and variances for direct and indirect prediction to be compared to the values obtained from the first-order best-linear-predictor theory outlined in section 3. In order to implement our first-order IRT analysis we needed estimates of the tests' reliabilities and we used the approach outlined in section 2. We operationalized the error variance as the integration specified by: ¢r2x = E[er2x(rX)] = E[er2x(rX)] =
oo
J
. Pjx(Ox)[1 - Pjx(Ox)] (p(Ox) dOx.
(63)
In (63), ~o(.) denotes the standard normal density function, and we used the true b-values in the IRFs within the integral rather than estimated values. The integration was carried out numerically. Table 2 shows the resulting reliabilities for the five different sets of item parameters used in our
TABLE 2.
Average reliability values for the five sets of item parameters Pattern of item difficulties Spread Peaked Spread low Spread high Spread 20 items
Average reliability (standard deviation) .88 .89 .88 .87 .79
(.005) (.002) (.003) (.002) (.008)
140
PSYCHOMETRIKA
study averaged across the various conditions in which they appeared, along with the standard deviations. F r o m Table 2 it is evident that the only factor that strongly affects the reliability of the tests used in the simulation is the number of items in the test, that is, 20 versus 40. These reliability values indicate that the test used in our study were not unrealistic in terms of the usual measures of test reliability. 6. Simulation and Real Data Results
6.1. How Good an Approximation to the Posterior Expectation is the Posterior BLP? We answer this question in two ways. First by using the overall measure of the discrepancy between the posterior means and the BLP given by k~)a and k/2a. Table 3 shows the values of 1000 times the k2-factors for the various conditions of the simulation design. The values in Table 3 are means across the 10 replications of each simulation condition. All of these values are very small, indicating that the BLP is a good approximation to the posterior means for the cases covered in our simulation. In addition, it suggests that in many cases the k2-factors can be ignored in the computation of H. The square root of each k2-factor is a percent of a standard deviation of the true-score distribution. This measure of the average squared difference between the best linear predictor and the conditional expectation ranges from 3 to 5% for direct and from 7 to 10% for indirect prediction. Examining Table 3 more closely, in the case of direct prediction, cases 1, 4 and 6 are essentially the same and the values of k2DG reflect this. Similarly, for direct prediction cases 2 and 3 are also the same and have identical values of k~)~. Case 5 is the only case of direct prediction that involves a 20-item test and its value of k~)~ is the largest. The values of k~)~ are all much smaller than the corresponding values of k/2~. For indirect prediction the biggest differences in k/2a are between the two values of the theta correlations, p. The differences in k/2a between the six cases are much smaller than the differences due to the theta correlation. We interpret this as saying that the details of the differences in the item parameters and number of items is not as important as the lack of parallelism indicated by the different thetas for the two tests. In the case of p = 0.5, the tests are measuring very different things. Case 4 is interesting in that it is the only one in which the two tests are differentially targeted for the underlying population. X is a bit too easy for the population and Y is a bit too hard for them. This is roughly what can happen in vertical equating studies. We note that the values of k/2~ are the biggest for these two cases and that they get larger as p increases. At first we thought this was an error in the simulation but we have convinced ourselves that it is not. We then did three more versions of case 4 where p was .90, .99, and 1.00 and the values of 1000k/2a were 13.2, 14.8 and 14.9, respectively. Figure 2 shows both the conditional expectation and the best TABLE 3. Values of 1000 times k 2 for each of the conditions in the simulation study. R o w s are sorted b y the value of k 2 for the case of direct prediction. (Values in parentheses are standard errors b a s e d on the 10 replications of each simulation condition.)
Case 5 6 1 4 2 3
TestX
Testy
Spread 20 Spread 40 Spread Spread low Peaked Peaked
Spread 20 Spread 20 Spread Spread high Peaked Spread
Direct
Indirect p = 0.8
Indirect p = 0.5
2.6 (0.7) 1.6 (0.5) 1.5 (0.4) 1.5 (0.3) 1.2 (0.3) 1.2 (0.4)
5.7 (1.1) 6.2 (1.8) 6.0 (1.3) 14.1 (3.4) 5.5 (1.4) 5.2 (1.7)
7.8 9.0 9.9 11.5 9.7 10.0
(1.7) (2.9) (3.8) (2.1) (2.0) (2.7)
P A U L W. H O L L A N D
AND MACHTELD
HOSKENS
141
linear predictor for Case 4 p = 0.80, and we see that there is substantial curvilinearity in the conditional expectation. The "bending" gets stronger the more correlated the two tests become and is related to the difference in the levels of difficulty of the two tests, and not to the lack of a perfect correlation between the abilities they measure. For comparison, we also did case 1 (X spread and Y spread) for p = 0.99, and instead of going up, 1000k/2a went down to 1.8 from 6.0 for p = 0.80. While the values of the k2-factors indicate that, for these examples at least, the linear approximation to the posterior mean by the BLP is quite good, as our second approach to investigating how well the BLP approximates the posterior mean we include two graphs in Figures 1 and 2, below. These graphs are used to show what the posterior means and the BLP look like as functions of the conditioning score value. We only show the graphs that correspond to the largest and smallest values of the k2-factors in our original study design. The BLP is an approximation to the conditional expectation because Theorem A part b and Theorem B part d in the appendix show that, averaged over the score distribution, both L(rxlSx) and E(rxISx) have the same value (this is also true for L ( r x l S y ) and E ( r x I S y ) ) . Hence there can be no constant bias between the approximation and the target, they must cross as we see in Figures 1 and 2. Our conclusion is that the BLP is a remarkably good linear approximation to the posterior mean in the IRT model studied here. This suggests the need for future analyses along these lines for more complicated IRT models.
How good is H as an approximation to the loss of precision that arises through the use of indirect prediction? First of all, there were, to two decimal values, virtually no differences between the values of H computed using formula (51) and one where the value of I:2 is set to 0. This is not surprising in light of the small values of I:2 that emerged in our study. This suggests that analyses of H that ignore I:2 probably give useful results in many situations. This is a useful topic for future research. Figures 3 and 4 give typical examples of the posterior variances for direct and indirect prediction. Both show a curvilinear relationship between the posterior variance and the conditioning test score. This curvilinear relationship is predicted from the simple betabinomial model (Gelman et al. 1995, p. 477) which is a special case of the Rasch IRT models used here. Comparing the two graphs we see that the posterior standard deviations for direct prediction are smaller than those for indirect prediction, which is exactly what the inflation factor H is attempting to measure. To see how well H does this we computed the average ratios of the posterior standard deviations at each conditioning score point, indirect divided by direct, and then averaged the results over the score points. If H is to be a useful average measure it needs to reflect the average amount by which the posterior standard deviation is increased when we move from direct to indirect prediction. When the two test are not of the same length (i.e., condition 6) some means of connecting their score values needs to be worked out to form these ratios of posterior standard deviations. We simply scored the two tests by their pereents correct and pooled values of the standard deviations associated with the neighboring scores of the test with the larger number of score values. We used the formula (64) to form the target ratios.
Target ratios =
Var (vxlSy = s)P{Sx = s} Var (rxlSx s)
)1j2
(64)
The values of the target ratios and the values of H are given in Table 4. Table 4 shows that the H-values, while usually smaller than the target values, are quite close and give exactly the same general type of information about the effect of indirect prediction relative to direct prediction for the 12 conditions of the simulation study. We regard this finding as a clear support for further work on the BLP tools we have developed here.
How does indirect prediction increase the imprecision of the prediction of the true score of the target test, X. Using the H-values, we get a clear picture as to what happens when we link a test that measures a different construct in a different way to a target test in terms of degrading
142
PSYCHOMETRIKA
40 ~
L
f
L
30
20
m
j
...
L('CxlSy)
10
0
F
01
02
r
r
03 Sy Panel (a)
I
I
I
II II -1 E('CxlSy)-L('CxlSy)
-2 -3 I
01
I
I
I
02
03
0
Sy
Panel (b) FIGURE 1. Panel (a): Plot of posterior mean, E(I), and posterior best linear predictor, L(I), for Case 3, X peaked, Y spread, witl p = 0.8. (Best fit.) Panel (b): Difference plotted against raw score.
PAUL W. H O L L A N D A N D M A C H T E L D H O S K E N S
I
I
143
I
30
x 20 .....
L('CxlSy)
10
01
I
I
I
02
03
0
Sy
Panel (a)
I
-1 -2
I
~
~
~
I
E('CxISy)L ('CxISy)]
-3
01
I
I
I
02
03
0
Sy
Panel (b) FIGURE 2. Panel (a): Plot of posterior mean, E(I), and posterior best linear predictor, L(I), for Case 4, X spread low, Y spread high, and p = 0.8. (Worst fit.) Panel (b): Difference plotted against raw score.
144
PSYCHOMETRIKA
t
l
J
_L
...
SD Y ('cxlS )_ obs M[SD('CxlSy)]
....
E[SD('CxlSy)]
~-
SD(zxlSx)
obs M[SD('CxISx)
*•
,
s
;
I
~
$1
-V
J
01
02
P S
]
03 0 Observed Test Score
FIGURE 3. Comparison of conditional standard deviations and their average with the average predicted by the BLP. For both direct and indirect prediction, for Case 1, X spread, Y spread, p = 0.8.
I
9
I
I
... 8 .... 7
SD('~xlSy) obs M[SD('CxlSy)l E[SD('CxlSy) ]
SD(zxlSx)
6 . . . . . . . . . . .
. . . . . . . .' "
"-'7"
,; Lot- r ~ - I ' - . . . . . . . . . . . . . . . . . . .
obs M[SD(zxlS x) E [ S D ( z x ISx )1 ~-i'. . . . . . . . . . . . .
~"
5
i I
i
el
I
s si
"
S Sl,,
4 3
.
2 ~ f 1
.
.
.
.,~_~N.
--- ~
~__
~ f I
01
02
I
03
0
Observed Test Score
FIGURE 4. Comparison of conditional standard deviations and their average with the average predicted by the BLP. For both direct and indirect prediction, for Case 2, X peaked, Y peaked, p = 0.8.
145
PAUL W. H O L L A N D A N D M A C H T E L D H O S K E N S TABLE 4. Average values of H and of the average ratios of the target posterior standard deviations for the several simulation conditions in the study. Standard deviations across the 10 replications axe in parentheses.
H
H
Case
X
Y
p = .8
p = .5
Mean Ratio p = .8
Mean Ratio p = .5
5 1 6 4 2 3
Spread 20 Spread Spread 40 Spread Low Peaked Peaked
Spread 20 Spread Spread 20 Spread High Peaked Spread
1.49 (.04) 1.86 (.03) 1.99 (.04) 1.92 (.03) 2.02 (.02) 2.03 (.02)
1.95 (.05) 2.53 (.03) 2.60 (.05) 2.55 (.03) 2.70 (.02) 2.71 (.03)
1.51 (.04) 1.89 (.03) 2.04 (.03) 1.77 (.03) 2.07 (.03) 2.03 (.03)
1.99 (.02) 2.58 (.04) 2.64 (.08) 2.48 (.03) 2.81 (.07) 2.77 (.06)
the accuracy of the prediction of the target true-score. For example, in this study the posterior standard deviation is inflated by factors ranging from 49% to 171%, depending in the simulation condition. As the correlation between the constructs lessens, the inflation factor increases. Interestingly, the smallest inflation factors arise for the case of the least reliable tests, case 6. This may be due to the fact that the least reliable tests have poorer direct prediction to begin with and thus the least to lose from using indirect test data to predict their true scores. This finding is worth more investigation than we have reported here.
6.2. A Real Data Example We also examined an example using test data from an administration of a 5th grade Science assessment in two states in 1998. This assessment involved both a multiple choice test (MC) with 29 items and a performance task (PT) that was followed by nine questions asking the students to record their observations and explain them. The nine questions from the performance task were scored dichotomously using expert judgement. We will use these two different testing formats as the two tests in our study, and use the PT scores to indirectly predict the true-scores on the MC test. Data were available for 1202 girls and 1096 boys, and we will use gender as the subgroupdefining variable, G. Table 5 shows some raw score (number right) means. According to the raw score means, the two tests, MC versus PT, reverse the order of the two g r o u p s - - b o y s better on MC (by 1.6% of a standard deviation) and girls better on PT (by 9.9% of a standard deviation). The Dorans-Holland (2000) measure, REMSD, that measures the average boy-girl difference between parallel-linear equating functions linking these two tests is 5.6%, which is of moderate size compared to the examples in their paper. We used ConQuest to estimate IRT models to these data. In particular, we wanted to estimate two different thetas, one for the MC test, and another one for the performance task questions. In addition we wanted to obtain separate bivariate Normal ability distributions for boys and girls.
TABLE 5. Means of number right scores for the MC and PT for All students and separately for Boys and Girls. (Standard deviations in parentheses)
Multiple choice Performance Task
All
Girls
Boys
Difference
Standardized difference*
16.17 (4.96) 4.73 (2.03)
16.13 4.83
16.21 4.63
-0.08 0.20
-1.6% 9.9%
*Difference dividedby the standard deviationfor All, expressed as a percent.
146
PSYCHOMETRIKA TABLE 6. IRT results for the Science A s s e s s m e n t Test data
Item parameters anchored at values for Boys
Item parameters anchored at values for Girls
Girls
Boys
All
Girls
Boys
Reliability (MC) Reliability (PT) Square Root of
.74 .56 2.11
.77 .61 2.14
.75 .58 2.13
.73 .55 2.13
.77 .60 2.16
E[Vax(rxISx, G)IG] Square Root of E[Vax(rx ISy, G)IG]
3.41
3.75
3.57
3.38
3.74
k2G k2G~ H (setting k 2 = 0) H
.002
.002
.003
.003
.002
.006 1.62 1.62
.007 1.76 1.76
.005 1.67 1.68
.012 1.60 1.60
.004 1.74 1.74
In all cases we fit Rasch models for the items and bivariate Normal distributions for the thetas. We did not want to add an investigation of D I F to this study so we estimated common item parameters for both genders. We did this two ways. First by estimating a model for the girls only, and then "anchoring" the item parameters for boys at the values obtained for girls and then estimating the boys' bivariate theta distribution subject to this constraint. Second, we reversed the process and anchored the item parameters at the values estimated for the boys. The results for both approaches are given separately and show minor differences. There was some evidence that the item parameters were slightly different for the two groups, but we do not think these differences are large enough to affect the conclusions we reached in this example. Table 6 summarized various quantities of interest when the IRT analyses are performed separately for boys and girls and for the total group. We see very little difference between the two methods of anchoring the items so we will comment on it no further. The reliabilities of both tests, M C and PT, are slightly higher for boys than for girls. However, the average posterior variances (shown here in the square root scale) show the opposite trend, with the girls having slightly smaller average posterior variances for either direct or indirect prediction. The values of k 2 are all small and have virtually no effect on the inflation factor, H . H is larger for boys than for girls, indicating that the indirect prediction of the true-score for M C from the PT scores inflates the prediction error more for the boys than it does for the girls. The values of the EPDD and EPDI values are respectively .023 (2.3%) and .038 (3.8%). These values indicate that subgroup differences have bigger effects on indirect prediction than they have for direct prediction. The values of the EPDD and EPDI measures are smaller than the Dorans-Holland R E M S D value of 5.6% given earlier. Since the connection between the two calculations is mostly by analogy there is no reason for their values to be equal.
7. Discussion The general IRT model we have developed here reproduces the main results of CTT in considerable detail, this includes CTT models that involve more than one test. In addition, we see that using the concept of Best Linear Predictor provides us with a version of CTT that does not need to assume that the form of the conditional means and variances are linear or constant and which do not really hold for test data. Furthermore, our simulation study suggests that the BLPs are useful alternatives to the posterior means of the true scores, at least for the simple model we have examined.
P A U L W. H O L L A N D
AND MACHTELD
HOSKENS
147
In addition to these general considerations, this approach allows us to successfully distinguish between direct and indirect true-score prediction in a simple but useful way. For example, it allows us to compute an index of the loss of information that accompanies the linking of nonparallel tests using true score prediction as the criterion. This is expressed in our index, H, which can be computed from quantities that are usually available in test-linking studies. In addition, we can generalize the Kelley formula for predicting a true score from an observed score to predicting the true score from a nonparallel test. Our analysis shows that linear regression gives an appropriate linking function (viewed as a BLP approximation to the posterior expectation) but that the proper residual standard deviation is not that given by the usual regression results. This justifies the use of multiple regression to link tests in studies such as Pashley and Phillips (1993) and Williams et al. (1995). The oft-stated assertion that "regression is not equating" immediately comes to mind when we talk about linking in the manner that we have in this paper. We think that our approach is a useful starting point, but it is not directly about test equating per se. For example, the symmetry requirement of equating cannot hold for true score prediction as we have defined it here. There are several topics that this research suggests might be worth future investigations. First, it seems useful to investigate the improvements that could be garnered by use of the best quadratic predictor rather than the best linear predictor. The departures from linearity shown in Figure 2 suggest that a quadratic term will fit most of the departure from linearity that the conditional expectation exhibits there. However, since we have only looked at the simplest IRT model, there is also considerable work to do to investigate the value of these ideas in more complex models for which the total score is not a sufficient statistic. In addition, instead of the constant average variance formula in computing H, it may be worthwhile to find a quadratic version of this using the beta-binomial as a starting point. Another possible use for the best linear predictor and the other quantities is to provide "targets" for the convergence of complex estimation procedures such as those exploiting Markov Chain Monte Carlo methods. It is possible that having an easily computed target quantity like the BLP available could indicate when the samples from the posterior distributions have converged to reasonable values. Appendix A: Some Facts about Conditional Distributions, Means, Variances, and Covariances We will use U, V and W to denote random variables defined over P so that we can state these results more generally than the specific notation we developed in section 1 for testing applications. We will also, whenever possible, state results conditionally given a subpopulation, defined in terms of G, as in section 1. We will use the notation, E(UIV, G), to denote the conditional expectation, or mean, of U given the values of V and G, and the notation, Var(UIV, G), for the corresponding conditional variance. Finally, Cov (U, VIW, G) denotes the conditional covariance of U and V given W and G. We use P{U = ulV, G} to denote the conditional probability distribution of U given V and G.
Theorem A. If U, V and W denote random variables for which all of the following conditional means, variances and covariances are well-defined, then the following relationships always hold: (a) P{U = ulG} = E[P{U = ulV, G}IG], where the outer expectation is averaging P{U = ulV, G} over the conditional distribution of V given G, this is repeated in (b) through (d). (b) E(UIG) = E[E(UIV, G)IG], that is, the appropriate mean of conditional expectation is a (less) conditional expectation. (c) Var (UIG) = E[Var (UIV, G)IG] + Var [E(UIV, G) IG] that is, "the mean of the conditional variance plus the variance of the conditional mean", which is the basis of all "within plus between" decompositions of a variance.
148
PSYCHOMETRIKA
(d) Cov (U, VIG) = E[Cov (U, VIW, G)IG] + C o v [E(U, IW, G), g ( v l w , G)IG], that is, this is a generalization of part (c) to covariances. Appendix B: Best Linear Predictors In this paper we use the idea of a "best linear predictor" (BLP) of one random variable by another as a way around doing the more difficult computations of conditional means and variances. Suppose U and V are two random variables, then the best linear predictor of U from V is denoted by L(UIV) = ~ ÷ f i V , where c~ and f are chosen to minimize E(U - c~ - flY) 2. Hence a best linear predictor is "best" only in the sense of minimizing the average prediction error. The BLP may also be put into a form that is similar to the "everything is conditional on G form" used in Theorem A, by having O~G and fig chosen to minimize the quantity: E[(U - o~ fiV)2lG]. In this case we will denote the BLP by L(UIV, G). The value of the minimized E[(U - o~ - fiV)2IG] is called the "average prediction error" and it may be expressed as the average, over V given G, of the "conditional prediction error", E[(U-o~ - f i V ) 2 IV, G]. Below we use a somewhat evocative, but nonstandard, notation for these prediction error measures. jrU21V,G(V) = E[(U -- L(UIV, G))2lV, G], and rCU21V,G = E{rCU21V,G(V)IG } = E{E[(U - L(UIV, G))2IV, G]IG} = E{[(U - L(UIV, G))2IG}. Thus, rCU21V,G (V) measures how poorly L(UIV, G) predicts U for a given value of V, while rCU21V,G is an average of this prediction error measure over the distribution of V given G. In this sense, rCU21V,G (V) is analogous to the conditional variance of U given V and G while rCU21V,G is analogous to the (conditional given G) mean of the conditional variance of U given V and G. It is important to point out that when the conditional expectation function, E(UIV, G) is linear in V, it is identical to the BLR L(UIV, G) because the conditional expectation is the best predictor of any form, linear or nonlinear. In this case, rCU21V,G (V) is the conditional variance of U given V and G and rCu2iv' G is the (conditional given G) mean of the conditional variance of U given V and G. Theorem B summarizes some well-known and easily derived facts about the BLR Theorem B. IfL(U[V, G) = O~G+ fiGV is the BLP of U from V, in the sense of minimizing 2 a, have E[(U - C~G -- fiGV)2IG], then c~a, fig and the average squared prediction error, rCuiv, these values: (a) O~G = I~tUIG -- flGI~tVIG, OUIG (b) fiG = PUVIGovlo, (C) rCU21V ,G = CrU21G(1-- pU2VlG)" In addition, the BLP has these relationships to the conditional moments of U given V. (d) E[L(UIV, G)IG] = E[UIG] that is, the mean of the best linear predictor is the mean of the predicted variable, U. This parallels Theorem A, part (b).
PAUL W. H O L L A N D A N D M A C H T E L D H O S K E N S
149
( e l ) rCg21v,a = E[Var(UIV, G ) I G ] + E[(E(UIV, G) - L(UIV, G))2IG], = E[Var(UIV, a ) l a ] + v a r [E(UIV, G) - L(UIV, a ) l a ] , or
(e2) E[Var(UIV, a ) l a ] = =~v,G - v a r [E(UIV, G) - L ( U I V , a ) l a ] . Part ( e l ) parallels Theorem A part (c) in that it is like a "between and within" variance decomposition. Part (e2) shows how the mean of the conditional variance of U can be expressed in terms of the average squared prediction error and the variance of the difference between the BLP and the corresponding conditional expectation. The quantities given in Theorem B parts (a) to (c), are exactly the same as the formulas for the intercept, slope and residual variance formulas that hold when the conditional distribution of U given V has a linear conditional expectation function and constant conditional variance. These conditions hold, for example, when U and V have a joint bivariate normal distribution. However, the BLP is useful even when the conditional mean function is not linear or when the conditional variance function is not constant, as is the case in most of the IRT applications we consider here. Parts (d), ( e l ) and (e2) of Theorem B show the connection between the best linear predictor and the conditional mean and variance of the joint distribution of U and V. These last three results motivate our notation of L(UIV, G) to mimic the conditional expectation notation, E(UIV, G). References Bock, R.D., & Mislevy, R.J. (1982). Adaptive EAP estimation in a microcomputer environment. Applied Psychological Measurement, 6, 431-444. Dorans, N., & Holland, P.W. (2000). Population invariance and the equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281-306.. Feuer, M.J., Holland, RW., Green, B.F., Bertenthal, M.W.,& Hemphill, F.C. (1999). Uncommon measures. Washington, DC: National Academy Press. Gelman, A. Carlin, J.B., Stern, H.S., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman and Hall. Holland, P.W. (1990) On the sampling theory foundations of item response theory models. Psychometrika, 55, 577-601. Kelley, T.L. (1923) Statistical methods. New York, NY: Macmillan Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mislevy, R.J., Beaton, A.E., Kaplan, B., & Sheehan, K.M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133-161. Pashley, P.J., & Phillips, G. W. (1993) Toward world-class standards: A research study linking national and international assessments. Center for Educational Progress. Princeton NJ: Educational Testing Service. WaJner, H. et al. (2001) Augmented scores--"Borrowing strength" to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test Scoring (pp. 343-387). Mahwah, NJ: Eaxlbaum. Williams, V. et al. (1995) Projecting to the NAEP scale: Results from the North Carolina End-of-Grade testing program (Tech. Rep. #34). Chapel Hill, NC: National Institute of Statistical Science, University of North Carolina, Chapel Hill. Wu, M., Adams, R., & Wilson, M. (1997) ConQuest [Computer program]. Melbourne, Australia: Australian Council for Educational Research. Manuscript received 24 JUL 2001 Final version received 3 JUN 2002