Theoretically-Consistent Cognitive Ability Test Development and Score Interpretation

Clinical cognitive ability assessment—and its corollary, score interpretation—are in a state of disarray. Many current instruments are designed to pro...

0 downloads 95 Views 526KB Size

Download PDF

Contemporary School Psychology https://doi.org/10.1007/s40688-018-0182-1

Theoretically-Consistent Cognitive Ability Test Development and Score Interpretation A. Alexander Beaujean 1 & Nicholas F. Benson 2 # California Association of School Psychologists 2018

Abstract Clinical cognitive ability assessment—and its corollary, score interpretation—are in a state of disarray. Many current instruments are designed to provide a bevy of scores to appeal to a variety of school psychologists. These scores are not all grounded in the attribute’s theory or developed from sound measurement or psychometric theory. Thus, for a given instrument, there can be substantial variation between school psychologists when interpreting scores from the same instrument. This is contrary to the very purpose of psychological assessment. As a contrast, we provide a sketch of theoretically driven test development and score interpretation. In addition, we provide examples of how this could be implemented using two theories of intelligence (Spearman’s two-factor and Cattell and Horn’s Gf-Gc) and measurement theory about the nature of psychological test scores. While different from what is often implemented by school psychologists, it is consistent with the guiding principles of evidence-based psychological assessment. Keywords Test construction . Two-factor theory . Gf-Gc theory . Intelligence . Score interpretation

Clinical cognitive ability assessment—and its corollary, score interpretation—are in a state of disarray. Part of the reason for this situation is that test publishers want to make the instruments as commercially appealing as possible (Frazier and Youngstrom 2007). Thus, they provide numerous scores— irrespective of whether they are theoretically defensible—with the hopes of appealing to the widest range of clinicians. In doing so, however, they either remove the scores from any theoretical grounding or have to rely of multiple, perhaps incompatible, theories to interpret all the scores. In this article, we describe some problems with the present state of clinical cognitive ability assessment development and score interpretation. Then, we describe some general principles of theorydriven test development and score interpretation. Finally, we provide two examples of how these principles could be implemented in the clinical assessment of cognitive ability.

* A. Alexander Beaujean [email protected] 1

Department of Psychology & Neuroscience, Baylor University, One Bear Place #97334, Waco, TX 76798-7334, USA

2

Department of Educational Psychology, Baylor University, One Bear Place #97301, Waco, TX 76798, USA

Instrument Development To see the disarray of modern clinical cognitive ability assessment instruments, one need look no further than the latest edition of the Wechsler Intelligence Scale for Children (WISC-V; Wechsler 2014). It provides 35 different scores (1 global, 13 domain-specific scores, and 21 subtests)—all of which the publisher recommends interpreting in some form. The same can be said for the latest editions of other popular tests—although some do not have as many scores to interpret. For example, the second edition of the Kaufman Assessment Battery for Children (KABC-II; Kaufman and Kaufman 2004) provides 36 scores to interpret (3 global scores, 5 index scores, 1 delayed recall index for a planned comparison, 9 core subtests, 9 supplemental subtests, and 9 planned clinical comparison scores). Similarly, the fourth edition of the WoodcockJohnson Cognitive (WJ IV; Schrank et al. 2014) provides 16– 20 scores to interpret (3 composite scores, 7 Cattel-HornCarroll Factors [plus extended versions of 3 scores], 6 narrow ability and other clinical clusters [plus extended versions of 1 score]). The problem with offering such a bevy of scores to interpret is that there is no single psychometric or attribute theory that can support all their interpretations. From a psychometric standpoint, it is extremely difficult to measure both something general and something specific within the same instrument:

Contemp School Psychol

inherently unidimensional … test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. (Luecht et al. 2006, p. 7) The reason for this is that there is a finite amount of variance in a set of test scores. Thus, if most of that variance can be explained by a single aggregate score (i.e., it is essentially unidimensional), then there is very little variance left over is unique to any subscores. Likewise, if most of the variance can be explained by multiple unique scores (i.e., it is essentially multidimensional), then there is little that a single aggregate score could contribute beyond what the unique scores already provide. Proffering that clinicians can interpret both aggregate scores and specific subscores from the same instrument with similar levels of psychometric heft is a strong assumption that requires a level of data not provided by most test publishers (e.g., adequate unique, reliable variance for each of the scores; evidence that each score has incremental validity in predicting a criterion). From the standpoint of attribute theory, a test cannot not adhere to a single theory when the number of scores it provides is greater than the number of attributes posited by that theory. Instead, this situation requires reference to multiple theories, which may not even be compatible. For example, the KABC-II was developed to assess attributes from both Luria's (1973) processing model and Cattel-Horn-Carroll (CHC; Schneider and McGrew 2012), but largely uses the same subtests to assess both sets of attributes. Moreover, the publishers never provide a justification of why the same subtests can be interpreted as measuring separate attributes. Thus, some have argued that the scores’ meanings are Bambiguous^ and suggest that examiners use other instruments whose scores that do a better job representing attributes (Braden and Ouzts 2005). Another example is the WISC-V. There is a long history of critiquing the absence of a cohesive rationale for the WISC (e.g., Littell 1960)—that is, a lack of clear indication of the attributes the instrument is designed to measure. The fifth incarnation of the test—developed, in substantial part, to maintain a historical connection to its predecessors—continues this tradition. To be sure, the WISC-V manual (Wechsler 2014) references numerous theories related to cognition, neuropsychology, intelligence, as well as noncognitive attributes (Wechsler 1950). In spite of their attempt to argue that the instrument’s scores measure so many attributes, however, the publisher neither provides a clear statement about specific attributes the instrument was designed to measure nor any justification that all the panoply of scores available to interpret measure specific attributes. One could argue that the KABC-II and WISC-V are anomalies. Most modern cognitive tests were developed using the CHC taxonomy (Keith and Reynolds 2010), so they must be on much

stronger theoretical footing since CHC is derived from the factoranalytic research and cognitive theories of Raymond Cattell, John Horn, and John Carroll (McGrew 2009). While Cattell and Horn proffered a theory of cognitive ability (Gf-Gc) as did Carroll (three stratum), the core of these theories is incompatible; this makes any integration difficult. For example, Carroll’s theory can be thought of as an extension of Spearman’s two-factor theory (Carroll 1996). Thus, Carroll was emphatic that general intelligence (g) was the primary—although not the only—driving force behind all cognitive performance. Cattel and Horn were just as emphatic about g (e.g., Horn 1985), only they eschewed it in favor of primarily focusing on fluid reasoning (Gf), crystallized/acculturation knowledge (Gc), and a few other abilities. Incorporating both perspectives under a single framework produces effete statements such as BCHC theory incorporates …g, but users are encouraged to ignore it if they do not believe that theoretical g has merit, particularly in applied clinical assessment contexts^ (Schneider and McGrew 2012, p. 111). Being ambivalent about a component so essential to both theories does neither any justice.

Score Interpretation The chaos within cognitive test construction may seem like academic minutia and not of much import to practicing school psychologists. On the contrary, this situation is directly related to school psychologists’ day-to-day work because it spills over into their interpretation of the test scores. Again, take the WISC-V as an example. When discussing score interpretation, the test publishers recommend interpreting the aggregate score (i.e., Full Scale IQ [FSIQ]), all possible index scores, index score differences, subtest differences, and a host of qualitative item response analyses (Wechsler 2014). Such a variety of interpretations would be anathema to any single theory of cognitive ability, so the WISC-IV publishers wrote that Bmany interpretation strategies, methods, and procedures^ are needed to interpret all these scores (p. 149). Such nebulous interpretation directives have resulted in a variety of interpretational approaches, some which are incompatible. These range from using only the FSIQ (Canivez and Watkins 2016) to examining patterns of ipsative score differences (Flanagan and Alfonso 2017) to qualitative process oriented interpretation (Raiford 2017) to an eclectic approach that uses different theoretical perspectives to interpret a panoply of different scores (Kaufman et al. 2016). Moreover, the WISC-V publishers imply that the scores can be useful in assessing attributes other than those in the domain of cognitive ability, such as diagnosing attention disorders, learning disorders, and Autism spectrum disorder (Wechsler 2014, pp. 123–147, 180–186)—a notion echoed by others (e.g., Courville et al. 2016).

Contemp School Psychol

Disagreement about score interpretation is not novel to current cognitive tests—in fact it has a very long history (Kamphaus et al. 2012). This seemingly ubiquitous disagreement about what scores to interpret puts a large burden on school psychologists. How are they supposed to know best practices for interpreting an instrument’s scores when presented with conflicting, incompatible recommendations? Even if there was some theoretical rationale and empirical data to support all the suggested score interpretations, examination of all possible scores (or score differences) is labor intensive and inefficient (Groth-Marnat 1999). Thus, school psychologists wind up differing in their score interpretations of the same test (Pfeiffer et al. 2000), which can be confusing for students, parents, and school personnel. This is all contrary to principles of evidence-based psychological assessment (Hunsley and Mash 2007).

Theory-Driven Test Development Much has been written about constructing psychological tests (e.g., Downing 2006; Kingston et al. 2013; Petri et al. 2015). We need not rehash this information here except to note that theory plays the most crucial role in constructing tests—not just measurement/psychometric theory, but also the theory of the attribute (Jackson and Maraun 1996). This may sound obvious, but Sijtsma (2012) noted: The greatest problem of psychological measurement is the frequent absence of well-developed attribute theories. Instead, items are often constructed guided by best guesses on the basis of whatever theory is available, but also based on intuition (what seems to be reasonable?), tradition (how were similar tests constructed?), and conformity (what do colleagues do or think?). Unfortunately, the role of attribute theory is often underestimated, and test … construction seen as engineering and sets of items as useful measurement instruments. (p. 790). What does it mean to ground a test in an attribute’s theory? This is a multifaceted issue that goes well beyond this article (Borsboom et al. 2009; Kane 2013). At its core, however, is the idea that the numerals used for test scores should represent what is known about the underlying attributes (Mari et al. 2015). This idea that has largely been ignored in psychological measurement (Michell 1999). The first task in creating a test is to specify the attributes to assess (Borsboom et al. 2004; Bringmann and Eronen 2016). Using Ludwig Wittgenstein’s terminology, assessing the attributes requires the existence of a grammar (i.e., set of rules) for what constitutes the attributes, part of which includes measurement practices (Maraun 1998a, b). Although the grammar itself

is not empirical, it should be based on empirical research and describe the attributes technically and thoroughly. Unfortunately, there is limited understanding of relevant attributes in most areas of psychological assessment. Consequently, the grammar of these attributes tends to be messy. Thus, we make do with relatively weak notions that merely describe how attributes correlate with other attributes or with relevant outcome variables (e.g., educational attainment, occupational complexity). While these theories can serve as stepping stones, valid measurement will never be possible without a strong, empirically based understanding of the targeted attributes. As an aside, it is common in school psychology to assume that attributes are quantitative and follow a normal distribution. But, there is no empirical proof to support these assumptions for most attributes. One could argue that test scores often follow a distribution very similar to a normal distribution, but this is largely under the test developers’ control (Horn 1963). As long as the influence of attribute-irrelevant factors (e.g., test-wiseness, guessing) are minimized, raw score distributions are a function of test items’ difficulties as well as the inter-item correlations. Vary these and the score distribution will change as well (Jensen 1993). Second, the attributes that are to be measured need to be operationalized, which simply means there needs to be a description of the operations needed to measure the attributes. Operationalization includes delineation of a domain of behaviors that are typical of the attribute as well as creation of stimuli to elicit reactions in order to gain information about respondents’ levels of the attribute. The stronger the attribute theory, the better the operationalization (Sijtsma 2013). Weak theories produce weak operationalizations with a lot of subjectivity and arbitrariness. This leads to variability across tests purporting to measure the same attributes as well as diluted test scores that are non-negligibly influenced by multiple attributes (Finkelstein 2005). A common response to weak theories of an attribute is to rely on complex psychometric models. Finding a model Bfits^ test data, however, cannot make up for a weak attribute theory. Statistical models can be used to describe a given dataset, but for every model that fits there are multiple other equivalent models, as well as non-equivalent models, that fit the data just as well (Tomarken and Waller 2003). Thus, if a test was developed from an operationalization of a weak attribute theory, then the psychometric model—even though it may fit the test data well—cannot provide strong support for any theory about the attribute.

Theory-Driven Test Scores After creating a test, the raw scores need to be made interpretable (i.e., scaling). This can be done in a variety of ways (e.g.,

Contemp School Psychol

Kline 2000), and should be based on a strong understating of the attributes (Schneider 2013). The scores available from many current cognitive ability instruments are mixtures (Horn 1989). That is, the scores represent a collection of different attributes in no particular proportion. For example, all index scores on the WISC-V are calculated by summing across multiple subtests and then determining how different the summed score is from the average score in the norming sample. Thus, these scores are not assessing a single attribute, but a mixture of a panoply of attributes (Wechsler 2014, pp. 7–12). Moreover, the subtests that comprise some of these scores have varied across different WISC editions, making scores with the same name have different meaning across editions (Beaujean and Sheng 2014). To be fair, the calculation of the WISC-V index scores does map onto David Wechsler’s notion of intelligence: What we measure with tests is not what tests measure— not information, not spatial perception, not reasoning ability. These are only means to an end. What intelligence tests measure, what we hope they measure, is something much more important: the capacity of an individual to understand the world about him and his resourcefulness to cope with its challenges (Wechsler 1975, p. 139). This understanding of measurement runs counter to most modern perspectives of measurement outside of psychology (Michell 2007), but it does allow the WISC-V to maintain a historical connection to previous versions of the instrument. To have scores that represent measures of the attributes, the attributes need to be set out ahead of time and the rules for measuring the attributes need to be followed. This notion is somewhat lost with many modern intelligence tests. A major contributor to this situation is that the attributes are so ill-defined that their grammars preclude scientific measurement (Maraun 1998a, b). Thus, we wind up with scores representing some mixture of a variety of attributes. Even if the test publishers report using CHC or some other taxonomy, the lack of well-defined attributes results in scores of the same names not actually representing the same attributes (e.g., Floyd et al. 2005; Horn 1989). For example, the WJ IV assesses an attribute called Comprehension-Knowledge (CK) and the WISC-V assess an attribute called Verbal Comprehension (VC). CK is defined as being the same as the Gc attribute in the CHC taxonomy, so its measurement follows this definition (McGrew et al. 2014). The measurement of VC follows the BWechsler tradition^ since it uses the same subtests (although with different items) as those

on the original WISC (Wechsler 2014).1 This, in and of itself, may not be problematic. CK and VC could be distinct attributes so would have separate grammars and measurement practices. But this is not the case. School psychologists are told that the CK and VC attributes are one and the same (e.g., Flanagan and Alfonso 2017; Flanagan et al. 2013). How can this be correct, though? They were created using different grammars, so they cannot be equivalent attributes—even if those attributes are correlated (Krause 2012). A second problem with having ill-defined attributes is that test publishers conflate measuring an attribute with information about an instrument’s structural evidence (Grégoire 2013). A very common practice for publishers is to include a section in the instrument manuals providing structural evidence (usually via factor analysis) that subtests organize into components that reflect the scores provided to interpret. This is different from justifying that that instrument’s scores are measures of certain attributes. For example, the factor analytic evidence provided in the WISC-V technical manual indicates that the subtests fit well with a higher-order model consisting of five lower-order factors (VC, Fluid/Abstract Reasoning, Visual Spatial Processing, Working Memory, and Processing Speed) and one super-ordinate general factor (Full Scale). This is not the same as providing evidence that WISC-V scores represent these abilities. This would require showing that the WISC-V publishers a priori selected the attributes they wanted to measure and followed the rules for measuring these attributes. To see the difference, examine the Processing Speed Index (PSI) score in the WISC-V. Interpretation seems straightforward because the two core subtests (Coding and Symbol Search) have relatively strong loadings on a latent variable named Processing Speed (PS). However, the subtests’ loadings on PS represent only the common variance between these subtests. Functionally, Coding and Symbol Search are relatively homogenous; they are both speed tests, require processing simple visual stimuli, and require a visual-motor response (Grégoire 2013). Is this what really is meant by PS? This issue could be resolved if PS had a technical definition and the WISC-V publishers described their measurement practices. But is not done. The WISC-V publishers provide no definition of PS or justification that their measurement practice coincides with the grammar of PS. Thus, we cannot rule out that the PS factor is nothing more than an artifact of 1 Unfortunately, David Wechsler never actually defined the attribute he was measuring with the Bverbal^ subtests on his instruments. Instead, it appears he included them because he wanted to have a cognitive ability test that was different from those that were already in existence circa 1920s. My usual examination of subjects included, in addition to a short interview, administration of the Stanford-Binet or Yerkes Point Scale, and nearly always one or more of the available performance tests. It then occurred to me that an intelligence scale, combining verbal and nonverbal tests, would be a useful addition to the psychometrist’s armamentarium (Wechsler 1981, p. 83).

Contemp School Psychol

the subtests included; if this is the case, then it would render the PSI a non-valid measure of PS. A third—and perhaps most pernicious—problem of having ill-defined attributes is that it allows the creation of test scores without end. From the perspective of selling the instrument, this is nice because having multiple scores means that the instrument can appeal to a variety of school psychologists. From a theoretical perspective, this is nonsensical. How can an instrument have more scores than there are attributes? For example, to measure g as Spearman (1927) defined it in his two-factor theory, an instrument would have a single score to interpret that is based on what is common among a group of diverse tasks. Yet, to our knowledge, no test that purports to measure g actually follows Spearman’s rules for measuring it.2 Instead they create a variety of aggregate scores by summing various subtests or subscores. Such scores may have practical value, but none of them in any way are measures of g. Multidimensional Tests When an intelligence theory is multidimensional (e.g., Gf-Gc), it is more tedious to create scores than for a unidimensional theory (e.g., two-factor). Luecht et al. (2006) argued that for multidimensional tests to provide useful scores, publishers need to set out to measure each of the multiple attributes well from the start of the design process (i.e., measure independent clusters of attributes; McDonald 1999). In the realm of intelligence, this means publishers should not attempt to create instruments that concurrently measure some unitary attribute (e.g., a general attribute) and then try spread out the same information across multiple scores of more specific attributes. This strategy results in creating less reliable scores of the specific attributes. The reason is that most of the common variance among the specific attributes’ tests is also in common with the general attribute—meaning there is little unique variance left over for the specific attributes. This has been seen in the multiple factor analyses of cognitive ability measures created to measure a panoply of attributes (e.g., Canivez and Watkins 2016). As an alternative, Luecht et al. (2006) recommended creating multiple unidimensional tests, each of which is carefully engineered to measure specific attributes. This can produce essentially independent clusters, which reduces some of the psychometric problems of measuring multiple attributes with a single instrument. Created well, a series of unidimensional factor models (as opposed to a more complex multidimensional model) may even be adequate to represent the instrument’s attributes. 2 A possible exception is the WJ IV, which uses principal component analysisderived weights for the calculation of the General Intellectual Ability score. Although the results are Btruly enough not identical with ‘g’ [they] are usually at any rate very good approximations to it^ (Spearman 1946, p. 121).

Score Interpretation Scores from psychological tests do not possess the properties of many instruments from physical science, such as true zero point or additivity of units. The plain fact is that test scores…can only represent at best an ordinal scale… and the most that any scores, even from the best-made tests, can actually permit us to do is merely to rank individuals on whatever amalgam of latent variables… are responsible for the total variance in the scores (Jensen 1993, pp. 141–142). Thus, at best, scores from cognitive ability tests should only be thought of as a rank-order correlate of the attributes the test was designed to measure. One consequence of the ordinal nature of test scores is that their units have heterogeneous order (i.e., that is there are heterogeneous differences between the units’ degrees, Michell 2012). In other words, the meaning of the test’s unit differs across the span of the test’s score. While a little oversimplified, percentile ranks exemplify the idea of heterogeneous orders. Say some test’s scores are normally distributed and on an IQ scale (mean 100; SD 15). The difference between scores at the 51st percentile and 50th percentile is < 1 IQ point. The difference between scores at the 99th percentile and 98th percentile is > 4 IQ points. The difference in percentile units is the same in both situations (1 percentile), but the meaning of the percentile unit difference differs depending on where on the test score distribution the difference is examined. A second consequence of having ordinal scores is that the values do not retain the same meaning across dissimilar phenomena, even if they are placed on the same scale. Thus, statements such as BMarvin's fluid reasoning skills are twice as developed as his arithmetic skills^ are meaningless. A third consequence is that employing ipsative analyses of the test scores should be done with great caution, if not abandoned all together. Since units do not have homogeneity, it is difficult to argue that examinees have relative strengths and weaknesses on different attributes based on test scores alone. For example—again using test scores on an IQ scale—say Gladys had a reading comprehension score of 85 and a math calculation score of 115. Although there is a 30-point difference between the scores, it does not directly imply that her reading comprehension skills are much less developed (i.e., relative weakness) than her math skills (i.e., relative strength). The test’s Bunits^ are not homogenous so difference between the scores do not have consistent meaning throughout the score distribution. Because both scores are relatively close to their respective means, it could be that the score differences imply minimal functional skill differences on the underlying attributes.

Contemp School Psychol

The ordinal nature of test scores does not mean test scores are useless. Ordinal information can be useful in scientific inference and clinical decision making (e.g., Grove and Vrieze 2013; Michell 2011). What it does mean is that the scores only derive meaning in the context of some clearly defined reference. Thus, psychological test scores can only be meaningfully interpreted in terms of where they stand in reference to something, whether it be some criterion (e.g., pass/fail, licensure) or the distribution of scores in some reference population. Returning to the example with Gladys, we could say that her reading comprehension score is lower than the reference group’s mean and her math calculation score is higher than the reference group’s mean. Likewise, depending on a school’s response-tointervention implementation, we may say Gladys’ reading performance warrants further investigation for a Tier 2 intervention, but her math performance does not. Both interpretations would be consistent with the ordinal nature of these scores. Without more information (e.g., the difficulty of text she is able to comprehend, the type of arithmetic problem can she consistently solve), little more could be inferred about Gladys’ abilities.

Method In the field of intelligence, there are several theories encompassing several attributes. We chose to focus on two prominent ones: two-factor (Spearman 1927) and Gf-Gc (Cattell 1987). We chose them because they have a long history in the field of cognitive ability, have technical definitions for the attributes of interest, and provide alternative approaches to test development. We are in no way arguing that either one represents the Bbest^ theory of cognitive ability, or that they should necessarily guide clinical test development; only that they are specified well enough to develop tests to measure their attributes.

Attributes of Interest g Briefly, the two-factor theory was created by Charles Spearman as an outgrowth of his attempt to measure g. The two-factor theory posits that performance on any intellectual task is primarily determined by two factors: g and some specific ability, s, related to the particular task. The exact influence of g or s depends on the particular task. Spearman's (1927) definition of g was the eduction (i.e., induction) of relations and correlates. In other words, the ability to identify relations among objects, comprehend their implications, and then draw inferences to novel content. Moreover, he wrote

magnitude. Further, that which this magnitude measures has not been defined by declaring what it is like, but only by pointing out where it can be found. It consists in just that constituent—whatever it may be—which is common to all the abilities inter-connected by the tetrad equation. (pp. 75-76) Spearman’s definition makes the grammar regarding g clear. It is what is in common among multiple tests that meet the tetrad differences criterion (for an explanation of the tetrad differences criterion, see Thomson 1927). For our purposes, it is sufficient that there are at least four tests and they assess substantially different cognitive abilities. Measuring g requires finding what is in common among the tests, which is tantamount to conducting a factor analysis with a single factor. Gf and Gc3 Although somewhat unintentional, Spearman found that some tests were not good measures of g, those tests were of Bretentivity of dispositions^ (Spearman 1927, Chapter 16), by which he meant non-contextual memory (e.g., memorizing digits, recalling sounds, general information). Raymond Cattell (1943) later picked up on this difference in developing his Gf-Gc theory, which was further developed and promulgated by his student, John Horn (Horn and Blankson 2012). According to Cattell (1943), Gf is very similar to Spearman’s conceptualization of theory of g. It is Ba purely general ability to discriminate and perceive relations between any fundaments, new or old… It is responsible for the intercorrelations, or general factor, found among children's tests and among the speeded or adaptation-requiring tests of adults^ (p. 178). It is Bmeasured by tests that require inductive, deductive, conjunctive, and disjunctive reasoning to understand relations among stimuli, comprehend implications, and draw inferences^ (Horn 1991, 214). Gc is comprised of the breadth of knowledge abilities (e.g., retained general information, scholastic knowledge) that are similar to the Bretentivity^ abilities that Spearman said were not good indicators of g (Horn and McArdle 2007). It Bconsists of discriminatory habits long established in a particular field, originally through the operation of fluid ability, but not [sic] longer requiring insightful perception for their successful operation^ (Cattell 1943, p. 178). It is Bmeasured by tests that indicate the breadth and depth of the knowledge of the dominant culture^ (Horn 1991, p. 214). Although the development of Gc is partially due to the Binvestment^ of Gf throughout various culture-specific learning processes, the two attributes are independent—like all other broad abilities in Gf-Gc theory. This independence is 3

this general factor g, like all measurements anywhere, is primarily not any concrete thing but only a value or

There are other noted abilities contained within Gf-Gc theory (Horn and Blankson 2012), but Gf and Gc are believed to make the most important contributions to intellectual functioning.

Contemp School Psychol

important because Cattell and Horn were adamant that there is no super-ordinate attribute influencing Gf, Gc, or any other broad abilities. Both Gf and Gc are measured using factor analysis (Horn 1991). Gf and Gc are often represented as higher-order factors, but a factor’s order is arbitrary and largely dependent on the amount of redundancy in the observed variables (Cattell 1987). If tests are selected to be relatively independent except for the influence of Gf or Gc, then these attributes could just as well be represented as first-order factors.

Creating Test Scores Spearman (1931), like Horn (1991), argued that attributes needed to be derived from what is in common among tests, not from the summation of the scores. Thus, g, Gc, and Gf should be measured in a way so that they represent only variance that constituent tests have in common, statistically independent of other sources of variance (e.g., other attributes). Instead of measuring these attributes by administering tests, we used information from previously published studies on the attributes. Specifically, we used factor loadings from tests chosen to represent the attributes. We set the reliability for all tests to be .85. We chose this value because it is the halfway point of average reliabilities of WISC-V subtests (Wechsler 2014, p. 57). In measuring g, Spearman’s wrote that Bany test will do just as well as any other, provided only that its correlation with g is equally high^ (Spearman 1927, p. 197). Thus, we selected four tests representing distinct ability domains that Spearman (1939) noted had a strong relation to g. We use the factor loadings Spearman (1939) provided, which are given in the top of Table 1. For Gf and Gc, we selected tests from studies on these attributes authored by Cattell and Horn (Cattell 1963; Horn and Cattell 1966). Generally, the loadings were lower for these tests than those from the two-factor model, so we included five subtests for each attribute. The factor loadings are given in the bottom of Table 1. For each model (i.e., two-factor and Gf-Gc), we created two types of scores: factors score estimates and aggregate scores. We then examined scores from both approaches using the subtest values in Table 2. Factor Score Estimates For simplicity, we calculated the weights needed to estimate a factor score from subtest scores using the regression method (Thurstone 1935). Unfortunately, factor scores have the problem of being indeterminate (i.e., there are multiple ways of scoring individuals on a given set of factor loadings). When the degree of indeterminacy is small, however, the different sets of factor scores will be highly correlated, indicating that the individuals’ rank order will be very similar across different sets of factor scores.

Table 1 Loadings used in factor models representing two-factor and Gf-Gc theories Loadings Subtest

Two-factor g Specific

Error

Abstraction

0.67

0.53

0.53

Verbal reasoning Space reasoning

0.68 0.77

0.51 0.37

0.53 0.53

Numerical reasoning

0.56 Gf-Gc

0.64

0.53

Subtest

Gf

Gc

Specific

Error

Spatial Classification

0.32 0.63

0.00 0.00

0.79 0.57

0.53 0.53

Matrices

0.50

0.00

0.69

0.53

Topology

0.51

0.00

0.68

0.53

Induction Verbal

0.55 0.00

0.00 0.46

0.65 0.71

0.53 0.53

Reasoning Number

0.00 0.00

0.50 0.59

0.69 0.61

0.53 0.53

Series Mechanical knowledge

0.00 0.00

0.43 0.48

0.73 0.70

0.53 0.53

Note. Gf-Gc factor correlation: 0.40. Two-factor subtests and loadings come from Spearman (1939). Gf-Gc subtests and loadings come from Cattell (1963) and Horn and Cattell (1966). Reliability for all subtests set to .85

Grice (2001) recommended some methods for examining factor score performance: minimum possible correlation between competing factor scores, validity, univocality, and

Table 2

Example subtest scores

Subtest Two-factor Abstraction Verbal reasoning Space reasoning Numerical reasoning Gf-Gc Spatial Classification Matrices Topology Induction Verbal Reasoning Number Series Mechanical knowledge

Score 5 5 7 4 8 10 9 12 11 6 7 5 8 6

Note. All scores from a distribution with a mean of 10 and standard deviation of 3

Contemp School Psychol

correlational accuracy.4 The minimum possible correlation between competing factor scores is the smallest value possible between different factor scores from the same model. Validity is the extent to which the factor-score estimates correlate with their true factor scores. Squaring the validity value provides the reliability of the factor score estimates (i.e., how much of the variance in the factor score estimates is due to the true factor). Univocally means that the factor score measures the attribute of interest exclusively and is evaluated by examining correlations among the factor score estimates and true factor scores of other factors. Finally, correlational accuracy is the extent to which the correlations among the factor score estimates match the correlations among the factors themselves. Obviously, the last two can only be applied to scores from the Gf-Gc model since it provides more than one attribute and score. Aggregate Scores We created aggregate score two ways. The first was a model-based composite score in which we summed the appropriate subtest scores (Grace and Bollen 2008). This allowed us to examine the relations among the aggregate scores and the latent attributes they were designed to measure. The second involved simulating subtest data from the factor models to represent a normative sample (Beaujean 2018). Specifically, we simulated 1000 observations for each model from a multivariate normal distribution, setting the subtests’ means to 10 and standard deviations to 3. Then, we created aggregate scores by summing across the appropriate subtests. To follow typical test construction procedures, we normalized the summed scores and created a norm-referenced score (mean of 100 and standard deviation of 15) for each possible summed subtest value.

Data Analysis We conducted all analysis in R (R Development Core Team 2017). Syntax is available on the Open Science Framework at osf.io/7b2nk.

Results Two-Factor Model Factor Score Determinacy The minimum possible correlation between competing factor scores was .56 and the validity coefficient was .88, meaning that 77% of the variance in the g factor score estimate was due to g.

4 Readers interested in information on how to calculate these statistics can consult Grice (2001).

Aggregate vs. Factor Score Estimate The correlation between the aggregate score and factor was .87, meaning 76% of the variance in the aggregate score is due to g. Approximately 12% of the variance is due to reliable specific subtest influence; the rest is due to error. Turning to the example data in the top part of Table 2. The factor score estimate for the individual is − 1.67, which is approximately 75 (95% CI 68–92) on an IQ scale. This indicates the examinee’s general cognitive ability is quite a bit below average, although the 68–92 confidence interval indicates that we really cannot tell how far below average without supplemental information. A somewhat similar interpretation can be made for the aggregate score, which is 69 (95% CI 64–89).5 Although for this score would have to be interpreted with more caution since it is a mixture score and likely represents more than just the examinee’s level of general cognitive ability.

Gf-Gc Model Factor Score Determinacy The minimum possible correlation between competing factor scores for Gf was .34 and for Gc was .28. The validity coefficients were .82 and .80 for Gf and Gc, respectively, meaning that approximately 67% of the variance in the Gf score is due to Gf and 64% of the variance in the Gc score is due to Gc. The univocality correlations ranged from .42–.43 and the correlations among the Gf–Gc factor score estimates was .53. Aggregate vs. Factor Score Estimates The correlation between the Gf aggregate score and factor score estimate was .79, meaning approximately 62% of the variance in the aggregate score is due to Gf. Approximately 23% of the aggregate scores’ variance was due to subtest specificity and the rest was due to error. The correlation between the Gc aggregate score and the factor score estimate was also .79, meaning 62% of the variance is due to the Gc. Approximately 24% of the variance was due to subtest specificity and the rest due to error. Turning to the example data in the bottom part of Table 2. The factor-score estimates for Gf and Gc are − 0.14 and − 1.47, respectively, which are approximately 98 (95% CI 85– 112) and 78 (95% CI 72–100) on an IQ scale. This indicates the examinee’s fluid reasoning ability is typical for the norm group. The Gc score of 78 indicates the examinee’s breadth/ depth of knowledge is somewhat below what is typical from the norm group, although the confidence interval range of 68– 97 indicates that we do not have a terribly accurate estimate of how far below. Because the confidence intervals are so large and overlap each other, it would make no sense to say the 5

We calculated reliability of the aggregate scores using the GuttmanCronbach α.

Contemp School Psychol

examinee’s culturally acquired knowledge is any differently developed than the examinee’s reasoning ability. Thus, any ipsative score comparisons or clinical profile analysis would not be warranted. A somewhat similar interpretation can be made of the aggregate scores. The fluid reasoning score is 98 (95% CI 84– 113) and aggregate crystallized knowledge score is 72 (95% CI 68–97). As with the g scores, however, these scores have to be interpreted with caution since they are both mixture score and likely represent more than just Gf and Gc.

Conclusions and Clinical Implications In the article, we argued that cognitive ability instruments need to be ground in theory. That is, the theory—both measurement/psychometric theory and attribute theory— should guide all aspects of test development, ranging from item creation/selection to score development and interpretation. This idea is not new. The AERA/APA/NCME (2014) Standards state in first sentence of the first chapter “Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests^ (p. 11, emphasis added). Yet, many modern intelligence instruments are not based on a strong theory of the attributes. While many modern instruments manuals give a nod to CHC (Keith and Reynolds 2010), there are problems with this since Carroll’s and Cattell/ Horn’s theories had some central components that were incompatible with each other (e.g., the existence of g). Moreover, a majority of these instruments provide so many scores that they wind up producing values that are inconsistent with the attributes or interpretations inconsistent with the scores’ properties. We provided two examples of what theory-driven test construction and score interpretation could look like. One from Spearman’s two-factor theory and the other from Cattell and Horn’s Gf-Gc theory. We do not necessarily want to argue that either of these theories represents the Bbest^ way to understand cognitive ability and go about its measurement. We chose them because the attributes within the theories are technically defined so allowed us to generate scores and highlight the major ideas of this theory-driven process. Likewise, we do not necessarily argue that factor scores are the best way to estimate examinees’ level on the attributes. While factor scores are part of the grammar of g, Gf, and Gc, they are not without their problems. Despite these caveats, there are some clear take away messages. First, instruments should only have any many scores as is consistent with the attributes the test is designed to measure. For the two-factor theory, there was one score (representing g); for Gf-Gc theory, there were only two scores (representing Gf and Gc). Other theories (or even a more intricate

specification of the Gf-Gc theory) might require more scores, but there should never be more scores to interpret than there are attributes being measured. If the theory is multidimensional in nature, this will require a more complex test design than tests measuring a single attribute. This is because attributes should be measured as independently as possible from other attributes (Luecht et al. 2006; McDonald 1999). Depending on how well the subtests reflect the attributes, this could result in tests requiring many subtests to measure an attribute. In our examples, even though the Gf and Gc scores were comprised of more subtests than the g score, they were less reliable and were more indeterminate. This is because the factor loadings tended to be lower for Gf and Gc than g. Thus, to evaluate examinees’ Gf or Gc abilities accurately, even more subtests would need to be added to the 10 already in the instrument. Second, the creation of the scores should follow the theory and grammar of the attribute. Spearman and Cattell/Horn provided technical definitions of their attributes, so any scores purporting to represent these attributes should reflect this and not just sum across a variety of attributes (i.e., create mixtures). To see why this is important, examine the difference between the g factor score estimate and aggregate score for the data in Table 2. Because the aggregate score method treats the subtest scores equally, the Space Reasoning score of 7 has just as much weight as the Numeric Reasoning score of 4. The factor score estimate, however, puts more weight on the Space Reasoning scores. Thus, the factor score estimate (75) is higher than the aggregate score (69). A similar result occurred with the Gf-Gc scores. The Gf factor score estimate is more similar to the aggregate Gf score than the Gc factor score is to the aggregate Gc score. Part of the reason for this is that the Gf subtest scores are closer to the mean, but another part of this is that the subtests are only moderate reflections of the attributes. Thus, when there are extreme scores (as is the case with the Gc subtests), there is more of a regression to the mean effect for factor score estimates than aggregate scores. Third, in addition to attribute theory, score creation and interpretation should follow measurement and psychometric theory. Although we assumed the attributes of g, Gf, and Gc were quantitative, it is doubtful that current test development practices are able to develop scores that are more than ordinal representations of the attributes. Consequently, score interpretation should be consistent with this state of affairs. The interpretations we provided were consistent with having ordinal data, but they were minimal; more could be possibly done. To date, little attention has been paid to this in school psychology, or psychology in general (Krause 2013). Thus, is an area that needs further developed in order to maximize the use of the scores from cognitive ability instruments.

Contemp School Psychol

Compliance with Ethical Standards Ethical Approval This article does not contain any studies with human participants or animals performed by any of the authors. Conflict of Interest The authors declare that they have no conflict of interest.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA/APA/NCME]. (2014). Standards for educational and psychological testing (4th ed.). Washington, DC: Authors. Beaujean, A. A. (2018). Simulating data for clinical research: a tutorial. The Journal of Psychoeducational Assessment, 36, 7–20. https://doi.org/10.1177/0734282917690302. Beaujean, A. A., & Sheng, Y. (2014). Assessing the Flynn effect in the Wechsler scales. Journal of Individual Differences, 35, 63–78. https://doi.org/10.1027/1614-0001/a000128. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061. Borsboom, D., Cramer, A. O. J., Kievit, R. A., Scholten, A. Z., & Franić, S. (2009). The end of construct validity. In R. W. Lissitz (Ed.), The concept of validity: revisions, new directions, and applications (pp. 135–170). Charlotte: Information Age Publishing. Braden, J. P., & Ouzts, S. M. (2005). Review of the Kaufman assessment battery for children, second edition. In B. S. Plake & J. C. Impara (Eds.), The sixteenth mental measurements yearbook (2nd ed., pp. 517–520). Lincoln: Buros Institute of Mental Measurements. Bringmann, L. F., & Eronen, M. I. (2016). Heating up the measurement debate: what psychologists can learn from the history of physics. Theory & Psychology, 26, 27–43. https://doi.org/10.1177/ 0959354315617253. Canivez, G. L., & Watkins, M. W. (2016). Review of the Wechsler intelligence scale for children-fifth edition: critique, commentary, and independent analyses. In A. S. Kaufman, S. E. Raiford, & D. L. Coalson (Eds.), Intelligent testing with the WISC-V (pp. 683–702). Hoboken: Wiley. Carroll, J. B. (1996). A three-stratum theory of intelligence: Spearman’s contribution. In I. Dennis & P. Tapsfield (Eds.), Human abilities: their nature and measurement (pp. 1–17). Mahwah: Erlbaum. Cattell, R. B. (1943). The measurement of adult intelligence. Psychological Bulletin, 40, 153–193. https://doi.org/10.1037/ h0059973. Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: a critical experiment. Journal of Educational Psychology, 54, 1–22. https://doi.org/10.1037/h0046743. Cattell, R. B. (1987). Intelligence: its structure, growth, and action. New York: Elsevier. Courville, T., Coalson, D. L., Kaufman, A. S., & Raiford, S. E. (2016). Does WISC-V scatter matter? In A. S. Kaufman, S. E. Raiford, & D. L. Coalson (Eds.), Intelligent testing with the WISC-V (pp. 209– 228). Hoboken: Wiley. Downing, S. M. (2006). Twelve steps for effective test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of testing (pp. 3– 25). Mahwah: Lawrence Erlbaum. Finkelstein, L. (2005). Problems of measurement in soft systems. Measurement, 38, 267–274. https://doi.org/10.1016/j. measurement.2005.09.002. Flanagan, D. P., & Alfonso, V. C. (2017). Essentials of WISC-V assessment (2nd ed.). Hoboken: Wiley.

Flanagan, D. P., Ortiz, S. O., & Alfonso, V. C. (2013). Essentials of crossbattery assessment (3rd ed.). Hoboken: Wiley. Floyd, R. G., Bergeron, R., McCormack, A. C., Anderson, J. L., & Hargrove-Owens, G. L. (2005). Are Cattell-Horn-Carroll (CHC) broad ability composite scores exchangeable across batteries? School Psychology Review, 34, 329–357. Frazier, T. W., & Youngstrom, E. A. (2007). Historical increase in the number of factors measured by commercial tests of cognitive ability: are we overfactoring? Intelligence, 35, 169–182. https://doi.org/10. 1016/j.intell.2006.07.002. Grace, J. B., & Bollen, K. A. (2008). Representing general theoretical concepts in structural equation models: the role of composite variables. Environmental and Ecological Statistics, 15, 191–213. https:// doi.org/10.1007/s10651-007-0047-7. Grégoire, J. (2013). Measuring components of intelligence: mission impossible? Journal of Psychoeducational Assessment, 31, 138–147. https://doi.org/10.1177/0734282913478034. Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6, 430–450. https://doi.org/10.1037/1082989X.6.4.430. Groth-Marnat, G. (1999). Financial efficacy of clinical assessment: rational guidelines and issues for future research. Journal of Clinical Psychology, 55, 813–824. Grove, W. M., & Vrieze, S. I. (2013). The clinical versus mechanical prediction controversy. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J. I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 2: Testing and assessment in clinical and counseling psychology (pp. 51–62). Washington, DC: American Psychological Association. Hale, J. B., Fiorello, C. A., Kavanagh, J. A., Hoeppner, J.-A. B., & Gaither, R. A. (2001). WISC-III predictors of academic achievement for children with learning disabilities: are global and factor scores comparable? School Psychology Quarterly, 16, 31–55. https://doi.org/10.1521/scpq.16.1.31.19158. Horn, J. L. (1963). Equations representing combinations of components in scoring psychological variables. Acta Psychologica, 21, 184–217. https://doi.org/10.1016/0001-6918(63)90048-9. Horn, J. L. (1985). Remodeling old models of intelligence. In B. B. Wolman (Ed.), Handbook of intelligence (pp. 267–300). New York: Wiley. Horn, J. L. (1989). Models of intelligence. In R. L. Linn (Ed.), Intelligence, measurement, theory and public policy (pp. 29–73). Urbana: University of Illinois Press. Horn, J. L. (1991). Measurement of intellectual capabilities: a review of theory. In K. S. McGrew, J. K. Werder, & R. W. Woodcock (Eds.), Woodcock-Johnson psycho-educational battery-revised technical manual (pp. 197–232). Chicago: Riverside. Horn, J. L., & Blankson, A. N. (2012). Foundations for better understanding of cognitive abilities. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: theories, tests, and issues (3rd ed., pp. 73–98). New York: Guilford Press. Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crystallized intelligence. Journal of Educational Psychology, 57, 253–270. https://doi.org/10.1037/h0023816. Horn, J. L., & McArdle, J. J. (2007). Understanding human intelligence since Spearman. In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: historical developments and future directions (pp. 205–247). Mahwah: Erlbaum. Hunsley, J., & Mash, E. J. (2007). Evidence-based assessment. Annual Review of Clinical Psychology, 3, 29–51. https://doi.org/10.1146/ annurev.clinpsy.3.022806.091419. Jackson, J. S. H., & Maraun, M. (1996). The conceptual validity of empirical scale construction: the case of the sensation seeking scale. Personality and Individual Differences, 21, 103–110. https://doi. org/10.1016/0191-8869(95)00217-0.

Contemp School Psychol Jensen, A. R. (1993). Psychometric g and achievement. In B. R. Gifford (Ed.), Policy perspectives on educational testing (pp. 117–227). New York: Kluwer Academic Publishers. Jensen, A. R. (2002). Galton’s legacy to research on intelligence. Journal of Biosocial Science, 34, 145–172. https://doi.org/10.1017/ s0021932002001451. Kamphaus, R. W., Winsor, A. P., Rowe, E. W., & Kim, S. (2012). A history of intelligence test interpretation. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment (3rd ed., pp. 56–70). New York: Guilford. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. https://doi.org/10. 1111/jedm.12000. Kaufman, A. S., & Kaufman, N. L. (2004). Kaufman assessment battery for children-second edition. Circle Pines: American Guidance Service. Kaufman, A. S., Raiford, S. E., & Coalson, D. L. (2016). Intelligent testing with the WISC-V. Hoboken: Wiley. Keith, T. Z., & Reynolds, M. R. (2010). Cattell-Horn-Carroll abilities and cognitive tests: what we’ve learned from 20 years of research. Psychology in the Schools, 47, 635–650. https://doi.org/10.1002/ pits.20496. Kingston, N. M., Scheuring, S. T., & Kramer, L. B. (2013). Test development strategies. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J. I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 1: test theory and testing and assessment in industrial and organizational psychology (pp. 165–184). Washington, DC: American Psychological Association. Kline, P. (2000). The handbook of psychological testing (2nd ed.). London: Routledge. Krause, M. S. (2012). Measurement validity is fundamentally a matter of definition, not correlation. Review of General Psychology, 16, 391– 400. https://doi.org/10.1037/a0027701. Krause, M. S. (2013). The data analytic implications of human psychology’s dimensions being ordinally scaled. Review of General Psychology, 17, 318–325. https://doi.org/10.1037/ a0032292. Littell, W. M. (1960). The Wechsler intelligence scale for children: review of a decade of research. Psychological Bulletin, 57, 132–156. https://doi.org/10.1037/h0044513. Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006). Scalability and the development of useful diagnostic scales. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Luria, A. R. (1973). The working brain: an introduction to neuropsychology. New York: Basic Books. Maraun, M. D. (1998a). Measurement as a normative practice: implications of Wittgenstein’s philosophy for measurement in psychology. Theory & Psychology, 8, 435–461. https://doi.org/10.1177/ 0959354398084001. Maraun, M. D. (1998b). The nexus misconceived: Wittgenstein made silly. Theory & Psychology, 8, 489–501. https://doi.org/10.1177/ 0959354398084004. Mari, L., Carbone, P., & Petri, D. (2015). Fundamentals of hard and soft measurement. In A. Ferrero, D. Petri, P. Carbone & M. Catelani (Eds.), Modern measurements: Fundamentals and applications (pp. 203–262). Hoboken, NJ: Wiley-IEEE Press. McDonald, R. P. (1999). Test theory: a unified treatment. Mahwah: Erlbaum. McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: standing on the shoulders of the giants of psychometric intelligence research. Intelligence, 37, 1–10. https://doi.org/10. 1016/j.intell.2008.08.004. McGrew, K. S., LaForte, E. M., & Schrank, F. A. (2014). WoodcockJohnson IV technical manual. Rolling Meadows: Riverside.

Michell, J. (1999). Measurement in psychology: critical history of a methodological concept. New York: Cambridge University Press. Michell, J. (2007). Measurement. In S. P. Turner & M. W. Risjord (Eds.), Philosophy of anthropology and sociology (pp. 71–119). Amsterdam: North Holland. Michell, J. (2011). Qualitative research meets the ghost of Pythagoras. Theory & Psychology, 21, 241–259. https://doi.org/10.1177/ 0959354310391351. Michell, J. (2012). Alfred Binet and the concept of heterogeneous orders. Frontiers in Psychology, 3(261), 1–8. https://doi.org/10.3389/fpsyg. 2012.00261. Petri, D., Mari, L., & Carbone, P. (2015). A structured methodology for measurement development. IEEE Transactions on Instrumentation and Measurement, 64, 2367–2379. https://doi.org/10.1109/TIM. 2015.2399023. Pfeiffer, S. I., Reddy, L. A., Kletzel, J. E., Schmelzer, E. R., & Boyer, L. M. (2000). The practitioner’s view of IQ testing and profile analysis. School Psychology Quarterly, 15, 376–385. https://doi.org/10.1037/ h0088795. R Development Core Team. (2017). R: a language and environment for statistical computing (version 3.3.3) [computer program]. Vienna: R Foundation for Statistical Computing. Raiford, S. E. (2017). Essentials of WISC-V integrated assessment. Hoboken: Wiley. Schneider, W. J. (2013). What if we took our models seriously? Estimating latent scores in individuals. Journal of Psychoeducational Assessment, 31, 186–201. https://doi.org/10. 1177/0734282913478046. Schneider, W. J., & McGrew, K. S. (2012). The Cattell-Horn-Carroll model of intelligence. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment (3rd ed., pp. 99–144). New York: Guilford. Schrank, F. A., McGrew, K. S., & Mather, N. (2014). Woodcock-Johnson IV tests of cognitive abilities. Rolling Meadows: Riverside. Sijtsma, K. (2012). Psychological measurement between physics and statistics. Theory & Psychology, 22, 786–809. https://doi.org/10. 1177/0959354312454353. Sijtsma, K. (2013). Theory development as a precursor for test validity. In R. E. Millsap, L. A. van der Ark, D. M. Bolt, & C. M. Woods (Eds.), New developments in quantitative psychology: presentations from the 77th annual psychometric society meeting (pp. 267–274). New York: Springer. Spearman, C. E. (1927). The abilities of man: their nature and measurement. New York: Blackburn Press. Spearman, C. E. (1931). Our need of some science in place of the word ‘intelligence’. Journal of Educational Psychology, 22, 401–410. https://doi.org/10.1037/h0070599. Spearman, C. E. (1939). Thurstone’s work re-worked. Journal of Educational Psychology, 30, 1–16. https://doi.org/10.1037/ h0061267. Spearman, C. E. (1946). Theory of general factor. British Journal of Psychology, 36, 117–131. https://doi.org/10.1111/j.2044-8295. 1946.tb01114.x. Thomson, G. H. (1927). The tetrad-difference criterion. British Journal of Psychology. General Section, 17, 235–255. https://doi.org/10.1111/ j.2044-8295.1927.tb00426.x. Thurstone, L. L. (1935). The vectors of mind: multiple-factor analysis for the isolation of primary traits. Chicago: University of Chicago Press. Tomarken, A. J., & Waller, N. G. (2003). Potential problems with Bwell fitting^ models. Journal of Abnormal Psychology, 112, 578–598. https://doi.org/10.1037/0021-843X.112.4.578. Wechsler, D. (1950). Cognitive, conative, and non-intellective intelligence. American Psychologist, 5, 78–83. https://doi.org/10.1037/ h0063112.

Contemp School Psychol Wechsler, D. (1975). Intelligence defined and undefined: a relativistic appraisal. American Psychologist, 30, 135–139. https://doi.org/10. 1037/h0076868. Wechsler, D. (1981). The psychometric tradition: developing the Wechsler adult intelligence scale. Contemporary Educational Psychology, 6, 82–85. https://doi.org/10.1016/0361-476X(81)90035-7. Wechsler, D. (2014). Wechsler intelligence scale for children-fifth edition administration and scoring manual. Bloomington: NCS Pearson.

A. Alexander Beaujean, Ph.D., ABAP is an associate professor in the Department of Psychology and Neuroscience at Baylor University. His research interests include psychological assessment and measurement, individual differences, and quantitative methods. Nicholas F. Benson is Associate Professor of School Psychology in the Department of Educational Psychology at Baylor University. His research interests focus broadly on psychological and educational assessment, with emphasis on examining the validity of interpretations and uses of test scores.

Theoretically-Consistent Cognitive Ability Test Development and Score Interpretation

Recommend Documents