C 2006) Journal of Occupational Rehabilitation, Vol. 16, No. 1, March 2006 ( DOI: 10.1007/s10926-005-9014-z
Measurement Qualities of a Self-Report and Therapist-Scored Functional Capacity Instrument Based on the Dictionary of Occupational Titles Craig A. Velozo,1,2,6 Bongsam Choi,3 Sheryl Eckberg Zylstra,4 and Rochelle Santopoalo5 Published online: 17 May 2006
Studies provide convincing arguments to support the development of functional capacity instruments based on the Dictionary of Occupational Titles (DOT). The purpose of this study is to investigate the item-level measurement qualities of a newly developed DOTbased functional capacity instrument for clients undergoing rehabilitation treatment for back pain. Client and therapist ratings were collected on 124 clients from 27 rehabilitation sites using the newly developed Occupational Rehabilitation Data Base (ORDB) functional capacity instrument. Rasch analysis was used to investigate: (1) unidimensionality, (2) hierarchical item difficulty continuum, (3) rater severity, and 4) person–item match. Overall, the functional capacity scale of the ORDB showed good measurement qualities. All items, except the Handling item fit the Rasch measurement model. Because of high fit statistics and loading on factors independent from the remainder of the items, the “handling” item was removed, from further analyses. Separate client-rated and therapist-rated instruments retained good item-level psychometrics. While client and therapist items showed similar item-difficulty hierarchical structures, clients had a tendency to be more severe in their rating and the correlation between client and therapist ratings was relatively low, 0.32. These findings suggest that Handling items should not be included as a DOT measure for clients with back pain. While the above psychometric study supports using client or therapist ratings as independent instruments, the lack of concordance between these ratings requires further investigation. KEY WORDS: low back pain; psychometrics; work capacity evaluation.
1 Rehabilitation Outcomes Research Center, Department of Veterans Affairs Medical Center, Gainesville, Florida. 2 Department of Occupational Therapy, College of Public Health and Health Professions, University of Florida,
Gainesville, Florida. 3 Rehabilitation Science Doctoral Program, College of Public Health and Health Professions, University of Florida,
Gainesville, Florida. 4 Student Support Services, Chehalis, Washington. 5 Global Halloween Alliance Corporation, Evanston, Illinois. 6 Correspondence should be directed to Craig Velozo, Department
of Occupational Therapy, College of Public Health and Health Professions, University of Florida, P.O. Box 100164, Gainesville, Florida, 32610-0164; e-mail:
[email protected].
109 C 2006 Springer Science+Business Media, Inc. 1053-0487/06/0300-0109/1
110
Velozo, Choi, Zylstra, and Santopoalo
INTRODUCTION With almost endless array of assessments designed to evaluate the potential to work, there is little consensus on how to measure functional capacity of workers with injuries (1–4). Functional capacity instruments range from those which assess isolated parts of the body, such as lumbar strength, to those which assess performance involving several body parts working together, as in isometric or isoinertial lifting (4,5). While designed to approximate the capacity to accomplish work tasks, many of these instruments fail to reflect the multitude of activities that are involved in a typical work situation. The limitations of functional capacity instruments have prompted several authors to call for more direct measures of functional status (4,6–9). Abdel-Moty et al. (1) provide convincing arguments to support the development of job-specific residual functional capacity (RFC) measurement based on the Dictionary of Occupational Titles (DOT). These authors and others note that impairment assessments, normally done by physicians, require a significant conceptual leap for translation into functional limitations (10–12). In contrast, DOT classifications represent a common mechanism of translating work capacity and functional restoration assessments into functional status and functional outcomes for clients (12). The standardized methodology using DOT-based assessments generally has been limited to propriety work evaluation systems (13), where there is little general access to reliability and validity support due to the propriety nature of these assessments. Fishbain et al. were the first to develop an instrument that measures RFC based on DOT classifications (10,12). Their instrument consists of 17 DOT items (Standing, Walking, Sitting, Lifting, Carrying, Pushing, Pulling, Climbing, Balancing, Stooping, Kneeling, Crouching, Crawling, Reaching, Handling, Fingering, and Feeling) that are scored on a dichotomous (pass/fail) scale. The authors present descriptive statistics including the percentage of 67 patients with low back pain passing each item and the results of factor analysis which indicates that DOT classifications fall into four determined factors (mobility/strength, pushing/pulling, tolerance, and manual dexterity) (10). These findings lead the authors to state, “. . . any DOT-RFC battery should contain tests addressing each of these four factors; and RFC batteries not designed to address aspects of these four factors will not tap all the potential functional deficits.” While Fishbain et al. (10,12) have shown that a DOT-based instrument demonstrates good psychometric properties, interpretations such as the above may be challenged from a clinical perspective (i.e., relevance of assessing manual dexterity in clients with back pain). Item Response Theory (IRT) methodologies, such as Rasch analyses, may clarify the validity of Fishbain et al.’s interpretations. Kopec (14), in a review of back specific questionnaires, encouraged using Rasch and IRT approaches in evaluating and developing functional outcome measures for persons with back pain. This methodology has been increasingly used to develop and validate a number of functional outcome scales in rehabilitation (15–24). In contrast to classical test theory, IRT focuses on the psychometric properties of the items of an instrument instead of the instrument as a whole. Through this methodology, one can estimate the probability that a respondent will endorse an item (e.g., choose “yes” or “no”) or select a particular rating of that item (e.g., choose “no difficulty”, “some difficulty”, or “a lot of difficulty”). IRT places item difficulty and person ability on the same linear continuum, thus “connecting” an individual’s response on items to their level of ability (or disability) (25). For example,
DOT-Based Functional Capacity Instrument
111
a person with severe back pain would likely endorse relatively easy items such as “able to sit for 10 min” while a person with mild back pain would likely endorse more challenging items such as “lifting 25 pounds from floor to waist”. This methodology can eventually lead to calibrating large item banks that can be used to selectively administer questions to respondents through computerized adaptive testing (26–29). The purpose of this study is to investigate the item-level measurement qualities of a DOT-based functional capacity scale. Rasch analysis was used to investigate the following instrument qualities: (1) unidimensionality, (2) a hierarchical item difficulty continuum, (3) rater severity, and (4) person–item match (30,31).Unidimensionality refers to measuring a single, dominant construct, even when multiple attributes are measured (32,33). This characterization is a necessary precursor to combining the items to obtain a total score for an assessment (34). A hierarchical item difficulty continuum refers to the structure in which items of a unidimensional construct progress from easy to difficult in a hypothesized fashion. The empirically derived item difficulties produced by Rasch analysis can serve as a means of validating the structure of the measurement instrument. For example, for individuals with back pain, we would hypothesize that Handling items would be easier than Lifting items. Rasch analysis item calibrations, in addition to verifying or challenging the above hypothesis, can detect rater severity. Rater severity refers to a rater’s tendency to give higher or lower scores relative to other raters. The dramatic influence that raters have on test scores has been demonstrated by a number of researchers (35–40). That is, different raters rate the same construct, but show different rating severities (i.e., patient self-ratings of functional capacity may be consistently lower or higher than therapist ratings). Of particular interest in this study is the relative severity of client self-reports versus therapist-generated ratings. Finally, Rasch analysis places both person ability and item difficulties on the same linear continuum. This aspect of the analysis will reveal ceiling and floor effects or “gaps” where item difficulty calibrations do not match person ability calibrations. In addition to revealing these effects, the person–item match can provide insight into ways to ameliorate these limitations in instrument design, i.e., generating more challenging items to remove ceiling effects and easier items to remove floor effects.
METHODS Instrumentation Our research team developed a DOT-based functional capacity instrument, similar to that of Fishbain et al. (10,12), as a part of an Occupational Rehabilitation Data Base (ORDB) instrument (41). The ORDB consists of several scales that are designed to measure critical variables relevant to return to work in work-related rehabilitation clinics. One of the scales, the functional capacity scale, measures 10 DOT job factors: standing, walking/running, sitting, lifting, carrying, pushing/pulling, climbing, stooping/crouching/kneeling, reaching, handling/fingering. As with Fishbain et al.’s DOT-RFC, the above factors were derived from the job factors listed in the DOT (10,42,43). It should be noted that not all DOT classifications were used and some of the classifications were combined (i.e., pushing/pulling, stooping/crouching/kneeling, handling/fingering). All items were scored on a four-point scale: 1-severely impaired, 2-moderately impaired, 3-mildly impaired and 4-not impaired.
112
Velozo, Choi, Zylstra, and Santopoalo
If at DOT-related area was not a focus of treatment, the factor was not scored. The ORDB consists of separate therapist and client rater forms. Both include definitions and “key aspects” of each of the 10 DOT job factors and descriptions of each of the four rating categories. An example of a DOT job factor is “Standing: Remaining on one’s feet in an upright position at a work station without moving about (key aspect: duration).” An example of a rating category is “Mildly impaired: demonstrates at least 75–99% of the physical demands needed for job.” Both therapists and clients were asked to rate each item relative to the demands of their previous job, and if the client was not planning to return to their previous job, the items were to be rated based on an anticipated job.
Subjects This study is part of a larger project to develop the ORDB functional capacity instrument and is based on data collected as part of the second clinical field test of the ORDB. Clinical sites for this field test were recruited through informational articles presented in occupational therapy and physical therapy trade publications and via word of mouth. The criterion for participation was that the rehabilitation facility incorporated return-to-work issues into their treatment program (e.g., work hardening and work conditioning). Prior to the field testing, representatives from 28 rehabilitation facilities were educated on the theoretical concepts underlying the instrument and trained on procedures for data collection in a 1-day workshop. Following a 3-month data collection period, useable data was received from 27 sites. Data from one site was considered unusable due to the degree of missing information. Subjects consisted of all appropriate clients presenting to the participating field sites between August 22, 1994 and November 11, 1994. While data was received on a total of 230 clients, only data from clients with low back pain (n = 124) were used for the present study. Sixty-six percent (82/124) of the clients participating in the study were males and 34% (42/124) were females with an average age of 38.1 (ranged from 21.0 to 65.0 years of age). Eighty-seven percent (104/120) were currently not working while 13% (16/120) were working (missing data on four clients) and 80% (97/121) were receiving workers’ compensation while 20% (24/121) were not (missing data on three clients). Seventy-one percent (81/114) completed the work rehabilitation program and 29% (33/114) did not (missing data on 10 clients).
Data Analysis Admission scores were analyzed with Winsteps Rasch analysis computer program, using the rating scale model (44,45). The Rasch or one-parameter model is the most robust of the IRT models relative to sample size with a recent Monte Carlo simulation study showing stable item calibrations and accurate fit statistics obtained with as few 100 subjects for a 10-item test (46). The Winsteps program provides goodness-of-fit statistics for each test item and each subject. These fit statistics were utilized to identify items that did not fit the Rasch model criterion of unidimensionality. In this study, items or person responses that presented infit or outfit mean square (MnSq) ≥ 1.4 and a standardized score greater than 2.0 were considered to misfit, which is an indication that the item or person was responding
DOT-Based Functional Capacity Instrument
113
erratically relative to other items (33,47–49). The erratic characteristic indicated by misfit suggests that: (a) the item might be measuring a different construct or (b) the item needs further clarification or revision to fit the Rasch model. Several studies have shown that unidimensionality cannot be determined solely by fit statistics (50,51). Therefore, factor analysis with oblique rotation was performed to identify underlying constructs that may be represented by the instrument items (52). Rasch analysis generates a log odds unit (logit) unit scale to present person ability and item difficulty measures. Logits are logarithmic transformation of item and person data into interval scales. Item difficulty is based on the ratio of the probability of success over the probability of failure on an item (if items are dichotomously rated) or success over failure on an item at a particular rating (if items have a rating scale). The analysis places items along a continuum of “easy” to “difficult.” This item difficulty hierarchy (“placing” items on a continuum of easy to hard) provides a means to investigate the construct validity of the instrument. For example, logically, it would be expected that for clients with low back pain, stooping, and lifting would be more difficult than reaching and handling (53). This should be reflected in stooping and lifting items having higher item-difficulty calibrations than reaching and handling items. Furthermore, the availability of both therapist ratings and client self-report ratings within the same instrument allow us to investigate whether one type of rater (e.g., therapist) is more “severe” or more “lenient” than another type of rater (e.g., client). That is, similar items, for example, Lifting as rated by a therapist may show a higher calibration (i.e., be more difficult) than Lifting as rated by a client. If raters appear to be rating the same construct, but show different severities, statistical methods may be used to equate the responses (39). The determination of how well the ORDB items measure the sample under study is based on the principle that Rasch analysis “places” person measures and item calibrations along the same linear continuum. Persons reporting higher functional capacity receive higher measures and persons reporting lower functional capacity receive lower measures. In a similar fashion, items that are of greater challenge receive higher calibrations and items that are of less challenge receive lower calibrations. By plotting measures of persons and items on the same linear continuum, one can determine how well the items of an instrument capture the functional capacity ability expressed the sample. Ceiling effects will be revealed if persons show higher measures than are represented by items. Floor effects will be revealed if persons show lower measures than are represented by items. In addition, gaps across the central portions of the scale, indicate items are not differentiating persons. In addition to the above analyses, Winsteps presents a number of more general psychometric statistics that provide information on the quality of the instrument. Person separation reliability, analogous to Cronbach’s alpha, provides an indication of the internal consistency of the instrument. Separation ratio (SR) provides an indication of the number of statistically significant strata to which the sample is divided [SR = (4Gp + 1)/3, where Gp = person separation reliability] (54). The appearance of significant strata in the sample permit classifying each strata into meaningful categories (e.g., “low”, “medium” and “high” ability clients). Finally, point-measure correlation coefficients, showing the correlation of each item to the entire assessment are also produced by the Winsteps program (44). While there are parallels to the above statistics in traditional statistical programs, the advantage of using the Winsteps-derived statistics is their robustness for missing data.
114
Velozo, Choi, Zylstra, and Santopoalo
RESULTS As an initial means to determine the unidimensionality of the ORDB functional capacity instrument, Rasch fit statistics were examined using admission data only. Using the criterion of misfit as MnSq > 1.4 and ZSTD > 2.0, one item misfit, Handling as rated by the client. Furthermore, Handling rated by the client and Handling rated by the therapist both showed low point measure correlation coefficients, 0.30 and 0.23, respectively, indicating that these two items were not correlating well with the rest of the items of the instrument. To further examine the unidimensionality of the ORDB functional capacity instrument, a rotated factor analysis was performed, on therapist-rated items and client-rated admission data only. Table I shows that items from the instrument load on five factors, all with eigenvalues greater than 1. Using a criterion of .46 as a “significant” loading (55), all therapist-rated items, except Handling load on factor 1. The therapist Reaching and Handling items also load on factor 5. For client ratings, all items except Sitting, Reaching, and Handling load on factor 2. There is no obvious pattern for factors 2 and 4. Client ratings for Standing, Walking, Lifting, Carrying, Pushing, Climbing, and Stooping load on factor 2 and client ratings for Standing, Walking, Sitting, Climbing, and Reaching load on factor 4. Finally, Handling and Reaching items for clients load on factor 3. Since the Handling items for both therapists and client ratings were the only items to load on a separate factor, factors 5 and 3, respectively, Handling items rated by the therapist and client were removed from further analyses. Table II presents item measures, error, infit/outfit statistics and point measure correlation coefficients for the 18 remaining DOT items (9 items rated by therapist, 9 items rated by client). All items, except the sitting item rated by therapists show acceptable infit and outfit fit statistics (MnSq ≥ 1.4 and ZSTD ≥ 2.0). In addition, all 18 items show point
Table I. Factor Loadings for Client and Therapist Admission Items Factor loadings Items
1
2
3
4
Standing (T) Walking (T) Sitting (T) Lifting (T) Carrying (T) Pushing (T) Climbing (T) Stooping (T) Reaching (T) Handling (T) Standing (C) Walking (C) Sitting (C) Lifting (C) Carrying (C) Pushing (C) Climbing (C) Stooping (C) Reaching (C) Handling (C)
0.643 0.765 0.499 0.904 0.915 0.811 0.695 0.853 0.612 0.211 0.225 0.186 0.202 0.183 0.192 0.179 0.144 0.179 0.140 0.020
0.277 0.228 0.111 0.232 0.230 0.145 0.075 0.138 −0.003 −0.041 0.618 0.480 0.341 0.904 0.888 0.800 0.471 0.810 0.421 0.211
−0.176 −0.022 0.222 0.092 0.021 0.009 0.022 0.014 0.089 0.107 0.012 0.013 0.314 0.185 0.213 0.281 0.215 0.155 0.778 0.855
− 0.466 − 0.402 −0.442 −0.169 −0.123 −0.248 −0.289 −0.130 −0.157 0.005 − 0.639 − 0.740 −0.729 −0.288 −0.357 −0.441 −0.774 −0.384 −0.480 −0.122
5 0.379 0.378 0.327 0.128 0.162 0.230 0.453 0.243 0.664 0.856 0.265 0.333 −0.064 −0.057 −0.066 0.018 −0.110 0.088 0.018 0.168
DOT-Based Functional Capacity Instrument
115
measure correlations above 0.45. Overall, the combined therapist client ratings showed good internal consistency psychometrics and adequately separated the sample under study. Person separation reliability (analogous to Cronbach’s alpha) was 0.86. Person separation index (person standard deviation in calibration error units) was 2.43, defining 3.57 statistically distinct levels of person ability (strata with centers three calibration errors apart). In addition to providing item fit and score correlation statistics, Table II shows the item difficulty hierarchy and the relationship of this hierarchy across therapist–client ratings. The easiest items (items most likely to be endorsed with a high rating—at the bottom of Table II) are Reaching, rated by the therapist and client and Sitting rated by the therapist. The most difficult items (items least likely to be endorsed with a high rating—at the top of Table II) are Stooping, Lifting, and Carrying all rated by the client. In addition to providing an overall indication of item difficulty, since ratings are included by both the therapist and the client, Table II also provides an indication of the relative severity of ratings across therapist and client raters. For all similar items (e.g., Stooping rated by client and Stooping rated by therapist), clients have a tendency to rate the item more difficult than therapists. For example, Stooping as rated by the client is at 0.99 ± 0.13 logits while Stooping as rated by the therapist is 0.27 ± 0.12 logits. While client–therapist ratings for Lifting and Carrying differ by only 0.29–0.32 logits, the rater differences for the remainder of the items differ by 0.6–1.16 logits. While the above analyses are based on the combined scoring of the instrument by both therapists and clients, a critical question is whether the instrument retains its internal consistency and person separation qualities when rated by therapists alone or by clients alone. Separate Rasch analysis of therapist and client ratings were performed. Person separation reliability was at 0.85 for therapist ratings and 0.82 for client ratings (as compared to 0.86 for the combined therapist–client ratings). In terms of strata, the present sample is separated into 3.53 strata for therapist ratings and 3.21 strata for clients (as compared to 3.57 strata for the combine client–therapist analysis). Figure 1 shows the item–person match for client ratings (left) and the item–person match for therapist ratings (right).7 The Xs on the left side of each “map” show the distribution of the sample in terms of “functional capacity ability” as measured by the ORDB instrument. Xs at the bottom of the scale represent individuals of low ability and Xs at the top of the scale represent individuals with high ability. Both maps show a relative normal distribution of patient abilities for therapist ratings ranging between − 3.5 and 5.0 logits and for client ratings between − 5.0 and 5.0 logits. In spite of the similarities in person ability distributions, the correlation between client generated person measures and therapist generated person measures was weak at 0.32 (p < 0.05). Item difficulty calibrations also match person ability measures fairly well for both the client ratings and the therapist ratings. The average item difficulty was 0.89 ± 1.49 logits higher than average person ability for client ratings and only 0.02 ± 1.60 logits lower than the average person ability for therapist ratings (compare the “M” to the left of vertical line— which represents average person measure to the “M” to the right of each vertical line—which represents average item measure). The person ability distribution for both maps also shows no obvious floor or ceiling effects for the instrument; for the client ratings, one client is in the 7 In
order to compare item difficulties across the client and therapist rated analyses, each analysis was anchored at the average person ability measure of 0.
Stooping (C) Lifting (C) Carrying (C) Pushing/Pulling(C) Standing (C) Lifting (T) Carrying (T) Stooping (T) Sitting (C) Walking (C) Climbing (C) Pushing/Pulling(T) Standing (T) Walking (T) Climbing (T) Sitting (T) Reaching (C) Reaching (T)
Items 0.99 0.75 0.61 0.55 0.51 0.43 0.32 0.27 0.27 0.26 0.20 −0.05 −0.33 −0.44 −0.71 −0.89 −0.91 −1.81
Measure (logits) 0.13 0.13 0.13 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.14
Error 1.04 1.05 0.87 0.88 1.05 0.78 0.78 0.83 1.15 0.97 1.07 0.97 1.08 0.76 1.09 1.40 1.13 1.12
Infit MnSq 0.30 0.40 −1.20 −1.00 0.40 −2.00 −2.00 −1.60 1.20 −0.30 0.60 −0.30 0.70 −2.30 0.80 3.00 1.10 0.90
ZSTD 0.97 1.02 0.86 0.87 1.01 0.78 0.77 0.90 1.17 0.97 1.08 1.00 1.04 0.76 1.08 1.38 1.11 1.07
Outfit MnSq
Score correlation 0.59 0.58 0.59 0.59 0.60 0.67 0.65 0.61 0.50 0.55 0.54 0.57 0.64 0.68 0.58 0.51 0.47 0.45
ZSTD −2.00 0.20 −1.10 −1.10 0.10 −2.00 −2.10 −0.80 1.30 −0.20 0.70 0.00 0.40 −2.20 0.60 2.80 0.90 0.50
Table II. Fit Statistics Admission on Nine Items Rated by Therapist and Client—Handling Item Excluded
116 Velozo, Choi, Zylstra, and Santopoalo
DOT-Based Functional Capacity Instrument
Fig. 1. Person–item match for client self-report ratings (left) and therapist ratings (right) excluding Handling item for ORDB functional capacity instrument. The graph represents the relationship among person ability measures and item difficulty measures in logits. Each “X” on the left side of each map represents one subject, with the Xs at the bottom of the scale representing individuals of low ability and Xs at the top of the scale representing individuals with high ability. The items of the instrument at their average measure are listed to the right of the each map, with the easiest items at the bottom of the map and hardest items at the top of the map. “M” to the left/right of the vertical lines represents average item measure and person measure, respectively. “S” and “T” to the left/right of the vertical lines represent 1 standard deviation and 2 standard deviations, respectively. Each analysis is anchored the average person ability measure in order to compare item difficulties across the client- and therapist-rated analyses.
117
118
Velozo, Choi, Zylstra, and Santopoalo
ceiling and one client is in the floor, and for the therapist ratings, three clients are in the ceiling and no clients are in the floor. That is, the most difficult items of the instrument and easiest items of the instrument appear to be separating persons of high and low ability, respectively. Figure 1 demonstrates similarities and differences in client self-report versus therapist ratings. In general, the item difficulty order is similar across the two ratings. That is, Lifting, Carrying, and Stooping are the most difficult items for both client self-report and therapist ratings. Reaching represents the easiest item across both sets of raters. The “middle-level” difficulty items, Climbing, Walking, and Sitting are essentially at the same calibration for the client self-report while at slightly different calibrations for the therapist ratings. On the average, items are about one logit more difficult when rated by the client (M = 1.0 ± 0.69 for clients and M = 0 ± 0.94 for the therapists). As noted earlier, the therapist–client difference in item ratings is less pronounced for the items lifting and carrying.
DISCUSSION Overall, the functional capacity scale of the ORDB showed exceptional measurement qualities as determined by its support of the IRT assumptions of unidimensionality, hierarchical item continuum, and item–person match. The combined Rasch analysis of both client and therapist-rated items revealed that the Handling item as rated by the client showed high fit statistics. Furthermore, factor analysis showed that for both clients and therapists the Handling item loaded on a third and fifth factor. With the Handling item removed, the combined analysis showed good item-level psychometrics (item fit, item difficulty hierarchy, person separation reliability, and item–person match). Separate client self-report and therapist-rated instruments retained good item-level psychometrics. The client rated and therapist rated instruments did show differences. With the exception of the handling item, therapist-rated items had a tendency to load on a single factor, while client-rated items loaded on three factors. Furthermore, while client and therapist items showed similar itemdifficulty hierarchical structures, clients had a tendency to be more severe than therapists in their ratings. Finally, while the psychometrics of client and therapist ratings are similar, the correlation between client-generated and therapist-generated measures was low. One of the most critical aspects of instrument development is the quality of the items used to measure a construct. Fishbain et al. (10,12) proposed using DOT classifications to measure residual functional capacity (RFC) for individuals with chronic low back pain. Our item-level analysis, like their traditional analysis, supports such a measure with one exception; that is, whether or not to include Handling items as part of such a measure. Both Fishbain et al.’s findings and our findings show manual dexterity items (Handling, Fingering, and Feeling) load on factors separate from the remainder of the items. Fishbain et al.’s (10) conclusion was that “. . . any DOT-RFC battery should contain tests addressing each of these four factors [mobility/strength, pushing/pulling, tolerance and manual dexterity] and RFC batteries not designed to address aspects of these four factors will not tap all the potential functional deficits” (p. 878). Our item-analysis results challenge such a conclusion. In the present study, not only did the Handling item load on a factor separate from all other factors, this item showed high fit statistics using Rasch analysis. Fit statistics provide an indication of observed variance over expected variance. The Handling item (as rated by
DOT-Based Functional Capacity Instrument
119
clients) in the present study showed a MnSq infit/outfit of 1.81/1.85, indicating that this item was showing 81–85% more variance than expected. In addition, the measure correlations of this item for both therapist and client ratings were the lowest among all items of the instrument (0.23–0.30). The fit and measure correlation statistics from the present study indicate is that there is little relationship between a client’s overall ability on the functional capacity measure and their scoring on the Handling item. High item infit suggests that there is an unusual pattern of scores for the item when attempting to measure persons at ability levels similar to the item. Since the Handling items are among the easiest items on the instrument, this means that for individuals of low overall ability, there is an inconsistent pattern in rating this easy item (i.e., individuals with low ability may show high or low ratings on the Handling item). High outfit suggests that there is an unusual pattern of scores for the item when attempting to measure persons at ability levels distance from the item. Again, since the Handling item is among the easiest of the items, this suggests that persons of high ability may show either high or low ratings on this item. Items fit the Rasch model based on their probability of their responses. For example, an individual with back pain who is having difficulty Walking (item with “average” difficulty) would be expected to have even more difficulty Lifting and Carrying (items more difficult than Walking). Similarly, an individual with back pain who is capable of Stooping, Lifting, and Carrying (most challenging items) would be expected to be even more capable of Walking (easier item). This pattern of responses is not demonstrated by the handling item. That is, if an individual with back pain is having difficulty with Handling (easy item), we cannot predict whether he/she will have difficulty with Stooping, Lifting and Carrying (difficult items). Also, if an individual with back pain is capable of Stooping, Lifting, and Carrying (difficult items), we cannot predict whether he/she will have difficulty with Handling (easy item). The above findings suggest that Handling should not be part of a functional capacity battery for individuals with back pain. A surprising finding in the present study was that the ORDB instrument appeared to retain its item-level measurement qualities, whether it was rated as a self-report by clients or as an assessment by therapists. Person separation reliability and sample separation strata were virtually identical for client and therapist ratings. Average person ability measures derived from therapist ratings matched the average item-difficulty better than did the client-derived ratings, though this may be sample dependent (i.e., with a sample of more able clients, client self-reports would likely show a better match to the ORDB item difficulties). While client self-reports and therapist ratings show similar item-level psychometrics, there was an apparent difference in rating severities. That is, clients had a tendency to rate themselves more severely (item difficulties were higher) relative to therapist-based ratings. While there are no studies comparing client self-reports to therapist ratings, several studies of self report versus proxy report of physical functioning produced mixed findings. In a study of hip fracture patients, Magaziner et al. (56) found that proxies tended to report more disability than clients. In contrast, Grootendorst et al. (57), in a study of the Ontario Health Survey with the general population show that relative to self-report, proxies had a tendency to under-report burden of morbidity for questions that address emotion and pain. Several reasons can be postulated for clients having a tendency to be more severe than therapists in reports of functional deficits. Due to pain, clients with back pain may have more of a tendency to view functional activities as being more challenging than when
120
Velozo, Choi, Zylstra, and Santopoalo
viewed by someone not experiencing the pain. Also, there may be psychosocial reasons for clients to be severe in their self-report ratings (e.g., to appear more disabled due to pending injury-related litigation). Rasch analysis of rater severity may provide an innovative method for future studies comparing self-reports to therapist ratings. In addition to severity differences, client- and therapist-generated measures showed a low correlation. This low correlation is of concern. First, why is there a discrepancy across client and therapist measures and if such a discrepancy exists, which measure is more valid, the client-generated measure or the therapist-generated measure? The discrepancy may be a result of differing perspectives. The therapist assessment is likely to be strongly based on the clinical assessment of the client, while the client’s assessment is likely to be strongly based on the self-assessment of their functioning in everyday life. One may hypothesize that the client self-reports may be less objective due to the adoption of a “sick role” and the potential influences of secondary gains (e.g., compensation for disability). Future studies should include functional performance measures to clarify the validity of client and therapist generated measures. In summary, the above study supports the findings of Fishbain et al. (1,10,12) in demonstrating that DOT classifications can be used to develop a functional capacity instrument that shows good measurement properties. Item-level statistical analyses, such as Rasch analysis, provide findings that challenge those obtained through using traditional statistics (i.e., challenge the inclusion of Handling items for measuring function for clients with back pain). Finally, Rasch analysis of the ORDB functional capacity instrument demonstrates that client self-reports and therapist based ratings were psychometrically sound though there was an apparent difference in rating severity and a low correlation between client and therapist rating. REFERENCES 1. Abdel-Moty E, Fishbain DA, Khalil TM, Sadek S, Cutler R, Steel-Rosomoff R, Rosomoff HL. Functional capacity and residual functional capacity and their utility in measuring work capacity. Clin J Pain 1993; 9: 168–173. 2. Gross DP. Measurement properties of performance-based assessment of functional capacity. J Occup Rehabil 2004; 14(3): 165–174. 3. King PM, Tuckwell N, Barett TE. A critical review of functional capacity evaluations. Phys Ther 1998; 8: 852–866. 4. Velozo CA. Work evaluations: Critique of the state of art of functional assessment of work. Am J Occup Ther 1993; 47: 203–209. 5. Mayer TG. Chapter 7: Assessment of lumbar function. In Disability Symposium. Clin Orthop Rel Res 1987; 221: 99–109. 6. Cutler RB, Fishbain DA, Steele-Rosomoff R, Rosomoff HL. Relationships between functional capacity measures and baseline psychological measures in chronic pain patients. Spine 2003; 1(4): 249–258. 7. Matheson LN, Isernhagen SJ, Hart DL. Relationships among lifting ability, grip force, and return to work. Phys Ther 2002; 82(3): 249–256. 8. Mooney V. Functional capacity testing: its role in assessing and treating back pain. Pain Manage 1990; 3(2): 107–113. 9. Abdel-Moty E, Khalil T, Sadek S, Dilsen E, Fishbain DA, Steel-Rosomoff R, Rosomoff HL. Functional capacity assessment: A test battery and its use in rehabilitation. In: Mital A, ed. Advances in industrial ergonomics & safety I. New York: Taylor & Francis, 1992, pp. 1171–1177. 10. Fishbain DA, Abdel-Moty E, Cutler R, Khalil TM, Sadek S, Steel-Rosomoff R, Rosomoff HL. Measuring residual functional capacity in chronic low back pain patients based on the dictionary of occupational title. Spine 1994; 19(8): 872–880. 11. Hadler NM. Disabling backache in France, Switzerland and the Netherlands: Contrasting sociopolitical constraints on clinical judgment. J Occup Med 1989; 31: 823–831.
DOT-Based Functional Capacity Instrument
121
12. Fishbain DA, Cutler RB, Rosomoff HL, Khalil T, Abdel-Moty E, Steele-Rosomoff R. Validity of the dictionary of occupational titles residual functional capacity battery. Clin J Pain 1999; 15: 102–110. 13. Manual for Valpor Component Work Sample 1, 9, 11. Tuscon, Arizona: L Valpar International Corporation, 1980. 14. Kopec JA. Measuring functional outcomes in persons with back pain. Spine 2000; 25: 3110–3114. 15. Dobel SE, Fisk JD, Fisher AG, Ritvo PG, Murray TJ. Functional competence of community-dwelling persons with multiple sclerosis using the assessment of motor and process skills. Arch Phys Med Rehabil 1994; 75: 843–851. 16. Haley SM, Andres PL, Coster WJ, Kosinski M, Ni P, Jette AM. Short-form activity measure for post-acute care. Arch Phys Med Rehabil 2004; 85(4): 649–660. 17. Haley SM, McHorney CA, Ware JE, Jr. Evaluation of the MOS SF-36 physical functioning scale (PF-10): I. Unidimensionality and reproducibility of the Rasch item scale. J Clin Epidemiol 1994; 47: 671– 684. 18. Haley SM, McHorney CA, Ware JE, Jr. Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol 1997; 50(4): 451–461. 19. Heinemann AW, Linacre JM, Wright BD, Hamilton BB, Granger C. Relationships between impairment and physical disability as measured by the functional independence measure. Arch Phys Med Rehabil 1993; 74: 566–573. 20. White L, Velozo CA. The use of Rasch measurement to improve the Oswestry classification scheme. Arch Phys Med Rehabil 2002; 83: 822–831. 21. Silverstein B, Fisher WP, Kilgore KM, Harley JP, Harvey RF. Applying psychometric criteria to functional assessment in medical rehabilitation: II. Defining interval measures. Arch Phys Med Rehabil 1991; 73: 507–518. 22. Silverstein B, Kilgore KM, Fisher WP, Harley JP, Harvey RF. Applying psychometric criteria to functional assessment in medical rehabilitation: I. Exploring unidimensionality. Arch Phys Med Rehabil 1991; 72(9): 631–637. 23. Whiteneck GG, Charlifue SW, Gerhart KA, Overholser JD, Richardson GN. Quantifying handicap: A new measure of long-term rehabilitation outcomes. Arch Phys Med Rehabil 1992; 73: 519–526. 24. Hays RD, Morales LS, Reise SP. Item Response Theory and health outcomes measurement in the 21st century. Med Care 2000; 38(9, Suppl II): II-28–II-42. 25. Velozo CA, Peterson E. Using Rasch analysis to develop meaningful fear of falling measures for community dwelling elderly. Am J Phys Med Rehabil 2001; 80(9): 662–673. 26. McHorney CA. Generic health measurement: Past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med 1997; 127: 743–750. 27. McHorney CA, Haley SM, Ware JE, Jr. Evaluation of the MOS SF-36 Physical functioning scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol 1997:50(4): 451–461. 28. McHorney CA. Health status assessment methods for adults: Past accomplishments and future challenges. Annu Rev Public Health 1999; 20: 309–335. 29. Weiss DJ. Adaptive testing by computer. J Consult Clin Psychol 1985; 3: 774–789. 30. Velozo CA Magalhaes L, Pan A, Leiter P. Differences in functional scale discrimination at admission and discharge: Rasch analysis of the Level of Rehabilitation Scale-III (LORS-III). Arch Phys Med Rehabil 1995; 76(8): 705–712. 31. McHorney CA, Haley SM, Ware JE, Jr. Evaluation of the MOS SF-36 Physical functioning scale (PF-10): I. Unidimensionality and reproducibility of the Rasch item scale. J Clin Epidemiol 1994; 47(6): 671–684. 32. Thurstone LL. Measurement of social attitudes. J Abnorm Soc Psycholol 1931; 26: 249–269. 33. Wright BD, Linacre JM. Observations are always ordinal; measurements, however, must be interval. Arch Phys Med Rehabil 1989; 70: 857–860. 34. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med Rehabil 1989; 70: 308–332. 35. Engelhard G. Monitoring raters in performance assessments. In: Tindal G, Haladyna T, eds. Large-scale assessment programs for ALL students: Development, implementation, and analysis. Mahwah, NJ: Erlbaum, 2002, pp. 261–287. 36. Fisher WP. Measurement-related problems in functional assessment. Am J Occup Ther 1993; 47(4): 331–337. 37. Kielhofner G, Mallinson T, Forsyth K, Lai J. Psychometric properties of the second version of the occupational performance history interview (OPHI-II). Am J Occup Ther 2001; 55(3): 260–267. 38. Tresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Stat Med 2000; 19: 1651–1683. 39. Lunz ME, Stahl JA. The effect of rater severity on person ability measure: A Rasch model analysis. Am J Occup Ther 1993; 47(4): 311–317. 40. Wilson M, Wang W. Complex composites: Issues that arise in combining different modes of assessment. In Special Issues: Polytomous item response theory. Appl Psychol Measure 1995; 19: 51–71.
122
Velozo, Choi, Zylstra, and Santopoalo
41. Velozo CA, Santopoalo R. Training manual: Occupational rehabilitation data base manual. Chicago: University of Illinois at Chicago, 1994. 42. U.S. Department of Labor, Employment and Training Administration. Selected characteristics of occupationals defined in the Dictionary of Occupational Titles. Washington, DC: U.S. Government Printing Office, 1981. 43. U.S. Department of Labor, Employment and Training Administration. Dictionary of Occupational Titles, 4th edn., Supplement. Washington, DC: U.S. Government Printing Office, 1986. 44. Linacre JM. WINSTEPS Rasch measurement computer program. Chicago: Winsteps.com, 2005. 45. Wright BD, Masters GN. Rating scale analysis. Chicago: MESA Press, 1982. 46. Wang WC, Chen CT. Item parameter recovery, standard error estimates, and fit statistics of the WINSTEPS Program for the family of Rasch models. Educ Psychol Measure 2005; 65: 376–404. 47. Avery LM, Russell DL, Raina PS, Walter SD, Rosenbaum PL. Rasch analysis of the Gross Motor Function Measure: Validating the assumptions of the Rasch model to create an interval-level measure. Arch Phys Med Rehabil 2003; 84: 697–705. 48. Velozo CA, Peterson EW. Developing meaningful fear of falling measures for community dwelling elderly. Am J Phys Med Rehabil 2001; 80: 662–673. 49. Davidson M, Keating JL, Eyres S. A low back pain version of the SF-36 physical functioning Scale. Spine 2004; 29(5): 586–594. 50. Linacre JM. Detecting multidimensionality: Which residual data-type works best? J Outcome Meas 1998; 2: 266–283. 51. Smith EV. Detecting and evaluating the impact of multidimensionality using item fit statistics and principle component analysis of residuals. J Appl Meas 2002; 3: 205–231. 52. Preacher KJ, MacCallum RC. Reparing Tom Swift’s electric factor analysis machine. Understand Stat 2003; 2(1): 13–32. 53. White LJ, Velozo CA. The use of Rasch measurement to improve the Oswestry classification scheme. Arch Phys Med Rehabil 2002; 83: 822–831. 54. Mallinson T, Stelmack J, Velozo C. A comparison of the separation ratio and coefficient α in the creation of minimum item sets. Med Care 2004; 42(1, Suppl): I-17–I-24. 55. Norman GR, Steiner DL. Biostatistics: The bare essentials. St. Louise: Mosby Yearbook Inc., 1994. 56. Magaziner J, Zimmerman SI, Gruber-Baldini AL, Hebel JR, Fox KM. Proxy reporting in five areas of functional status. Comparison with self-reports and observations of performance. Am J Epidemiol 1997; 146(5): 418–428. 57. Grootendorst PV, Feeny DH, Furlong W. Does it matter whom and how you ask? Inter- and intra-rater agreement in the Ontario Health Survey. J Clin Epidemiol 1997; 50(2): 127–135.