ZDM https://doi.org/10.1007/s11858-018-0939-z
ORIGINAL ARTICLE
Measuring who counts: gender and mathematics assessment Gilah C. Leder1 · Helen J. Forgasz1 Accepted: 30 April 2018 © FIZ Karlsruhe 2018
Abstract Assessment in mathematics is assumed to provide credible and important information about what students know and can do. In this paper we focus on large scale tests and question whether mathematics assessment is essentially gender neutral. We consider aspects of test validity and discuss issues of terminology related to gender and mathematics. In particular, we highlight examples of the ways that test content, response processes, and interactions with other variables such as socioeconomic status and beliefs/attitudes can distort students’ results and affect the interpretation of achievement outcomes. Where appropriate, we highlight findings from our own research Keywords Assessment · Gender · Equity · Test validity · Large scale tests
1 Introduction In his preface to the review and applications of the national testing regimes in 30 European countries, Figel (2009) argued that “reliable information on pupil performance is key to the successful implementation of targeted education policies and it is not surprising that in the past two decades national tests have emerged as an important tool for providing a measure of educational achievement” (p. 3). More broadly, Postlethwaite and Kellaghan (2009) noted: “Many nations have now established national assessment mechanisms with the aim of monitoring and evaluating the quality of their education systems across several time points” (p. 9). Mathematics or numeracy (mathematical literacy) are invariably included among large scale testing programs. The power and impact on the larger population of mathematics and performance in this domain were eloquently summarised some years ago by Reisman and Kauffman (1980) as follows: In our culture, mathematics is considered to be a powerful tool. Successful performance in mathematics carries with it positive connotations. Being “good in * Gilah C. Leder
[email protected] Helen J. Forgasz
[email protected] 1
Monash University, Clayton, VIC, Australia
math” is “being bright”, and being bright in mathematics is associated with control, mastery, quick understanding, [and] leadership. Unsuccessful mathematics achievement implies the opposite of these positive connotations. (1980, p. 36) Proficiency in mathematics, it is also widely argued, is an important literacy requirement for the twenty-first century (e.g., Kearney, 2016; Steen, 2004). This view is certainly captured, forcefully, by the simple question posed by Australia’s Chief Scientist: “What if we lived in a world without mathematics? …Take away numbers, and you take away commerce, farming, medicine, music, architecture, cartography, cooking, sport… and every other activity we’ve invented since 3000 BC” (Finkel, 2017, p. 3). Reliance on large scale national tests as measures of students’ achievement has not abated since the reassuring words of Figel (2009) and Postlethwaite and Kelleghan (2009) were written almost a decade ago. The extensive penetration of large scale international surveys such as the Programme for International Student Assessment [PISA] and the Trends in International Mathematics and Science Study [TIMSS] are further testimony to the widespread preoccupation with measuring students’ achievement. With respect to PISA, “(a)pproximately 540,000 students completed the assessment in 2015, representing about 29 million 15-year-olds in the schools of the 72 participating countries and economies” (Organisation for Economic Co-operation and Development [OECD], 2016, p. 3). A large number of countries also participated in TIMSS 2015: “In 2015, 57
13
Vol.:(0123456789)
countries and 7 benchmarking entities (regional jurisdictions of countries such as states or provinces) participated in TIMSS. In total, more than 580,000 students participated in TIMSS 2015” (International Association for the Evaluation of Educational Achievement [IEA], 2015, p. 1). Data from PISA and TIMSS are popularly used to gauge and compare the progress of nations, their curricula, and their teaching approaches. In addition to the broad reports in these large scale international surveys of students’ performance, much information is also provided about the contexts—environmental and personal—of students who participate in the tests. Findings from these large scale tests can therefore also be used to identify groups of students whose performance is lower than might be expected. Kearney (2016), among others, pointed to the European Commission’s concerns that students’ underachievement in mathematics and science seemed “strongly connected to socio-economic status, immigration background and gender” (p. 12). The importance of a focus on gender in mathematics performance is captured by OECD (2009) with its listing of three core reasons for studying gender differences in mathematics achievement: “(1) to understand the source of any inequalities; (2) to improve average performance; and (3) to improve our understanding of how students learn” (p. 8). Despite the focus in this article on large scale tests, we readily acknowledge and accept that assessment is much broader than that offered by instruments which deliver a gross measure of group performance and, when administered, cannot offer fine-tuned descriptors of individuals’ levels of attainment. For this, different instruments are needed. These invariably have their own strengths and weaknesses, are often comprised of items and demands with characteristics in common with those found in large scale tests, and thus also warrant within group scrutiny, for example, for possible differences linked to gender, social class, or region of schooling. However, it is reports of large scale test data and not of smaller carefully targeted tests that are highlighted in the popular media and capture widespread attention. As shown by Forgasz and Leder (2011), such media reports often—inadvertently or perhaps to comply with space constraints—fail to present the complexities of within group variations in performance, in turn perpetuating pre-conceived stereotypes and beliefs held by some readers and students themselves about issues such as assumed gender differences in mathematics achievement.
1.1 Our aims for this article A test, whether educational or psychological, can be judged appropriate for some purposes but quite inadequate for others. What criteria, common to both large scale and more locally devised instruments of more limited scope, could and
13
G. C. Leder, H. J. Forgasz
should be used to determine the adequacy and cogency of a test aimed at providing a measure of students’ performance? As foreshadowed above and discussed in more detail below, gender differences in mathematics learning have attracted sustained attention from researchers, practitioners, and the wider community. In this article we aim to focus on several issues which may confound the measurement of females’ and males’ performance in mathematics. In particular, we structure our examination around the notion of validity and the debate about the interpretation and scope of this construct. In the next section we discuss what is meant by gender differences, summarise the terminological evolutions associated with the term gender, and provide a brief overview of the literature concerned with gender differences in mathematics learning. We then turn to the interpretation of what is being assessed in some large scale testing regimes, and follow this with an examination of aspects of test validity.
2 A brief overview of previous research on gender differences in mathematics learning Throughout the literature review that follows we refer to gender difference(s), unless the term sex difference(s) is specifically used by an author whose work we quote. Those versed in research on gender and mathematics education are well aware that the terms sex differences and gender differences are both readily found in the literature. Originally, the term sex differences was used uniquely and consistently. In more recent times, sex has more commonly been used to denote biologically-based differences. The use of gender evolved following debates on whether all differences could or should be attributed to biology alone. Gender, as a term was consequently often used to describe differences between males and females that are not attributable to biology. Recently the sufficiency of the male–female binary distinction has been further challenged and more nuanced refinements have been recommended. This is exemplified by the decision of the influential American Education Research Association to introduce an expanded list of gender identity categories for use by members when they joined or renewed their membership. The organization advocated “a two-step approach to collecting data on gender: the first being the collecting of data on the biological sex assigned at birth, and the second asking members how they describe their gender” (Levine, 2016). How, or whether, these new categorizations will impinge on research on gender/sex differences in mathematics learning remains to be seen. Meanwhile we have adhered to the use of the term gender to indicate that, typically in the work we have considered, the focus is on differences between males and females that are not attributable to biology.
Measuring who counts: gender and mathematics assessment
Mathematics has historically been considered to be a male domain, that is, more suitable for males than for females. Henrion (1997, p. xxxiv) observed that: Throughout history there has been a recurrent belief that at some fundamental level women were just no good at mathematics. First it was argued that their brains were too small, later that it would compromise their reproductive capacities, still later that their hormones were not compatible with mathematical development. Initiated by the seminal work of Fennema and her colleagues in the 1970s (e.g., Fennema, 1974; Fennema & Sherman, 1976, 1977), gender issues in mathematics education— achievement, participation, and attitudes/affect—began to attract research attention which has continued unabated until the present. Fairly consistent findings were soon evident. On average, females’ mathematics achievements were found to be lower than males’ (e.g., Fennema, 1974; Ethington & Wolfle, 1984; Robitaille & Travers, 1992), although the gender gaps and their directions were not always consistent across countries (e.g., Hanna, Kundiger, & Larouche, 1990). Quite early it was identified that in some mathematics content domains, the gender gap was reversed (e.g., Moss, 1982; Smith & Walker, 1988), and that question type (e.g., Bolger & Kellaghan, 1990) and test setting (e.g., Benbow, 1992; Kimball, 1989) were factors affecting the direction of the gender gaps identified. Achievement differences favouring males were generally found to be greater among the highest achieving group of students (e.g., Benbow & Stanley, 1980) and on problem-solving tasks (e.g., Fennema, Carpenter, Jacobs, Franke, & Levi, 1998). Many of the findings from the late twentieth century research described above have continued to be built upon. For example, with respect to the gender gap in mathematics achievement, in their meta-analysis of data from six large national studies conducted between 1960 and 1992, Hedges and Nowell (1995) reported small gender differences in mathematics achievement in favour of males, with effect sizes ranging from d = 0.03 to d = 0.26. They further reported that proportionally more males than females were consistently found among the highest performing students. Based on a more recent meta-analysis of the US National Assessment of Educational Progress [NAEP], Reilly, Neumann, & Andrews (2015) reported that “small mean sex differences favoring males were observed in science and mathematics performance, making claims of their absence premature” (p. 655). In that study, d = 0.10 was used as the threshold for considering a difference in mean scores to be “practically meaningful” (p. 650). Once again an over-representation of males was found among the high mathematics achievers, with the difference in favour of males described as moderate for students in grades 4 and 8 but increasing to
a ratio of 2.13 males for every female in grade 12. While discussing such differences, it also needs to be noted that the gender differences in mathematics achievement reported are generally small compared with much larger within-sex variations. Lower female participation in higher level mathematics courses internationally was also identified by early researchers (e.g., Schildkamp-Kundiger, 1982) and continues to be documented (OECD, 2009, 2013a). Females continue to be under-represented in school level enrolments in the most challenging mathematics subjects when there is choice and mathematics is no longer compulsory (for example, in Australia, Barrington & Brown, 2014). As a consequence, females remain in the minority in Engineering and other STEM-related fields (e.g., Hill, Corbett, & St. Rose, 2010; OECD, 2009). On various affective/attitudinal measures related to mathematics and to themselves as learners of the subject, males’ views were, and continue to be found to be more “functional” (leading to future success) than females’ (e.g., Hyde, Fennema, Ryan, Frost, & Hopp, 1990; OECD, 2013a). The existence of gender differences in mathematics learning is intermittently challenged by the findings of some researchers. Yet, the persistence of small gender differences in favour of males continues to be reinforced by results obtained from large scale studies. Drawing on data from within and beyond the USA, Hyde and Mertz (2009) explored three major questions: “(1) Do gender differences in mathematics performance exist in the general population? (2) Do gender differences exist among the mathematically talented? (3) Do females exist who possess profound mathematical talent?” (Hyde & Mertz, 2009, p. 8801). Their conclusions to each of these questions was “yes”, but with some qualifications. As reported in other research, Hyde and Mertz (2009) also found more males than females among the highest scoring students. It is worth mentioning that in media reports, with an emphasis on total group performance, the often more acute gender differences in favour of males among the highest achieving students tend to remain hidden. Mathematical literacy was a particular focus in the 2003 and 2012 Programme for International Student Achievement [PISA] surveys. In both years, gender differences were found in mean mathematics performance scores. In 2003, the mean performance of males on the PISA mathematics assessment component was higher than for females in 28 of the 29 participating OECD countries. For 26 of these same countries, the PISA 2012 data still revealed gender differences in favour of males. Among the 65 countries participating in PISA 2012, males outperformed females in 38, and females outperformed males in five (OECD, 2013b). On the 2012 PISA survey the difference in the mean score in favour of males was 11 points. In only six countries did the gap in favour of males exceed the equivalent of half a year
13
G. C. Leder, H. J. Forgasz
Fig. 1 Mean scores by gender, NAPLAN numeracy (2008– 2016)
of schooling. It is noteworthy, however, that the gender gap favouring males was greatest among the highest-achieving students. Still focusing on the Organisation for Economic Cooperation and Development [OECD] participating countries, Contini, Di Tommaso, and Mendolia (2017) noted that gender differences in the STEM (Science Technology Engineering and Mathematics) disciplines were found in most OECD countries, but that mathematics was the only subject where girls’ performance was consistently lower than that of boys’. They further reported that their inspection of PISA data generated by Italian students replicated “the general findings on gender inequalities in math test scores observed in US data and in particular, that the math gender gap starts at an early age, is larger among well performing than among low performing children and widens as children grow older (p. 33). Subtle but persistent gender differences in the Trends in International Mathematics and Science Study [TIMSS] have also been reported (Mullis, Martin, Foy, & Hooper, 2016). At the grade 4 level, in both the 2011 and 2015 TIMSS assessments, boys had a higher mean achievement score in 11 countries, compared with two countries in which girls had the higher mean score. But it should also be noted that at the grade 4 level there was no significant difference in the mean performance of girls and boys in 16 countries. At the 8th grade level few mean differences were found in the 2015 testing. Out of the 39 participating countries, no mean difference in the performance of girls and boys was found in 26 countries. The mean score for girls was higher in seven countries, and higher for boys in six countries.
13
These findings showed little difference from those reported in TIMSS 2011. Recent (American) SAT mathematics test data for college-bound seniors (SAT, 2016) revealed a small but continuing gender gap in the performance of males and females: the mean score for males was 524; for females 494. While approximately equal numbers of males and females scored in the highest range of the critical reading component of the test, 39,271 males and 38,933 females, a different picture emerged for the mathematics test data. On that, approximately 72,000 males compared with 45,000 females scored in the highest range of that test. Data from the numeracy component of Australia’s National Assessment Program—Literacy and Numeracy [NAPLAN] test are also worth noting. The recently released 2016 results (ACARA, 2016) revealed, for example, that at the Year 3 level a slightly higher proportion of females (96.0%) performed at or above the national minimum level compared with that of the boys (95.1%). Yet there was a higher proportion of boys (17.1%) than girls (12.7%) whose score placed them in the highest category available. Similarly, for students at the Year 9 level, a slightly higher proportion of girls (95.7%) than boys (94.7%) were deemed to have performed at or above the national minimum level. But at that year level too, a higher proportion of boys (9.7%) than girls (6.6%) recorded a score that placed them in the highest category. When mean scores on NAPLAN numeracy are examined, it can be seen that there has been a persistent pattern, over time, that males outperform females at each grade level (see Fig. 1).
Measuring who counts: gender and mathematics assessment
In summary: to a large extent the findings highlighted in the early phase of research on gender differences in mathematics learning continue to be replicated and reinforced in more recent studies.
3 What is being assessed in (large scale) mathematics tests? There is some confusion in terminology, and some misinterpretation by stakeholders and the popular media, of what exactly is being assessed by the internationally administered TIMSS (mathematics) and PISA (mathematical literacy) tests, and the Australian national NAPLAN (numeracy) test. According to the National Assessment Program (NAP) (2016), the NAPLAN numeracy test is closely aligned with the Australian Curriculum: Mathematics (see Australian Curriculum, Assessment and Reporting Authority [ACARA], 2017) and assesses “the proficiency strands of understanding, fluency, problem-solving and reasoning across the three content strands of mathematics: number and algebra; measurement and geometry; and statistics and probability”. In other words, NAPLAN “numeracy” is a test of mathematics achievement as articulated in the Australian Curriculum: Mathematics. Furthermore, this test is administered to all students in the target grade levels (3, 5, 7, and 9) and can provide diagnostic information about the progress of individual students against national performance data and standards. Similarly, TIMSS (mathematics) is a test of mathematical knowledge, and international comparisons of achievement are frequently made. According to Mullis, Martin, and Loveless (2016), “the TIMSS Curriculum Model has three aspects: the intended curriculum, the implemented curriculum, and the attained curriculum…. In summary, the model considers what is expected, taught, and assessed” (p. 23) in the mathematics curricula at Grades 4 and 8. For this test, only a carefully and statistically justified sample of students attempt the test items. For example, in Australia about 5% of students in the targeted school year or age level are involved in the tests (Fullarton, 2010). On the other hand, PISA assesses 15-year old students’ mathematical literacy1 capabilities, and not their mathematics achievements. Again only a sample of the country’s 15-year old students are assessed and not all PISA participants are necessarily studying mathematics at school. The PISA definition of mathematical literacy is focussed on how effectively students can use and interpret mathematics in
1 It should be noted that the term mathematical literacy is not used universally. Numeracy, quantitative literacy, and mathemacy are among the range of terms used—for further detail, see Vacher (2014).
different contexts. This “includes reasoning mathematically and using mathematical concepts, procedures, facts and tools to describe, explain and predict phenomena” (OECD, 2013b, p. 25) in real world settings. An important element captured by mathematical literacy is the ability for an individual to make sound and constructive decisions in daily life. At the same time, mathematical skills are needed to answer PISA items. While we are cognisant of the differences between mathematics and mathematical literacy, in this article we have drawn on important and frequently cited work in which mathematics achievement has been (in our view, incorrectly) described as associated with PISA mathematical literacy data. In the next section we discuss test validity (particularly relevant in high stakes, large scale testing programs) and pertinent definitions of relevant dimensions of validity, as well as exploring how validity may contribute to observed gender differences in performance.
4 Tests and their validity “Unfortunately”, wrote Newton and Shaw (2016), “there is no widespread professional consensus over the meaning of the word ‘validity’ as it pertains to educational and psychological testing” (p. 178). In a commonly used textbook about research methods in education and psychology, Mertens (1998) stated: the “conventional definition of the validity of an instrument is the extent to which it measures what it was intended to measure” (pp. 291–292). While Kline (1998) and Wiersma and Jurs (2009) offered a similar definition in their texts, the one favoured by Miller, Linn, and Gronlund (2009) was nuanced somewhat differently: “Validity is the adequacy and appropriateness of the interpretations and uses of measurement results” (p. 70). Contradictory views about the complexity of the concept of validity are captured anecdotally but evocatively through quotations attributed by Newton and Shaw (2016, p. 178) “to two prominent validity scholars”. One apparently considered the concept of validity to be so simple that “even a bright 8-year-old could grasp the general idea.... Yet according to another, her undergraduate class on validity is ‘the most challenging class of the semester’ and the hardest ‘for students to understand’” (cited by Newton and Shaw, 2016, p. 178). A comprehensive overview of the diverse yet overlapping ways in which the term validity is operationally defined is undoubtedly beyond the scope of this article. Summaries of the theoretical and terminological stances taken over time by those in the field abound. Early such examples include Ebel (1961), Messick (1995), and Crocker (1997). The special issue of Assessment in Education: Principles,
13
G. C. Leder, H. J. Forgasz
Policy, and Practice (2016, 23(2)) offers extensive, contemporary, accounts of the various aspects of the debate and stances taken by researchers and practitioners. It is generally accepted that there is the need to consider several sources of evidence. These are: • Content validity defined as the extent to which the test
actually measures the content area it is intended to measure. Subsumed are item validity and sampling validity, referring respectively to the relevance and coverage of the test items for measuring the content area being considered, • Construct validity whether the test actually measures the concept or attributes it is supposed to measure. “In other words, construct validity reflects the degree to which a test measures an intended hypothetical construct… [c] ontructs are nonobservable traits… [and] underlie the variables that researchers measure”. (Gay, Mills, & Airasian, 2009, p. 157), and • Criterion validity (sometimes divided into concurrent and predictive validity)—the extent to which the test score can predict some current or future behaviour. While concurrent validity describes how well the results of the new test compare with the outcome of a previously administered but relevant test or against another appropriate measure, predictive validity describes how well the test can predict the future performance or behaviour of the person being tested.
The issue of the relevance of considering consequential validity is more controversial. As described by Mertens (1998), “Messick include(d) appraisal of the social consequences of testing, including an awareness of the intended and unintended outcomes of test interpretation and use, as an aspect of validity” (p. 292). Many researchers (e.g., Zumbo Hubley, 2016; Nichols & Berliner, 2007) have accepted this unified position, with the latter using the term the 4Cs to describe content, construct, criterion, and consequential validity. Others (e.g., Kane, 2016; Popham, 1997; Scriven, 2002) disagreed. Popham (1997) insisted that it is a mistake “to tie the social consequences into a validity framework. Such a wedding of related but distinctive concepts will not be symbiotic, it will be septic” (p. 13). While also rejecting Messick’s unified validity definition, Scriven (2002) was less forceful: “Although Messick wants to move to what he calls unified validity, he takes this to include both of what are, I suggest, properly called validity and utility” (p. 259). To conclude, it is useful to cite the approach adopted to this aspect of validity in readily available texts, for example Gay et al. (2009): Consequential validity, as the name suggests, is concerned with the consequences that occur from tests…
13
All tests have intended purposes…, and in general, the intended purposes are valid and appropriate. There are, however, some testing instances that produce (usually unintended) negative or harmful consequences to the test takers. Consequential validity, then, is the extent to which an instrument creates harmful effects for the user. (p. 157) As can be seen below, we consider it important when discussing gender differences in mathematics learning to examine not only test content, response processing demands, and possible links to other variables, but also the notion of consequential validity.
4.1 Test content and gender differences in mathematics learning As pointed out earlier in this article, large scale tests such as PISA and TIMSS provide not only measures of students’ performance on the test items but also offer much contextual information: about the home, school, and broader environment in which learning occurs and about students’ attitudes, beliefs, and longer term aspirations. Yet carefully contextualized presentations of the huge amount of data generated by large surveys are often disregarded or overlooked in discussions of national and international performance data. Concerns expressed about the inevitable segmented, rather than more integrated and nuanced reporting of the masses of information provided by these tests, are captured well by Berliner (2011) in his critique of large scale testing. He, in common with those responsible for the collation of the large pool of materials produced, pointed to limitations in the reporting and evaluation of data generated by the measures of student achievement. For example, what is given as the students’ achievement score in mathematics is inevitably influenced, at least in part, by previous exposure to the content on which they are tested, and by the scope allowed by the boundaries of the test to show the depth of the knowledge they have acquired. External influences, local expertise, and individual teacher or pupil preferences can mould or change what students essentially experience and value. Relevant contextual data that gave rise to the tests, and details of the participants’ social class and the likely associated advantages or disadvantages, are important factors that may influence test results, but are often minimized or ignored when reporting or interpreting the outcomes of tests. The constraints just listed can apply not only to large scale tests but also to smaller, locally designed, and supposedly strategically targeted instruments. We provide only a few examples to illustrate how the content of a test or task can influence apparent gender differences in performance. Group data for TIMSS 2015 were reported in an earlier section of the paper. Provocative
Measuring who counts: gender and mathematics assessment
nuanced differences emerge when the data are reported by content domain. At the grade 4 level, boys performed better than girls on number items in 21 countries (see Mullis et al., 2016) while the mean score for girls was higher than for boys in seven countries. For geometric shapes and measures, the mean score for boys was higher than for girls in 14 countries but higher for girls than boys in nine countries. For data display, girls out performed boys in 13 countries, and boys did better than girls in two countries. Inconsistencies in gender differences in performance by content domain were also found for students in eighth grade. In number, on average, boys did better than girls in 17 countries, while girls did better than boys in four countries. In contrast, on algebra domain items, girls did better than boys in 21 countries, and boys did not outperform girls in any countries. Girls also did better than boys on geometry items in eight countries compared with two countries where boys outperformed girls on items in this domain. Finally, for data and chance, boys outperformed girls in six countries, and girls outperformed boys in seven countries. Group findings for PISA were also referenced earlier. A more nuanced appraisal of mathematics assessment data provides further insights. In OECD (2014), data are presented inter alia in terms of the four content subscales: change and relationships, space and shape, quantity, and uncertainty and data. Mean differences in the scores of boys and girls across the OECD countries, it was reported, ranged from 15 points in favour of boys on the space and shape scale to a difference of nine points in favour of males on the uncertainty and data subscale. Within country group differences varied considerably, however. In the quantity subscale, for example, differences ranged from 31 points in favour of boys to 19 points in favour of girls. Such varying patterns of gender differences across the performance of large groups of students on the different scales “highlight the difficulties in designing educational policies that promote gender equity” (OECD, 2009, p. 22). Apparent gender differences in achievement are also reported in locally designed tests administered to smaller groups of students. Using a sample of over 500 students aged 5 to 11, Zohar and Gershikov (2008) found that the context used to describe a problem affected performance: “in neutral contexts, the scores of boys and girls are similar. In (stereotypically) boys’ contexts, however, boys score significantly higher than girls. In (stereotypically) girls’ contexts, a significant interaction is found between age and gender” (p. 677). Thus, they reported, girls in particular were affected by the context of a problem. As noted by Tarampi, Heydari, and Hegarti (2016), the research literature is replete with reports of males outperforming females on spatial tasks. Their own research, they reported, “confirmed sex differences favoring males in two psychometric measures of spatial perspective taking: the spatial-orientation test and the road-map test”
(p. 1515). These gender differences in performance disappeared, however, when the task was reframed and modified to include a human figure. To summarise: collectively these examples confirm that the content of a test can indeed affect whether or not gender differences in performance are identified and recorded. This raises questions about the content validity of the test.
4.2 Response processes and gender differences in mathematics learning Data based on results from the Victorian (Australia) Year 12 examination—the final year of secondary school—for the years 1994–1999, indisputably revealed that the format of assessment is another factor that may confound the determination of students’ mathematics achievements. Although the source on which these conclusions are based is somewhat dated, the unique circumstances which enabled the performance on the different tasks to be gathered, and the still prevalent usage of the different assessment approaches described, warrant continued attention to these findings. Over the 1994–1999 period, there were three subjects offered to students in Year 12 mathematics: Further mathematics (the least difficult option); Mathematical Methods (with more challenging content); and Specialist Mathematics (the most challenging of the three subjects). For each of the three mathematics subjects, enrolled students completed three quite different Common Assessment Tasks [CATs] during the year. CAT 1 was an investigative project or challenging problem completed during school time and at home, over an extended period of several weeks. CAT 2 was a strictly timed examination comprised of multiple choice and short answer questions. CAT 3 was also a strictly timed examination, but the problems required extended answers. Compared to the (then) very innovative CAT 1, the format adopted for CATs 2 and 3 was consistent with traditional timed examinations. Capitalising on the introduction of the varied examination tasks in the mathematics subjects, Cox, Leder, and Forgasz (2004) and Leder (2015) examined the mean performance of boys and girls on the different CAT formats in the three mathematics subjects. For the period 1994–1999, differences in the mean performance of boys and girls on the mathematics tests varied, depending on which component of the tests and which of the three mathematics subjects was being considered. For example, as a group, girls did better than boys in CAT 1 each year for all three mathematics subjects. In Further Mathematics, the most popular and least demanding mathematics subject, on average girls also did better than boys on CATs 2 and 3. However in Mathematical Methods, the subject which was, and still is, a prerequisite for many tertiary courses, boys did better than girls on CATs 2 and 3 with their more
13
traditional format and time restrictions. Collectively the data provide another example of the way that the form of assessment administered can influence students’ performance and whether males or females appear to be better at mathematics. Further evidence of the importance of the response format required in tests of mathematics achievement is furnished by Dowling and Burke’s (2012) reference to the 2009 General Certificate of Secondary Education [GCSE] examinations in the United Kingdom as the first occasion in a decade for boys to perform better than girls in an external examination. “This reversal coincided with a change in the form of the examination” (p. 94), they noted. Also relevant is work reported by Gallagher et al. (2000). They devised a number of experiments to explore whether success in solving an item differed depending on the required response format: free response or multiple choice, on the item’s cognitive demand, and on its mathematics content domain. The smallest gender differences in performance were observed for items relying for their solution on verbal skills and storing or retrieving information. The largest differences were observed for “items requiring the creation of a spatial representation and the retention and manipulation of a given representation” (Gallagher et al., 2000, p. 188). The type of technology used in assessment regimes has also been found to impact the extent of gender differences in mathematics achievement. For 9 years, 2002–2010, there was a transition from the use of scientific calculators to computer algebra systems (CAS) calculators in the intermediate level mathematics subject offered as a Year 12 subject in the Victorian Certificate of Education [VCE]. During these years, there were two parallel courses, Mathematical Methods and Mathematical Methods (CAS) which covered similar content in their common areas of study. Schools could choose which of the two courses would be offered to their students. In Mathematical Methods students used an approved graphics calculator, while in CAS they used an approved Algebra System (CAS) calculator or software [VCAA], (2013). Forgasz and Tan (2010) examined the performance data by gender for the three assessment tasks (one schoolbased, and two examinations) in the two parallel courses for the period 2002–2008. The data revealed that males’ and females’ performances were different “at the very highest level of achievement (A+), with boys outperforming girls in both subjects and with the gender gap in favour of males appearing to be greater in Mathematical Methods CAS” (p. 33). They found that the pattern was similar “when the percentages of males and females achieving the three top grades A+, A, and B+ were combined, with one exception” (p. 33); females slightly outperformed males for the schoolbased task in Mathematical Methods, but males outperformed females for Mathematical Methods CAS.
13
G. C. Leder, H. J. Forgasz
Findings such as those described above undoubtedly illustrate that a test’s content and its response format demands can impact the perception of a group’s mathematical “ability” as well as the validity of the (mathematics) scores of individual students in these groups.
4.3 Relations with other variables Authors of many of the OECD publications referred to in this article have indicated, either directly or indirectly, that affective descriptors such as attitudes, beliefs, emotions, confidence, and anxiety, are all considered as possible facilitators or inhibitors of performance on mathematics or mathematical literacy tests. These same factors are among those persistently attracting research attention from those concerned with possible gender differences in mathematics learning. To what extent, then, do the repeated reports of gender differences in mathematics participation and performance contribute to these gender differences in affective factors? Or alternately, to what extent do the differences in affect contribute to reported differences in mathematics achievement? With a small sample of college-based psychology students, Spencer, Steele, and Quinn (1999) demonstrated that stereotype threat (being judged, treated, or self-fulfilling negative stereotypes) affected women’s mathematics performance level. A gender gap favouring males was found under stereotype threat, and the gap eliminated when stereotype threat was lowered. In a later meta-analyses conducted by Flore and Wicherts (2015), published studies on stereotype threat among school-aged students under 18 produced an overall, significant effect size (− 0.22). The authors included four moderator variables—test difficulty, presence of boys, gender equality within countries, and control group type used—none of which was significant. They identified the presence of publication bias (publication only of studies with significant findings) and cautioned that they wanted “to avoid the unjustifiable generalization that stereotype threat… generally leads to lower math grades and women leaving the STEM field (Flore & Wicherts, 2015, p. 41). Based on achievement results published in a metropolitan newspaper (score, name of student, and school attended) in each of the years 2007–2009, Forgasz and Hill (2013) reported findings of the effects that gender, school type (a measure of socioeconomic background), school learning setting (single-sex or co-education) and geographic location had on the mathematics performance of the highest achieving students (top 2%) in the three Grade 12 Victorian Certificate of Education (VCE) mathematics subjects described earlier in the article. They found: • a very clear pattern of male dominance amongst the high-
est achievers (top 2%) in all three mathematics subjects;
Measuring who counts: gender and mathematics assessment
the dominance was highest in Specialist Mathematics (the most challenging of the three subjects); • that students in single-sex schools (mainly fee-paying, non-government schools- a socio-economic [SES] bias), particularly boys’ schools, were over-represented amongst the highest achievers in all three subjects in each of the 3 years; • that the SES factor was also evident when achievements were examined by school type attended (government— lowest SES, Catholic—middle SES, and Independent— highest SES). For example, while only 14% of students attended independent schools in 2009, over 50% of the top 2% of achievers in the three subjects over the 3 years attended these schools; • that students attending metropolitan schools outperformed their peers from non-metropolitan schools. Forgasz and Hill (2013) concluded that the findings reinforced the impact of SES on educational outcomes, and subsequently on occupational opportunities. Since the results had been published in the media and were accompanied by articles highlighting some of the generalizable findings, they also commented on the potential this had to shape public opinion and re-inforce stereotypes. The messages conveyed might include that boys are better than girls at mathematics, that feepaying schools outperform government schools, that single-sex schooling is superior to co-education, and that it is advantageous to attend schools in the metropolitan area. Using the popular social media site Facebook to recruit participants, Forgasz, Leder, and Tan (2014) explored the general public’s views about aspects of mathematics learning. Beliefs about boys’ and girls’ proficiency in mathematics and the perceived suitability for boys and girls in certain STEM related occupations were among the issues examined. Almost 800 respondents, from many different countries, completed the online survey. There were nine countries—Canada, China, Egypt, India, Israel, Singapore, UAE, UK, and Australia— with at least 30 responses from each. In each of the nine countries a majority of those who completed the survey indicated that they considered studying mathematics to be important for all students, irrespective of gender. They also believed that mathematics should continue to be studied even when it was no longer compulsory. However, Forgasz et al. (2014) also reported that: “the consistency in the direction of the findings in support of the traditional male stereotype provides strong evidence that gendered perceptions of mathematics and related careers persist in many parts of the world” (p. 386).
5 Gender differences in mathematics learning and the consequences of testing Mathematics is typically considered to be important and a potential gate keeper to educational and career choices. For example, in Australia, clearly prescribed personal numeracy (mathematical literacy) goals were recently set for graduating teachers: Prior to graduation, all initial teacher education students are expected to sit the Literacy and Numeracy Test for Initial Teacher Education Students (the test) to demonstrate that they have been assessed as being in the top 30 per cent of the adult population for personal literacy and numeracy…. In 2015 all education ministers [in every Australian state and territory] agreed that from 1 July 2016 the test would be used as the means to demonstrate students have achieved this standard. (Department of Education and Training, 2017, paras 1–2). While the level of mathematics background for entry into teacher education courses may vary from institution to institution, a clearly specified level of personal numeracy (mathematical literacy) proficiency is now required before a student can commence teaching as a fully qualified professional. In previous sections we shared evidence of subtle differences in the mathematics performance of males and females, with males more frequently found to outperform females. We indicated that such gender differences could vary in direction depending on the mathematics content domain. Response formats of the assessment instrument administered can also impact on the direction of gender differences. We provided evidence of subtle but hauntingly enduring stereotyping of mathematics as a male domain among at least some sections of the communities in many countries. Of concern are the extent and pervasive continuity over time of these patterns of gender differences generally advantaging males. What might be the impact of such stereotyping and reiteration of instances of lower achievement on the expectations of girls as they engage in mathematics and of those people who evaluate the mathematical products produced by girls and compare them with those of boys? The forces that drive human behaviour are undoubtedly complex and multifaceted. Competing motives that can be constructively considered are captured, for example, by the expectancy-value theory of motivation. In brief, behaviour, it is hypothesized within this framework, is a function of the expectancies an individual has of reaching a particular goal and the value assigned by that individual to that goal (see, for example, Atkinson, 1964, for early
13
and seminal work on this construct). We presented evidence that in our society the importance of mathematics continues to be reinforced. At the same time, we illustrated how assessment in mathematics can produce inequities. While these inequities appear to persist, despite intermittent efforts in some countries to redress them, or while they remain insufficiently recognized as important, the cycle of self-fulfilling gender-linked prophesies is unlikely to be broken.
References Atkinson, J. W. (1964). An introduction to motivation. Oxford: Van Nostrand. Australian Curriculum, Assessment and Reporting Authority [ACARA] (2017). Australian Curriculum: Mathematics. http://www.austr aliancurriculum.edu.au/mathematics/curriculum/f-10?layout=1. Australian Curriculum, Assessment and Reporting Authority [ACARA] (2016). National Assessment Program—Literacy and Numeracy Achievement in Reading, Writing, Language Conventions and Numeracy: National Report for 2016. http://www.nap. edu.au/docs/default-source/default-document-library/2016-napla n-national-report.pdf. Barrington, F., & Brown, T. (2014). AMSI monitoring of participation in Year 12 mathematics. Gazette of the Australian Mathematical Society, 41(4), 221–226. Benbow, C. P. (1992). Academic achievement in mathematics and science of students between ages 13 and 23: Are there differences among students in the top one percent of mathematical ability? Journal of Educational Psychology, 84(1), 51–61. Benbow, C. P., & Stanley, J. (1980). Sex differences in mathematical ability: Facto or artefact? Science, 210, 1262–1264. Berliner, D. C. (2011). The context for interpreting the PISA results in the U.S.A.: Negativism, chauvinism, misunderstanding, and the potential to distort the educational systems of nations. In M. A. Pereyra, H. G. Korthoff & R. Cowen (Eds.), PISA under examination. Changing knowledge, changing tests, and changing schools (pp. 77–96). Rotterdam: Sense Publishers. Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165–174. Contini, D., Di Tommaso, M. L., & Mendolia, S. (2017). The gender gap in mathematics achievement: Evidence from Italian data. Economics of Education Review, 58, 32–42. Cox, P. J., Leder, G. C., & Forgasz, H. J. (2004). The Victorian certificate of education—mathematics, science and gender. Australian Journal of Education, 48(1), 27–46. Crocker, L. (1997). Editorial: The great validity debate. Educational Measurement: Issues and Practice, 16(2), 4. Department of Education and Training. (2017). Literacy and Numeracy Test for Initial Teacher Education Students. Canberra: Australian government. https://www.education.gov.au/literacy-and-numer acy-test-initial-teacher-education-students. Dowling, P., & Burke, J. (2012). Shall we do politics or learn some maths today? Representing and interrogating social inequality. In H. Forgasz & F. Rivera (Eds.), Towards equity in mathematics education. Gender, culture, and diversity (pp. 87–103). Berlin: Springer-Verlag. Ebel, R. L. (1961). Must all tests be valid? American Psychologist, 16(10), 640–647. https://doi.org/10.1037/h0045478.
13
G. C. Leder, H. J. Forgasz Ethington, C. A., & Wolfle, L. M. (1984). Sex differences in a causal model of mathematics achievement. Journal for Research in Mathematics Education, 15(5), 361–377. Fennema, E. (1974). Mathematics learning and the sexes: A review. Journal for Research in Mathematics Education, 5(3), 126–139. Fennema, E., Carpenter, T. P., Jacobs, V. R., Franke, M. I., & Levi, L. W. (1998). A longitudinal study of gender differences in young children’s mathematical thinking. Educational Researcher, 27(5), 6–11. Fennema, E., & Sherman, J. (1977). Sex-related differences in mathematics achievement, spatial visualization and affective factors. American Educational Research Journal, 14, 51–71. Fennema, E., & Sherman, J. A. (1976). Fennema-Sherman Mathematics Attitude Scales: Instruments designed to measure attitudes toward the learning of mathematics by females and males. Journal for Research in Mathematics Education, 7, 324–326. Figel, J. (2009). Preface. National testing of pupils in Europe: Objectives, organisation and use of results. Brussels: Education, Audiovisual and Culture Executive Agency-Eurydice. http://eacea.ec.europa.eu/education/eurydice/documents/thema tic_reports/109EN.pdf. Finkel, A. (2017). Measuring up. Mathematics Education Research Group of Australasia 40th Anniversary Conference … Opening Address. http://www.chiefscientist.gov.au/wp-content/uploads/ MERGA-speech.pdf. Flore, P. C., & Wicherts, J. M. (2015). Journal of School Psychology, 53, 25–44. Forgasz, H., & Hill, J. (2013). Gender, school settings, and high achievers in mathematics. International Journal of Science and Mathematics Education, 11(2), 481–499. Forgasz, H., Leder, G., & Tan, H. (2014). Public views on the gendering of mathematics and related careers: International comparisons. Educational Studies in Mathematics, 87(3), 369–388. Forgasz, H., & Tan, H. (2010). Does CAS use disadvantage girls in VCE mathematics? Australian Senior Mathematics Journal, 24(1), 25–36. Forgasz, H. J., & Leder, G. C. (2011). Equity and quality of mathematics education: Research and media portrayals. In B. Atweh, M. Graven, W. Secada & P. Valero (Eds.), Mapping equity and quality in mathematics education (pp. 205–222). Dordrecht: Springer. Fullarton, S. (2010). Mathematics learning: What TIMSS and PISA can tell us about what counts for all Australian students. https ://research.acer.edu.au/cgi/viewcontent.cgi?article=1087&conte xt=research_conference. Gallagher, A. M., De Lisi, R., Holst, P. C., McGillicuddy-De Lisi, A. V., Morely, M., & Cahalan, C. (2000). Gender differences in advanced mathematical problem solving. Journal of Experimental Child Psychology, 75(3), 165–190. https://doi.org/10.1006/ jecp.1999.2532. Gay, L. R., Mills, G. E., & Airasian, P. (2009). Educational research. Competencies for analysis and applications (9th edn.). Upper Saddle River, NJ: Pearson Education, Inc. Hanna, G., Kündiger, E., & Larouche, C. (1990). Mathematical achievement of grade 12 girls in fifteen countries. In L. Burton (Ed.), Gender and mathematics: An international perspective (pp. 87–97). London: Cassell. Hedges, L. V., & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers of high—scoring individuals. Science, 269, 41–45. https://doi.org/10.1126/science.7604277. Henrion, C. (1997). Women in mathematics. The addition of difference. Bloomington and Indianapolis: Indiana University Press. Hill, C., Corbett, C., & St. Rose, A. (2010). Why so few? Women in science, technology, engineering, and mathematics. Washington, DC: AAUW. http://www.aauw.org/learn/research/upload/ whyso few.pdf.
Measuring who counts: gender and mathematics assessment Hyde, J. S., Fennema, E., Ryan, M., Frost, L. A., & Hopp, C. (1990). Gender comparisons of mathematics attitudes and affect: A metaanalysis. Psychology of Women Quarterly, 14, 299–324. Hyde, J. S., & Mertz, J. E. (2009). Gender, culture, and mathematics performance. Proceedings of the National Academy of Sciences, 106(22), 8801–8807. https://doi.org/10.1073/pnas.0901265106 (Gender, culture, and mathematics performance). International Association for the Evaluation of Educational Achievement [IEA]. (2015). About TIMSS 2015. http://timss2015.org/ wp-content/uploads/filebase/full%20pdfs/T15-About-TIMSS -2015.pdf. Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy and Practice, 23(2), 198–211. Kearney, C. (2016). Efforts to Increase Students’ Interest in Pursuing Mathematics, Science and Technology Studies and Careers. National Measures taken by 30 Countries—2015 Report, European Schoolnet, Brussels. http://www.dzs.cz/file/3669/kearney2016-nationalmeasures-30-countries-2015-report-28002-29-pdf/. Kimball, M. M. (1989). A new perspective on women’s math achievement. Psychological Bulletin, 105(2), 198–214. Kline, P. (1998). The new psychometrics: Science, psychology and measurement. London: Routledge. Leder, G. C. (2015). Mathematics for all? The case for and against national testing. In S. J. Cho (Ed.), The proceedings of the 12th Internal Congress on Mathematical Education (pp. 189–208). Springer. https://doi.org/10.1007/978-3-319-12688-3_14. Levine, F. J. (2016). From the Desk of the Executive Director—AERA to Further Refine Gender Demographic Categories. http://www. aera.net/Newsro om/AERA-Highli ghts- E-newsle tter/ AERA-Highl ights- May-2016/From-the-Desk-of-the-Execut ive-Direct orAER Ato-Further-Refine-Gender-Demographic-Categories. Merten, D. M. (1998). Research methods in education and psychology. Integrating diversity with quantitative & qualitative approaches. Thousand Oaks: Sage Publications, Inc. Messick, S. (1995). Validity and psychological assessment. American Psychologist, 50(9), 741–749. Miller, M. D., Linn, R., & Gronlund, N. (2009). Measurement and evaluation in teaching. Upper Saddle River: Merrill. Moss, J. D. (1982). Towards equality: Progress in mathematics in Australian secondary schools (Occasional paper No. 16). Hawthorn: Australian Council for Educational Research. Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 international results in mathematics. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2015/international-results/. Mullis, I. V. S., Martin, M. O., & Loveless, T. (2016). 20 Years of TIMSS. International trends in mathematics and science achievement, curriculum, and instruction. http://timssandpirls.bc.edu/ timss2 015/inter n ation al-result s/timss2 015/wp-conten t/uploa ds/2016/T15-20-years-of-TIMSS.pdf. National Assessment Program (NAP). (2016). Numeracy. https: //www. nap.edu.au/naplan/numeracy. Newton, P. E., & Shaw, S. D. (2016). Disagreement over the best way to use the word ‘validity’ and options for reaching consensus. Assessment in Education: Principles, Policy and Practice, 23(2), 178–197. Nichols, S., & Berliner, D. (2007). Collateral damage. How high stakes testing corrupts America’s schools. Cambridge: Harvard Education Press. OECD. (2009). Equally prepared for life? How 15-year-old boys and girls perform in school OECD Publishing. https://www.oecd.org/ pisa/pisaproducts/42843625.pdf. OECD. (2013a). PISA 2012 results: Ready to learn: Students’ engagement, drive and self-beliefs (Volume III). PISA, OECD Publishing. https://www.oecd.org/pisa/keyfindings/PISA-2012-resul ts-volume-III.pdf.
OECD. (2013b). PISA 2012 assessment and analytical framework: Mathematics, reading, science, problem solving and financial literacy. Paris: OECD Publishing. https://doi.org/10.1787/97892 64190511-en. OECD (2014), PISA 2012 Results: What Students Know and Can Do—Student Performance in Mathematics, Reading and Science (Volume I, Revised edition, February 2014), PISA, OECD Publishing. https://doi.org/10.1787/9789264201118-en. OECD. (2016). PISA 2015. Results in focus. https://www.oecd.org/ pisa/pisa-2015-results-in-focus.pdf. Popham, W. J. (1997). Consequential validity: Right concern—wrong concept. Educational Measurement: Issues and Practice, 16, 9–13. Postlethwaite, T. N., & Kellaghan, T. (2009). National assessment of educational achievement. Paris: UNESCO, International Institute for Educational Planning. http://www.iiep.unesco.org/fileadmin/ user_upload/Info_Services_Publications/pdf /2009/EdPol9.pdf. Reilly, D., Neumann, D. L., & Andrews, G. (2015). Sex differences in mathematics and science achievement: A meta-analysis of National Assessment of Educational Progress assessments. Journal of Educational Psychology, 107(3), 645–662. Reisman, S. K. &. Kauffman, S. H. (1980). Teaching mathematics to children with special needs. Columbus: Merrill. Robitaille, D., & Travers, K. (1992). International studies of achievement in mathematics. In: Grouws D. (Ed.), Handbook of research on mathematics education (pp. 687–709). New York: Macmillan Publishing Company. SAT (2016). College-bound seniors. Total group profile report. https:// secure -media. colleg eboar d.org/digita lServ ices/ pdf/sat/total- group -2016.pdf. Schildkamp-Kundiger, E. (Ed.). (1982). International review on gender and mathematics. Columbus, OH: ERIC Clearinghouse for Science, Mathematics and Environmental Education. [ERIC Document No. 222326]. Scriven, M. (2002). Assessing six assumptions in assessment. In H. I. Braun, D. N. Jackson & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 255–275). Mahwah, NJ: Lawrence Erlbaum Associates. Smith, S. E., & Walker, W. J. (1988). Sex differences on New York state Regents examinations: Support for the differential coursetaking hypothesis. Journal for Research in Mathematics Education, 19(1), 81–85. Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Journal of Experimental Social Psychology, 35, 4–28. Steen, L. A. (2004). Achieving quantitative literacy: An urgent challenge for higher education (No. 62). Washington, DC: The Mathematical Association of America Incorporated. Tarampi, M. R., Heydari, N., & Hegarti, M. (2016). A tale of two types of perspective taking: Sex differences in spatial ability. Psychological Science, 27(11), 1507–1516. https://doi.org/10.1177/09567 97616667459. Vacher, H. L. (2014). Looking at the multiple meanings of numeracy, quantitative literacy, and quantitative reasoning. Numeracy. https ://doi.org/10.5038/1936-4660.7.2.1. Victorian Curriculum and Assessment Authority [VCAA]. (2013). Mathematics 2006–2009. Frequently asked questions. http://www. vcaa.vic.edu.au/Pages/vce/studies/mathematics/mathsfaqs.aspx. Wiersma, W., & Jurs, S. G. (2009). Research methods in education. An introduction (9th edn.). Boston: Pearson. Zohar, A., & Gershikov, A. (2008). Gender and performance in mathematical tasks: Does the context make a difference? International Journal of Science and Mathematics Education, 6(4), 677–693. Zumbo, B. D., & Hubley, A. M. (2016). Bringing consequences and side effects of testing and assessment to the foreground. Assessment in Education: Principles, Policy and Practice, 23(2), 299–303.
13