Accred Qual Assur DOI 10.1007/s00769-015-1162-z
DISCUSSION FORUM
Homogeneity studies and ISO Guide 35:2006 Stephen L. R. Ellison1
Received: 8 January 2015 / Accepted: 3 August 2015 Springer-Verlag Berlin Heidelberg 2015
Abstract Homogeneity assessment of reference materials is discussed with particular reference to the recommendations given in ISO Guide 35. It is suggested that Guide 35 could be improved by the addition of a wider range of experimental designs, and by using statistical power calculations to improve the recommendations for design of experiments for homogeneity assessment. It is further shown that the present recommended allowance for possible inhomogeneity uncertainty in the absence of a significant between-bottle effect does not have a sound statistical basis and is considerably smaller than an accurate upper 95 % confidence limit for the between-unit term. Alternatives are considered, including the use of a more accurate (and usually much larger) calculation of possible between-unit variance, post hoc power calculation, and the simple use of the calculated between-unit variance coupled with increased emphasis on effective experimentation to limit the risk of underestimation. Recommendations for improved guidance are made. Keywords Reference material certification Homogeneity studies ISO Guide 35
Papers published in this section do not necessarily reflect the opinion of the Editors, the Editorial Board and the Publisher. A critical and constructive debate in the Discussion Forum or a Letter to the Editor is strongly encouraged!
Introduction ISO Guide 34 [1] requires the assessment of homogeneity of reference materials. Homogeneity testing is accordingly discussed in ISO Guide 35 [2] (note that references to Guide 35 in this paper refer solely to ISO Guide 35:2006). In addition, Guide 35 describes a generic uncertainty budget for reference material (RM) certification in the form qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uCRM ¼ u2char þ u2bb þ u2lts þ u2sts ð1Þ where uCRM denotes the standard uncertainty associated with the certified value, ubb the uncertainty associated with between-unit variation in certified value arising from inhomogeneity, ults an allowance expressed in the form of a standard uncertainty for possible changes arising over the long-term storage of the material and usts a similar allowance for possible instability during transport and short-term storage in the user’s laboratory (note that ‘‘unit’’ here refers to a packaged unit—for example a bottle or vial—of a reference material, and not a measurement unit). The term ubb is intended to allow differences in (true) mean value for individual RM units, and Guide 35 recommends that it be derived from the results of a homogeneity study. Under Guide 35, ubb is usually, but not always, based on the between-unit or ‘‘between-bottle’’ standard deviation found in the study. In this paper, the recommendations of Guide 35 for the design of homogeneity studies and the use of the resulting data to obtain ubb are discussed and some improvements proposed.
& Stephen L. R. Ellison
[email protected] 1
LGC Limited, Teddington, Middlesex, UK
123
Accred Qual Assur
Homogeneity measurements Experimental designs for homogeneity testing Guide 35 (Clause 7.5) asserts that ‘‘Measurements in a homogeneity study should be carried out under repeatability conditions…’’ Although accurate, this simple statement can easily be read as excluding some important experimental strategies that can improve homogeneity studies. Many practical situations require more observations than can be carried out under stable repeatability conditions, for example when a comparatively small number of measurements can be carried out in the available run time, where a single multi-station instrument has a limited number of stations, or where extended run times show progressive drift. Under such circumstances, use of strategies such as nested and randomised block designs, which include run effects in the statistical model as well as permitting shorter runs, allows accurate comparison of the between-unit term with the repeatability standard deviation and unbiased estimation of both repeatability and betweenunit variances without adverse impact from run-to-run effects. This arises because each set of observations is made under repeatability conditions, even though the study as a whole is pot. Typical designs of these kinds are illustrated schematically in Fig. 1. Data analysis for all three can use classical analysis of variance assuming that between-unit effects are random. It is common and sensible, to assume additionally that between-run effects are random, but for balanced designs, this does not change the outcome for the betweenunit term. Restricted maximum likelihood (REML) estimates are also well-established and appropriate methods of estimating variance components for more complex experiments, and are strongly recommended for this purpose [3]. For randomised block designs, additional runs reduce the residual degrees of freedom compared to the simple one-way layout with the same total number of observations; this is rarely a problem for homogeneity assessment. For nested designs, however, additional runs reduce the degrees of freedom for the between-unit term that is the principal subject of the experiment. For every additional run added in a nested design, therefore, it is necessary to run at least one additional unit in order to retain a required minimum number of degrees of freedom for the betweenunit term. For balance in a nested design, which simplifies some subsequent calculations, this will sometimes require addition of p units, p being the number of runs. For example, consider an initial simple design using 12 units to give 11 degrees of freedom for the between-unit term. If it
123
proved necessary to replace this with a nested design of three runs, the experiment would require an additional two units (14 units) to obtain 11 degrees of freedom for the between-unit term. This would, however, include two runs of five units and one of four; the design is unbalanced. A balanced design with five units in every run would need 15 units. Note, however, that a randomised block design with three runs could use 12 units whilst providing the desired 11 degrees of freedom for the between-group term. Randomisation (considered further below) is essential for good experimental practice. In nested designs, it is especially important that units are not allocated to analytical runs in order of bottling; randomised allocation to runs is at least as important as randomisation within the run. Analysis of production trends remains advisable. The principal difference for randomised block and two-level nested designs is that run trends are best evaluated by inspection of model residuals—effectively subtracting the run means before plotting the residuals in unit filling or production order to avoid interference from the run effects. For objective tests, it is also possible (and recommended where available) to include a bottling trend in the statistical model and test objectively for it. This is best achieved using mixed effects modelling and can only realistically be achieved using software. It is also valid to obtain more than one observation per unit in each run in a blocked design, but the analysis becomes more complex. For example, run/unit interactions should be examined and shown to be negligible before proceeding, and the corresponding classical ANOVA table should then be constructed for main effects only to obtain the required main effect variances. References such as Searle [3] or Snedecor and Cochran [4] describe the principles. It is, however, more convenient to apply specialist software to obtain the individual variance components in such designs. Current practice in statistics is to apply restricted maximum likelihood modelling with run, unit and residual terms treated as random effects; this deals properly with missing values and modest imbalance as well as adjusting automatically for negative variance estimates. Randomisation Guide 35 does not currently recommend randomisation unambiguously; instead, it recommends more generally (in Section 7.5) that ‘‘The measurements should be carried out in such a way that a trend (drift) in the measurements can be separated from a trend in the batch of samples’’ and adds that ‘‘This can be achieved by measuring the replicates of the samples used in the homogeneity study in a randomized order’’. This latter statement offers randomisation as an option. Section 7.5 immediately goes on to suggest and
Accred Qual Assur Fig. 1 Alternative designs for homogeneity studies. The plot shows schematic results for three balanced designs. a Simple one-way nested design (replicates within units), in which all observations are obtained in a single run and replicates are observed in fully randomised order. b Randomised block design, in which every RM unit is observed once in every run, with the order randomised separately in each run. c Two-level nested design (replicates nested within RM units, RM units nested within runs), in which RM unit numbers are randomly allocated to runs and replicates are also randomised within runs
describe an alternative, systematic allocation of observations (considered further below). No examples are given of randomised designs, here or in the subsequent Appendix B. Although the particular layout suggested is reasonably effective in separating steady analytical within-run drift from bottling trends, the overall effect of the illustration is to place emphasis on systematic allocation of units over randomisation. It is also worth noting that whether randomised or carefully chosen systematic allocation are used, run trends can substantially increase the within-unit variance; it would be useful in any practical guidance to point this out and to note that alternative strategies (including blocked designs) can reduce the adverse effects of trends by reducing run lengths. Systematic allocation of observations in a homogeneity study The example of systematic allocation of observations in Guide 35, for a study with ten RM units and three replicates, reads as follows: Replicate No. 1: 1–3–5–7–9–2–4–6–8–10 Replicate No. 2: 10–9–8–7–6–5–4–3–2–1 Replicate No. 3: 2–4–6–8–10–1–3–5–7–9
For this arrangement, the correlation between bottling order (represented by unit number) and analytical run order is small, but nonetheless slightly negative, leading to a small bias in filling order trend in the presence of a simple linear analytical trend. The design relies on the fact that in the presence of an analytical run trend, the within-unit variance increases much faster than the apparent gradient induced by the small correlation. The design is accordingly effective in avoiding incorrect conclusions on bottling trend when the run trend is linear; in fact, simulations indicate that the chance of incorrect interpretation as a bottling trend is substantially reduced, owing to the systematic increase in within-unit variance caused by the regular alternation in the design. Simulations of some alternative simple scenarios (for example, a systematic drift starting after 20 observations) confirmed this, though to a lesser extent as the run trend becomes more closely correlated with bottling order. The principal disadvantage is that, as with randomisation, measurement trends will, unless correctable, produce an increase in apparent withinunit variance, reducing the power of the experiment to detect inhomogeneity. It is also important to remember that this particular design relies on a specifically chosen combination of alternations and sequence inversions that nearly cancel a
123
Accred Qual Assur
simple linear trend in a three-replicate design. Although similar strategies are often effective for a linear or nearlinear trend, each different design will generally require an individually chosen sequence and, unless the sequence is already well established and previously studied, any new choice of systematic allocation will require some degree of validation (by, for example, simulation or theoretical study) to demonstrate that it achieves its objectives. This becomes more difficult if the possibility of non-monotonic drift is considered; it is nearly always possible to find some conceivable systematic drift that compromises any given systematic allocation. By contrast, two-level nested or randomised block designs greatly reduce the impact of linear trends on the variance by shortening run time considerably, require minimal additional planning to adapt to different replicates or unit numbers and are sufficiently well established and characterised to require no additional validation. It follows that if a simple linear trend over a lengthy analytical run is a likely risk, dividing the study into several shorter runs is likely to be a more generally effective strategy than either systematic or simple random run order. It would therefore be useful if blocked and two-level nested designs were at least provided, and perhaps recommended, in future international guidance supporting RM production.
Evaluating a homogeneity study Guide 35 treats inhomogeneity as a term in the uncertainty for the certified value. The term is derived from the homogeneity study. The present analysis in Guide 35 (Clause 7.7) assumes a simple one-way layout. The between-unit term is extracted in the usual way; given within- and between-group mean squares MSw and MSB, respectively, the between-unit (or ‘‘between-bottle’’) variance s2bb is given by s2bb ¼
MSB MSw n
ð2Þ
where n is the number of replicates per unit, sbb is usually set to zero if the between-unit mean square is smaller than the within-unit mean square and sbb is used as the uncertainty associated with homogeneity if it is sufficiently large. Alternative action is suggested by Guide 35 if sbb is small; this is discussed further below. More complex designs naturally require somewhat more complex data analysis. For the two-level nested design in which units are nested within runs, the classical ANOVA table provides three mean squares: MSw being the withinunit or residual mean square and MSunit and MSrun, the
123
between-unit and between-run mean squares, respectively. The simplest analysis of the between-unit variance is almost identical to that for the simple single-run design: s2bb ¼
ðMSunit MSw Þ n
ð3Þ
where n is the number of observations on each unit. Note that the run mean square is not needed for this calculation. The between-run variance s2run can, however, be similarly calculated as s2run ¼
ðMSrun MSunit Þ np
ð4Þ
where p is the number of units per run. For the blocked design where only one observation is obtained on each unit and in each run, analysis typically employs two-way analysis of variance without replicates. The mean squares again relate to run, unit and residual terms. However, the between-unit term becomes s2bb ¼
ðMSunit MSw Þ nq
ð5Þ
where q is the number of runs and n (n = 1 for the simplest randomised block design) is the number of observations per unit per run. It is worth stating here that in a blocked experiment, it is not valid to subtract or divide by run means and then proceed by simple one-way ANOVA; this fails to adjust correctly for the reduction in degrees of freedom incurred by controlling for run means. However, if run means are controlled for by subtraction or division, a one-way ANOVA table can be used provided that the residual degrees of freedom is reduced by the number of runs and the residual mean square recalculated accordingly. Broadly, the same is true for two-level nested designs, although in that case it is the between-group mean square and not the residual mean square that needs to be adjusted. Usually, however, it is as simple to analyse the data properly using appropriate software, according to a twoway model. Finally, in connection with the data analysis, it is noted above that restricted maximum likelihood (REML) estimation is a well-established, appropriate and widely recommended method of estimating variance components for more complex experiments [3]. REML also copes well with modest imbalance, such as may be caused by missing values caused by instrumental or other errors. Explicit reference to REML and at least permission for its use in future guidance would raise awareness of an important and useful statistical technique that has seen surprisingly little use to date.
Accred Qual Assur
Allowance when the between-unit term is not significant Guidance in Guide 35:2006 Guide 35 notes that ‘‘It is not always feasible to perform a homogeneity study with a measurement method that is sufficiently repeatable. In those cases, an alternative approach may be necessary that attempts to estimate the maximum effect’’. Guide 35 itself does not make it entirely clear when this consideration applies. Ancillary references cited in Guide 35, particularly Ref. [3], suggest that this consideration applies whenever the between-unit term is smaller than a suggested minimum value for ubb. Guide 35 goes on to give a method for estimating the minimum uncertainty associated with a between-unit effect. The relevant text, from Guide 35 Section 7.9, is: ‘‘A discussion about various approaches to obtain an uncertainty estimate that accounts for insufficient repeatability of the measurement method other than the result of Eq. (2) is given in Ref. [5]. The influence of the repeatability standard deviation on sbb can be accounted for using rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffiffiffiffiffi MSwithin 4 2 ubb ¼ ð6Þ mMSwithin n where MSwithin is equal to the repeatability variance of the measurements used in the between-bottle homogeneity study. This expression is based on the consideration that a confidence interval can be developed for sbb, and that the half-width of the 95 % confidence interval, converted to a standard uncertainty, can be taken as a measure of the impact of the repeatability of the method on the estimate of sbb’’. (Note that the citation and equation numbers in the quotation have been amended to correspond with the numbering in this review). It is clear from this extract that the intention of Guide 35 and Ref. [5] is to include an allowance based on the 95 % confidence interval for the between-unit term. The derivation of equation [6] is given in [5], which indicates that the equation is intended to estimate the ‘‘maximum between-bottle variability that could be masked by method variation’’. Guide 35 itself implies that the equation is based on the half-width of a 95 % confidence interval ‘‘converted to a standard uncertainty’’. It will be seen that this is not an accurate description. The derivation proceeds by considering the standard uncertainty in the within-group mean square, derived from the Chi-squared distribution for MSw, and adding this
uncertainty to the between-group mean square (assumed approximately equal to the within-group mean square) before assessing the effect on the calculated between-group standard deviation by substituting the amended betweengroup mean square for MSb. Unfortunately, there are substantial flaws in the derivation. Details are given in the Appendix; the main points are as follows. First, the standard uncertainty for the within-group term is not the uncertainty for the quantity of interest. The quantity of interest is the between-group component of variance. Thus, Eq. (6) here describes the effect of a standard uncertainty of a quantity which is at best of indirect interest in estimating the upper bound of the between-group term. Second, both the within- and between-group mean squares have associated variances, and focussing only on the withingroup mean square is certain to underestimate the uncertainty in the between-group standard deviation. Third, the calculation ignores the distribution of the estimate, which is asymmetric. Finally, the stated intent in [5] and the statement in ISO Guide 35 both refer misleadingly to the half-width of a 95 % confidence interval ‘‘converted to’’ a standard uncertainty, whereas the derivation does not consider a 95 % interval and Eq. (6) instead provides an estimated standard uncertainty for between-unit standard deviation directly. Alternative calculations for uncertainty associated with inhomogeneity How important these rather fundamental issues are in practice depends on whether the numerical value of the calculated term is either approximately correct or at least conservative. Some insight into this can be obtained by comparing the Guide 35 estimate to the correctly estimated confidence interval for the between-group variance estimated from a one-way ANOVA. A reliable recommendation for this confidence interval is given in detail by Burdick and Graybill [6]. The recommended approximate 1–2a interval for a one-way layout is
S2b S2w n
pffiffiffiffiffiffi 2 pffiffiffiffiffiffi VL Sb S2w þ VU ; n
ð7Þ
where S2b and S2w are the between- and within-group mean squares, n is the number of replicates and VL ¼ G2b S4b þ Hw2 S4w þ Gbw S2b S2w ; VU ¼ Hb2 S4b þ G2w S4w þ Hbw S2b S2w 1 1 Gl ¼ 1 ðl ¼ b; wÞ; Hl ¼ 1 ðl ¼ b; wÞ Fa;ml ;1 F1a;ml ;1 2 2 1 Fa;mb ;mw G2b Fa;m Hw2 b ;mw Gbw ¼ Fa;mb ;mw 2 2 1 F1a;mb ;mw Hb2 F1a;m G2w b ;mw Hbw ¼ F1a;mb ;mw
123
Accred Qual Assur
and Fa;m1 ;m2 denotes the upper one-tailed critical value for F with upper tail area a. For brevity, this interval is referred to as the Burdick– Graybill (BG) interval for the remainder of this discussion. For verification, this interval was tested by generating 10000 simulated data sets for each of four sets of withingroup:between-group variance ratio and data set size. In all cases, the fraction of intervals containing the (known) between-group variance was, as expected, very close to 95 %. The upper limit of the one-sided 95 % confidence interval for s2bb is compared with some alternative calculations for u2bb in Fig. 2, as a function of observed F (=MSb/ MSw). Both plots assume ten groups; Fig. 2a assumes n = 2, and Fig. 2b assumes n = 3. Both plots use the same vertical scale, allowing direct comparison. The alternative calculations include: The within-group variance s2w (=MSw), a simple orderof-magnitude allowance. (ii) s2w n, a less conservative order-of-magnitude allowance suggested on largely heuristic grounds [7] by a study group convened in 1999–2000 to consider the impact of the then newly issued ISO Guide to the expression of uncertainty in measurement (‘‘the GUM’’) [8] (originally issued in 1993) on the then current EU guidance for RM production [9]. (iii) The estimated between-group variance s2bb calculated as in Eq. (2), with truncation at zero. This is perhaps the most defensible estimate given that it is the best estimate of the quantity of interest, and the GUM requests ‘‘best estimates’’ of both measurand value and uncertainty. (iv) The value r2b; 0:8;0:05 of the between-group variance r2b that would be detected with a statistical power of 0.8 (80 %) given a within-group variance r2w equal to 1.0 and a significance level of 0.05 (95 % confidence) for the F-test for significant between-group variance. 0.8, or 80 %, is a commonly chosen power for experimental designs in many sectors, balancing good power with some economy. This is a comparatively conservative estimate based, essentially, on post hoc power considerations; it directly answers the question ‘‘what size of between-group effect could reasonably have been detected with this experiment if the true within-group variance were equal to s2w ?’’. It has a clear theoretical basis (from power calculations), but, like any post hoc power calculation, it relies on a value (for s2w ) taken from the data and also depends on the power chosen.
(i)
There are two important factors to consider in reviewing all of these possible allowances. First, all, including the Guide 35 allowance and the BG 95 % upper limit, provide
123
Fig. 2 Figure shows calculated allowances for the between-unit variance s2bb for different methods (see text) by calculated F value and for experiments using a two replicates per unit and b three replicates. The variance allowance is shown as a multiple of the estimated within-unit variance s2w . The vertical dashed line corresponds to the critical value for F for a test for significant inhomogeneity. Note that although all the allowances are intended as possible values for ubb in Eq. (1), they differ appreciably in principle of calculation; see the text for further discussion
a variance (or standard deviation) that could be included directly in a calculation of a reference material uncertainty in order to allow for between-unit inhomogeneity. That is, each would directly replace ubb in Eq. (1). In that sense, they can and should be compared directly with one another. Secondly, however, the different procedures have differing underlying intent. Some, like (iii) above, provide some estimate of the true variance; some, such as (ii) above, provide only an order-of-magnitude allowance, and others set a statistically based upper limit for the between-unit variance. In particular, the BG interval shown is the upper 95 % confidence limit for the true between-unit variance, which is expected to be substantially more conservative than most others. This difference in intent explains some of the large differences in value in Fig. 2. Nonetheless, the comparison in Fig. 2 is striking. The Guide 35 allowance is by far the smallest of the allowances shown, especially for n = 3. It is very much smaller than half of the BG upper
Accred Qual Assur
limit for most of the range shown, so clearly not close to the half-width of a reliable 95 % interval even for much of the range F \ 1. It is also small in absolute terms: 0.22s2w for n = 2 and 0.11s2w for n = 3. It is usually hard to reduce other uncertainties to less than 20 % of the available within-run variance, making it unlikely that the Guide 35 allowance would often contribute significantly to the certified value uncertainty unless the method used for homogeneity study was much less precise than the method used for characterisation. It follows from the forgoing discussion that (i) the current recommendation in Guide 35 is not a valid estimate of the likely, or the possible maximum, between-bottle variance and (ii) that it is not usefully conservative compared to other proposed calculations for ubb. It is therefore difficult to recommend retaining the present recommendation, and it is appropriate to consider alternative approaches. Of the deliberately conservative possibilities suggested above, two provide upper bounds with some defensible statistical basis; the one-sided BG interval and the post hoc power calculation. The one-sided 95 % BG interval is a good estimate of the upper 95 % confidence limit for the true between-bottle variance. However, it has the substantial disadvantage of being considerably larger than the estimated between-bottle variance itself, typically by a factor of four or more even above the 95 % critical value for F at which most would argue that it is entirely defensible to use s2bb itself. Only an extremely conservative RM producer would adopt such an extreme allowance. It would also not be sensible to adopt a strategy that changed from the 95 % limit back to the point estimate as the critical value was passed; aside from inconsistency of approach, that would result in the allowance decreasing sharply just as the evidence for a between-bottle term became statistically significant. The post hoc power calculation is generally conservative and appealing from the point of view of answering a useful question; what could reasonably have been detected as significant? (and, by implication, what could reasonably have gone undetected?). It also allows adoption of a conservative, but fixed, value until exceeded by an increasingly strongly significant between-unit variance estimate. However, it, too, exceeds s2bb for some way above the critical value for F, especially for n = 2. It is also not clear that the bound should be set at 80 % power; although commonly used, a conservative view might choose 95 % power, resulting in appreciably higher estimates. Both of the statistically based conservative allowances, therefore, seem unnecessarily conservative. The two order-of-magnitude estimates based on the within-group variance behave similarly to the post hoc power calculation; s2w is more conservative, especially at
high n; s2w n, unsurprisingly, gives intermediate behaviour between s2w and the Guide 35:2006 suggestion. Coincidentally, for this experiment size, s2w falls close to the value of s2bb at Fcrit for n = 2, while s2w n falls close to that for n = 3. Neither coincidence is a strong argument for using either allowance. Consistent use of the between-unit variance The remaining suggestion above is to use s2bb itself, irrespective of statistical significance. This is automatically defensible as being the best available estimate of the variance of interest. It is not noticeably less conservative than the current Guide 35 allowance when the homogeneity test uses the characterisation method, or one equally precise, and it has a very good theoretical basis. It is also directly available from most sensible homogeneity experiments using classical ANOVA or by other computational methods. It is, however, harder to defend against the risk of underestimation of r2bb when a comparatively imprecise method is used for homogeneity testing. Both this risk and possible solutions to it are therefore of interest. First, is there really a severe risk of consistently underestimating inhomogeneity by use of an imprecise test method? Obviously, any homogeneity test can underestimate, and the relatively small studies discussed here will frequently fail to detect inhomogeneity (using an F-test) on the scale of rbb. But note that poor precision will also overestimate inhomogeneity far more than precise methods. For rbb = 0, and taking the usual practice of setting negative variance estimates to zero, sbb calculated from Eq. (2) is positively biased. For a ten-unit experiment with two replicates (‘‘10 9 2 design’’) on a homogeneous material, the bias in between-unit variance is about þ0:13r2w , corresponding to ubb of approximately 0:36rw . Further, though imprecise methods will more often give zero sbb for small positive rbb than will more precise methods, they will also frequently give homogeneity uncertainties that are much larger. For the 10 9 2 design applied to a homogeneous material, the upper 95 % quantile is at about 0.7rw. This risk is a powerful incentive to use more precise methods— perhaps more so than the allowance suggested in Guide 35:2006, which is always nonzero but which has lower risk of such high uncertainties. Second, the solution to the problem of imprecise homogeneity test methods should not be to enforce an essentially arbitrary lower limit to ubb. Doing so effectively penalises all homogeneity studies, not just imprecise studies. But the intention of Guide 35 is to recommend good practice for producing reference materials that are demonstrably fit for purpose, not to penalise poor practice. If a less precise method is needed for homogeneity
123
Accred Qual Assur
testing—whether for technical or economic reasons—good practice simply requires that the experiment be planned in such a way as to detect important inhomogeneity after taking account of the available precision. This can be done, for example, by requiring that the experiment has a combination of sufficient precision and sufficient replication to give a reasonably good estimate of the between-unit standard deviation. One approach to this is by applying an appropriate power calculation to determine the required experiment size. Power calculations for one-way ANOVA are available in at least two free software applications [10, 12], as well as most commercial statistics packages, and if necessary can be implemented in spreadsheet applications. Further, choice of experiment size based on desired power is in any case good practice in design of experiments. There are, however, disadvantages to power calculations. The feasibility of implementing a power calculation basis in future RM certification guidance is therefore considered further below.
On the use of statistical power calculations for homogeneity experiments Introduction to power analysis Statistical significance tests are commonly used to decide whether experimental observations—summarised as a test statistic—could reasonably have occurred by chance given some initial hypothesis (the ‘‘null hypothesis’’) or whether the observations are sufficiently improbable to reject that hypothesis in favour of another, the ‘‘alternate hypothesis’’. ‘‘Sufficient improbability’’ is codified in the familiar ‘‘significance level’’ a chosen for the test (often 0.05 for 95 % confidence). This controls the probability of incorrectly rejecting the null hypothesis if it is true, but says nothing about the probability of correctly rejecting it if it is false. The probability of correctly rejecting the null hypothesis is the statistical power of the test. A good experimenter will design an experiment so that it has a good chance of making correct decisions about both acceptance of the null and rejection of the null. The simplest way in which power considerations can inform guidance is simply to recommend that experiments be designed to give a particular minimum test power. For example, one could require designs that give at least 80 % power to detect inhomogeneity equivalent to one withinunit standard deviation when using a significance test at 95 % confidence. This is straightforward and allows considerable flexibility. An example will help to illustrate this. Assume that a reference material producer has decided that
123
inhomogeneity up to 2 % (as a between-bottle relative standard deviation) is acceptable without compromising the certified uncertainty beyond use. A homogeneity test method is available with a precision of 2 %, again as relative standard deviation. This corresponds to a betweenunit/within-unit variance ratio of 1. For the homogeneity test 15 units of the material are available, and a simple oneway layout is planned, with sufficient homogeneity to be decided based on a simple F-test at 95 % confidence. The producer wants to be reasonably confident of detecting inhomogeneity over 2 % RSD, so chooses a test power of 80 %. Providing this information to a simple power calculator such as G*Power [11] returns a minimum value of 2.2 for n; two replicates are nearly sufficient (and actually gives a test power of 70 %), to obtain at least 80 % power with a balanced design that requires three replicates per unit. Now assume that a much cheaper test method becomes available, but with precision of 3 % RSD. This corresponds to a variance ratio of 4/9. To achieve the same test power now requires at least four replicates per unit, approximately doubling the number of observations. Two things should be clear from this simple example. First, statistical power calculations provide an objective tool for deciding experiment size to achieve a desired power for the experiment. Second, the relatively sharp increase in total number of observations for a comparatively small change in test method precision suggests that large changes in precision would lead very quickly to very large experiments. Similarly, unmanageable experiment sizes will occur if the experimenter sets unrealistically optimistic test power requirements for the experiment. Disadvantages of power calculations in reference material certification Although the idea of specifying test power in RM standardisation is at first sight attractive, there are some important drawbacks to the use of power calculation. First, like any other statistical test, power calculations make assumptions. Unlike ANOVA itself, which makes no assumptions about the distribution of true group means under the null hypothesis (because the null assumes a zero variance for the between-group term), a power calculation requires assumptions about the distribution of groups under the alternate hypothesis. In the illustrations so far, it has been assumed that the alternate hypothesis is characterised by a normal distribution of true group means. This is not the only possible case. A more serious problem for RM users is the presence of rare, but extreme, units. Experiments designed to optimise power for a normal assumption can be very poor at detecting rare extreme cases in an otherwise homogeneous material. Any guidance on
Accred Qual Assur
particular test power would therefore have to either provide guidance on the distributions to be considered, together with the corresponding calculations, or choose some compromise distribution and perhaps provide tabulated values of n and k to achieve the desired test power (this approach has actually been used in the social sciences; see Ref. [12] for a textbook that provides exactly this type of tabulation). Second, current minimum recommendations for homogeneity studies deliver very low power, largely because they are intended to detect only very gross inhomogeneity that might remain if a planned homogenisation procedure failed. Incautious application of power calculations might substantially increase the cost of homogeneity (and other) studies with only modest gain in quality of reference materials. Third, a more intractable difficulty is that power calculations relate to the power of a pass/fail test. If the intent is, as is very often the case, to obtain a reasonable estimate of sbb rather than to carry out a test for statistical significance, a power calculation will not usually be appropriate at all. Although it is likely to be useful to encourage experimenters to examine the power of their experiments and plan accordingly, therefore, perhaps the most effective use of statistical power calculations in guidance for reference material production is to ensure that different choices of experiment and test method give approximately equivalent power to some recommended design or designs. For example, Guide 35 currently suggests a minimum homogeneity test size of ten units analysed in duplicate by the most precise test method available. It is then straightforward to suggest, for example, optimising only n, or permitting only increase in k while optimising n, to allow experimenters to adjust experiment size upward to allow for less precise test methods (an assumed starting ratio of between-group to within-group variance would, of course, have to be stated). Since guide 35 already specifies minimum values for k, this offers an appealing solution to the problem of controlling poor test methods in homogeneity tests.
Conclusions Based on the discussion above, the following recommendations are made for improving homogeneity testing guidance in Guide 35: •
Add a modest range of more sophisticated experimental designs, including blocked and nested designs, with some guidance on their advantages and associated data analysis.
•
•
•
•
Remove the current calculation of the allowance when the between-group term is small, and instead using of sbb itself irrespective of the value found. Include recommendations for effective homogeneity tests that limit the risk of significantly underestimating sbb. Add reference to the use of statistical power calculations to compare proposed homogeneity study designs, and especially in recommendations for increased experiment size when a low-precision test method is used for economic reasons. Provide explicit permission for, or guidance on, the use of restricted maximum likelihood estimation in the analysis of homogeneity test data.
Acknowledgments Preparation of this paper was supported by the UK National Measurement System.
Appendix: Derivation of Guide 35 Eq. (6): critical review Guide 35 provides the equation rffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffiffiffiffiffi MSw 4 2 ubb ¼ mMSwithin n as an allowance for between-bottle variation when the between-group variance in a one-way ANOVA is small. The derivation (from Ref. [5]) proceeds as follows. In common with Ref. [5], a one-way analysis of variance applied to k groups of n observations is assumed. Lines are separated and numbered to facilitate comment. 1. 2.
3. 4. 5.
6. 7.
8.
It is assumed that the between-bottle standard deviation sbb calculated from pis ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sbb ¼ ðMSb MSw Þ=n where MSb and MSw are the between- and within-group mean squares from a oneway analysis of variance and n the number of replicates per group. The intent of the derivation is to evaluate the impact of (the uncertainty of) MSw on sbb. The uncertainty of MSw can be expressed as pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uðMSw Þ ¼ 2MS2w =mMSw where mMSw is the number of degrees of freedom for the within-group mean square (equal to k(n-1)). This is asserted to follow from the variance of a variance, given in [5] as 2s4/m. Using standard uncertainties, the maximum value of MSb is given as MSw ? u(MSw) in the case where MSb = MSw. A ‘‘corrected’’ (sic) MSb, MSb , is given as
123
Accred Qual Assur
MSb 9.
sffiffiffiffiffiffiffiffiffiffiffiffi 2MS2w ¼ MSw þ mMSw MSb
Substituting for MSb in the calculation for sbb (line 2 above) gives rffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffi MSw 4 2 ubb ¼ mMSw n
This is Eq. (6) of Guide 35. Taking the development line by line: 1, 2.
3.
4–6.
7.
This is the usual calculation for the between-group standard deviation rb, with the usual procedure being truncation at zero. (a) The question of interest is ‘‘what range of values of the true between-group variance r2b are consistent with the observed mean squares?’’ and not the more restricted question ‘‘how much could uncertainty in MSw affect the estimate of sbb?’’. (b) Focussing solely on uncertainties in MSw fails to consider the fact that the between-group term MSb also follows its own, independent, Chi-squared distribution with k - 1 degrees of freedom and that the uncertainty associated with the difference should take both sources of variation into account. The development appears (correctly) to assume a chi-squared distribution with mMSw degrees of freedom for MSw. However, the variance of a sample variance s2 is not 2s4/m, but 2r4/m where r2 is the population variance. Substituting s2 for r2 leads to biased estimates of the sampling variance, though the effect is modest for large degrees of freedom. (a) The statement implicitly assumes MSw = MSb. This is a very restrictive assumption. It is possible that the intended assumption was MSw MSb , which would at least guarantee small or negative estimates of r2b . It is, however, worth noting that s2b \MSw for any MSb less than (n ? 1)MSw. Since n C 2 for any balanced one-way ANOVA, it follows that MSb could be quite different from MSw while still falling into the domain of interest for this problem. (b) Much more fundamentally, since MSb and MSw are independent and have different degrees of
123
8.
freedom, the uncertainty in MSw has little or nothing to do with the uncertainty in MSb. It is unclear why MSb should be corrected upwards by a standard uncertainty in MSw. The standard uncertainty does not provide the full extent of possible or even probable deviation; the amount added is the uncertainty in a different quantity; the distribution is asymmetric, and (as noted above) both MSb and MSw have uncertainties and both need to be taken into account in estimating the uncertainty of their difference.
References 1. ISO Guide 34:2009 (2009) General requirements for the competence of reference material producers. International Organization for Standardization, Geneva 2. ISO Guide 35:2006 (2009) Reference materials—general and statistical principles for certification. International Organization for Standardization, Geneva 3. Searle SR, Casella G, McCulloch CE (2006) Variance components. Wiley, Hoboken 4. Snedecor G, Cochran W (1989) Statistical methods, 8th edn. Iowa State University Press, Ames. ISBN 0813815614 5. Linsinger TPJ, Pauwels J, Van Der Veen AMH, Schimmel H, Lamberty A (2001) Homogeneity and stability of reference materials. Accred Qual Assur 6:20–25 6. Burdick RK, Graybill FA (1992) Confidence intervals on variance components. Marcel Dekker, New York. ISBN 0-82478644-0 7. Ellison SLR, Burke S, Walker RF, Heydorn K, Ma˚nsson M, Pauwels J, Wegscheider W, Njenhuis B (2001) Accred Qual Assur 6:274–277 8. ISO/IEC Guide 99:2009 (2009) Guide to the expression of uncertainty in measurement. International Organization for Standardization, Geneva 9. European Commission: DG XII-5-C (SMT Programme) Guidelines for the certification of reference materials (1997). Document BCR/01/97, European Commission, Brussels 10. R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN: 3-900051-07-0. http://www. R-project.org 11. Faul F, Erdfelder E, Lang AG, Buchner A (2007) G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39(2):175–191 (G*Power is available from http://www.psycho. uni-duesseldorf.de/abteilungen/aap/gpower3/) 12. Bausell RB, Li YF (2002) Power analysis for experimental research. Cambridge University Press, Cambridge