Accred Qual Assur DOI 10.1007/s00769-012-0894-2
GENERAL PAPER
Causes of error in analytical chemistry: results of a web-based survey of proficiency testing participants Stephen L. R. Ellison • William A. Hardcastle
Received: 12 December 2011 / Accepted: 27 March 2012 Ó Springer-Verlag 2012
Abstract Results of a voluntary-response survey of respondent-identified causes of unacceptable results in nine proficiency testing schemes are reported. The PT schemes were predominantly environment and food analysis schemes. 111 respondents reported 230 identified causes of error. Sample preparation (16 % of causes reported), Equipment failures (13 %), ‘Human error’ (13 %) and Calibration (10 %) were the top four general causes of poor analytical results. Among sample preparation errors, sample extraction or recovery problems were the most important causes reported. Most calibration errors were related to errors in calculation and dilution and not in availability or quality of calibration materials. No failures were attributed to failures in commercial software; software-related problems were largely associated with user input errors. Corrective actions were generally specific to the particular problem identified. Review of all reported causes indicated that about 44 % could be attributed to simple operator errors. Keywords Survey
Proficiency testing Causes of error
This material was presented in part at the joint EUROLAB, EA and Eurachem workshop ‘Accreditation—a tool to develop competence’ (BAM Berlin, 2007).
Electronic supplementary material The online version of this article (doi:10.1007/s00769-012-0894-2) contains supplementary material, which is available to authorized users. S. L. R. Ellison (&) W. A. Hardcastle LGC Limited, Queens Road, Teddington, Middlesex TW11 0LY, UK e-mail:
[email protected]
Introduction An indication of the more common causes of error is clearly of interest in formulating measurement infrastructure programmes. An obvious area to seek such information is in the results of proficiency testing schemes. Proficiency testing (PT) as described in the IUPAC [1] and ISO [2] protocols is an effective means of monitoring the quality of measurement results in laboratories, and participation in PT has been shown to result in improvement of laboratory performance over time in a variety of sectors [3–6]. PT has also been used to assess and compare the interlaboratory performance of analytical methods [7–13] and a recent report has reviewed the use of PT for that purpose [14]. Retrospective analysis of PT data has also been used to assess the impact or efficacy of quality management activities, for example the impact of analytical quality control [15], accreditation [16–18] or measurement uncertainty evaluation [19]. Yet despite the well-documented expectations that laboratories should investigate the causes of poor scores and institute appropriate corrective action (for example in the accreditation standard ISO/IEC 17025 [20]) and the evidence that good PT providers collect, review and use methodological information to assist feedback to users (see, for example, [21] and related events) there seem to be few published summaries of specific causes of poor scores following investigation within the laboratories concerned. There are examples from clinical chemistry: Steindel et. al. [22] surveyed causes of PT failures in clinical chemistry and blood gas analysis, referencing three earlier studies in clinical laboratories; Jenny and Jackson-Tarentino [23] investigated causes of unsatisfactory performance in clinical laboratories; Hertzberg et. al. [24] and references cited therein briefly identified principal causes of problems in external quality assurance (EQA) programmes for genetic
123
Accred Qual Assur
testing. But recent examples from general analytical chemistry are at best hard to find. There are many possible reasons for this. Post-study investigation is not the responsibility of the provider and laboratories need not report back to the scheme organisers; data collection and its analysis are time-consuming; the burden on participants of providing detailed information with every report at the point of reporting measurement results may be high and discourage detailed questionnaires or reduce response; post-study investigation takes time and is not complete at the time of reporting results; and PT providers have no obligation to report such summaries outside the scheme community (and indeed confidentiality obligations may make it difficult to do so). Some of these barriers can be overcome by conducting a survey set up independently of the scheme’s normal operation. The cost of data collection and summary can be reduced considerably by taking advantage of electronic survey methods, which were becoming widely available at the time of this study. Confidentiality barriers to publication can be reduced by requesting voluntary response with permission for summary publication. Our laboratory accordingly conducted a pilot voluntaryresponse web-based survey to collect information on causes of error in PT across 2004–2005. Interest expressed in the full data on presentation of highlights at conferences since have encouraged us to place the complete results in the open literature. In this paper, therefore, we report the full results of the study. We describe the survey methodology, followed by summary data on the principal findings. The detailed response counts are provided in Appendix 1 and charts of these, together with all supplementary textual responses received (unattributed) are provided as electronic supplementary material (ESM) for further review if desired.
Experimental The study used a web-based survey service provided by a commercial marketing survey service (Key Survey, Forbes Business Center, Braintree, MA; http://www.keysurvey. com/). The service was selected following testing to ensure that the relevant questionnaire structure, support for optional questions and variety of response types (categorical, scale, free text, etc.) were supported. The questionnaire was trialled before use among analytical staff at our laboratory; comments were incorporated into a final questionnaire, which was launched in October 2004 and kept open to April 2005. Statistical analysis and graphs presented in this paper were prepared using R version 2.13.0 [25], following transfer of raw data using the data extraction tools provided
123
by the survey service provider (above). The statistical significance of differences within a set of responses was checked using Chi-squared tests; differences were considered to be attributable to chance unless the test was significant at the 95 % level of confidence or above. The results of the Chi-squared tests are included in Appendix 1 and Appendix 2 (ESM). Since some groups of responses included low or zero counts, the Chi-squared tests were additionally checked by simulation [26] using 104 replicates per test; although some of the larger p values altered appreciably, no test results changed in significance at the 95, 99 or 99.9 % levels of confidence.
Study methodology General methodology A key issue in any study of performance issues is the quantity of useful data collected. In general, most participants perform at an acceptable level for their scheme (that is, 90–95 % of participants in an established scheme typically obtain ‘acceptable’ scores). Because relatively few ‘poor’ results are found in any one round, there is rarely sufficient information to correlate poor performance with the individual quality measures in place. This traditional model of performance study also requires laboratories to provide very detailed data even when they perform well. The present study avoided some of these problems by using software to request only the information specific to the causes of laboratories’ most recent quality issues and the efficacy of the particular remedial action taken. Since most laboratories have experienced and remedied at least one PT problem (albeit not necessarily recently), most participants can in principle provide useful information. Further, laboratories should need to provide much less information per return, minimising the burden of questionnaire completion. This approach was expected to improve the likelihood of useful returns, and allow a relatively simple assessment of the frequency of different problems and the effectiveness of remedies. Questionnaire preparation and distribution The questionnaire (given in full as Appendix 1) was intended for Web dissemination and rapid completion. It included a minimum of respondent identification and classification information, relying on the PT scheme identifier for analytical sector. Two principal questions on general cause of errors and on the general nature of corrective action were included. Depending on the responses to these, respondents were presented with follow-up questions to provide additional detail on the nature of the problem, the particular
Accred Qual Assur
corrective action and the efficacy of each corrective action. In addition, respondents were asked about the principal indication of the quality problem described. As far as possible, questions were kept to ‘tick-box’ responses to allow effective summary and analysis. PT providers were informed of the questionnaire location and provided with a URL containing a short identifier for the specific scheme, to allow the general origin of responses to be tracked. Providers emailed invitations to their participants in October–November 2004. A subsequent invitation was sent in March 2005 (the timing reflects various scheme organisers’ scheme timetables). The data reported on here therefore cover responses obtained between October 2004 and April 2005. Respondents 111 responses were logged during the study. The most frequent scheme identifiers were, with respondent counts, ANDANTE, (Unichim, Italy): (12); Brewing Analytes Proficiency Scheme (LGC, UK): (11); Distillery Analytes Proficiency Scheme (LGC, UK): (20); Drinking-SO3 (SYKE, Finland): (17); FAPAS (FERA, UK): (23) and Intercal (EPA, Ireland) (10). These schemes are principally food and environment schemes. Country of origin information was not requested from participants. From the scheme ID information, 60 % of responses were associated with UK schemes. However, the scheme ID is only an indication of likely origin; PT schemes are often international. The non-UK contingent may therefore be significantly larger.
Results The number of responses for each restricted-entry question (those with tick-box responses) is given with the questionnaire in Appendix 1. Summary charts for all questions are given in Appendix 2 (ESM), which also includes the text entered against an entry of ‘other’. Appendix 3 (ESM) provides a full listing of the text entered for open questions. The responses are divided into four main groups: Identification of quality issues (Question 2); Causes of error (Question 3 and follow-up questions 4–16); Corrective Action (Question 17 and follow-up questions 18–28) and open responses on other quality issues (the remaining questions). The responses in these categories are discussed in turn below, with a brief indication of the major issues arising in each category. Question numbers are included here to facilitate reference to the tabulated data and charts in Appendix 1 and Appendix 2 (ESM), respectively. The major issues are considered in the Discussion section following the presentation of results.
Identification of quality issues (Question 2) The initial question related to the class of event which led to identification of a possible error. The predominant events are related to PT failures, as might be expected given that this study relied on requests for information on quality issues relating to poor PT scores. A relatively high proportion of participants reported that ‘several’ poor scores led to the identification of a problem. Though this might indicate a series of poor scores in separate rounds, it might also be attributable to the fact that many PT schemes provide more than one score for a single round, either because several samples are sent or because multiple analytes are reported. The relative frequencies of the remaining responses, which were significantly different, may be more representative of typical quality issue identification, as there is no reason that their frequencies might be influenced by participation in PT, or biased by the nature of the study. Perhaps the most interesting observation is that customer complaints figure in only 9 cases, compared to 23 for internal QC monitoring, 20 for internal audit and 14 for external audit. The implication is that laboratory QC and QA procedures are more effective for identifying quality issues than customer complaints. Causes of error Figure 1 shows the responses obtained for the question ‘Which of the following categories best cover the problem?’ (Question 3). This is the principal question the study was designed to answer. In all, 230 responses were logged by the 111 respondents, indicating that errors often had more than one assignable cause among the response categories provided. Overall, the differences in response counts for each category were very strongly significant.
Sample preparation Human error Equipment failure or servicing problem Calibration Reporting problem Calculation error Selection of measurement method Test material problem Sample transport and storage Primary sampling Laboratory environment PT provider problem Sample tracking Other problem category 0
10
20
30
40
Number of instances reported
Fig. 1 The chart shows the number of responses for each problem area
123
Accred Qual Assur
Sample preparation, Equipment failures and Human error are the top three categories in terms of response, with 41 % of the responses between these three. A group consisting of calibration errors, method selection, calculation error, reporting issues and test material problems form a natural intermediate group; the remaining responses (Sample transport and storage, sample tracking, primary sampling, lab environment and the half-dozen ‘other’ problems) form a low-response group including 41 responses, or 18 %, of the total. PT provider problems were the joint least important cause of poor scores in PT, which PT providers should find reassuring. In terms of the relative importance in practice, it is important to note that ranking within, and to an extent between, these groups is not entirely safe. For example, while Sample preparation response can safely be taken to indicate that sample preparation is more important in practice than any of the intermediate group, the same cannot be said with confidence for, for example, the relative importance of human error and calibration problems. Thus, the overall picture is a good indication of relative importance at the extremes, but the exact ordering may differ in the general laboratory population. One important caveat is the low importance attributed to sampling. Given the dependence on poor performance in PT, sampling is unlikely to appear as an issue because of the provision of homogenised samples. It would accordingly be unsafe to conclude from these results that sampling is not a problem area for analytical chemistry. Following the summary question above, respondents were invited to provide more details on the general problem area they had identified. The complete list of questions and the response counts are given in Appendix 1, in order of the questions. The responses are summarised briefly below in descending order of importance in Fig. 1, with details given only on the six most important. Sample preparation (Question 9) Sample preparation (54 responses from 36 respondents) showed strongly significant differences among possible causes. The two most prominent problems were sample dilution to volume (16 responses) and extraction/recovery problems (11 responses). Digestion and clean-up were less important, at 5 and 6 responses, respectively. Equipment failure (Question 12) Equipment failure showed one dominant source of problems; actual failure of the equipment, as opposed to servicing problems, set-up, etc. This response was the most frequent single cause of a problem in the study (19 responses).
123
Human error (Question 15) Human error showed significant differences among response totals for different sub-categories, with ‘lack of training or experience’ and transcription errors being the two leading issues identified in this category. Note that this category did not provide specifically for reporting on basic procedural errors attributable elsewhere (e.g. in sample preparation), and respondents consequently did not generally report these issues additionally under ‘human error’. Calibration problems (Question 10) Approximately equal distribution of responses for the three high scoring issues (lack of reference material availability, incorrect calibration procedure and insufficient calibration range) (6–8 each) and lower responses for other categories led to a lack of statistical significance for differences in response in this category. Four responses for ‘reference material defective’ among a total of 111 respondents might, however, be considered worrying for RM providers. Method selection (Question 8) Method selection showed a more or less even distribution of responses across all available responses. ‘Inadequate precision’ was marginally in the lead, but the differences are not statistically significant. Calculation error (Question 11) The majority of the modest number of problems in this area were incorrectly entered values and spreadsheet problems caused by incorrectly entered formulae. The few ‘Other’ problems in this category generally reinforced this conclusion. The quality of commercial software does not appear to be an issue. Reporting problems (Question 14) Strongly significant differences between detailed response counts allow the general conclusion that reporting in the wrong units (either correctly, but not as required by the customer, or incorrectly as in quoting ‘ppm’ instead of ‘ppb’) is a leading cause of reporting problems, at least in PT, with transcription errors the other important contributor. Test material problems (Question 4) The responses were (in descending order) insufficient sample, unsuspected interference or matrix effect, and (half the
Accred Qual Assur
next largest response) material outside the method scope; the differences were, however, not statistically significant. The remaining issues are in the ‘low’ response group. Few showed statistically significant differences within the groups; the exception was primary sampling, for which the absence of responses on ‘cross-contamination in sampling’ and ‘incorrect sampling device’ led to apparently significant differences in response counts. The leading responses in this group (12 in total) were ‘unrepresentative sample’, lack of a sampling protocol and sample stabilisation problems. Reviewing the individual causes of error within categories, a small number stand out as particularly common. Equipment failure is the most common single problem reported (8 % of responses); dilution to volume in sample preparation the next (7 %). Use of incorrect units in reporting, extraction problems, transcription errors in data entry and reporting each account for ca. 5 % of reported causes. Corrective action The study sought information on the corrective actions used, their efficacy and the means of monitoring efficacy. The following paragraphs summarise the main findings. The general responses are summarised in the supplementary information (Appendix 2 (ESM), Q17ff); detailed responses on specific actions are listed, as received, in the supplementary information as Appendix 3 (ESM). Corrective actions reported The different response counts for the various corrective actions are summarised in Fig. 2. The differences between response counts are very strongly significant. The twofold difference between the two leading actions (additional training and implementation of new procedures) and the next largest (revalidation, etc.) is also strongly significant. Ranking thereafter is increasingly unsafe as counts drop. Detailed actions are listed in Appendix 2 (ESM). Reading in detail provides an interesting insight into the wide variation of actions, but few corrective actions appear more than once or twice in any section. The 32 detailed responses for changes in procedure showed a range from specific method amendments for particular sample types, through updated SOPs applicable to all samples, to general changes in training procedures (this last appearing under ‘procedure change’ rather than ‘additional training’—allocation of corrective action to the categories in Fig. 2 is clearly somewhat subjective). Among the 28 detailed responses given for staff training, most responses showed a specific inhouse intervention from more senior or more experienced staff. The two main conclusions are that corrective actions
Additional staff training New procedures implemented New equipment obtained Change to method documentation Analytical method revalidated Additional equipment calibration Change of analytical method Change of reference material source Other 0
10
20
30
40
50
Number of instances reported
Fig. 2 The chart shows the number of responses for each corrective action
listed were very specific to the particular issue identified, and (for training) external training courses were relatively infrequent (two responses of 28). One further conclusion emerges from the number of responses in this category. The 111 respondents indicated a total of 217 corrective actions, and since each response was concerned with a single quality incident or issue, this can be taken to indicate that on average, about two different types of intervention were generally used to resolve each issue. Efficacy of corrective action Respondents were asked to rate the effectiveness of each corrective action on a scale of ‘Ineffective/moderately effective/fully effective’. There were apparent slight differences in response counts, but statistical checks showed no significant differences in efficacy between different interventions. It is therefore not appropriate to single out particular interventions as more or less effective. Overall, only 54 % of responses indicated that the specific intervention was ‘Fully effective’. 36 % were ‘moderately effective’ and the remaining 10 % were marked ‘ineffective’. Since this applies very generally across all corrective actions, it is most usefully interpreted as supporting the general need for multifactor intervention, that is, action on more than one front. Assessing efficacy of corrective action In this study, further PT results were the most commonly reported method of assessing the effectiveness of corrective action, followed by internal audit, internal QC monitoring, judgement and (less frequently) management review and external audit. The apparent reliance on PT results for confirming efficacy of corrective action is clearly likely to be influenced by the study population. The relatively high use of internal audit and QC monitoring is to be expected if corrective action is to be assessed quickly; the relatively low incidence of management review and external audit as
123
Accred Qual Assur
confirmatory actions may also follow from the need for early monitoring. In the same way as for corrective action itself, the number of responses here (209) is much larger than the number of respondents, indicating that respondents often use more than one method to assess efficacy. Other issues Respondents were invited to provide further comment on the specific quality issue they were describing, on the survey itself and on analytical quality in general. Further comment on the quality issue described (Q29) Fourteen responses were logged and are given as entered in Appendix 2 (Q29) (ESM). Most were unique. Three indicated practical use of PT scheme information. Of these, one indicated that the scheme was at least sometimes used to test possible alternative methodologies; one showed that the lab had reviewed longer-term trends in their PT data and another had carried out trend analysis for other participants to benchmark their own performance. These respondents at least are therefore using their PT schemes for substantially more than simple quality checks from time to time. Further comment relevant to the study Four responses were made. Two commented on the survey itself. Of these, one commented that multiple responses would be possible from a single lab; another commented that the questions might make a useful failure investigation checklist. The remaining two comments were (i) a comment on the lab problems in a particular scheme (interestingly, lamenting the lack of time to ‘do the scheme justice’, indicating a desire to score well in PT) and (ii) a comment on the difficulty presented by lack of a published standard method. General analytical quality issues Five responses were received, two supporting the need for regular internal QC, one noting a need for greater availability of suitable reference materials for food analysis and one supporting the use of traceable (‘true’) assigned values for PT schemes. The fifth was a specific comment for the scheme organiser.
Discussion The prevalence of ‘dilution to volume’ problems in sample preparation, coupled with the four responses for weighing
123
difficulty, shows convincingly that failures in basic laboratory procedures are a relatively serious problem. This is apparently true even for laboratories sufficiently experienced to involve themselves in PT. Overall, basic errors in dilution and weighing accounted for about a third of sample preparation errors, and extraction and digestion combined accounted for another third. ‘Human error’ was the second most important reported cause of poor PT scores in this survey, being listed as a contributory cause by 29 respondents. However, because of the general nature of the question, the category does not include all of the assignable causes of poor scores that could be attributed to human error. For example, the ‘calculation error’ category includes issues such as incorrect formulae entered in spreadsheets, which is clearly a human, rather than software, error. We therefore reviewed the available responses and categorised those which could reasonably be considered to be examples of human error during measurement in the laboratory. The responses included in this set are listed in Table 1. We then counted the respondents who had recorded one or more of these responses. A total of 49 respondents (44 % of the total) had reported one or more of these contributory causes. No other general category was reported by so many respondents; it follows that simple human errors of all kinds represented the most common cause of unacceptable PT scores in this survey, affecting 44 % of respondents. These findings are broadly consistent with many aspects of previous studies in the clinical sector. Steindel et. al. [22] identified instrument problems, dilution errors, transcription errors in reporting and incorrect calibration, together with ‘general or environmental’ problems as responsible for about 67 % of reason codes in a study of 2544 unacceptable results. Importantly, these authors noted that reporting errors can often be associated with the scheme requirements, which may differ from the laboratory’s customer requirement for routine reports. Jenny et. al. [23] similarly noted instrument failure and transcription error onto PT report forms as leading causes of problems in 106 error reports (from a total of 41212 PT test events in toxicology, indicating a very low error rate). Jenny et. al. [23], however, additionally noted that calibration drift was cited by laboratories as a common cause of error. It is useful to consider, here, how such errors might be avoided. Clearly, staff training must play some part, and the responses for corrective actions show clearly that additional training was the most frequently cited corrective action. The need for re-validation of methods also points to a well-established principle of good analytical measurement: the use of properly validated measurement methods. But one avenue that is not reflected well here is the use of internal quality control (QC), as documented, for example, by IUPAC for analytical chemistry [27]. Some respondents
Accred Qual Assur
in this study had identified problems via QC failures even though the study was largely aimed at those experiencing PT problems, and it is noteworthy that more problems were identified by internal QC and internal audit than by customer feedback. In principle, however, effective QC using relevant QC materials should catch the majority of automated calculation errors, extraction problems, calibration errors and method selection problems before the results leave the laboratory. This may account for the two additional remarks on general analytical quality issues that noted the need for improved QC. There is, therefore, perhaps a case for reiterating the need for effective internal QC as a means of reducing the incidence of poor PT results. Of course, problems found via a poor PT score must, almost by definition, be those issues that are not detected by internal QC, and this study cannot determine the fraction of problems that are caught by IQC measures. However, it would be of interest in future studies to include additional questions on the level of QC in place and to allow for ‘improved QC measures’ as a specific category of corrective action, to allow further exploration of the need for improvements in internal QC. It is also useful to consider the possible causes that were not reported by many respondents. Among the most widely discussed issues in metrology in chemistry is the establishment of traceability of chemical measurement results to appropriate chemical reference values. It might therefore be expected that this would appear as a major cause of error in PT schemes. However, although ‘Calibration’ problems affected 22 respondents, closer investigation shows that only 8 of these were caused by unavailability of a reference material, four were attributed to a defective reference material and 2 to ‘uncalibrated conditions of measurement’. Note, here, that the question did not distinguish between calibration material and reference material used for QC or validation, so it is not certain that the 12 reference material issues were all calibration material issues. All other responses in this category related to procedural or test material issues. Absence of traceability (in the restricted sense of lack of a complete calibration chain) was therefore a comparatively rare cause of poor results in this survey population. Of course this does not mean that continued attention to traceability is unimportant; clearly, proper attention to traceability is a large part of the reason for the large number of acceptable results in PT. Further, improving traceability infrastructure may be necessary to improve the dispersion of the acceptable results if that were desired. But the finding does suggest that effort to improve the availability of reference materials, or changes to the way in which such materials are value assigned, may have limited effect on the incidence of clearly unacceptable results in food and environmental PT with current performance criteria.
Another commonly cited concern is the reliability of software. Software errors were reported by respondents, but none arose from defects in commercial software. Rather, all software-related errors arose from errors in user input. Turning to the utility of web-based survey methods for collection of data on causes of error, it is clear from this study that quite detailed information can be obtained with comparatively low burden on respondents. Since initial presentation of summary data is also automated, initial review also takes comparatively little time. We note, however, that as in any voluntary-response study it is hard to control for selection bias. For example, it is conceivable that some particular classes of quality failure might be under-reported, either because of the nature of the population (sampling errors are unlikely in a PT population, as noted above) or for other reasons such as a fear of adverse legal or commercial consequences. The results of any voluntary survey therefore need to be treated as indicative and interpreted with caution unless additional measures are taken to monitor such sources of bias. Nonetheless, the present study has provided an interesting perspective on at least perceived causes of error, and we accordingly conclude that web-based survey methods provide a useful additional tool for analytical quality improvement.
Table 1 Responses classed as human error Questiona
Responsea
Problem category
Human error
Sample tracking
Sample misidentified on receipt Results attributed to incorrect sample
Calibration
Incorrect calibration procedure
Calculation error
Spreadsheet user error Pocket calculator error Arithmetic error in hand calculation Incorrect values entered in calculation Software incorrectly applied
Equipment failure/servicing Instrument wrongly set up Reporting problem
Value correct but not in required units Wrong units reported Transcription/typographical error Interpretation incorrect
Human error
Lack of training or experience Transcription error Instrument reading error Reporting instructions not followed Arithmetic error Interpretation error Other human error
a
Abbreviated
123
Accred Qual Assur
Conclusions A web-based data collection methodology has been used successfully to collect 111 responses including 230 individual causes of error from PT scheme participants from 9 PT schemes and two additional sources of contacts. The web-based format simplified data entry for respondents and the focus on recent unacceptable scores minimised the burden on the target population. Thus, although we note that the question list used here could usefully include additional exploration of internal quality control implementation, web-based follow-up survey appears to be a useful tool for improving analytical quality. Turning to the results of this survey, the respondents were principally in environment and food analysis schemes. Most respondents had, as might be expected given the survey population, identified quality issues as a result of a poor PT score. Internal audit and QC monitoring were the most important additional indicators (39 % total); external audit and customer complaints were less common (13 % and 8 %, respectively). Causes reported as Sample preparation, Equipment failure and Human error were the top three cited causes of poor analytical results by a significant margin, accounting for 41 % of all reported problems. Within sample preparation, errors in dilution to volume and extraction problems were the most commonly reported problems. Actual failures in equipment were the principal cause in the ‘equipment failure’ category and, indeed, the most frequently reported problem in any category. Specific details under ‘human error’ listed lack of training and experience as the most important cause, with transcription problems second. In general, simple mistakes such as incorrect transcription, miscalculation, formula errors in spreadsheets, incorrect dilutions, etc were implicated by 40 % of responses. Failures in traceability infrastructure and flaws in commercial software were rarely or never implicated. Corrective actions were generally specific to the particular problem identified. The majority included additional staff training and/or implementation of new procedures. New equipment, revalidation and changes to documentation were about equally common, with changes in methodology or reference material source less common. No single method of corrective action proved particularly effective; on average, 54 % of individual actions were considered ‘fully effective’. Acknowledgments Preparation of this paper was supported by the UK National Measurement System Chemical and Biological Metrology Programme. The authors are additionally grateful for the assistance of the organisers of the FAPAS PT Scheme (York, UK), the ANDANTE PT scheme (Unichim, Italy), the Intercal scheme (EPA, Ireland), SYKE (Finland), rconcept (Germany) and PT scheme staff at LGC Standards (BAPS, DAPS, CONTEST, SODAS), all of
123
whom kindly arranged to email requests directly to scheme participants, and to Professor J Firth for advice during the study.
Appendix 1: Questions and response counts Order of presentation of questions All questions are shown in the list below, including two (Q1 and Q33) which were completed automatically from the http: URL. Questions in the list are in the order of appearance to the respondent. The bulleted list under each question is the list of possible responses, selected by check box response. Where a question is followed by the symbol ==[ Qn, question n is the next question (==[ without a number indicates fall-through to the next question). Where a response (bulleted list item) is followed by ==[ Qn, question n was presented subsequently only if the respondent gave the particular response. This is particularly important for questions 3 and 17, where follow-up questions 4–16 (for Q3) and 18–26 (for Q17) were asked only for positive responses in Q3 and Q17. Respondents were therefore not asked for responses on irrelevant follow-up questions. The questions Questions are numbered; possible responses are shown as bullet points. The response count for each response shows the number of responses and, in parentheses, the percentage of responses to the particular question. Totals for each question show the total number of responses to the question and, for causes of error listed in question 1, the question total as a percentage of all causes of error reported (230) in square brackets. Response counts marked ‘ESM’ show that text responses are listed in Appendices 2 and 3 (electronic supplementary material). The result of a Chi-squared test for significant difference is also included (as a p value), together with a visual indication of significance. Additional details of the statistical analysis are given in the Experimental section. The indication scale used is: p [ 0.05: No indication; 0.05 [ p [0.01: ‘*’; 0.01 [ p [ 0.001: ‘**’; p \ 0.001: ‘***’. There were no p values in the range 0.05–0.10, which is sometimes considered ‘marginally significant’.
Question
Response count
Q1. Response number [Automatically completed from originating URL]
110
Q2. What alerted you to this measurement quality issue? A single poor score in a PT scheme
44 (29.0 %)
Accred Qual Assur continued Question
continued Response count
Question
Response count
Several poor PT scores
37 (24.3 %)
Incorrect sampling device
Customer complaint
9 (5.9 %)
Other
Internal QC problem
23 (15.1 %)
Total
13 (5.6 %)
Internal audit or inspection
20 (13.2 %)
Chi-squared test p value and significance
0.025*
External/third party audit or inspection (e.g. assessor visit)
14 (9.2 %)
Q6. Sample transport and storage: which of the following were contributory causes? (check all that apply) ==[ Q17
Other
0 (0 %) 0 (0 %)
5 (3.3 %)
Transport: conditions inappropriate
Total
152
Transport: container inappropriate
2 (9.5 %)
Chi-squared test p value and significance
\0.001***
Transport: time too long
3 (14.3 %)
Storage: time too long Storage: temperature incorrect
2 (9.5 %) 5 (23.8 %)
Storage: container leakage
4 (19.1 %)
Q3. Which of the following categories best cover the problem? (check all that apply) Test material problem ==[ Q4.
16 (7.0 %)
Primary sampling ==[ Q5.
9 (3.9 %)
Sample transport and storage ==[ Q6.
11 (4.8 %)
Sample tracking (labelling, chain of custody) 4 (1.7 %) ==[ Q7. Selection of measurement method ==[ Q8.
19 (8.3 %)
Sample preparation (weighing, drying, 36 (15.7 %) extraction, digestion, clean-up, dilution, etc.) ==[ Q9.
Other
5 (23.8 %)
0 (0%)
Total
21 (9.1 %)
Chi-squared test p value and significance
0.35
Q7. Sample tracking problem (labelling, chain of custody): which of the following were contributory causes? (check all that apply) ==[ Q17 Sample mislabelled prior to receipt Sample misidentified on receipt
1 (20.0 %) 1 (20.0 %)
Calibration ==[ Q10.
22 (9.6 %)
Calculation error ==[ Q11.
19 (8.3 %)
Results attributed to incorrect sample (samples mixed up)
3 (60.0 %)
Equipment failure or servicing problem ==[ Q12.
29 (12.6 %)
Other
0 (0 %)
Laboratory environment (climate control, cross-contamination etc.) ==[ Q13.
6 (2.6 %)
Reporting problem (format, units, detection, 19 (8.3 %) interpretation) ==[ Q14. Human error ==[ Q15.
29 (12.6 %)
Proficiency scheme provider problem ==[ Q16. Other problem category
4 (1.7 %)
Total Chi-squared test p value and significance
7 (3.0 %) 230 (100 %) \0.001***
Q4. Test material problem: which of the following were contributory causes? (check all that apply) ==[ Q17 Test material outside method scope
4 (16.7 %)
Unsuspected matrix or other interference present
8 (33.3 %)
Insufficient sample available
9 (37.5 %)
Other
3 (12.5 %)
Total Chi-squared test p value and significance
24 (10.4 %) 0.23
Q5. Primary sampling: which of the following were contributory causes? (check all that apply) ==[ Q17
Total
5 (2.2 %)
Chi-squared test p value and significance
0.28
Q8. Selection of measurement method: which of the following were contributory causes? (check all that apply) ==[ Q17 Material or application outside stated scope of chosen method
4 (12.5 %)
Inadequate detection limit
5 (15.6 %)
Inadequate precision available from method
7 (21.9 %)
Method has unacceptably large error range
5 (15.6 %)
Differences in performance between standard 4 (12.5 %) methods Measurement cost too high Other Total Chi-squared test p value and significance
2 (6.3 %) 5 (15.6 %) 32 (13.9 %) 0.81
Q9. Sample preparation (weighing, drying, extraction, digestion, clean-up, dilution etc.): which of the following were contributory causes? (check all that apply) ==[ Q17 Weighing
4 (7.4 %)
Drying
2 (3.7 %)
Milling/grinding
2 (3.7 %)
Extraction/recovery
11 (20.4 %)
Sample digestion Clean-up
5 (9.3 %) 6 (11.1 %)
Sample dilution to volume
16 (29.6 %)
Unrepresentative sample
5 (38.5 %)
Failed to follow sampling protocol, or no protocol in place
3 (23.1 %)
Sample stabilisation problem
4 (30.8 %)
Cross-contamination during sampling
0 (0 %)
Total
54 (23.5 %)
Incorrect container
1 (7.7 %)
Chi-squared test p value and significance
0.0012**
Other
8 (14.8 %)
123
Accred Qual Assur continued Question
continued Response count
Q10. Calibration: which of the following were contributory causes?(check all that apply) ==[ Q17 Uncalibrated timing, temperature or other conditions of measurement
2 (6.1 %)
Reference material unavailable
8 (24.2 %)
Reference material defective Incorrect calibration procedure (e.g. single point vs. line, weighted/unweighted, incorrect curve shape)
4 (12.1 %) 8 (24.2 %)
Question
Q14. Reporting problem (format, units, detection, interpretation): which of the following were contributory causes? (check all that apply) ==[ Q17 Value reported correctly but not in 12 (29.3 %) customer’s/PT provider’s units (e.g. reported 86 mg/100 ml instead of 860 mg/l) Units incorrectly stated (e.g. reported 26 ppm 7 (17.1 %) instead of 26 ppb) Transcription/typographical error in value
Insufficient calibration range
6 (18.2 %)
Inadequate matrix match
2 (6.1 %)
Other calibration problem
3 (9.1 %)
Total
33 (14.3 %)
Chi-squared test p value and significance
0.19
Q11. Calculation error (excluding calibration procedure): which of the following were contributory causes? (check all that apply) ==[ Q17
Response count
11 (26.8 %)
Reported as ‘‘not detected’’ or ‘‘less than xx’’ 4 (9.8 %) when numerical value requested Measurement result correct but interpretation 3 (7.3 %) incorrect Measurement uncertainty statement did not meet customer/PT provider requirement
3 (7.3 %)
Other
1 (2.4 %)
Total
41 (17.8 %)
Chi-squared test p value and significance 0.005** Q15. Human error: which of the following were contributory causes? (check all that apply) ==[ Q17
Commercial instrument software error
0 (0 %)
Spreadsheet problem caused by spreadsheet software
1 (3.3 %)
Spreadsheet problem caused by user-entered formula
7 (23.3 %)
Pocket calculator error
3 (10.0 %)
Instrument reading error
4 (8.2 %)
4 (13.3 %)
Reporting instructions misunderstood/ incorrectly followed
7 (14.3 %)
Arithmetic error in hand calculation Incorrect values entered in calculation
10 (33.3 %)
Software incorrectly applied
1 (3.3 %)
Other calculation error
4 (13.3 %)
Total
30 (13.0 %)
Chi-squared test p value and significance
0.003**
Q12. Equipment failure or servicing problem: which of the following were contributory causes? (check all that apply) ==[ Q17 Instrument wrongly set up e.g. gc column temperature
2 (6.5 %)
Lack of training or experience
13 (26.5 %)
Transcription error
12 (24.5 %)
Arithmetic error
5 (10.2 %)
Interpretation error Other human error
4 (8.2 %) 4 (8.2 %)
Total
49 (21.3 %)
Chi-squared test p value and significance
0.041*
Q16. Proficiency scheme provider problem: which of the following were contributory causes? (check all that apply) ==[ Q17 Sample incorrectly packed/transported
1 (20.0 %) 0 (0 %) 1 (20.0 %)
Equipment outside service interval
2 (6.5 %)
Labelling error
Equipment failure
19 (61.3 %)
Recent servicing altered readings/ performance
5 (16.1 %)
Results attributed to wrong laboratory or sample Assigned value incorrect
1 (20.0 %) 0 (0 %)
3 (9.7 %)
Treatment of detection limit data incorrect
Total
31 (13.5)
Distribution of data poor
0 (0 %)
Chi-squared test p value and significance
\0.001***
Scores did not take account of differences between standard methods or different instruments
2 (40.0 %)
Other
Q13. Laboratory environment (climate control, cross-contamination etc.): which of the following were contributory causes? (check all that apply) ==[ Q17 Ambient conditions inadequately controlled
3 (50.0 %)
Cross-contamination within laboratory
3 (50.0 %)
Other
0 (0 %)
Total Chi-squared test p value and significance
123
6 (2.6 %) 0.22
Other
0 (0 %)
Total
5 (2 %)
Chi-squared test p value and significance
0.52
Q17. Indicate the corrective action (ESM) taken: Additional staff training ==[ Q18.
48 (22.1 %)
Analytical method revalidated ==[ Q19.
22 (10.1 %)
Change to method documentation ==[ Q20. 22 (10.1 %) New procedures implemented ==[ Q21.
47 (21.7 %)
Additional equipment calibration ==[ Q22.
19 (8.8 %)
Accred Qual Assur continued Question
continued Response count
Question
Change of reference material source ==[ Q23.
9 (4.2 %)
Change of analytical method ==[ Q24.
15 (6.9 %)
The responses each allowed response on the scale Ineffective/ Moderately effective/Fully effective [shown here as 0, M, F, respectively]
New equipment obtained ==[ Q25.
22 (10.1 %)
Other ==[ Q26.
13 (6.0 %)
Total
217
Chi-squared test p value and significance
\0.001***
Q18. Additional staff training: if you wish, please describe briefly what was done. ==[ Q27
ESM
Q19. Analytical method revalidated: if you ESM wish, please add brief details or comment on any relevant results from the revalidation exercise. ==[ Q27. Q20. Change to documentation: if you wish, ESM please describe briefly what was done. ==[ Q27 Q21. New procedures implemented: if you wish, please describe briefly what was done. ==[ Q27
ESM
Q27. Where taken, how effective was the corrective action? (mark any that apply) Additional staff training
5
Method revalidated
2
12 13
Change to method documentation
3
10 17
New procedures implemented
2
19 30
Additional equipment calibration
4
7
Change of reference material source
4
6
8
Change of analytical method
3
8
15
19 36
2
10 14
‘‘Other’’ action (mark only if you have specified this previously)
2
6
Chi-squared test p value (two-way table, test for differences in proportion among responses) and significance
0.80
Q28. Indicate how the effectiveness of the corrective action taken was assessed Internal audit
40 (19.1 %)
External audit
12 (5.7 %)
Further PT results
62 (29.7 %)
Monitoring internal QC materials Management review
38 (18.2 %) 19 (9.1 %)
Q25. New equipment obtained: if you wish, ESM please give brief details of the type of equipment, any significant differences between the new and old equipment and/ or comment on the reasons for the choice. ==[ Q27 Q26. You indicated that ‘‘other’’ action was taken: If you wish, please give further details. ==[ Q27
Analysts judgement
31 (14.8 %)
Other
7 (3.4 %)
48 (22.1 %)
Method revalidated
22 (10.1 %)
Change to method documentation
22 (10.1 %)
New procedures implemented
47 (21.7 %)
Additional equipment calibration
19 (8.8 %)
Change of reference material source Change of analytical method
9 (4.2 %) 15 (6.9 %)
New equipment obtained
22 (10.1 %)
’’Other’’ action (mark only if you have specified this previously)
13 (6.0 %)
4
Response count
Q23. Change of reference material source: if ESM you wish, please identify the old and new reference materials. ==[ Q27 Q24. Change of analytical method: if you ESM wish, please describe briefly what changes were made. ==[ Q27
Additional staff training
12
New equipment obtained
Question
Q22. Additional equipment calibration: if ESM you wish, please add brief details of what was done. ==[ Q27
Response count
Total
209
Chi-squared test p value and significance
\0.001***
Q29. If you wish, please add any further comments you consider relevant to the quality issue you have described in the preceding questions
ESM
Q30. If you wish, please add any further comments you consider relevant to this survey
ESM
Q31. Please add any comments you wish about analytical quality issues in general
ESM
Q32. If you would like to receive a copy or notification of the final report (electronic copy only), please check ‘‘Yes’’ and provide an email address for notification
N/A
Q33. Scheme ID
Total
217
Chi-squared test p value and significance
\0.001***
Deduced from URL. See Appendix 2 (ESM) for breakdown
N/A
123
Accred Qual Assur
References 1. Thompson M, Ellison SLR, Wood R (2006) The international harmonized protocol for the proficiency testing of analytical chemistry laboratories (IUPAC Technical Report), Pure Appl Chem 78:145–196 2. ISO/IEC 17043:2010 Conformity assessment—general requirements for proficiency testing. International Organization for Standardization, Geneva, (2010) 3. Gaunt W, Whetton M (2009) Accred Qual Assur 14:449–454 4. Whetton M, Finch H (2009) Accred Qual Assur 14:445–448 5. Tholen DW (2002) Accred Qual Assur 7:146–152 6. Taylor RN, Fulford KM (1981) J Clin Microbiol 13:356–368 7. Shinton NK, England JM, Kennedy DA (1982) J Clin Pathol 35:1095–1102 8. Dawson DW, Fish DI, Frew ID, Roome T, Tilston I (1987) J Clin Pathol 40:393–397 9. Boley N (1998) Accred Qual Assur 3:459–461 10. Poller L (1989) J Clin Pathol 42:1–3 11. Lowthian P, Thompson M, Wood R (1996) Analyst 121:977–982 12. Thompson M, Owen L, Wilkinson K, Wood R, Damant A (2002) Analyst 127:1666–1668 13. Thompson M, Owen L, Wilkinson K, Wood R, Damant A (2004) Meat Sci 68:631–634 14. Analytical Methods Committee (2010) Accred Qual Assur 15:73– 79
123
15. Thompson M, Lowthian P (1993) Analyst 118:1495–1500 16. Morris A, Macey D (2004) Accred Qual Assur 9:52–54 17. Thompson M, Mathieson K, Owen L, Damant AP, Wood R (2009) Accred Qual Assur 14:67–71 18. King B, Boley N, Kannan G (1998) Accred Qual Assur 4:280– 291 19. Ellison SLR, Matheson K (2008) Accred Qual Assur 13:231–238 20. ISO/IEC 17025:2005. International organization for standardization, Geneva 21. Belli M, Brookman B, de la Calle B, James V, Koch M, Majcen N, Menditto A, Noblett T, Perissi R, van Putten K, Robouch P, Slapokas T, Taylor P, Tholen D, Thomas A, Tylee B (2009) Accred Qual Assur 14:507–512 22. Steindel SJ, Howanitz PJ, Renner SW (1996) Arch Pathol Lab Med 120:1094–1101 23. Jenny RW, Jackson-Tarentino KY (2000) Clin Chem 46:89–99 24. Hertzberg MS, Mammen J, McCraw A, Nair SC, Srivastava A (2006) Haemophilia (Suppl. 3) 12:61–67 25. R Development Core Team (2011). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN: 3-900051-07-0, URL http://www.R-project. org/ 26. Hope ACA (1968) J R Stat Soc B 30:582–598 27. Thompson M, Wood R (1995) Pure Appl Chem 67:649–666