“Degrees of equivalence” for chemical measurement capabilities: primary pH

Accreditation and Quality Assurance, Sep 2014

The key comparison (KC) studies of the Consultative Committee for Amount of Substance—Metrology in Chemistry help ensure the reliability of chemical and biochemical measurements relevant to international trade and environmental-, health-, and safety-related decision making. The traditional final evaluation of each measurement result reported by a KC participant is a “degree of equivalence” (DEq) that quantitatively specifies how consistent each individual result is relative to a reference value. Recognizing the impossibility of conducting separate KCs for all important analytes in all important sample matrices at all important analyte levels, emphasis is now shifting to documenting broadly applicable critical or “core” measurement competencies elicited through a series of studies. To better accomplish the necessary synthesis of results, data analysis and display tools must be developed for objectively and quantitatively combining individual DEqs. The information detailed in the 11 KCs of primary method pH measurements publically available as of 2013 provides an excellent “best case” prototype for such analysis. We here propose tools that enable documenting the expected primary pH measurement performance of individual participants between pH 1 and pH 11 and from 15 °C to 37 °C. These tools may prove useful for other areas where the uncertainty of measurement is a predictable function of the measured quantity, such as the stable gases. That results for relatively simple measurement processes can be combined using relatively simple analysis and display methods does not ensure that similarly meaningful summaries can be devised for less well understood and controlled systems, but it provides the incentive to attempt to do so.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs00769-014-1076-1.pdf

“Degrees of equivalence” for chemical measurement capabilities: primary pH

David L. Duewer 0 1 2 3 4 Kenneth W. Pratt 0 1 2 3 4 Chainarong Cherdchu 0 1 2 3 4 Nongluck Tangpaisarnkul 0 1 2 3 4 Akiharu Hioki 0 1 2 3 4 Masaki Ohata 0 1 2 3 4 Petra Spitzer 0 1 2 3 4 Michal Mariassy 0 1 2 3 4 Leos Vyskocil 0 1 2 3 4 0 A. Hioki M. Ohata National Metrology Institute of Japan (NMIJ), 3-9 Tsukuba Central , 1-1-1, Umezono, Tsukuba, Ibaraki 305-8563, Japan 1 C. Cherdchu N. Tangpaisarnkul National Institute of Metrology ( Thailand ) (NIMT) , 3/4-5 Moo 3, Klong 5, Klong Luang, Pathum Thani 12120, Thailand 2 D. L. Duewer (&) K. W. Pratt National Institute of Standards and Technology (NIST) , 100 Bureau Drive, Gaithersburg, MD 20899-8390, USA 3 M. Mariassy L. Vyskocil Slovensky metrologicky ustav (SMU), Karloveska 63, 842 55 Bratislava, Slovakia 4 P. Spitzer Physikalisch-Technische Bundesanstalt (PTB), Bundesallee 100, 38116 Brunswick, Germany The key comparison (KC) studies of the Consultative Committee for Amount of SubstanceMetrology in Chemistry help ensure the reliability of chemical and biochemical measurements relevant to international trade and environmental-, health-, and safety-related decision making. The traditional final evaluation of each measurement result reported by a KC participant is a ''degree of equivalence'' (DEq) that quantitatively specifies how consistent each individual result is relative to a reference value. Recognizing the impossibility of conducting separate KCs for all important analytes in all important sample matrices at all important analyte levels, emphasis is now shifting to documenting broadly applicable critical or ''core'' measurement competencies elicited through a series of studies. To better accomplish the necessary synthesis of results, data analysis and display tools must be developed for objectively and quantitatively combining individual DEqs. The information detailed in the 11 KCs of primary method pH measurements publically available as of 2013 provides an excellent ''best case'' prototype for such analysis. We here propose tools that enable documenting the expected primary pH measurement performance of individual participants between pH 1 and pH 11 and from 15 C to 37 C. These tools may prove useful for other areas where the uncertainty of measurement is a predictable function of the measured quantity, such as the stable gases. That results for relatively simple measurement processes can be combined using relatively simple analysis and display methods does not ensure that similarly meaningful summaries can be devised for less well understood and controlled systems, but it provides the incentive to attempt to do so. - Asia Pacific Metrology Programme Consultative Committee for Amount of SubstanceMetrology in Chemistry Comite International des Poids et Mesures Degrees of equivalence DerSimonianLaird Electrochemical Analysis Working Group Electronic supplementary material European Collaboration in Measurement Standards GraybillDeal PTILE(p,dMC) Median absolute deviation of a set of values from their median value Maximum value of a set of values Median value of a set of values Normal (Gaussian) distribution having mean l and standard deviation r The p percentile of the set of all dMC values Summation of a series of values Union of two or more sets of values DEq for a single reported result for a specific NMI for a specific buffer PBMC estimate of d Combination of available d over temperature for a specific NMI for a specific buffer Combination of available d over temperature and buffers for a specific NMI Coverage factor providing a 95 % level of confidence coverage interval Subscript designating a relationship to PBMC analysis Number of x in a given set Number of PBMC samplings of a complete set of data Number of temperature-specific d available for estimating D or the number of buffer-specific D available for estimating Probability expressed as a percentage (i.e., on the range 0100) The acidity function at zero added chloride Correlation between two quantities Standard deviation GraybillDeal weighted standard deviation (also called external consistency) Subscript designating a successor KC Subscript designating a particular result in a series of evaluation temperatures Number of evaluation temperatures for a given buffer Standard uncertainty estimated as a GD weighted standard deviation Standard uncertainty estimated from the MAD Standard uncertainty estimated from s and u Standard uncertainty Pooled value of a set of u; i.e., the square root of the mean of the squared u values One-half of a 95 % level of confidence symmetric coverage interval Lower bound of a 95 % level of confidence asymmetric coverage interval Upper bound of a 95 % level of confidence asymmetric coverage interval Reference value for the root KC Reference value estimated from anchor participant results in the root KC Reference value estimated from anchor participant results in successor KCs Reported value DerSimonianLaird weighted mean of a set of x GraybillDeal weighted mean of a set of x Arithmetic mean of a set of x Median of a set of x Value reported in a successor study re-centered onto the reference value of a given earlier study The Comite International des Poids et Mesures (CIPM) is responsible for the conduct of international key comparison (KC) studies that enable national metrology institutes (NMIs) and related organizations to document measurement capabilities relevant to international trade and environmental-, health-, and safety-related decision making. The technical supplement to the 1999 Mutual Recognition Arrangement (CIPM MRA) [1] establishes the process by which NMIs demonstrate the degree of equivalence (DEq) of national measurement standards. The CIPM MRA states that (1) KCs lead to reference values, (2) a key comparison reference value (KCRV) is expected to be a good indicator of an international system of units (SI) value, (3) DEqs refer to the degree to which a national measurement standard is consistent with the KCRV, and (4) DEqs for measurement standards are expressed quantitatively by the deviation from the KCRV and the uncertainty of this deviation at a 95 % level of confidence. The Working Groups of the Consultative Committee for Amount of SubstanceMetrology in Chemistry (CCQM) Table 1 pH-related key Comparisons Results reported as Number of results used in KCRV or RV Original estimators are responsible for selecting and overseeing the operation of KCs that address chemical (and biochemical) measurements. Few such measurements directly realize an SI unit: a mole of one chemical analyte may have no physiochemical properties in common with a mole of another beyond containing the same number of entities. Further, with a few exceptions such as atmospheric ozone [2], the higher order chemically related measurements made by an NMI do not reflect national measurement standards but rather the organizations measurement capabilities at a given time. However, until recently most CCQM-sponsored KCs have attempted to keep as closely as possible to the philosophy of the CIPM MRA as described above by estimating a separate DEq for each reported result in each KC. Recognizing the impossibility of conducting separate KCs for all important chemically related analytes in all important sample matrices (and the ever-increasing resource burdens placed on the worlds NMIs by attempting to address even a tiny subset of these measurands), several of the Working Groups within the CCQM are now using KCs to evaluate a series of critical or core measurement competencies. While continuing to provide DEqs for the results reported in individual KCs, the overall assessment of an NMIs measurement capabilities may require combining DEqs for several different measurands that may be estimated in different KCs and at separate times. The KCs conducted by the CCQM Electrochemical Analysis Working Group (EAWG) and two regional metrology organizations (RMOs) on primary pH-related measurements are an excellent, and prescient, model for such studies. Initiated in 1999, to date results are publicly available for 11 KCs involving five buffer systems, with all but one of these systems characterized at 15 C, 25 C, and 37 C (see Table 1). While individual NMIs routinely if informally assess their primary pH measurement capabilities by qualitative comparison of the various DEqs for different temperatures and buffers, no formal mechanism currently exists for quantitatively summarizing such results. We here propose quantitative data analysis methods for combining individual DEqs from multiple KCs to estimate an NMIs measurement capabilities for particular measurement areas. We will show that the various primary pH measurements can be combined to document the expected measurement performance for primary pH measurements from pH 1 to pH 11 and from 15 C to 37 C. These data analysis methods represent a first step in the development of tools for assessing NMI measurement capabilities from less coherent evidence. The data used in this study are the results of primary method pH measurements as provided in the published Final Reports [313] of the KCs listed in Table 1. All of the primary pH measurement data given in these reports are listed in Tables S1.a to S5.a of the electronic supplementary material (ESM), with the exception of values that (1) were identified in the KCs final report as technically flawed and as such were excluded from the reference value (RV) estimation process for that KC and (2) are not the most recent primary pH measurement in that buffer system for the NMI that submitted the excluded result. Table 2 lists the number of DEq estimates available for each NMI for each buffer system. As the focus of this report is the process of combining results rather than particular outcomes for these data, each NMI is designated as a singleletter alphabetical code. The 11 KCs considered include five root comparisons of pH measurements made in different buffer systems: CCQM-K9 (phosphate), CCQM-K17 (phthalate), CCQMK18 (carbonate), CCQM-K19 (borate), and CCQM-K20 (tetroxalate). These root KCs were activities of the EAWG. The remaining studies, formally differentiated as Subsequent KCs and Regional KCs but here referred to as successor KCs, are each linked to one or another of the roots through the use of in-common measurement protocols and qualitatively similar buffer solutions. The four successor studies CCQM-K9.1, -K9.2, -K18.1, and -K19.1 were activities of the EAWG; the integer part of the label designates the root KC and the decimal designates the temporal order of the successor KC relative to its root. The APMP.QM-K9 and EUROMET.QM-K17 (also termed EUROMET Project 696) KCs were activities of the Asia Pacific Metrology Programme and the European Collaboration in Measurement Standards RMOs, respectively, both in collaboration with the EAWG. All of the successor studies were designed to enable additional NMIs to demonstrate newly Table 2 Participation history Number of DEq estimates a Single alphanumeric character unique to each participating NMI acquired pH measurement capabilities and/or to allow participants in earlier studies to document improved capabilities. The KCs examined in this study, all with completion dates ranging from 1999 to 2010, constitute the initial cycle of primary pH KCs. The recently completed CCQM-K91 (phthalate) [14] is the first KC of the second cycle and is not included in this study. CCQM-K91 and the other pH studies currently in progress or planned are designed as fresh root comparisons rather than maintaining linkages to the earlier studies. Primary method pH measurements All of the data considered here are the primary pH measurements reported by KC participants for a buffer solution prepared and distributed by the coordinator of each KC. The direct result of the primary measurement itself is pa0, the acidity function at zero added chloride. Depending on the KC design, pa0 determinations were made at one or more specified temperatures. The metrological basis for the primary measurement of pH is discussed in detail elsewhere [1517]. In essence, the pa0 is a function of the potential of a specified type of electrochemical cell, commonly referred to as the Harned cell. 7 3 5 3 10 10 9 4 12 13 13 13 10 13 9 10 12 10 1 4 68 171 The pH is obtained from pa0 by adding a constant term, defined by the BatesGuggenheim convention, specific for a given buffer and temperature [15, 18]. Since the value of this term is invariant among the participants of each KC, all measurement-specific factors that affect the pa0 affect the corresponding pH values (as well as any KCRV calculated from them) to the same extent. The uncertainty [15] of the BatesGuggenheim convention is excluded from the reported uncertainties for the pH KCs. This exclusion avoids inflating the reported uncertainties for the pH KCs and ensures that the reported uncertainties relate to the measurement capabilities per se of the participants. Measurements for the carbonate, borate, and tetroxalate buffer KCs are recorded in the Final Reports as the reported pa0 values. Measurement results for some of the phosphate and phthalate buffer system KCs were recorded as pH values. We consider the recorded values for all of these KCs as being of the same kind: primary pH. Note that primary pH is a procedurally defined kind-ofquantity [19]. Since primary pH cannot be determined except through the measurement process itself, the KCRV for a primary pH KC must be estimated from the measurement results even though the study materials are prepared quantitatively from materials of established composition. This is in contrast to some chemical systems (such as synthetic gas mixtures and organic and inorganic calibration solutions) where materials can be prepared to have well-defined compositions that, with suitable verification, provide KCRVs that are independent of results reported by the studys participants. All calculations used in this study were performed in a spreadsheet environment using a modern desktop computer. Purpose-built programs in languages native to this environment were used to automate repetitive computations. Versions of these tools are available on request from the corresponding author. Results and discussion National standard degrees of equivalence as currently estimated As defined by the CIPM MRA, the DEq, d, for a particular KC result is estimated as where x is the reported value and VKC is the KCRV and is a close realization of an SI value as assigned by the sponsoring Working Group and approved by the Consultative Committee. where i indexes over the individual estimates. This U95(D) estimated in this manner can be considered as conservatively large since the among-temperature Using formal variance propagation, the uncertainty associated with d should be estimated as [20], ud pffiuffiffi2ffiffiffiffixffiffiffiffiffiffiffiffiffiffiuffiffi2ffiffiffiffiVffiffiffiKffiffiCffiffiffiffiffiffiffiffiffiffi2ffiffiffiqffiffiffiffixffiffi;ffiffiffiVffiffiKffiffiffiCffiffiffiffiuffiffiffiffiffixffiffiffiffiuffiffiffiffiVffiffiffiKffiffiCffiffiffiffi 2 where u(x) is the standard uncertainty associated with x, u(VKC) is the standard uncertainty of the VKC, and q(x,VKC) is the correlation between the reported value and the KCRV. Within at least the CCQM, except when the KCRV has been assigned using the Graybill-Deal estimator [21, 22], the q(x,VKC) term has generally been ignored effectively asserting that q(x,VKC) = 0. Since the MRA requires that uncertainties are to be specified at the 95 % level of confidence, standard uncertainties must usually be estimated from reported expanded uncertainties where k95 is the coverage factor expected to yield an expanded uncertainty such that the interval x k95u(x) includes the true value with a 95 % level of confidence. The desired 95 % level of confidence expanded uncertainty on d, U95(d), is likewise typically estimated as Again, within at least the CCQM, k95 has generally been asserted to be 2 regardless of how the various quantities are actually estimated. Measurement capability degrees of equivalence for a given buffer Given N individual d U95(d) estimates for a particular NMI and assuming that they are independently drawn from a relatively normal distribution, a combined measurement capability DEq, D U95(D), for that NMI can be estimated from the mean of the d, the standard deviation of the d, and the pooled U95(d) , pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u2d s2d G F I #L K Q N O M Fig. 1 Dot-and-bar plot of degrees of equivalence estimated by variance propagation for all participants in CCQM-K9, -K9.1, -K9.2, or APMP.QM-K9 who reported primary method pH results. The vertical axis displays degrees of equivalence, D U95(D) and d U95(d). The horizontal axis is used to separate the NMIs. The filled circles and thick vertical lines represent the combined D U95(D) for each NMI as estimated from Eq. 5. The NMIs are sorted in order of increasing D within each KC; the KC is identified above the results for the participant with the lowest-valued D within that KC. The open symbols and thin vertical lines represent d U95(d) for measurements made at 15 C (diamond), 25 C (triangle), and 37 C (square) as specified in the KC Final Reports. The thick horizontal line represents zero bias; the thin horizontal lines are visual guides variability, estimated from the standard deviation of the di, includes contributions from the within-temperature variability, estimated as the pooled U95(di)/2. However, these U95(D) will always be at least as large as the expected within-temperature U95(di) and will closely approach 2 s(d) as between-temperature differences become dominant. Note that u(D) is not scaled by HN since D U95(D) is intended to be characteristic of individual measurement processes rather than any estimate of the central tendency of N processes. The variance propagation results for all five buffer systems are listed in Tables S1.b to S5.b of the ESM, along with the d and u(d) recalculated from the reported results as listed in Tables S1.a to S5.a. Of course, that the d U95(d) can be mathematically combined does not address the question as to whether combining them is chemically reasonable. Figure 1 displays the d U95(d) for all NMIs that reported primary pH results in the CCQM-K9, -K9.1, K9.2, and APMP.QM-K9 studies of the phosphate buffer system along with the combined D U95(D). These d U95(d) estimates are taken directly from the Final Reports or calculated using the data and formulae provided in those reports. The coherence of the d U95(d) over the three temperatures for nearly all of the NMIs suggests that combining the individual estimates is reasonable. If the validity of the combination is accepted, then the D U95(D) provides a snapshot of the NMIs phosphate buffer primary pH measurement capabilities from 15 C to 37 C. Revisiting the estimation of degrees of equivalence Since estimating D U95(D) is outside the scope of the CIPM MRAs measurement standard paradigm, the question arises whether even more informative estimates could be achieved using data analysis approaches that do more than just propagate reported summary estimates. Key comparison reference value, VKC While many location estimators have been proposed for evaluating a KCRV and recent guidance provided for choosing and calculating ones appropriate to particular circumstances [23], all of the KCs considered here have used either the median when there was significant between-result variance, s2b, or the GraybillDeal weighted mean [21], xGD, when s2b was considered insignificant. The xGD is defined as where i indexes over all the accepted results in a KC and n is the number of such results. Three of the root KCs (CCQM-K9, -K17, and -K20) used xGD as their KCRV estimate for all temperatures studied. It is now better appreciated that use of xGD is justified only in the unusual case where s2b is both truly zero and all of the u(x) are credible. For situations where s2b is appreciable but the x follow an approximately unimodal symmetric distribution and the u(x) are at least plausible, the DerSimonianLaird (DL) [24] weighted mean, xDL, is more appropriate [25]. Commonly used in clinical metaanalysis, xDL, is identical to xGD when s2b is zero but approaches the arithmetic mean, xmean, as s2b becomes large relative to the u(xi). The xDL is defined as Xn xi ,Xn 1 where MAX is the function return the largest value of the arguments. Since xDL asymptotically approaches xmean, it is as sensitive as xmean itself to the presence of discordant results and is only appropriately used after any and all such results have been identified, reviewed by the submitting NMI, and excluded if a cause for the discordance is identified. Due to what was considered appreciable s2b, the CCQMK18 and -K19 studies used the median of the accepted x, xmedian, to estimate the KCRV at each temperature studied. While appropriate for any distribution and robust to minority populations of discordant values, xmedian is not a very efficient estimate of location (that is, it is more variable than xmean when applied to normally distributed data) and does not make use of any information provided by the u(x) even when they are quite informative [26]. Standard uncertainty of the key comparison reference value, u(VKC) as too small for use as the u(VKC). Instead, a weighted standard deviation estimated using the same inverse-variance weighting used to define xGD was used to provide estimates that take non-zero sb into account. While sometimes referred to as the external consistency uncertainty [3, 7, 27], this estimate is more simply termed the GraybillDeal weighted standard deviation and is defined as vuffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uGDxGD ut Xn 1 xi n xG1D2, Xn 1 9 While providing more chemically reasonable u(VKC) for these studies than does u(xGD), this approach does not address the xGDs bias towards x that have very small u(x). The two studies that estimated the KCRV values as the xmedian used a scaled version of the robust median absolute deviation from the median (MAD) dispersion estimate to estimate u(VKC): 1:858 uMADVKC MADx pffiffiffiffiffiffiffiffiffiffiffi n 1 MEDIANjx where MEDIAN is the function find the median value of the specified list of values and the scaling factor of 1.858/ H(N - 1) adjusts the estimate to (1) have the approximately the same coverage as a standard deviation for normally distributed data, (2) compensate for the lower efficiency of xmedian relative to xmean, and (3) compensate for the relatively small N. While robust to the inclusion of discordant values, the MAD is inefficient compared to the standard deviation when applied to normally distributed data. While various approaches for estimating uncertainties for weighted means have been proposed that provide more efficient coverage intervals [28, 29], the original estimate associated with xDL is [24] vuffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uxDL ut1,Xn 1 : 11 Linkages between studies The CIPM MRA does not specify how results from successor KCs are to be linked to those of a root KC; however, it does mandate [1] that The results of the RMO key comparisons are linked to key comparison reference values established by CIPM key comparisons by the common participation of some institutes in both CIPM and RMO comparisons. The uncertainty with which comparison data are propagated depends on the number of institutes taking part in both comparisons and on the quality of the results reported by these institutes. The CCQM has chosen to link successor and RMO KCs using the same general methods. When a successor or RMO KC uses materials and methods that are sufficiently similar to those used in a rootas is the case for the primary pH studies considered here, the studies can be directly linked through results provided by one or more anchor NMIs who successfully participated in a prior KC. For example, results in the successor CCQM-K9.1 are linked to the KCRV of the root CCQM-K9 through results provided by one anchor who made full sets of measurements in both studies, CCQM-K9.2 is linked to CCQM-K9 through the results of two such anchors, and APMP.QM-K9 is linked through results of one anchor from CCQM-K9, one from CCQM-K9.1, and one from CCQM-K9.2. The linkages for all of the pH studies considered here are detailed in Tables S1.a to S5.a of the ESM. To date, degrees of equivalence for participants in a successor pH KC have been estimated using a National standard paradigm assuming that DEq are unchanging over time and samples: U95d k95ud pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u2x u2VKC u2VR u2VS 2qVR; VKCuVRuVKC where VR is a reference value estimated from the results of the anchor participants in previous studies, u(VR) is its estimated standard uncertainty, VS is a reference value estimated from the results of the anchor participants in the successor KC, u(VS) is its estimated standard uncertainty, and q(VR, VKC) is the correlation between prior studies reference values and the KCRV. Although VR has (nearly) always been estimated from a subset of the participants in the root KC, none of the other quantities are estimated from the same data sets and so are not expected to be strongly correlated. As with the d U95(d) estimated for the participants in the root KC, q(VR, VKC) has typically been ignored and k95 asserted to be 2. In the successor studies involving two or more anchor participants, VR and VS have been estimated from xmean; the standard deviation, s(x); and pooled uncertainty of the anchor participants results, u(x). The VS and its standard uncertainty, u(VS), are readily estimated: VS Xn xSj,n; uVS rffiuffiffi2ffiffiffiffixffiffiSffiffiffiffiffinffiffiffiffiffisffi2ffiffiffiffixffiffiSffiffiffiffi where j indexes over the anchors, n is the number of anchors, xS are the results for the anchors in the successor KC, and u(xS) are the standard uncertainties for the anchor values. When all anchors successfully participated in the same prior KC, the estimation process for the prior reference value, VR, is analogous to the above with the xS replaced by xR. However, when some of the anchors participated in different studies (as in APMP.QM-K9), the national standard paradigm re-centers all of the anchor values to have the value they should have had: where xadj designates a re-centered value, xR is the value in the most recent KC that the anchor successfully participated in, dR is the DEq in that KC, and u(dR) is its standard uncertainty. The uncertainty associated with xR, u(xR), is not included the calculation of u xadj since it is already included in u(dR). The measurement capability paradigm suggests a much simpler calculation. If a participants result does not reflect the fixed bias of a national standard, successful participation in a prior KC implies only that all anchor participants are expected to routinely realize true values within their assessed uncertainties. The DEq for the nonanchor participants in the successor KC is thus independent of results in the root KC: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u2x u2VS; U95d k95ud When there is only one anchor participant, the k95 expansion factor in Eqs. 12 and 15 must be assigned by expert judgment. Reference value estimators When there is more than one anchor participant in a successor KC, using Eq. 13, i.e., estimating VS as xmean, does not make efficient use of the information provided in the reported u(x). As in the estimation of the KCRV, estimating VS as xDL (Eq. 7) and u(VS) as u(xDL) (Eq. 11) makes more complete use of the available information. Further, use of the same estimators for the VKC and Vs provides a philosophically consistent approach to the analysis of the successor KCs. Leave-one-out reference values Estimating a KCRV using all accepted results can be considered to provide the closest realization of an SI unit that can be estimated using a consensus process. However, using that KCRV to estimate the d U95(d) for a x U95(x) used in the determination of the KCRV may result in non-negligible values for the often-ignored q(x,VKC) term in Eq. 2. This can be avoided by estimating each d U95(d) relative to a reference value that is independent of the associated x U95(x). At the cost of additional calculations and an pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi increase in the n=n 1 estimated uncertainty, the same estimator used for VKC can provide individual reference values for the d U95(d) for each x U95(x) using all of the accepted results except itself. This leave-one-out (LOO) approach is a routine tool for assessing the predictive utility of regression models [30]. LOO is a particularly useful tool for identifying the influence of particular values on consensus summaries and the consequences of such inclusion on the other values [31]. When the measurement capability linkage of Eq. 15 is used, the LOO-estimated DEq for participants in a root KC does not impact the DEq estimated for participants in successor studies since these are linked only to the KCRV of the root and the measurements made by the anchor participants in the successor KC itself. In any case, eliminating the potential distortion from ignoring non-zero q(x,VKC) places the U95(d) estimates for root and successor KC participants on more equal footing. Use of corrected and imperfect results It can happen that an NMI recognizes computational oversights only after the results of a KC have been revealed. While the DEq for such an NMI must be estimated from the originally reported results, when the error results from miscalculation then the WG may choose to use a transparently corrected result in determining the KCRV. In CCQM-K9, the NMI who reported the errant result had to demonstrate its capability in a successor KC. In these circumstances, an issue arises when the NMI is an anchor in a later successor: which result should be used as the link? The approach used by the EAWG has been to link to the result from the successor KC. However, since the KCRV of the root KC is based in part on that NMIs corrected result, linkage through the corrected result shortens the linkage chain for later participants without further compromise. As this shortening does not benefit the anchor participant but impacts only those NMIs that are linked through that anchor, measurement capability DEq should be based on the most direct valid linkage. Occasionally, too, results are reported that are valid in their own right but that are excluded from formal inclusion in the KC and so cannot be used to estimate a national standard DEq. Such exclusions include but are not limited to measurements made at not quite the KCs design conditions and values submitted without an accompanying uncertainty budget. Given that the proposed process for combining results is already well outside the scope of the CIPM MRAs paradigm, it seems reasonable to try to make use of such data after conservative adjustment. For example, (1) measurements made at an off-target temperature could be interpolated to the target if the approximate temperature dependence of the measurements can be estimated or (2) missing uncertainties could be estimated as the worst case of previously supplied complete data, assuming that sufficient such data were available. While it would be inappropriate to base critical decisions primarily on resurrected data, ignoring available information is inefficient. Parametric Bootstrap Monte Carlo analysis The DEq uncertainty estimates detailed above generally follow the conventional propagation rules, with the exception that degrees of freedom and known correlation issues are routinely ignored. Given the relatively small number of data available for estimating a VKC or VS, the assumption that k95 = 2 provides about a 95 % level of confidence coverage interval about the true value is difficult to justify. And, while the correlation between a given location estimate and a datum used in its estimation can be determined, the functional relationship can be fairly complex. Parametric Bootstrap Monte Carlo (PBMC) analysis is one approach that provides a relatively simple and convenient method for estimating coverage intervals directly from just the reported data. Assuming that all of the reported x U95(x) credibly specify N(x,(U95(x)/2)2) normal kernel distributions, then empirical posterior distributions for all d values estimated from Eqs. 1 or 15 can be estimated by (1) repetitively sampling all of the input values within their distributions, providing one PBMC sample per reported result for each set, (2) estimating VKC and VS for each of the PBMC sets, and (3) estimating and storing the d (call them dMC) for all of the resampled results in each set. This methodology is closely related to the methods described in [32] and to empirical Bayesian analysis [33]. While not particularly efficient in terms of computer resources, PBMC can be readily implemented in any computational environment that supports user definition of programs for the evaluation of specialized functions (e.g., xDL) and for the storage of intermediate results. Since spreadsheets can provide a familiar working environment that simplifies the definition and maintenance of the linkages between root and successor KCs, PBMC analysis within a spreadsheet environment can be quite efficient in terms of analysts resources when appropriate care is taken in their design. Assuming that a suitably large number of PBMC samplings, NMC, are available, d can be estimated from the empirical 50 percentile of the stored PBMC results: d PTILE50; dMC MEDIANdMC where PTILE is the function return the p percentile of the specified values and for p = 50 is identical to the median. Credible uncertainty intervals about d can be estimated in the same manner, with the 95 % level of confidence interval estimated from the 2.5 and 97.5 percentiles: PTILE(2.5,dMC) and PTILE(97.5,dMC). If the ratio (d - PTILE(2.5,dMC))/(PTILE(97.5,dMC) - d) is about 1, then the usual symmetric 95 % confidence interval on d can be estimated as U95d PTILE97:5; dMC PTILE2:5; dMC=2: However, if the ratio is far from 1 then the interval can either be reported as asymmetric, or as the larger of the two half-intervals, U95d MAX U95d; U95d Asymmetric intervals are the narrowest intervals that provide the stated coverage; however, the familiar symmetric form may be more convenient for use in further calculations. While the symmetric estimates of Eq. 19 are conservative, they increasingly over-estimate the length of the interval as asymmetry increases. Using the same sets of PBMC dMC used to estimate the d in Eq. 16, the measurement capability DEq that combines results for all temperatures in a given buffer, the D for a given NMI of Eq. 5 can be estimated as ! where t indexes over the temperatures, T is the number of tteemmppeerraattuurree,s,SddMMCCtt aisrethethueniPoBnMofCallvtahlueePsBfMorCtdhMeCgfiovrena given NMI, and the number of dMC is the same for all temperatures. The U95(D) can be estimated using the same approaches and decision criteria detailed in Eqs. 1719: ! !! U95D MAX U95D; U95D Figure 2 displays the PBMC-estimated d U95(d) and D U95(D) for all NMIs that provided results for primary pH measurements in phosphate buffer. All of the expanded uncertainties are estimated conservatively as the maximum of the two half-intervals. At graphical resolution, the differences between the national standard estimates of Fig. 1 and the measurement capability estimates of Fig. 2 are quite small. Figure 3 provides a high-resolution comparison between the DEq as reported in the Final Reports and those estimated using PBMC and the several estimation and linkage modifications proposed above. All of the pH differences are small with none larger than 0.003 and most less than 0.001, but the pattern of changes attributable to specific modifications may be of interest. Figure 3a visualizes the differences in d, U95(d), D, and U95(D) attributable to the PBMC estimation method itself. The d are essentially unaffected; the D are mostly unaffected except for those NMIs where the distribution of the combined dMC is not Fig. 2 Dot-and-bar plot of PBMC estimated degrees of equivalence for the CCQM-K9, -K9.1, -K9.2, and APMP.QM-K9 participants. The graphical format is identical to that of Fig. 1 well described as symmetric unimodal. For these NMIs, the PBMC-estimated median dMC is somewhat closer to the ideal zero D than the arithmetic average. The PBMCestimated U95(d) for the CCQM-K9 participants are somewhat smaller than the reported values. The PBMCestimated U95(d) for the participants in successor studies are either unchanged or somewhat larger, depending on which KC is considered. The U95(D) are essentially unaffected, again except for the NMIs where the combined dMC distribution is significantly asymmetric. Figure 3b depicts the changes attributable to the use of xDL for the reference values in CCQM-K9, -K9.2, and APMP.QM-K9. None of the d and D are changed by more than about 0.0005. The U95(d) and U95(D) are on average very slightly smaller than the values provided in the reports or estimated from them. Figure 3c depicts the change resulting from linking CCQM-K9.2 to the KCRV using the corrected value reported in CCQM-K9 by one of the anchor NMIs rather than that NMIs official DEq estimated in CCQM-K9.1. The change only affects the APMP.QMK9 participants. Figure 3d depicts the change resulting from using LOO evaluation for the CCQM-K9 participants, where the d and D become on average about 0.0002 farther from zero and the U95(d) and U95(D) become uniformly about 0.0003 larger. These small changes have virtually no effect on the DEq estimated for the participants in the successor KCs. Figure 3e depicts the change in linkage from the national standard paradigm of Eq. 12 to the measurement capability paradigm of Eq. 15. The d and D for the participants in the successor KCs are changed by up to 0.002, reflecting the elimination of the VR bias-correction resulting in a small majority of the DEq becoming closer to the ideal zero. The U95(d) and U95(D) for these NMIs rather uniformly become about 0.0005 shorter, reflecting the elimination of the u(VR) uncertainty component. (c) Linkage Choice (e) Linkage Paradigm Fig. 3 Differences between the degrees of equivalence and their expanded uncertainties for the CCQM-K9, -K9.1, -K9.2, or APMP.QM-K9 participants as reported and as estimated using the proposed modified approaches. The panels display differences due to the use of a the PBMC estimation process, b DerSimonianLaird weighted mean to estimate all reference values, c linking to a corrected value in CCQM-K9 rather than to its replacement in CCQM-K9.1, d leave-one-out evaluation of CCQM-K9 results, e measurement capability paradigm linkage, and f the combination of all the proposed modifications. For all panels, the horizontal axis displays differences in absolute d and D; the vertical axis displays differences in U95(d) and U95(D). Negative values along either axis indicate that the reported values are further from the ideal zero than those estimated using the proposed modification. Small open symbols represent temperature-specific differences in d and U95(d); large solid symbols differences in the estimated D and U95(D); circles estimates from CCQM-K9, triangles CCQM-K9.1, diamonds CCQM-K9.2, and squares APMP.QM-K9. The bars on all symbols represent 95 % level of confidence intervals on the PBMC estimates, based on 9 sets of 1000 random draws Figure 3f depicts the whole of the proposed modifications. The great majority of the observed changes are attributable to use of (1) the measurement capability paradigm, (2) PBMC analysis, (3) xDL as the estimator for the reference values in both the root and successor KCs, and (4) LOO analysis of the DEq for participants in the root KC. Note that each of these modifications can have very different effects on the participants in the root and in the successor KCs, and the magnitude of the changes observed with the CCQM-K9, -K9.1, K-9.2, and APMP.QM-K9 studies may not predict their relative impact on other measurement systems. The PBMC results for all five buffer systems are listed in Tables S1.c to S5.c of the ESM. Measurement capability degrees of equivalence for all buffers While each buffer system has its unique attributes, the d U95(d) estimates for most NMIs in other buffers where measurements were reported at 15 C, 25 C, and 37 C are about as self-consistent as they are in the phosphate buffer discussed above. Given that all D U95(D) within-buffer estimates appear to make chemical sense, it remains to explore how results can be combined across the buffers and whether such combinations are chemically informative. To meaningfully combine across the buffer systems, the magnitude and distributions of the quantities combined must be similar. Figure 4 displays the standard deviation, s(x), the DerSimonian-Laird between-NMI component of variation, sb, and the pooled (see Eq. 7) measurement uncertainties, u(x), estimated from the accepted results in the five root KCs. The u(x) are strikingly similar for all five buffers, indicating that the participating NMIs regarded the measurement processes as being of similar complexity. However, the reported measurement uncertainties do not fully account for the observed between-NMI variability in any of the buffer systems. The magnitude of the unexplained between-NMI variability is about the same and rather small in four of the buffers. Only in the carbonate system investigated in CCQM-K18 and -K18.1 the unexplained variability is significantand can be entirely attributed to a reproducible offset in the measurement results reported by two NMIs. While not yet completely understood, this offset is believed to be related to the procedures used to account for slow loss of CO2 from the buffer into the hydrogen flow in the Harned cell. The carbonate buffer KCs are also unique in that, owing to the time required for measurement at each temperature, the KC protocol only involved measurements at 25 C. It is plausible that primary pH measurements in this system may not be comparable to those in the other four buffers. However, the variability of the DEq in the carbonate system is not so much greater than that in the others to preclude attempting to combine them with those for the other buffers and evaluating the resulting combined values for chemical plausibility. The number of temperatures evaluated in the EAWGs pH KCs does differ; further, KC participants do not always report results for all of the temperatures included in the KC 0 0 0 .01 Fig. 4 Uncertainty components for the pH-related measurement results reported in the CCQM-K9, -K17, K-18, -K19, and -K20 key comparisons. The horizontal axis displays the KCRVs as estimated from the 15, 25, and 37 C results accepted for use in estimating the KCRV. The vertical axis displays estimates of variability for these results. The open triangles represent the standard deviations, s, for the reported x in each of these root KCs; the dashed horizontal line the pooled value of the s. The asterisk represents the pooled uncertainty, u, of the reported u(x); the thick horizontal line their pooled value. The solid circles represent the DerSimonianLaird estimate of between-NMI variability, sb; the thin horizontal line their pooled value. The horizontal and vertical lines represent PBMC-estimated 95 % coverage intervals, based on 9 sets of 1000 random draws design. To provide an all buffer DEq summary, U95(), for each NMI, this potential imbalance in the number of temperature-specific d U95(d) available in different buffers requires modification of the single-buffer approaches for combining DEq. This is trivial for the propagation approach, requiring only that the d U95(d) in Eq. 5 be replaced by the summary D U95(D): where t now indexes over all temperatures in all buffers.. The U95() can be estimated using the analogous modifications to Eq. 21, again using the decision criteria discussed for Eqs. 1719. To ensure that each of the five buffer systems has equal influence on the all-buffer U95() estimates, the total number of dMC should be the same for all buffers, e.g., for each 1000 PBMC dMC values generated for each of the three results reported in the phosphate buffer system there should be 3000 dMC for the carbonate buffers single result. While just a bookkeeping detail, having balanced numbers of dMC is necessary for the PBMC process to yield equalweighted estimates. Figure 5 displays the variance propagation and PBMCgenerated U95() estimates for all NMIs reporting any primary pH result in any of the pH KCs listed in Table 1, with the U95(D) and U95() conservatively estimated as the maximum half-interval. Figure 5 uses the same dot-and-bar format used in Fig. 1, but with the thin lines representing the buffer-specific D U95(D) rather than the within-buffer temperature-specific d U95(d). At graphical resolution, the two methods provide very similar estimates; numeric values of the estimates are listed in Table S6 of the ESM. Figure S6 displays the PBMC results using symmetric and asymmetric U95(D) and U95() intervals. The D U95(D) for the carbonate buffer do not appear to be systematically different from those of the other buffer systems. For the large majority of NMIs, the DEq in different buffers are quite coherent. The reproducible and relatively large offset for the NMI coded as T has been previously noted and identified as the result of using a somewhat different electrochemical cell design than that used by most other NMIs. The very similar values of the temperature-specific d U95(d) for the primary pH measurement results reported by most KC participants in each of the five buffer systems suggest that combining them into buffer-specific D U95(D) summaries provides chemically useful informationat least for the measurements made over the range of temperatures evaluated in that buffer. Likewise, the very similar values for the buffer-specific D U95(D) for most NMIs suggest that combining them into the buffer-independent U95() summaries may usefully summarize the primary pH measurement capabilities of the KC participantsat least for the five buffer systems and 15 C 37 C temperature range considered in this study. While not essential to reaching the above conclusions, we propose a number of modifications to the methods where i now indexes over the buffers and N is number of buffer systems for which the NMI provided results. Estimating U95() is only a bit more complicated for the PBMC approach of Eq. 16. Using the same sets of PBMC dMC used to estimate d, U95(d), D and U95(D): D PTILE (a)Variance Propagation K9 K17 K18 K19 K20 A B C D E F G H I J K L M N O P Q R S T K9 K17 K18 K19 K20 A B C D E F G H I J K L M N O P Q R S T Fig. 5 Dot-and-bar plots of degrees of equivalence for all NMIs that reported primary method pH results in any of the 11 KCs listed in Table 1. a Variance propagation estimates, b PBMC estimates. The graphical format is similar to that of Fig. 1 with the exception that the large solid circles and thick bars represent the all-temperature allbuffer U95() summaries, and the smaller symbols and thin lines represent the all-temperature D U95(D) buffer-specific summaries. The smaller solid circles represent results for phosphate buffer (CCQM-K9, -K9.1, -K9.2, APMP.QM-K9), times phthalate buffer (CCQM-K17, EUROMET-K17), solid triangles carbonate buffer (CCQM-K18, -K18.1), plus borate buffer (CCQM-K19, -K19.1), and solid diamonds tetroxalate buffer (CCQM-K20) usually used for CIPM MRA degrees of equivalence that may contribute to providing more representative estimates. The most significant of these are use of (1) measurement capability linkages between root and successor KCs, (2) Monte Carlo (PBMC and others) methods for evaluating the consequences of different distributional assumptions on the estimation of credible coverage intervals, (3) comparison of leave-one-out (LOO) degrees of equivalence estimates with those using the traditional approach to evaluate the influence of correlation, and (4) a modified dot-and-bar graphic for displaying summary estimates such as D U95(D) and U95(). The primary pH measurement results provided by the NMI participants in these pH-related KCs were chosen for study for a number of reasons, but chief among them is the remarkable agreement among the participant results over all of the solutions and evaluation temperatures thus far studied by the EAWG. If the degrees of equivalence for these measurements could not have been meaningfully combined, it would be highly unlikely that the results for less well understood and controlled measurement systems could be meaningfully combined. That the primary pH results can be combined using relatively simple analysis and display methods thus does not ensure that similarly meaningful summaries can be devised for other measurement systems, but it provides the incentive to attempt to do so. Acknowledgments We thank all participants in CCQM-K9, -K9.1, K9.2, -K17, -K18, -K18.1, -K19, K19.1, -K20, APMP.QMK9, and EUROMET.QM-K17 for their thoughtful contributions to the design of the studies and evaluation of the results and for their meticulous measurements. DLD thanks Katherine Sharpless and Katrice Lippa for their assistance and advice in preparing this report and the anonymous reviewers for their careful corrections and insightful suggestions. Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs00769-014-1076-1.pdf

David L. Duewer, Kenneth W. Pratt, Chainarong Cherdchu, Nongluck Tangpaisarnkul, Akiharu Hioki, Masaki Ohata, Petra Spitzer, Michal Máriássy, Leoš Vyskočil. “Degrees of equivalence” for chemical measurement capabilities: primary pH, Accreditation and Quality Assurance, 2014, 329-342, DOI: 10.1007/s00769-014-1076-1