Analysis of Exposure–Biomarker Relationships with the Johnson SBB Distribution

Annals of Occupational Hygiene, Aug 2007

Application of the Johnson bivariate SB distribution, or alternatively the SBB distribution, is presented here as a tool for the analysis of concentration data and in particular for characterizing the relationship between exposures and biomarkers. Methods for fitting the marginal SB distributions are enhanced by maximizing the Shapiro–Wilk W statistic. The subsequent goodness of fit for the SBB distribution is evaluated with a multivariate Z statistic. Median regression results are extended here with methods for calculating the mean and standard deviation of the conditional array distributions. Application of these methods to the evaluation of the relationship between exposure to airborne bromopropane and the biomarker of serum bromide concentration suggests that the SBB distribution may be useful in stratifying workers by exposure based on using a biomarker. A comparison with the usual two-parameter log-normal approach shows that in some cases the SBB distribution may offer advantages.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://annhyg.oxfordjournals.org/content/51/6/533.full.pdf

Analysis of Exposure–Biomarker Relationships with the Johnson SBB Distribution

MICHAEL R. FLYNN 0 0 Department of Environmental Sciences and Engineering, School of Public Health, University of North Carolina , Chapel Hill, NC 27599-7431 , USA Application of the Johnson bivariate SB distribution, or alternatively the SBB distribution, is presented here as a tool for the analysis of concentration data and in particular for characterizing the relationship between exposures and biomarkers. Methods for fitting the marginal SB distributions are enhanced by maximizing the Shapiro-Wilk W statistic. The subsequent goodness of fit for the SBB distribution is evaluated with a multivariate Z statistic. Median regression results are extended here with methods for calculating the mean and standard deviation of the conditional array distributions. Application of these methods to the evaluation of the relationship between exposure to airborne bromopropane and the biomarker of serum bromide concentration suggests that the SBB distribution may be useful in stratifying workers by exposure based on using a biomarker. A comparison with the usual two-parameter log-normal approach shows that in some cases the SBB distribution may offer advantages. - INTRODUCTION Deciphering the role of exposure to an environmental contaminant in the cause of human disease is a complex problem requiring tools from a variety of scientific disciplines. Occupational epidemiology studies often look for statistical associations between the relative risks of disease in groups of workers stratified by surrogate exposure metrics. In the best of circumstances these surrogates include measurements such as breathing-zone concentration. The misclassification that arises from bias and imprecision in the exposure assessment has been studied, and although methods exist to minimize such effects, they represent a significant limitation in assessing the causality (or association) of such exposures and disease. In an oft-cited paper, Hill (1965) discusses factors involved in establishing a supportive argument for the causality of a given exposure to a specific disease. Among the items discussed temporality, plausibility (e.g. a mechanistic model) and a doseresponse relationship (biological gradient) hold pre-eminent positions. The exposure must precede the disease for *Author to whom correspondence should be addressed. Tel: 919-966-3473; fax: 919-966-7911; e-mail: a causal relationship to exist. Such a relationship is strengthened by both a plausible model describing the disease process and a repeatable doseresponse effect. Research focusing on the relationship between exposure, the absorbed dose, metabolic transformations and disease mechanisms has the potential to improve our understanding of the causality of such exposures by contributing to each of these criteria. Current research makes use of biomarkers in a chain of sequential processes leading from exposure to disease. Figure 1 gives a simple overview of this process indicating biomarkers of both exposure and effect. Thus, exposure to an airborne contaminant may be characterized by its concentration in air, and upon inhalation, the substance (or metabolite) may appear in the blood, exhaled breath or urine as a biomarker of exposure. Subsequently, this chemical or its metabolite may lead to an important biological outcome (gene mutation, enzyme modification, etc.) which can be measured as a biomarker of effect. This effect may represent a risk factor for the disease in question, and thus, a mechanistic model of causality may be postulated and tested. A useful biomarker of exposure will have good specificity for the contaminant of concern, the greater the correlation between the biomarker and the actual exposure, the more useful the biomarker will be as Fig. 1. A sequential model of the exposuredisease process with biomarkers of exposure and effect. a potential assessment tool. Questions arise as to which biomarkers may be best suited to assess exposure and/or risk based on a variety of factors. Individual variability and genetic susceptibility play confounding roles and the potential for the biomarkers to be affected by other exposures, and other substances lead to serious questions about specificity. A fundamental question therefore is, does a potential biomarker of exposure correlate with exposure or is the relationship so variable that little differentiation of exposure is possible. Analogous questions exist for all subsequent biomarkers in the causal chain. This paper illustrates an application of a statistical methodology to characterize the relationship between exposure to an airborne contaminant (bromopropane) and a biomarker of exposure (serum bromine concentration). The methodology is based on regression analysis using the Johnson system of bivariate normal distributions (Johnson, 1949b); and the data set is from Toraason et al. (2006). STATISTICAL CHARACTERIZATION OF EXPOSURES AND BIOMARKERS When considering the mathematical relationship between exposure and a given biomarker, it is natural to employ a regression methodology seeking to establish the biomarker as a function of the exposure. We believe that the exposure produces the biomarker and that a functional relationship exists, albeit a complex and perhaps highly variable one. While it is possible to control exposures and other covariates in animal studies and thus elucidate more mechanistic relationships, in human studies the random nature of both exposure and biomarker leads us to consider a more probabilistic approach and in particular whether certain bivariate normal distributions may be suitable for characterizing such relationships. The significance of this is that when a transformation to the bivariate normal distribution is possible, powerful methods exist for making inferences about exposure and its variability, given a measurement of the biomarker and vice versa. This in turn enhances quantitative differentiation of individuals in epidemiological and related health studies with respect to exposure and dose. Just as measurements of exposure, biomarkers are often concentrations (but in biological fluids) or values that can be expressed as dimensionless proportions (e.g. adducts per number of nucleotides). All concentrations, and many types of biomarkers, thus can be expressed as dimensionless, continuous, bounded variates. It has been proven mathematically that if a dimensionally consistent functional relationship exists between a set of variables, then it will exist in the form of dimensionless power products (Buckingham Pi Theorem, see, e.g. Vignaux and Scott, 1999). From a theoretical perspective these facts and observations lead us to favor concentrations in dimensionless form and tend to bias our selection of probability distributions to those for bounded variates. The two most common choices are the Beta distribution of the Pearson family or the SB distribution in the Johnson family. Generally, exposures to airborne contaminants are expressed as time-weighted average concentrations in parts per million (p.p.m.) for gases and vapors or in milligrams per cubic meter for aerosols. Either is immediately expressed as a dimensionless proportion by dividing by the theoretical maximum concentration, i.e. one million p.p.m., or for aerosols, the bulk density of the material. More realistic maximum values may be used as well, e.g. saturation vapor pressure concentration (see e.g. Flynn, 2004b). In reality there may be better values for the maximum and minimum concentrations that either can be estimated from the sample data or selected based on other considerations. Background levels of concentration may exist that effectively define a minimum value, and maximums are inevitably far below the theoretical values based on physical or chemical principles. Similarly for biomarkers expressed as units of mass per volume of blood, urine, exhaled air, etc., or as counts per some maximum value; all may be converted to dimensionless proportions by appropriate selection or estimation of maximum and minimum values. Analysis of concentration data frequently employs the log-normal distribution in recognition of the fact that the measurements tend to be positively skewed and that the transform may approximately normalize the variance. The distribution is appealing in its relative simplicity and the transformation to normality. However, the log-normal distribution is limited in its ability to cover the skewnesskurtosis plane relative to either the Beta or SB distribution, and it is an unbounded distribution. The Johnson system of univariate probability distributions In a seminal work Johnson (1949a) developed a family of probability distributions based on a series of transformations (translations) to normality. Three distributions were described: the SU, SB and SL. The SL distribution is the usual log-normal distribution (both two- and three-parameter forms), the SU is an unbounded distribution based on an inverse hyperbolic sine transform and the SB is a bounded distribution that has been called the four-parameter log-normal distribution. Each of the three distributions in the Johnson family employs a transformation of the original variable to a new variable that has a normal distribution. Probability distributions are characterized by their moments or functions of the moments. The mean, variance, coefficient of skewness (b1) and coefficient of kurtosis (b2), have been used to describe the location, scale and shape of such distributions. Traditionally, a plot of skewness and kurtosis has been used to differentiate the various distributions based on their shape. Figure 2 is such a plot for the Johnson family of probability distributions. The inverted b2 scale is traditional, but counter to what is expected. The Johnson system divides the skewnesskurtosis plane into three regions by two curves. The first region is the area above the line defined by the relationship b2 5 1 b1, which is an impossibility for any probability distribution. The second region is between this line and the curve that defines the lognormal distribution; this region is covered by the SB distribution. Parametric equations for this lognormal line as it is sometimes called can be found in Johnson and Kotz (1970). The final region below the lognormal curve is covered by the SU distribution. Thus, based on the shape of the distribution as defined by the coefficients of skewness and kurtosis, one may select one of the three distributions by using this plot. For our purposes here the SU distribution is not relevant and it will not be discussed further. While the log-normal distribution is reasonably well Fig. 2. The skewnesskurtosis plane for the Johnson system of distributions, the sample data for Facility A serum bromine and bromopropane exposure are also plotted. known, a brief summary for the less popular SB distribution follows. The reader is referred to Johnson and Kotz (1970) for a more detailed discussion of the Johnson family of distributions and the preceding material. Consider the transform y on the random variable of concentration c: e 5 the minimum value of c e k 5 the maximum value of c: Within our context, c is an exposure measurement, e.g. a time-weighted average concentration, or a given biomarker. The concentration c is distributed according to the SB distribution if the variable z defined as is a unit normal variable. If the extremes are known a priori, maximum likelihood estimates for gamma (c), and delta (d) are f is the sample mean and sf 5 When the extremes are not known a priori, the fitting process is more difficult, particularly when all four parameters are to be estimated. Johnson and Kitchen (1971) outline a method-ofmoments fitting procedure based on calculating sample coefficients of skewness and kurtosis and using a lookup table for estimating gamma and delta. The range (k) and minimum (e) are then estimated directly as: e 5 lc where f 5 ln 1 y y ; An alternative to the look-up table, for estimating gamma and delta from the coefficients of skewness and kurtosis, is a graph presented in Bowman and Serbin (1981). This plot is similar to Fig. 2, but with curves of constant values for gamma and delta overlaid on it, it permits easy estimation of gamma and delta, given the sample values of b1 and b2. The SBB distribution Subsequent to the development of the univariate system, Johnson (1949b) described a family of bivariate distributions based on the transformations to normality. Median regression equations were provided for 16 possible pairs of variates based on fitting the marginal distributions. Thus, the bivariate lognormal (SLL) and bivariate SB distributions (SBB) were described, as well as mixed bivariate distributions (e.g. log-normal and SB). Convenient expressions for the conditional array distributions were also given based on the theory of bivariate normal distributions [see also Kotz et al. (2000) for a good summary of these distributions]. The joint probability for the bivariate SBB distribution is pz1; z2 5 h2ppffi1ffiffiffiffiffiffiffiffiffiqffiffiffi2ffi i 1exp 6 where q is the correlation coefficient between z1 and z2 and The conditional array distribution of y2 given y1 will be an SB distribution defined by c2 replaced by d2 replaced by d2 1 The mean, variance, skewness and kurtosis of this array distribution can be recovered by convergent infinite sums or interpolation from tabulated values (Flynn, 2005). The median regression for the bivariate SB (median of y2 5 y~2 given y1) is h 5 expqc1 c2=d2 The major problem in using the SBB distribution is fitting the marginal SB distributions. Once the marginal distributions have been fit, calculation of the correlation coefficient between f1 and f2 results in specification of the bivariate distribution. Note that once the marginal distributions have been fit, the slope and intercept from a least-squares linear regression of f2 against f1 provide the values for theta and phi in equations (10) and (11) since f2 5 /f1 lnh: This can be conveniently obtained from a spreadsheet or statistical package along with the correlation coefficient q. APPLICATIONBIOMARKERS OF EXPOSURE To explore application of the Johnson system of bivariate distributions, and the SBB distribution in particular, to the problem of modeling exposures and biomarkers, a relevant data set was extracted from the literature (Toraason et al., 2006) and analyzed with the methods described above. The data consist of measurements of personal exposure (airborne concentrations) to 1-bromopropane; biomarkers of exposure in the form of serum and urine concentrations of bromine; and biomarkers of effect, including comet tail moments and dispersion coefficients measured in blood leukocytes. The data were collected on workers at two different facilities (A and B) involved in the manufacture of foam rubber cushions for furniture. Bromopropane is a solvent in spraying and gluing operations at both facilities. Biomarker measurements were taken at the beginning and end of the week, and exposure data were collected during the week. Table 1 presents the summary statistics for the data employed here: only end of the week biomarker data were used. The illustration here will focus on the relationship between exposure and serum bromine. Figure 3 is Exposure, all data Exposure, Facility A Exposure, Facility B Serum, all data Serum, Facility A Serum, Facility B a plot of this data, which clearly shows two different relationships for Facility A and B. In particular, all of Facility B measurements of serum bromine are below those of Facility A, even when the exposures are similar. There may be additional sources of exposure to bromine for individuals in Facility A, due to brominated compounds in the drinking water or air or perhaps increased dermal exposure to the adhesives in Facility A relative to Facility B. The exposure data collected did not include dermal assessments and studies indicate that this is a potential route for exposure to bromopropane as well (Rozman and Doull, 2002). It has also been established that biological monitoring for the actual compound 1-bromopropane provides better correlation with exposure than does biological bromine, due to non-occupational exposure sources in the environment. The apparent preference for bromine as a biomarker in these types of studies appears due to cost and convenience considerations (Hanley et al., 2006). In addition to the effect for facility, Fig. 3 shows clearly the non-linearity of the relationship between exposure and serum bromine for Facility A. This suggests the potential for transformations, and in particular the Johnson system of bivariate distributions. Due to the small sample size for Facility B, only Facility A data are considered further. The sample coefficients of skewness and kurtosis are readily calculated using k statistics (see, e.g. Keeping, 1995) for exposure and serum bromine data and are shown in Table 1 and plotted in Fig. 2 as well. The results for both variables are well above the log-normal curve, and in the region covered by the SB distribution, suggesting the potential for a bivariate SBB fit to the data. The first task is therefore to fit the marginal distributions of exposure and serum bromine to individual SB distributions. Fitting the marginal SB distributions Recent papers (Flynn, 2004a, 2006) describe some of the various methods for fitting Johnson SB distriFig. 3. Serum bromine versus airborne bromopropane, untransformed data for both Facilities A and B. butions including percentile, quantile and methodof-moments techniques. Maximum likelihood methods have been explored as well (Tsionas, 2001). When all four parameters are estimated, no matter which fitting method is employed, erroneous results and/or failures of the algorithms are possible. One common problem is that even when good estimates for gamma and delta are obtained, the estimated minimum may be above the sample value and/or the estimated maximum may be below that of the sample. To overcome these difficulties an additional fitting procedure is employed here based on the work of Royston (1982, 1993), who suggested maximizing the W statistic in the ShapiroWilk test as a method to fit the three-parameter log-normal distribution. The W statistic is essentially an R2 value between the sample order statistics and the expected order normal statistics. The latter are functions of sample size and can be obtained either through approximate calculations (Royston, 1993) or from more accurate tables (Shapiro and Wilk, 1965). The W statistic is between 0 and 1, and the higher it is, the better the sample data fit a normal distribution. The approach is modified here for the SB distribution to obtain fits when other methods failed or to optimize results obtained with the other fitting methods. This procedure involves estimating gamma and delta (either graphically or with one of the methods noted above) and then using a non-linear constrained optimization algorithm (the Excel spreadsheet Solver function) to maximize the W statistic based on the f values. Epsilon and lambda are adjusted by the algorithm, subject to the specified constraints, until the W statistic is maximized. The specified constraints are as follows: (i) epsilon lies between zero and the sample minimum and (ii) lambda is greater than the sample maximum but less than some absolute maximum (e.g. the saturation concentration of bromopropane). Once the values of lambda and epsilon are obtained, gamma and delta are estimated with equation (3). This procedure proved very useful especially in cases where other methods failed. Table 2 presents the results of the marginal (univariate) SB fits. Although the skewnesskurtosis plot is a useful fitting tool, the sample values b1 and b2 can be unduly influenced by outliers and they have inherent limitations. In addition, the relative simplicity of the two-parameter log-normal distribution makes it reasonable to enquire about the adequacy of Exposure, 38 0.1876 308.13 1.635 0.371 0.97 0.38 Facility A Serum, 38 1.68 65.0 1.782 0.528 0.94 0.08 Facility A a bivariate log-normal distribution. Figure 4 is a plot of the logarithms of serum bromine versus bromopropane exposure. While a linear regression is possible for this data, the plot indicates a remaining nonlinearity that would lead to bias in the relationship, particularly at the ends of the data ranges. This is consistent with results in the next subsection rejecting the bivariate log-normal distribution for this data. Assessing the bivariate fits As Johnson (1949b) noted, the marginal distributions of the bivariate SBB distribution are SB distributions; however, the converse need not be true. Thus, a procedure for assessing the goodness of fit for the bivariate distribution is employed here using the p-variate Zp statistic described by Surucu (2006). For bivariate distributions p 5 2 and Zj The individual Zj values are algebraic functions of the sample order statistics, and the common variance V is obtained through simulation. The Zp statistic is distributed asymptotically as chi-square with p degrees of freedom. Surucu (2006) provides simulation results for V as a function of sample sizes from 10 to 100. For Facility A data, the Z2 statistic for the SBB distribution is 1.45, with a critical chi-square value of 4.605. The SBB distribution is judged a good fit at the 10% level. The Z2 statistic for the bivariate lognormal data is 13.7 and is rejected as a potential model. Based on these results the median regression (equation 12) was done using the Excel Data Tools statistical package. Table 3 summarizes the results. The correlation coefficient is 0.91 and the values for theta and phi are 0.572 and 0.639, respectively. Figure 5 shows a plot of the bivariate SB median regression results with the regression equation noted on the graph. An improved linearization relative to the log-normal plot in Fig. 4 is evident. Fig. 4. Bivariate log-normal plot of Facility A, serum bromine versus bromopropane exposure, showing a residual non-linearity. Fig. 5. Bivariate SBplot of the transformed serum bromine biomarker versus the transformed airborne exposure measurements of 1-bromopropane for Facility A. Array distributions At this point it is reasonable to explore the bivariate regression as a tool to stratify workers by exposure based on the serum bromine levels. Thus, within Facility A we ask what is the distribution of exposure for a given blood serum level of bromine. This is obtained from the conditional array distribution of y1 given y2 (an SB distribution) by using equation (8) with subscripts interchanged. With the conditional values of gamma and delta thus obtained for the array distributions, it is possible to calculate the mean and variance of the concentrations. Table 4 presents these results for several different conditional values of blood serum bromine and Fig. 6 is a plot of the exposure mean (squares) 1 SD (light solid lines) overlaid on the raw data. Also included in Fig. 6 are the conditional mean exposures calculated using the bivariate log-normal distribution (triangles and heavier solid line). As noted above this distribution was rejected based on the results of the Z statistic test. The calculation for the first row in Table 4 is illustrated here. The mean and standard deviation for the exposure (c1), given a bromine blood serum level of 5 mg dl 1 (c2), is obtained from the conditional array distribution defined by the following equations: Table 3. Median regression results f2 (serum bromine) versus f1 (bromopropane exposure) Variable Facility A Mean y1 Standard deviation of y1 Median exposure (p.p.m.) 7.39 Mean exposure (p.p.m.) 11.43 Serum bromine (mg dl 1) Standard deviation of exposure (p.p.m.) d1 j c2 5 d11 d1 j c2 5 0:3712:412 5 0:895: These conditional values for c1 and d1 are input to a computer code (Flynn, 2005) that solves the equations given in the Appendix to calculate the mean and standard deviation for y1. The exposure (c1) values are recovered using the values for e1 (0.1876 p.p.m.) and k1 (308.13 p.p.m.), which do not change for the array distributions. CONCLUDING REMARKS The primary goal of this work is to present a methodology for the analysis of concentration data based on the Johnson family of bivariate distributions and to illustrate its application to the exposurebiomarker problem. In many cases of interest, exposure concentrations and biomarker data will be positively skewed with standardized third and fourth sample moments (i.e. b1, b2) above the log-normal line in the skewnesskurtosis plane. Under these conditions, the Johnson SB distribution is a potential candidate to fit the marginal distributions, and the question of a bivariate SBB fit can be explored subsequently. If a bivariate fit can be obtained in the Johnson system of distributions, then convenient equations exist within the median regression format to specify the conditional or array distributions. In many cases differentiation of exposure or dose is done based on mean values and recourse to further calculation or tabulated values is needed. Table 5 gives an overall summary of the process described here with specific focus on the SBB distribution, although any one of the 16 possible Johnson bivariate distributions would follow an analogous approach. Although the goodness of fit is always an important consideration in selecting a probability model for describing data, other issues may also be relevant. In particular, the extent to which the distributions are consistent with the fundamental physical principles Table 5. Overview of methodology for modeling exposurebiomarker data Plot biomarker versus exposure data to assess initial need for transformation Calculate sample coefficients of skewness and kurtosis and plot on the skewnesskurtosis plane; select candidate distributions: SB, SL or SU Fit each marginal distribution (assume both are SB here) Estimate c and d from a look-up table or graph Use equations (4) and (5) to calculate initial estimates of e and k If better fits are needed, optimize W statistic to get final values for e and k Calculate final values for c and d with equation (3) Using final values transform data to either f or z values by equations (1)(3) Evaluate the bivariate distribution using either f or z values, using equation (13) Perform median regressions, e.g. equation (12), to get h, u and q Calculate array distributions as needed [see, e.g. equation (14)] Calculate or use tables for means and standard deviations as needed of the process. In exploring the relationship between exposures and biomarkers, there is a desire to move beyond statistical association and to mechanistic models and causality, interfacing especially with physiologically based pharmacokinetic models. N B 5 2pd X exp sin2n 1pdccosech2n 2n2p2d2 cos 2npcd ; The mean of y is (Johnson, 1949a) where / 5 A B and w 5 CD n2=2d2 l29 5 l d The variance of the y is where l92 is the second moment of y about the origin, and The partial derivatives are 1 XN n exp 5 2pd2 X 2n 1exp cos2n 1pdccosech2n Application of the chain rule to the expression (A7) and simplification yields the following explicit expression for the variance of y: 2n2p2d2 sin 2npcd : REFERENCES Bowman KO, Serbin CA. (1981) Explicit approximate solutions for SB. Commun Stat Sim Comput A; B10: 115. Flynn MR. (2004a) The 4 parameter lognormal (Sb) model of human exposure. Ann Occup Hyg; 48: 61722. Flynn MR. (2004b) The beta distributiona physically consistent model for human exposure to airborne contaminants. Stoch Environ Res Risk Assess; 18: 30608. Flynn MR. (2005) On the moments of the 4-parameter lognormal distribution. Commun Stat Theor Methods; 34: 74551. Flynn MR. (2006) Fitting human exposure data with the Johnson SB distribution. J Exp Anal Environ Epid; 16: 5662. Hanley KW, Petersen M, Curwin BD, Sanderson WT. (2006) Urinary bromide and breathing zone concentrations of 1-bromopropane from workers exposed to flexible foam spray Adhesives. Ann Occup Hyg; 50: 599607. Hill AB. (1965) The environment and disease: association or causation? Proc R Soc Med; 58: 295300. Johnson NL. (1949a) Systems of frequency curves generated by method of translation. Biometrika; 36: 14976. A10 A11


This is a preview of a remote PDF: https://annhyg.oxfordjournals.org/content/51/6/533.full.pdf

MICHAEL R. FLYNN. Analysis of Exposure–Biomarker Relationships with the Johnson SBB Distribution, Annals of Occupational Hygiene, 2007, 533-541, DOI: 10.1093/annhyg/mem033