Bayesian correction for covariate measurement error: a frequentist evaluation and comparison with regression calibration

Bayesian approaches for handling covariate measurement error are well established, and yet arguably are still relatively little used by researchers. For some this is likely due to unfamiliarity or disagreement with the Bayesian inferential paradigm. For others a contributory factor is the inability of standard statistical packages to perform such Bayesian analyses. In this paper we first give an overview of the Bayesian approach to handling covariate measurement error, and contrast it with regression calibration (RC), arguably the most commonly adopted approach. We then argue why the Bayesian approach has a number of statistical advantages compared to RC, and demonstrate that implementing the Bayesian approach is usually quite feasible for the analyst. Next we describe the closely related maximum likelihood and multiple imputation approaches, and explain why we believe the Bayesian approach to generally be preferable. We then empirically compare the frequentist properties of RC and the Bayesian approach through simulation studies. The flexibility of the Bayesian approach to handle both measurement error and missing data is then illustrated through an analysis of data from the Third National Health and Nutrition Examination Survey.


Introduction
Many epidemiological studies are affected by measurement error in one or more of the covariates of interest. It is well known that error in covariates results in biased estimates of true covariate(s)-outcome associations and in a loss of power to detect such associations. 1 In this paper, we focus on correcting for the effects of measurement error in continuous covariates in three models which are commonly used in epidemiological analysis; linear regression models for continuous outcomes, logistic regression models for binary outcomes, and Cox proportional hazards models for survival or time to event outcomes.
Throughout most of the paper, we will focus on the situation in which there is one main exposure of interest, which is subject to measurement error, and one or more other covariates to be adjusted for, which are assumed to be measured without error. The variable which is measured with error could equally be one of the confounders, and indeed the approaches we describe also extend to the more general case of multiple covariates measured with error. While exposure measurement error is commonly prioritized, measurement error in confounders is also a serious and highly prevalent issue, and causes estimates of exposure effects to only be partially adjusted for the poorly measured confounder. Error in continuous variables can take a number of forms. The most simple and most commonly assumed form is the classical measurement error model, under which the measured exposure is equal to the true exposure plus an independent random error term. Under this model, the measured exposure is an unbiased measure of the true exposure. The error terms are assumed to have zero mean and, typically, constant variance.
To make corrections for the effects of covariate measurement error in regression models requires some information about the relationship between the true exposure and the measured exposure, i.e. regarding the parameters of a measurement error model. One way of gaining information about the error model is to use a validation study within the main study sample, in which the true exposure is observed alongside the measured exposure. It is often not feasible or even possible, however, to obtain a validation sample and a more common alternative is to obtain one or more replicate observations of the exposure for a subset of individuals within the main study sample. We refer to this as a replication study. In this paper, we focus on replication studies. Many methods have been described for correcting for the effects of measurement error in regression models. 1 The most widely used correction method is regression calibration (RC), which is popular due to its simplicity and applicability in different types of regression models. In RC, the true exposure, which is unobserved in the main study sample, is replaced when fitting the outcome regression model by the expected value of the true exposure, conditional on the measured exposure and the other error-free covariates for each individual. RC gives consistent estimates of the true associations between the explanatory variables and the outcome in a linear regression model, and approximately consistent estimates in non-linear models, including logistic regression models 2,3 and Cox proportional hazards models. 4 RC has some drawbacks, however. First, for non-linear models estimates can have moderately large biases even when the sample size is large, particularly if the effect size (odds ratio or hazard ratio) is large. 5 Second, RC does not automatically accommodate uncertainty in the parameters indexing the measurement process. Measures of uncertainty require use of approximate methods (the 'delta method' approach), bootstrapping methods, which are computationally intensive, or estimating equation methods, whose validity relies on asymptotic conditions and are complex to implement in practice. Third, extending the basic RC approach to more complex situations, such as when the outcome model is assumed to depend on non-linear functions of the true covariate, 6 or when the measurement error model is more complex (e.g. heteroscedastic error 7 ), is not trivial.
The Bayesian approach has been often advocated as a natural route to accommodating sources of uncertainty, including measurement error, misclassification, and missing data. Early papers include those by Richardson and Gilks,8,9 who described a Bayesian approach to handling measurement error. By taking a Bayesian approach to handle covariate measurement error, uncertainty in the parameters indexing the measurement process is automatically accommodated. Like the method of maximum likelihood (ML), the posterior distributions involved typically involve intractable likelihoods, but this difficulty is obviated by Markov Chain Monte Carlo (MCMC) methods, which are now implemented in a number of standard and Bayesian-specific packages. A further strength is that these software packages allow one to define and fit quite complex user defined Bayesian models, meaning that there is great flexibility in adapting the modelling assumptions to the situation at hand. Lastly, and in contrast to methods such as ML or multiple imputation (MI), whose inferences typically rely on various large sample assumptions (e.g. to handle nuisance parameters or in deriving simple imputation combination rules), Bayesian methods do not. In the setting of covariate measurement error, estimators which allow for the error typically have skewed sampling distributions, and this is automatically accommodated in a Bayesian approach, since the entire posterior distribution is simulated.
Despite excellent book length treatments of covariate measurement error methods, 1 including one specifically focusing on Bayesian methods, 10 in our view it nevertheless continues to be underused by the epidemiological and clinical research communities. This may be for a number of reasons, but principal among them may be the apparent need to move from a frequentist to a Bayesian inferential approach, and the fact that standard statistical packages have (with exceptions) not enabled such Bayesian models to be fitted. To the first of these reasons, as has been noted by others (e.g. Little 11 ), Bayes procedures often have good frequentist properties, and indeed in small samples can have better frequentist properties than ML methods. As such, one may be able to use a Bayesian method without necessarily adopting the Bayesian inferential paradigm. To this end, we present simulation results to examine the frequentist properties of the Bayesian approach to covariate measurement error, using certain default priors. To the second reason, major steps forward have been made over the last 25 years in terms of accessible MCMC software, such that software and computational power are usually not a hinderance to using a Bayesian approach. Moreover, we make all of our code available online to facilitate increased use of the Bayesian approach.
In Section 2, we begin by describing the assumed setup and notation for the covariate measurement error problem. Next, in Section 3, we review the RC approach. In Section 4, we describe the Bayesian approach, both in terms of modelling choices and statistical properties, and its practical implementation. We contrast the Bayesian approach with ML and MI in Section 5. In Section 6, we evaluate the frequentist properties of RC and Bayesian analysis in a series of simulation studies of the most common outcome model types. In Section 7, we present results of illustrative analyses using data from the Third National Health and Nutrition Examination Survey (NHANES III). We conclude in Section 8 with a discussion.

Setup and notation
In this section, we describe the general setup used for the remainder of the paper.

Outcome model
We assume data are available for an i.i.d. sample of n individuals. For individual i, we let Y i , X i , and Z i , respectively, denote the outcome, true covariate which is subject to measurement error and error-free covariates. We consider three types of outcome models for Y i (i) linear, (ii) logistic, and (iii) Cox proportional hazards regression. We assume that the outcome model includes only main effects of X i and Z i . For a linear regression outcome model, we thus assume that where i $ Nð0, 2 Þ is an independent normally distributed residual error. For a logistic regression outcome model, we assume that Lastly, in the case of a censored time to event outcome, the outcome Y i ¼ ðT i , D i Þ where T i denotes the observed event or censoring time and D i denotes the event indicator. We then assume a Cox proportional hazards outcome model, such that the hazard given X i and Z i is given by where h 0 ðtÞ denotes the baseline hazard function. In the standard frequentist analysis based on the Cox proportional hazards model, the baseline hazard is left unspecified, and inferences about the hazard ratio parameters ð X , Z Þ are made via a partial likelihood. 12

Measurement error model
We assume that for each study individual, an error-prone measurement W i1 is available, rather than the covariate of interest X i . We assume a classical error model where EðU i1 jX i Þ ¼ 0. We also assume that the errors U i1 are independent of all other random variables. This implies that the error is non-differential with respect to the outcome Y i . To permit estimation of the measurement error variance, we assume the existence of an internal replication substudy. This means that for a randomly selected group of individuals a second error-prone measurement W i2 ¼ X i þ U i2 is obtained, where the error U i2 is assumed independent of U i1 . We let W i denote the vector of error-prone measurements on individual i, and let N i denote its length, which is two for those individuals in the replication sub-study and one for those not. In the following, we specify further assumptions as required by RC and Bayesian methods.

Regression calibration
In the simplest version of RC, the outcome model is fitted as usual, with the unobserved X i replaced by an estimate of EðX i jW i1 , Z i Þ. 13 Typically, the latter conditional expectation is assumed to be linear in W i1 and Z i , and it can be estimated by linearly regressing W i2 on W i1 and Z i in the individuals from the internal replication sub-study.
We note that this version of RC does not rely on an assumption that the two errors U i1 and U i2 have the same variance.
If one is willing to make additional assumptions, a somewhat more efficient version of RC can be used, in which X i is replaced by an estimate of EðX i jW i , Z i Þ, and the parameters involved in the latter are estimated using all study individuals. A common assumption is to assume that XjZ Á and that the measurement errors U i1 and U i2 are normally distributed with mean zero and common variance 2 U . The parameters can be estimated by ML as a random-intercepts mixed model for the W i conditional on Z i . It then follows from standard properties of the multivariate normal distribution that X i jW i , Z i is normally distributed, with where W i denotes the mean of individual i's N i error-prone measurements.
As noted in the introduction, RC gives consistent parameter estimates in the case of a linear outcome model. For logistic and Cox outcome models, RC is approximately consistent. Armstrong first gave justification for RC in generalized linear models under the assumption that VarðX i jW i Þ is 'small', using the delta method. 2 This condition will be satisfied when the measurement error variance is 'small'. Rosner et al. 3 later justified its use in logistic regression under the assumptions that the outcome is rare and that X i jW i is normal. Subsequently, Kuha 14 showed that RC could be justified as an approximate method for logistic regression provided that 2 X VarðX i jW i Þ is small, without the rare outcome assumption. This condition holds when X is small or the measurement error variance is small. For a Cox proportional hazards outcome model, RC can be justified when the event rate is low or the measurement error variance is small. 4,15 For valid inferences, the estimation of the parameters involved in EðX i jW i , Z i Þ should be allowed for. One approach is to use bootstrapping. Alternatively, it is possible to construct sandwich variance estimators by stacking the estimating equations used in the two stages. 1 One drawback with this approach is that the resulting Wald type symmetric confidence intervals do not reflect the asymmetric sampling distribution of the RC estimator, which may lead to confidence interval coverage which deviates from the nominal level.

Bayesian approach
In this section, we describe the key elements of a Bayesian analysis of the covariate measurement error problem.

Model specification
First, we specify a joint parametric model for ðY i , X i , W i1 , W i2 jZ i Þ. We condition on the fully observed Z i , thereby avoiding the need to model its distribution. Assuming that the measurement error is non-differential with respect to both Y i and Z i , this joint model can be decomposed as The first component is the outcome model, which contains regression parameters of primary interest ¼ ð 0 , X , Z Þ in the case of linear or logistic regression and ¼ ð X , Z Þ in the case of Cox regression, and possibly additional parameters (e.g. a residual variance in the case of a linear regression outcome model). The second component is the measurement model, and as described previously the simplest assumption is that the error-prone measurements W ij follow a classical error model, with independent normally distributed errors U ij $ Nð0, 2 U Þ. The final component specifies a model for the unobserved covariate X i , conditional on Z i , with a default choice being a normal linear regression model. We return later to questions of robustness and to model extensions to relax such distributional assumptions. In the case of a Cox proportional hazards outcome model, the outcome Y i has two components ðT i , D i Þ, and the additional parameters, , denote the baseline hazard function H 0 ðtÞ.

Prior specification
In the Bayesian approach, we must specify priors for the model parameters. The first 'Bayesian' analyses made use of flat or constant priors, based on the notion that these represent a priori ignorance regarding the value(s) of the model parameter(s). 16 The key issue with such priors is that while a flat prior expresses ignorance on one scale, a transformation of the parameter implies a non-flat prior on the transformed parameter. The latter half of the 20th century witnessed the growth of the subjective Bayesian approach, in which the analyst carefully choose the priors to represent their beliefs about the model parameters in advance of seeing the data. Arguably the majority of Bayesian analyses which are now performed by researchers make use of so called non-informative or reference priors. 17 Such priors do not (and cannot) represent total ignorance about the model parameters, but can be viewed as default priors that one might use when subjective prior information is either not available, or one does not want to use such information in the analysis. The intention of such priors is usually that they have minimal impact on inferences.
For the joint model in equation (4), prior independence is typically assumed for the parameters in the three submodels. For the outcome model regression coefficients and the coefficients in , a common default prior is a very diffuse normal prior centred at zero. For the variance parameters, the conjugate inverse Gamma distribution has traditionally been advocated. In the context of adjustment for covariate measurement error, Gustafson has proposed using a Gað0:5, 0:5Þ prior for the precision (reciprocal of variance) parameters. 10 This prior equates to the likelihood that would be obtained from one observation, with the best guess for the precision of one.
Bayesian analysis of the Cox model requires specification of a prior for the baseline cumulative hazard process H 0 ðtÞ in addition to priors for the regression coefficients and the other sub-model parameters. The prior distribution for the baseline cumulative hazard process H 0 ðtÞ is assumed to be independent of the other priors, including that for . Here, we use a Gamma process prior for H 0 ðtÞ as described by Kalbfleisch 18 and Sinha et al., 19 denoted H 0 ðtÞ $ GPðcH Ã 0 , cÞ, where H Ã 0 ðtÞ is a prior guess at the mean and c is a parameter which represents the confidence in that guess, with small values of c corresponding to a diffuse prior. We let t ð1Þ 5 t ð2Þ 5 Á Á Á 5 t ðn Ã Þ denote the ordered observed event times. Under the assumption that the hazard is degenerate at 0 except at the observed event times T i where D i ¼ 1 it follows from the Gamma process prior for H 0 ðtÞ that the increments in the cumulative baseline hazard from time t ð j Þ to time t ð jþ1Þ (j ¼ 1, . . . , n Ã À 1) have independent Gamma distributions; dH 0 ðt ð j Þ Þ $ GammaðcðH Ã ðt ð jþ1Þ Þ À H Ã ðt ð j Þ Þ, cÞ. In the later application of this approach, we use H Ã ðt ð jþ1Þ Þ À H Ã ðt ð j Þ ¼ rðt ð jþ1Þ À t ð j Þ Þ where r is a guess at the event rate per unit time. It can be shown 18,19 that under the Gamma process prior for the cumulative hazard the likelihood for ð, H 0 ðtÞ, cÞ tends to the partial likelihood in the limit as c tends to 0, and that it tends to the full likelihood with H 0 ðtÞ ¼ H Ã as c tends to infinity.

Posterior inference and simulation
Given specification of the model and priors, Bayesian inference is then based on the posterior distributions of the model parameters. For the purposes of point estimation, the posterior mean is commonly used. To form a 95% credible interval for a particular parameter, we take the 2.5% and 97.5% centiles of the posterior distribution. An advantage of this in the present context of adjustment for covariate measurement error is that asymmetry in the posterior distribution, which typically occurs when adjusting for covariate measurement error, is automatically accounted for in credible intervals.
Except for very specific choices of model and prior, in general the posteriors are not available analytically. Instead, we can utilize MCMC methods to simulate draws from the posteriors distributions (see e.g. part III of Gelman et al. 20 ). The most common approach is the method of Gibbs sampling, in which taking each parameter in turn, a new value is drawn from its full conditional distribution given all other quantities. Often these conditional distributions do not belong to standard parametric families, necessitating the use of more sophisticated sampling techniques (see Robert 17 and Gelman et al. 20 for further details). However, these are implemented in the software packages we describe in the following, such that the analyst need not generally concern themselves with the details.

Frequentist properties
Under certain regularity conditions, as the sample size tends to infinity, the choice of prior has no impact on the posterior distribution, since the latter is then dominated by the likelihood function. Consequently, Bayes estimators and uncertainty intervals enjoy the same large sample properties as maximum likelihood methods: the Bayes posterior mean estimator is consistent, asymptotically normal, and efficient. 20 . In reality of course, all samples are finite, and the choice of prior can sometimes have a material affect on inferences.
Importantly, however, in small samples or sparse data situations, Bayesian methods can have better frequentist properties than ML procedures, particularly if sensible priors are adopted. 21

Software
The explosion of Bayesian data analyses being performed over the last few decades is largely thanks to both the MCMC methods developed and their implementation in accessible software. Chief among these is the WinBUGS software package, developed in the 1990s. 22 It allows the user to define, using a simple language syntax or graphical interface, the model and priors. MCMC methods are then automatically chosen by the package, depending on the specified model and priors. One can then run the MCMC sampler, and after a sufficient number of burn-in iterations, draws can be saved as draws from the respective posterior distributions. More recently, new packages have been developed, with developments in various directions. These include the OpenBUGS project (www.openbugs.net) and Stan (mc-stan.org). In the simulations described later, we make use of the Just Another Gibbs Sampler (JAGS) program, 23 whose model language is very similar to the BUGS language used by WinBUGS, and which can be easily called from R.
5 Maximum likelihood and multiple imputation 5.1 Maximum likelihood ML estimation and inference is based on the likelihood function, but unlike the Bayesian approach, does not involve specification of prior distributions for parameters. ML methods enjoy many favourable frequentist properties -assuming correct model specification the ML estimator is consistent, asymptotically normal, and efficient. In the specific context of adjustment for covariate measurement error, a drawback of ML is that the likelihood function typically involves intractable integrals, 1 such that numerical methods such as numerical integration are required in order to obtain estimates. The same obstacle is overcome in the Bayesian approach through the use of MCMC methods. A further drawback of ML in the present context is that in small samples inference based on symmetric Wald based confidence intervals may perform poorly due to the lack of regularity of the likelihood function. Software to fit user defined models which allow for covariate measurement error is also somewhat limited. Lastly, the absence of prior distributions prevents the incorporation of external information which may sometimes be available regarding the measurement process.

Multiple imputation
MI has become an extremely popular approach for handling missing data and has also been advocated as an approach for handling measurement error, in which X i is multiply imputed. [24][25][26][27] There is a very close connection between MI and 'direct' Bayesian inference. In its originally devised form, MI is based on repeatedly drawing imputations of missing values from their posterior distribution based on a Bayesian imputation model. The analysis model of interest is then fitted, typically using ML, to each of the complete datasets. The resulting estimates and standard errors are then pooled using rules developed by Rubin. 28 MI can most directly be viewed as an approximation to a full Bayesian analysis, 29 although its frequentist properties can of course also be evaluated. 30 From the Bayesian perspective, application of MI and Rubin's rules can be viewed as a particular route to performing a Bayesian analysis, in which one effectively assumes that the posterior distributions for the parameters are normally distributed.
As described by Carpenter and Kenward 29 in the context of missing data, there are a number of settings where use of MI may be preferable to a direct Bayesian analysis. However, in the context of covariate measurement error we argue that the advantages of a direct Bayesian approach far outweigh the disadvantages, relative to MI. First, when only replicate error-prone measurements are available, standard software for performing MI cannot be applied, since X i is missing for all individuals. Second, standard parametric imputation models which might be used to impute X i in general may not be compatible with the assumed outcome model. 31 This will in particular occur when the outcome model is itself non-linear, or the imprecisely measured covariate is assumed to have a non-linear effect on the outcome. Third, when allowance is made for covariate measurement error, as noted earlier, the posterior distributions for the outcome model parameters are typically skewed in small to moderate sample sizes, such that symmetric credible/confidence intervals constructed using Rubin's rules may perform poorly, either from a subjective Bayesian perspective or in a frequentist evaluation. Lastly, we note that software for performing Bayesian inference will typically also permit saving of the imputed values of X i as a by-product, such that if the analyst really wants imputed datasets they can still be obtained.

Simulations
In this section, we present simulation results for the cases of a linear, logistic, and Cox proportional hazards outcome model, comparing the popular RC approach with the Bayesian approach. We adopt the standard frequentist type simulation setup, in which datasets are repeatedly generated using fixed population parameter values.

Linear regression
We first present simulation results for a simple linear regression outcome model. Datasets of size n ¼ 1000 were simulated, with covariates X i and Z i drawn from a bivariate normal distribution with means 0, variances 1, and covariance 0.25. Continuous outcomes Y i were generated from the linear regression model given in equation (1), with 0 ¼ 0 and X ¼ Z ¼ 1. The normally distributed residual variance 2 was chosen in order to given R 2 ¼ 0:1, 0:5, 0:9. Each individual had an error-prone measurement A random subset of 10% of individuals had a second error-prone measurement, with the same error variance 2 U . This variance was chosen such that the unconditional reliability ¼ 2 X =ð 2 X þ 2 U Þ took values 0:5, 0:7, 0:9, corresponding to low, moderate, and high reliability.
We first estimated the outcome model parameters using RC, by fitting a random-intercepts model for the errorprone measurements, with Z i entering as a fixed effect, as described in Section 3. Next, we fitted a Bayesian model, calling JAGS from R using the rjags package. We adopted non-informative priors for all model parameters, following those proposed by Gustafson. 10 Specifically, we assumed independent normal priors for 0 , X , Z , 0 , Z , with mean 0 and variance 10,000, and inverse gamma IGð0:5, 0:5Þ priors for each of the variance parameters. As discussed by Gustafson, the latter prior can be thought of as being equivalent to a best guess for the variance of one, coming from a single observation. We ran five parallel chains with 1000 burn-in iterations and 5000 main iterations. If the Rubin-Gelman convergence statistic Rhat was greater than 1.05 for any of 0 , X , Z , we extended the chains until this was met. Table 1 shows the simulation results, with 1000 simulations per scenario. For Bayes, convergence to stationarity (as defined previously) was achieved in all simulations, although additional iterations were sometimes required to achieve this, particularly when reliability was 0.5. RC had slight upward bias for X when the reliability of the error-prone measurements was 0.5, and was unbiased for the higher reliability values. The Bayes mean estimator was upwardly biased for reliability equal to 0.5 and R 2 ¼ 0:1 and R 2 ¼ 0:5. Inspection of the estimates showed that the sampling distribution of the Bayes mean estimator had greater skew than the RC estimator, with the larger estimated values inducing the upward bias. For reliability of 0.5 and 0.7, and R 2 ¼ 0:9, the Bayes estimators had lower empirical standard deviation (SD) than the RC estimator and consequently had lower mean squared error in these scenarios. For the other scenarios, the RC and Bayes estimators performed similarly. Lastly, the 95% Bayesian credible intervals had frequentist coverage close to 95%. We emphasize that the performance of the Bayesian estimator depends on the choice of priors. In particular, the use of more informative priors for the measurement error variance and the variance for XjZ could be used to reduce the bias and variability of the Bayesian estimators. To investigate this, we performed a further set of simulations, performing the Bayesian analysis with three different sets of priors. The first (Bayes 1) used the same priors as described previously. The second (Bayes 2) used the same priors except that the prior for 2 XjZ was replaced by a Beta(3, 1) prior for the conditional (on Z) reliability jZ ¼ 2 XjZ =ð 2 XjZ þ 2 U Þ. This prior corresponds to assuming that the (conditional) reliability is unlikely to be small (% 0:015 chance of being less than 0.25), and it can be expected to stabilize estimates of this parameter (and therefore also the outcome model parameter estimates). The third (Bayes 3) instead used a Betað10, 10ð1 À Ã jZ Þ= Ã jZ Þ prior for jZ , where Ã jZ denotes the true value of the parameter for each scenario. The mean of this prior is Ã jZ , and so this corresponds to using a prior that is centred at the correct value of this parameter. In order to allow the different priors to have a larger influence, the sample size was reduced to 250 subjects, and only 25 subjects had a second error-prone replicate measurement W i2 . Table 2 shows the results. When the (unconditional) reliability was 0.5, RC showed some upward bias. This is due to occasionally the estimated conditional reliability being very small, leading to a large estimate of X . Bayes 1 performed worse than RC with this reliability, except for R 2 ¼ 0:9, where Bayes 1 had little bias and was much less variable. This can be attributed to the fact that the Bayes approach essentially imputes the unobserved X using all other variables, and when R 2 is large, the outcome Y enables X to be imputed with little variability. Bayes 2 had similar bias to RC when the reliability was 0.5 and R 2 was 0.1 or 0.5, but had much lower variability. As we would expect Bayes 3 had even lower bias than Bayes 1 and Bayes 2 (as well as RC), and was less variable. Results for reliability equal to 0.7 were qualitatively similar, again demonstrating that the use of stronger priors reduced bias and variability. For reliability of 0.9, the three Bayes estimators performed very similarly, but had slight upward bias (in contrast to RC, which had little bias).

Logistic regression
Next, we performed simulations with a logistic regression outcome model. The covariates X i and Z i , and errorprone measurements were generated as described previously. The binary outcome Y i was then generated according to a logistic regression model (2). The intercept 0 was chosen so that PðY i ¼ 1Þ ¼ 0:2 approximately. We performed simulations with log odds ratios X ¼ 0:1, 0:5, 2, representing small, moderate, and large effects of X i . As before, we set Z ¼ X .
RC was implemented as described previously. For the Bayesian approach, we again used independent normal priors for each of 0 , X , and Z . For 0 we used, as before, the non-informative prior 0 $ Nð0, 10000Þ. For X and Z , we adopted the Nð0, 1:38Þ prior suggested by Hamra et al. 32 As described by Hamra et al., the use of such a mildly informative prior can help in terms of stabilizing estimates. This prior corresponds to assuming, a priori, that we are 95% sure that the odds ratios expð X Þ and expð Z Þ lie between 0.1 and 10, an assumption that is arguably generally reasonable in most epidemiology studies (provided the predictor has been suitably standardized). For the parameters and the variance parameters, we assumed the same priors as in the case of the first set of linear regression simulations. Table 3 shows the results of the simulations. For X ¼ 0:1, RC and Bayes performed very similarly, both being essentially unbiased and having similar empirical SD. For X ¼ 0:5 and reliability of 0.5, while RC is unbiased, Bayes showed some upward bias and was more variable than RC. For reliability ratios of 0.7 and 0.9, the performance was similar for both. For X ¼ 2 and reliability of 0.5, RC showed downward bias, consistent with the known properties of RC for logistic regression, in that bias is larger for large covariate effects. In contrast, Bayes showed only a slight downward bias. The bias of RC reduced, again as expected, as the reliability was increased to 0.7 and then 0.9, although some downward bias remained even for the latter case. In contrast, Bayes estimates were essentially unbiased. Lastly, the Bayesian 95% credible intervals had approximately 95% coverage across all scenarios.

Cox proportional hazards regression
Lastly, we performed simulations for time-to-event data based on a Cox proportional hazard model. The covariates X i and Z i were generated as described for linear regression. Event times T i were generated according to the Weibull hazard model hðtjX i , Z i Þ ¼ lt À1 e X X i þ Z Z i . We used ¼ 2, and was chosen so that approximately 10% of individuals had an event time before the end of follow-up which was fixed at time 10 (e.g. 10 years); the remaining individuals were censored at time 10 (D i ¼ 0). We performed simulations with log hazard ratios X ¼ 0:1, 0:5, 2 and as before we set Z ¼ X . RC was performed as described previously. For the Bayesian approach, we used independent normal priors for X and Z , and we chose the Nð0, 1:38Þ prior as was used in the logistic regression simulations, corresponding here to an assumption that we are 95% sure that the hazard ratios expð X Þ and expð Z Þ lie between 0.1 and 10. For the parameters and the variance parameters, we assumed the same priors as in the case of linear and logistic regression. As outlined in Section 4.2, we assumed a process prior for the baseline cumulative hazard which implies Gamma priors for the increments in the hazards; dH 0 ðt ð j Þ Þ $ GammaðcðH Ã ðt ð jþ1Þ Þ À H Ã ðt ð j Þ ÞÞ, cÞ. We used c ¼ 0.001, representing low confidence in the prior mean of the Gamma Process, H Ã 0 ðtÞ. We used H Ã ðt ð jþ1Þ Þ À H Ã ðt ð j Þ Þ ¼ rðt ð jþ1Þ À t ð j Þ Þ with r ¼ 0.01, since the data were simulated so that 10% of individuals have the event during 10 time units of follow-up. The analysis requires the data to be specified using a counting process format, and this is illustrated in example code given online. Due to the higher computational burden of fitting the Cox model, only three parallel chains were used, and 100 (rather than 1000) simulations were performed for each scenario. Table 4 shows the simulation results. For X ¼ 0:1 and X ¼ 0:5, RC and Bayes performed very similarly across all three reliability values. For X ¼ 2 and reliability of 0.5, RC showed bias toward the null, in line with previous simulation evidence. 33 This bias was reduced as the reliability increased, although there was still downward bias with reliability of 0.9. In contrast, the Bayes estimator was much less biased. The Bayesian credible intervals had approximately 95% coverage across all nine scenarios. To explore the impact of an increased proportion of individuals failing, an additional simulation study was conducted in which 50% of individuals were observed to fail (and with a smaller sample size of n ¼ 100, each of whom had two error-prone replicates). The results (not shown) were qualitatively very similar to those in Table 4.

Illustrative example
To illustrate the potential flexibility and advantages of the Bayesian approach, we consider data from the NHANES III. NHANES III was a survey conducted in the US between 1988 and 1994 in 33,994 individuals aged two months and older. We consider a model relating known risk factors for cardiovascular disease (CVD) measured at the original survey to subsequent hazard of CVD. Mortality status at the end of 2011 is available through linkage to the US National Death Index, with cause of death classified using the ICD-10 system. Here, we consider illustrative analyses of the data available on those aged 60 years and above at the time of the original survey. Fitting Cox models to large datasets is very slow using JAGS, particularly for large datasets. We therefore considered inference for a Weibull regression model for hazard for death due to CVD, with age, sex, smoking status, diabetes status, and systolic blood pressure (SBP) at the time of the survey as covariates hðtÞ ¼ rt rÀ1 expð 0 þ 1 sbp i þ 2 sex i þ 3 age i þ 4 smoker i þ 5 diabetes i Þ where r is a shape parameter, 1 , . . . , 5 are log hazard ratios, and t is time since the original survey for each individual. We take the first SBP measurement taken at the original survey, sbp i1 , to be an error-prone measurement of each individual's underlying SBP, subject to classical error. An approximate 5% subset of individuals was (non-randomly) selected to participate in a second examination, during which SBP was again measured. This second exam took place on average 17.5 days after the first exam. We assume this second measurement of SBP, sbp i2 , to be an independent error-prone measurement of each individual's underlying SBP.
After deleting seven individuals who were missing diabetes status, data were available on 6519 individuals. Of these, by the end of 2011 1469 (22.5%) had died due to CVD, 3641 had died from other causes, and 1409 were still alive. In the Weibull model for hazard for CVD, we treat deaths from causes other than CVD as censorings; 5033 (77.2%) had an SBP measurement available from the first examination. An SBP measurement at the second examination was available in 401 (6.2%) of individuals. Unfortunately, smoking status was only recorded in 3433 (52.3%) of individuals. The analysis thus required handling of both the measurement error in the SBP measurements and the substantial missingness in the smoking and SBP variables.

Naive analyses
We first fitted the Weibull regression using sbp i1 (ignoring measurement error) to the 2667 complete cases, whose estimates are shown in the first column of Table 5. Strong evidence was found for independent associations between each of the covariates and hazard of death due to CVD, with associations in the expected directions. To check that a Weibull assumption was appropriate, we additionally fitted a Cox proportional hazards model with the same covariates. The estimates of the log hazard ratios between the two models were very similar, suggesting a Weibull assumption is reasonable here. Secondly, we performed a global test of the proportional hazards assumption using the Schoenfeld residuals following fitting the Cox model, which gave p ¼ 0.08, indicating no evidence to reject an assumption of proportional hazards. Through fitting a logistic regression model for the missingness indicator of the smoking variable, we found evidence that smoking was more likely to be missing for females, older individuals, diabetics, and those individuals with longer follow-up times. The latter finding suggests that the complete case analysis (CCA) may be biased. Next, we fitted the same Weibull model to the complete cases, again ignoring measurement error, using the Bayesian approach. We assumed an exponential prior for the shape parameter r with parameter 0.001. Rather than placing a prior on the log hazard ratios, we placed independent Nð0, 10 6 Þ priors on À k =r, since this leads to improved MCMC mixing. 34 Five independent chains were used, with 5000 burn-in iterations and 5000 main sample iterations. The estimates are 95% credible intervals are shown in Table 5. In line with theory, due to the large sample size, the Bayesian estimates and intervals were almost identical to those from ML.

Regression calibration
We then applied RC to the complete cases. To do this, we fitted a linear mixed model to the available SBP measurements, with a random effect for individual and fixed effects of sex, age, smoking, and diabetes. We also included a fixed effect to allow for a systematic shift in mean between the first and second exams. The resulting predicted true SBP values at exam one were then used as a covariate to fit the Weibull regression model. We used 2000 non-parametric bootstrap samples to obtain percentile 95% confidence intervals for the estimates, in order to take into account the two stage estimation process. Based on the mixed model fit to the complete cases, the estimated reliability (conditional on the error-free covariates) was 0.75. Adjusting for measurement error using RC led to the estimated log hazard ratio for SBP increasing, from 0.085 to 0.115, as expected by approximately 4/3 (1 divided by the reliability 0.75). Estimates for the other covariates did not materially change.
In order to apply RC to the full dataset, we contemplated use of its use in combination with MI. This is problematic, however. First, one could use MI to impute the missing smoking, sbp i1 and sbp i2 values. For example, one could apply the full conditional specification MI approach, imputing the smoking variable using logistic regression and the SBP variables using linear regression models. In these models, one must include the error-free covariates, plus the outcome. In the case of a time to event outcome modelled using a proportional hazards model, an approximately compatible imputation model for covariates includes the event indicator and an estimate of the cumulative hazard function. 35 Having generated the imputed datasets, RC could then be applied to each imputed datasets. However, in order to apply Rubin's rules, one requires valid within imputation estimates. As described in Section 3, for RC these can only be obtained by bootstrapping or by programming large sample theory estimating equation variance estimators. While both could in principle be programmed, they are not entirely straightforward to implement, and so we do not pursue MI in combination with RC.

Bayesian analyses adjusting for covariate measurement error
Next, we modified the Bayesian CCA to accommodate measurement error. We assumed that each individual's true underlying SBP around the time at which the first measurement was obtained, sbp i , was normally distributed conditional on smoking, sex, age, and diabetes, with Nð0, 10 4 Þ priors on the regression coefficients and a Gað0:5, 0:5Þ prior on the precision parameter. For the first SBP measurement, sbp i1 , we assumed with U i1 $ Nð0, 2 U Þ. For the second SBP measurement, sbp i2 , we assumed where is a parameter to allow for a systematic shift in mean between the two exams, and U i2 $ Nð0, 2 U Þ. The errors U i1 and U i2 are assumed to be independent. For À2 U a Gað0:5, 0:5Þ prior was assumed, and for a Nð0, 10 4 Þ prior was assumed. The posterior mean and credible intervals are shown in Table 5, under 'Bayes adj. CCA'. The results were very similar to those based on RC CCA, except that the credible interval for SBP was slightly narrower.
A strength of the Bayesian approach is its flexibility to simultaneously handle missing data and measurement error. To accommodate missingness in the smoking and SBP variables under a missing at random assumption, we assumed a model for the distribution of smoking, conditional on the fully observed error-free covariates sex, age, and diabetes. In our analysis, we assumed a logistic model for this conditional distribution with independent mean zero normal priors for the regression coefficients, each with variance 10,000. A major advantage of the Bayesian approach here is that the missing smoking and underlying SBP values are imputed by the Gibbs sampler, using the conditional distributions implied by a single well-specified joint model for the data. The posterior means were somewhat different to the RC and Bayes complete case analyses, and as expected the credible intervals were narrower, due to the inclusion of observed data from 3852 individuals. The changes in coefficient estimates may be indicative of bias in the complete case analyses.

Discussion
In this paper, we have empirically compared the frequentist properties of RC and Bayesian approaches to handling covariate measurement error. Our simulations demonstrate that for what might be considered a fairly typical epidemiological study setup, the methods often perform very similarly. As such, we believe that the Bayesian approach for measurement error adjustment may be as useful for the frequentist as for the Bayesian statistician. When the reliability of error-prone measurements was low, the Bayes estimator performed somewhat worse than RC. However, for larger effect sizes, RC was biased for logistic and Cox regression, while the Bayes estimator showed much less bias. A critical point to bear in mind is that there are infinitely many Bayes estimators, corresponding to the different choices of prior distributions -use of different priors could lead to, depending on the true data generating mechanism, better or worse performance. While some analysts dislike the Bayesian approach because of the requirement to specify priors, they give the analyst the opportunity to exploit external information about model parameters, potentially leading to more precise estimates. We have highlighted the fact that Bayesian estimators enjoy the same large sample frequentist properties as the method of ML, and also described the relationships between these approaches and the popular MI approach. Software for MI cannot be directly applied to handle covariate measurement error when replication data are available. Moreover, even when validation data are available, the covariate imputation models included in MI implementations may not be compatible with the analyst's outcome model. 31 A further strength of the Bayesian approach is that uncertainty intervals automatically allow for the skewness typically found in covariate measurement error adjusted estimators.
As has been noted by many authors before, a key strength of the Bayesian approach is its flexibility to handle more complicated models and data structures. As we have demonstrated in Section 7, the Bayesian approach can readily accommodate both covariate measurement error and missing data. Moreover, more complex measurement error models can in principle be used, for example, to allow for heteroscedastic error, systematically biased measurements, or more flexible modelling of the true covariate's distribution. 1 The flexibility of the Bayesian approach also lends itself to the problem of adjusting for covariate measurement error when the true covariate is assumed to have a complex non-linear association with the outcome. 36 In this paper, we have focused on the setting whereby internal replication data are available; the Bayesian approach readily handles the situation where validation data are instead available.
Nonetheless, the Bayesian approach has a number of drawbacks. As a fully parametric approach, a natural concern is sensitivity of inferences to distributional assumptions, particularly those about the unobserved true covariate and measurement errors. In this regard, the Bayesian approach can utilize more flexible model specifications, for example, by modelling the unobserved true covariate using a normal mixture model. 37 An important practical issue is that although the software available for fitting complex analyst defined models using the Bayesian approach has seen dramatic developments over the last 25 years, 22 fitting certain models (e.g. Cox proportional hazards models) can still take tremendously long. Although this concern will be progressively mitigated by increasing computational power, it is arguably still a material drawback. Further research and effort are therefore warranted to develop software implementations of the Bayesian approach which mitigate this.