Adjustment for the measurement error in evaluating biomarker performances at baseline for future survival outcomes: Time-dependent receiver operating characteristic curve within a joint modelling framework

The performance of a biomarker is defined by how well the biomarker is capable to distinguish between healthy and diseased individuals. This assessment is usually based on the baseline value of the biomarker; the value at the earliest time point of the patient follow-up, and quantified by ROC (receiver operating characteristic) curve analysis. However, the observed baseline value is often subjected to measurement error due to imperfect laboratory conditions and limited machine precision. Failing to adjust for measurement error may underestimate the true performance of the biomarker, and in a direct comparison, useful biomarkers could be overlooked. We develop a novel approach to account for measurement error when calculating the performance of the baseline biomarker value for future survival outcomes. We adopt a joint longitudinal and survival data modelling formulation and use the available longitudinally repeated values of the biomarker to make adjustment of the measurement error in time-dependent ROC curve analysis. Our simulation study shows that the proposed measurement error-adjusted estimator is more efficient for evaluating the performance of the biomarker than estimators ignoring the measurement error. The proposed method is illustrated using Mayo Clinic primary biliary cirrhosis (PBC) study.


Introduction
Due to current trends in medical practice towards personalised medicine, biomarkers have grown in importance in clinical studies. More and more studies are conducted to discover biomarkers that can accurately signal a clinical endpoint, e.g. measures of liver function such as prothrombin index as indicators of liver fibrosis, 1 and in clinical practice, rapid tests of biomarkers hold the promise of prompt diagnosis of diseases for an improved outcome, e.g. sepsis. 2 In this article, we refer the term "biomarker" to a single biomarker such as prothrombin index or to a composite risk score. A good biomarker can help identify patients who will have an early clinical benefit from a treatment or effectively guide the choice of therapeutic decisions, improving patients survival. However, due to imperfect laboratory conditions such as operator error, contamination, variable storage conditions, and limited machine precision, biomarkers are often subjected to substantial error in studies. 3 Failing to adjust for such measurement error may hinder the explanatory power of the biomarker, and in a direct comparison, useful biomarkers could be overlooked due to measurement error. 4,5 The performance of a biomarker is based on how well the biomarker is capable of discriminating between individuals who experience the disease onset (cases) from individuals who do not (controls). It is usually quantified by receiver operating characteristics (ROC) curve analysis, a well-established methodology in medical diagnostic research. 6 The area under the ROC curve (AUC) is an effective way to summarise the discriminative capability of the biomarker. AUC takes values from 0 to 1, and a biomarker with high AUC is considered better. A single biomarker value at baseline is mainly used in this assessment. Baseline time is an important time horizon in practice, as it is considered as the earliest time point of the patient follow-up time and provides the time base to assess the disease progression. However, individuals who are free of disease at baseline may develop the disease later in the follow up, and therefore, the assumption of fixed disease status over time may not be appropriate when evaluating the biomarker performance. Hence, incorporating the time dimension in ROC curve analysis has recently been actively researched, enabling better clinical guideline in medical decision based on biomarkers. The time-dependent ROC curve is usually derived from risk regression models such as Cox proportional hazards model as they naturally account for censored failure times. This ROC curve estimates the performance of baseline biomarker at future time points. For example, in a breast cancer study, time-dependent ROC curve was used to assess whether the patients are free from subclinical disease if the clinical disease does not emerge by two years of screening. 7 It has also been used to assess the predictive ability of the gene expression signatures in detecting early tumour response among metastatic colorectal cancer patients. 8 Lu et al. 9 identified a robust prognostic biomarker for tumour recurrence among lung cancer patients using time-dependent ROC curve analysis by estimating the AUC of the 51-gene expression signature at 60 and 100 months of follow-up. Using time-dependent AUC, Chen et al. 10 made direct comparison of five recently recognised serum biomarkers and identified those can be recommended for use in clinical practice to surveillance of cirrhosis for hepatocellular carcinoma patients. A comprehensive review of current time-dependent ROC curve analysis approaches is provided by Kamarudin et al. 11 However, Faraggi, 12 Reiser 13 and others concerned about ignoring the measurement error of biomarker values in ROC methodology, and showed that the effect can be substantial on the decision as to the diagnostic effectiveness of the biomarker.
As discussed by Henderson et al. 14 and many others, a framework such as joint longitudinal and failure-time outcome modelling is capable of avoiding biases not only due to informative missingness in biomarker measurement schedule, but also due to measurement error. In a joint model, both longitudinally repeated biomarker and censored failure-time processes is modelled simultaneously. This novel modelling framework has rapidly been developed in the past decade (see Gould et al. 15 and Tsiatis and Davidian 16 for comprehensive reviews of the model). Many have adopted or extended this framework to investigate the association between the biomarker and the hazard of failure (e.g. 17 ), or to derive risk predictions (e.g. Proust-Lima and Taylor, 18 Garre et al. 19 ). However, adopting the joint models for estimation of diagnostic effectiveness of a biomarker has been limited. Kolamunnage-Dona and Williamson 20 used joint modelling framework to evaluate time-dependent discriminative capability of a biomarker within the ROC curve analysis. In other studies, ROC curve has been used to evaluate the accuracy of the predicted survival probabilities from the joint model (e.g. Rizopoulos 21 ). Henderson et al. 22 has parameterised the underling association between longitudinal biomarker and failure-time processes by individual-level deviation of the longitudinal profile from the population mean, but R 2 like statistic is used to quantify the predictive accuracy of a biomarker for failure rather than the ROC curve.
According to our review, 11 measurement error of the biomarker has been ignored in all current timedependent ROC curve approaches. And, to our knowledge, joint modelling framework has not been adopted to make adjustment for measurement error when evaluating the performance of the biomarker in ROC curve analysis. As its main contribution, this article provides a new development of time-dependent ROC curve to evaluate the performance of baseline biomarker correcting for the measurement error. We propose to utilise a joint model to link the baseline biomarker and failure-time process, and use the individual-level deviation of the biomarker from the population mean to develop an estimator to evaluate the time-dependent ROC curve. In health research, often biomarkers are recorded longitudinally as patients are followed up over time, and we use available longitudinal measurements of the biomarker to make adjustment of the measurement error in our proposed approach. By incorporating the longitudinally repeated biomarker measurements, we make the most efficient use of the data available.

General notation
Let T i be the true failure time (e.g. time to death or time to disease onset) for the ith individual. Let d i ¼ IðT i C i Þ be the indicator of the failure, taking values 1 if the failure is occurred at time T i , and 0 if it is not occurred, so censored at time C i . We observe the failure-time process fX i ; d i g where X i ¼ minðT i ; C i ) defines the observed failure-time for i ¼ 1; . . . ; n individuals in the study dataset. Let y i ¼ y i t ij ð Þ ; È j ¼ 1; . . . ; m i g be a set of all available biomarker measurements recorded at times t ij ; j ¼ 1; . . . ; m i for the ith individual. y i0 is the biomarker measurement of the ith individual at baseline level (observed at baseline time t i1 ¼ 0).

Estimation of measurement erroradjusted estimator of baseline biomarker
Firstly, we formulate the joint model. A joint model is usually consisted of two submodels; a submodel for longitudinal measurements of the biomarker y i and a submodel for failure-time fX i ,d i g. The two components are linked together through some shared parameters. Longitudinal data are typically modelled by linear mixed effect models, while the failure-times assume various choice of modelling approaches through shared latent effects. 15 In general, a Gaussian linear model is assumed for longitudinal data, and proportional hazards is assumed for failure-times. Through the joint model, we can link the true (without measurement error) biomarker trajectory and the hazard of the failure for each individual. We follow joint modelling approach proposed by Henderson et al., 14 but as we only consider that the baseline value of the biomarker is predictive of the failure, we formulate the joint model by where U 0i and U 1i are individual-level random intercept and random slope respectively, and they reflect the true difference between longitudinal profile of each individual from the population mean. In particular, U 0i reflect the true deviation of the biomarker value from the population at baseline time. Therefore, throughU 0i , the proposed submodel for failure-time links the risk of failure directly on the true scale of the biomarker at baseline for each individual. We assume ðU 0i ; U 1i Þ follows a bivariate normal distribution with mean 0 and . In the joint model, the measurement error process is accounted for by e ij in longitudinal data submodel. The measurement error process is non-differential and can be defined by a classical additive measurement error model y ij ¼ y Ã ij þ e ij where y ij is the error-prone measure of y Ã ij . We assume e ij follows a Gaussian distribution with mean zero and variance r 2 e . The above longitudinal data submodel assumes that in the absence of measurement error, the biomarker follows a perfectly linear trajectory. In failure-time submodel, k 0i t ð Þ is an unspecified baseline hazard, and c estimates the level of association between baseline biomarker and hazard for the failure.
We can estimate the model by maximising the joint likelihood of the observed data via the Expectation-Maximization (EM) algorithm. 5,11 The EM algorithm involves taking expectations with respect to the unobserved random effects U 1i and U 0i , and it iterates between two steps (E and M) until convergence is achieved. For the proposed joint model, E-step determines expected values E½U 0i conditional on observed joint outcome fy i ; X i ; d i g. M-step maximises the complete data log-likelihood by U 0i replaced by corresponding expectation. The EM algorithm provides the best linear unbiased estimates of the individualspecific deviations U 0i .
Secondly, based on the estimated values, we can compute measurement error-adjusted estimator based on the linear predictor of the failure-time submodel bŷ whereÛ 0i is the estimated true deviation of the biomarker value from the population mean at baseline for the ith individual, andĉ is the estimated association parameter between baseline biomarker and hazard for failure. Note that, expðĉÞ is the hazard ratio associated with a unit increase in the value of biomarker at baseline with respect to the population mean. In our simulation study (Simulation investigations section), we will extensively explore the validity ofM i as the measurement error-adjusted estimator within time-dependent ROC curve analysis.

Estimation of the time-dependent ROC curve at future time horizons
We need to define the cases and controls at time future Þ be at-risk indicator for each individual defining the riskset at a t h . Then, dichotomise the riskset at time t h into two mutually exclusive groups: cases (experienced failure at timet h ) and controls (survived failure beyond time t h ). At any t h , each diseased individual (i.e. d i ¼ 1) plays a role as control for an early time t h < T i but then play the role of case when t h ¼ T i . In this case, the failuretime is represented through the counting process N t h ð Þ ¼ I T i t h ð Þ, and the corresponding increment is defined by dN t h ð Þ ¼ N t h ð Þ À N t h À ð Þ in terms of the failure time T i alone. This is the incident/dynamic failure-times proposed by Heagerty and Zheng 23 for the estimation of time-dependent ROC curve, and is the version adopted by most methodologists.
Finally, we can assess the discriminatory potential of the measurement error-adjusted estimatorM i at time t h conditional on a threshold value c. Following the standard ROC curve methodology,M i ! c determines the test positive (disease presence) and test negative (disease absent) ifM i < c. The sensitivity and specificity of the error-adjusted baseline estimator at t h can then be defined by where c 2 ðÀ1; þ1Þ. Sensitivity c; t h ð Þ estimates the expected fraction of individuals withM i ! c among those who experience the failure at t h , while specificity c; t h ð Þ estimates the expected fraction of individuals withM i < c among those who survived failure beyond t h . To estimate the two conditional probabilities (conditional on incident/dynamic failure-times), we can use the proportional hazards properties of the joint likelihood function related to the failure-time submodel in (1). Xu and O'Quigley 24 proposed estimating the proportion of variation in a covariate that is explained by failure times. They estimated the distribution of the covariate conditional on failure at a time t based on the weights p k t ð Þ from the Cox proportional hazards model. The same approach was later used by Heagerty and Zheng 23 to estimate the time-dependent sensitivities and specificities as defined as (3). Following them, for a given threshold value c, we estimate the sensitivity (or true positive fraction, TPF) at t h by is the total weight for the riskset individuals, and Ið:Þ is an indicator. We can calculate the specificity (or 1À false positive fraction, FPF) empirically by where R 0 k t h ð Þ is the set of failure-free individuals in the riskset at time t h and P k R 0 k t h ð Þ is the size of that control-set.
Bansal and Heagerty 25 have also used the same incident/dynamic failure-times definition when there exists time-specific cases of interest at a particular time t h . However, we apply this definition to estimate the discriminative capability at any time t h > 0 with no such prior information. Note that the proportional hazard assumption does not require any case to exist at t h to estimate the above sensitivity; it will force the FPF equal to zero and specificity equal to one. Thus, although there is no case (had a failure) exists at t h (which usually happens in practice), sensitivity can still be estimated at t h . Once the above sensitivity and specificity are computed at t h , the corresponding timedependent ROC curve and AUC at time t h for all c 2 ðÀ1; þ1Þ can be computed by kernel (density) smoothing which follows closely the details of the original data. 26 When there is no specific time t h of interest, but restricted to a fixed follow-up period 0; s ð Þ, a global summary of the AUC can be provided by a survival concordance index (C-index). 27 In the proposed approach,M i is computed from joint model estimates, which is then used as the input to ROC curve analysis; hence, the 95% confidence intervals (CIs) of the AUC, sensitivity and specificity are estimated by the bootstrap sampling with replacement 28 to account for uncertainty due to the two estimation processes. The previously suggested timedependent ROC models for censored failure-times also used bootstrap approaches to estimate the corresponding CIs. 11,20,23,27 The software to implement the proposed joint model has been developed in R language, and will be available as part of the current joineR package. 29 The risksetROC package in R can be used to estimate the corresponding incident/dynamic ROC curve, 23 and we have modified corresponding R functions to implement the proposed ROC curve. The R codes are available from the authors on request.

Simulation investigations
We have conducted three simulation investigations to demonstrate whether the proposed approach is an appropriate framework for estimating the timedependent ROC curve. The details of data simulation and investigations are given in the supplementary file. Firstly, we explored the accuracy of estimation of association parameter c from the joint model which is crucial for estimating the correct ROC curve from the proposed approach. For comparison with c, we also fitted Cox proportional hazard model including the observed baseline biomarker value as a covariate, and also the estimated random intercept terms from the linear mixed effect (LME) model ðÛ 0 Þ lme (usually referred as two-stage model 17 ), see more details of these models in supplementary file. Figure 1 presents the bias for estimated association from the proposed joint, observed and ðÛ 0 Þ lme models for 30% censoring. The corresponding numerical values together with mean square error (MSE) and 95% coverage probabilities are given in supplemental table S1, and S2 and S3 present these for 50% and 70% censoring respectively. All estimates were obtained from 500 bootstrap samples with replacement. We observe that the joint model provides the most accurate estimation of the association with smaller biases, lower MSE and coverage probabilities closer to 95% across all settings. Both observed and ðÛ 0 Þ lme approaches underestimate the level of association to a great extent (high bias) when the true association is fairly strong and measurement error is high. The underestimation of the association from the model including observed baseline biomarker value is anticipated, as this approach assumes the biomarker value is measured without error. Although the ðÛ 0 Þ lme approach is computationally simpler than the joint model, due to the two individual regressions it could lead to bias estimation for conditional effects such as the association between the two outcomes. Our observations are also consistent with the previously published simulation study results from various joint model specifications [e.g. 17,20 ]. The proposed joint model estimates the association fairly close to the true value with lower bias even when the measurement error is high; indicating that the proposed model makes the proper adjustment of measurement error when estimating the underlying association at the baseline level, and strengthening the case of using the model for estimating association between biomarker and risk of failure at baseline.
Secondly, we evaluated how the proposed measurement error-adjusted estimatorM modifies the ROC curve on a given association c. The C-index ofM for a fixed follow-up period of (0, 2) was computed for varying level of association and compared with the true C-index (based on the true biomarker value at baseline). To compare with the proposed estimator, we used the observed baseline value, as the observed baseline value is generally used in ROC curve analyses. And due to computational simplicity of ðÛ 0 Þ lme , we also considered it as a potential estimator for the time-dependent ROC curve analysis, but ðÛ 0 Þ lme has not been previously used as an estimator of its own for the time-dependent ROC curve. 11 Figure 2 presents the estimated C-Indexes for 30% censoring for various strengths of association. The corresponding numerical values together with bias, MSE and 95% coverage probabilities are given in supplemental table S4, and S5 and S6 present these for 50% and 70% censoring respectively. When there is no association between the baseline biomarker and failure, the C-index is estimated from the proposed estimatorM fairly close to the null value of 0.5 (indicating that biomarker shows no discriminatory potential) across all settings of c and censoring %s, see supplemental tables. As strength of the association is stronger (c moves towards 1.0), the estimated C-index fromM is also increased by acceptable margins. We can observe from Figure 2 that the point estimate of the C-Index from bothM and ðÛ 0 Þ lme are fairly similar, especially when the association is weak to moderate, and ðÛ 0 Þ lme point estimate is better as compared to the observed value. However, ðÛ 0 Þ lme substantially under-coverage the 95% confidence intervals as compared to that fromM, especially when the association is moderate to strong; see Supplementary tables S4, S5, and S6. The coverage issue associated with ðÛ 0 Þ lme is also observed in association estimation (supplementary tables S1, S2, and S3). Therefore, we can expect the same extent of under-coverage for other ROC curve summaries, implying ðÛ 0 Þ lme is failed as an estimator for the time-dependent ROC curve analysis. M provides the most accurate C-index estimation with smaller biases, and also with lower MSE and higher 95% coverage probabilities across all settings of c and censoring %s.
Finally, the accuracy of the time-dependent ROC curve was further evaluated by comparing the estimated AUCðt h Þ at future time points t h with the true AUCðt h Þ. We also computed the sensitivityðt h Þ and specificityðt h Þ at optimal thresholds for the proposedM, and compared with the observed baseline value estimates. Table 1 presents the bias for estimated AUCðt h Þ at t ¼ t h ¼ 1; 2; 3; 4 for c ¼ 1 and 30% censoring and estimated sensitivityðt h Þ and specificityðt h Þ. Supplemental table S7 to S21 present bias, MSE and 95% coverage probabilities of estimates for all values of c and censoring rates. The estimated AUCðt h Þ from the proposed measurement error-adjustedM is more accurate, with lower biases and MSE as compared to the observed baseline value. We observe thatM is failed to archive the nominal coverage when the measurement error is considerably high at early time points; however, such high measurement error is rarely observed in current clinical data due to precision of latest machinery and better laboratory regulations. This investigation proves that the proposed methodology effectively corrected for a moderate measurement error when calculating performance of the baseline biomarker at future time points. As expected, AUCðt h Þ decreases as t h increases because discriminatory potential of the biomarker becomes weaker as departing from baseline. 23

Application: Mayo Clinic primary biliary cirrhosis (PBC) study
We apply the proposed approach to the data from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984. PBC is a fatal, but rare liver disease. If PBC is not treated, and reaches an advanced stage, it can lead to several major complications, including death. The trial randomised 312 patients between D-penicillamine (n ¼ 158) for the treatment of PBC and placebo (n ¼ 154). 30 Among the 312 patients randomised, 125 died during the follow-up. Although the study established that D-penicillamine is not effective for the treatment of PBC, the data have been used to develop clinical prediction models, and has been widely analysed using joint modelling methods. [31][32][33][34] Patients with PBC typically have abnormalities in several blood tests; hence, during the study followup several biomarkers associated with liver function were serially recorded for these patients. In this article, we considered three biomarkers: serum bilirubin (measured in units of mg/dl), serum albumin (mg/ dl), and prothrombin time (seconds) with the aim of assessing the performance of each biomarker at the baseline level for patient survival. The available longitudinal measurements of each biomarker were used to correct for measurement error. As the proposed modelling framework assumes Gaussian random effects and errors, the bilirubin measurements were log-transformed and the prothrombin time were transformed by (0.1Âprothrombin time) À4 as suggested by Box-Cox transformation. Albumin did not require transformation. A linear trajectory is assumed for each biomarker and the residual plots did not indicate any deviations from the linear form; see supplementary file for diagnostic plots. Table 2 shows the estimated time-dependent AUC, sensitivity and specificity at times t h ¼Year 1, Year 5 and Year 10 together with estimated measurement error and association parameter for both the measurement error-adjusted and observed baseline biomarker. The 95% confidence intervals (CI) were computed from 500 bootstrap samples with replacement. The measurement error-adjusted estimator is associated with a considerably high time-dependent AUC(t h ) than the observed baseline biomarker at all t h , and this is observed across all three biomarkers. The level of association between the baseline biomarker and risk of death is also substantially underestimated as the measurement error is ignored. Among the three biomarkers, once corrected for the measurement error, the highest time-dependent AUC (t h ) was achieved for serum bilirubin, which means that among the 3 biomarkers, serum bilirubin is best for the earliest diagnosis of PBC. The ROC(t h ) curves at three discrete time points and timedependant AUC(t h ) over continuous time for serum bilirubin are shown in Figure 3. Table 1. Time-dependent AUC (Standard Error, SE) and Bias at t h for the measurement-error adjusted and observed biomarker when c ¼ 1 and 30% censoring for varying measurement error variance. Sensitivity (SE) and specificity (SE) are estimated at the corresponding optimal threshold.

Discussion
The focus of this article was to develop a novel methodology for evaluating time dependent performance of the baseline biomarker correcting for measurement error. We proposed a novel utility of the joint modelling framework within the theory of timedependent ROC curve analysis by developing a more efficient estimator that links the risk of failure and baseline biomarker. The baseline is an important time point as the biomarker value at baseline can serve as the earliest indicator of a potential future adverse clinical event (e.g. death). We have shown from our simulation investigations that measurement error could cause a severe bias in estimating the association between the baseline biomarker and risk of failure event. Although, this has been investigated in joint modelling literature in relation to various specifications of the model, this study was the first to show that observed baseline value could severely underestimate the true discriminative capability of the biomarker as estimated by AUC. Our simulation investigations proved that the proposed methodology effectively corrects for a moderate measurement error when calculating the performance of the baseline biomarker over time. A similar joint model specification was suggested by Crowther et al. 35 to predict survival for new patients. In their model, association was defined on the current biomarker value rather than the individual-level Table 2. Time-dependent AUC, sensitivity and specificity (at the corresponding optimal threshold) at t h for the measurement-error adjusted and observed baseline biomarkers.  deviation, and restricted cubic splines were used to define the longitudinal biomarker while failure-time assumes a parametric distribution. This level of complexity is necessary to model highly nonlinear biomarker trajectories over time, and to capture complex baseline hazards when predicting the future survival probabilities. However, our aim was to quantify the true discriminant capability of the baseline biomarker at future time points, and a more classical modelling and estimation framework has been proven sufficient from our thorough simulation study. To facilitate the use of the methods in practice, software is written in R language (which is a free software environment). The proposed approach can be implemented with a relatively low computational burden; for example, in our application dataset with 312 patients, the proposed joint model for each biomarker took under 1 minutes to converge on a standard desktop computer, and the time-dependent AUCs were derived in few seconds.
More recently, quantities such as proportion of information gain (PIG) have been proposed to measure the importance of a biomarker. Li and Qu 36 adjusted for the measurement error in calculating the PIG for continuous, binary and failure-time outcomes. However, our focus in this article was to account for the measurement error of a more familiar and well established quantity among the medical research community. We proposed a computationally simple approach to estimate the true timedependent ROC curve for a baseline biomarker subjected to measurement error. Although information from longitudinally repeated measurements is required for the proposed approach in addition to the single biomarker measurement at baseline, often in clinical studies, longitudinal measurements are recorded alongside the main study as secondary outcomes, e.g. to monitor the progression of a disease. Therefore, the prospects of utilising the proposed framework to detect the true performance of biomarkers is quite substantial.
The proposed ROC curve approach can be extended to incorporate multiple biomarkers by utilising multivariate joint models (e.g. Hickey et al. 34 ). In our application, we evaluated the measurement error-corrected performance of three biomarkers in separation for the survival of PBC patients. It may be of interest to assess the performance in a combination of biomarkers, as in many diseases it is unlikely that a single biomarker will ever be more effective due to complexity of the disease (e.g. Aerts et al. 37 ).

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: AK was supported by Malaysia government PhD studentship Majlis Amanah Rakyat Malaysia (MARA) during 2014-2018. This work was also partly supported by the Medical Research Council [grant number MR/M013227/1].

Supplemental material
Supplemental material for this article is available online.