An augmented likelihood approach for the Cox proportional hazards model with interval-censored auxiliary and validated outcome data—with application to the Hispanic Community Health Study/Study of Latinos

In large epidemiologic studies, it is typical for an inexpensive, non-invasive procedure to be used to record disease status during regular follow-up visits, with less frequent assessment by a gold standard test. Inexpensive outcome measures like self-reported disease status are practical to obtain, but can be error-prone. Association analysis reliant on error-prone outcomes may lead to biased results; however, restricting analyses to only data from the less frequently observed error-free outcome could be inefficient. We have developed an augmented likelihood that incorporates data from both error-prone outcomes and a gold standard assessment. We conduct a numerical study to show how we can improve statistical efficiency by using the proposed method over standard approaches for interval-censored survival data that do not leverage auxiliary data. We extend this method for the complex survey design setting so that it can be applied in our motivating data example. Our method is applied to data from the Hispanic Community Health Study/Study of Latinos to assess the association between energy and protein intake and the risk of incident diabetes. In our application, we demonstrate how our method can be used in combination with regression calibration to additionally address the covariate measurement error in self-reported diet.

- Table S1 - Table S2 - Table S3 - Table S4 - Table S5 S1 R Code with sample data analysis We now provide R code that illustrates how to apply the proposed method and the standard, no auxiliary data method to a simulated data set. The simulated data mimics the complex survey design of HCHS/SOL that includes unequal probability sampling, stratification, and clustering. To mimic the measurement error of dietary factors in HCHS/-SOL, we simulate an error-prone covariate (X˚) and assume that we additionally have 2 error-free continuous covariates, such as body mass index and age (Z 1 and Z 2 ). The auxiliary data outcome variable is recorded at 8 time points, while the gold standard outcome variable is recorded at year 4. The sensitivity and specificity of the error-prone, auxiliary data outcome are estimated by comparing the auxiliary data outcome indicator to the gold standard outcome indicator at year four. This data set is provided on GitHub at https://github.com/lboe23/AugmentedLikelihood with the file name Sam-pleData.RData.
We begin by loading the packages needed for this analysis as well as the functions required to calculate our log-likelihood and gradient. These functions are available on the GitHub site above in the file titled "PROPOSED AUGMENTEDLIKELIHOOD FUNCTIONS.R." This file contains a few functions needed for analysis, including the following: (1) log like proposed() which calculates the log-likelihood for the proposed method and (2) gradient proposed() which calculates the gradient/estimating function for the proposed method. Both of these functions require a specification of the function purpose, where the options are "SUM" or "INDIVIDUAL." SUM returns the sum of the log-likelihood or gradient contributions, re-spectively, while individual returns a vector/matrix of each person's contributions. We also use the Rcpp function cmat() available in the file "RcppFunc.cpp" from GitHub. Additionally, we use the Rcpp functions dmat() and getrids() developed by Gu et al. (2015) that can be found in the icensmis package on Cran or on GitHub at https://github.com/ XiangdongGu/icensmis/blob/master/src/dataproc.cpp. Below is code that loads all of the required functions: We will allow for a proportion of the gold (reference) standard event status variables to be missing, and we fix this missingness rate to be 20% for this analysis.
prop _ m <-0.20 Now, we load in sample simulated data. The data we input is in wide form, with one row per subject and each auxiliary data event indicator in a separate column. load ( file = paste0 ( ' SampleData . RData ' )) N <-dim ( samp ) [1] Now we estimate the sensitivity (Se) and specificity (Sp) values for the auxiliary data.
set . seed (2548) mcar <-runif (N ,0 ,1) GS _ data $ GS _ vis4 _ mis <-ifelse ( mcar < prop _m , NA , GS _ data $ GS _ vis4 ) Next, we create two datasets using the merge function: datafinal, which is the data in long form with one row per visit, and datafinal GS, which has one row per person for the standard, no auxiliary data analysis.
datafinal <-merge ( datafinal _ 1 , GS _ data , by = " ID " ) datafinal _ GS <-merge ( samp _ long , GS _ data [ , c ( " ID " ," GS _ vis4 _ mis " )] , by = " ID " ) The last step with the data is to ensure that we are creating a unique data set with only N rows, which saves only the variables that will be necessary for our analysis.
Now, we create survey designs using survey package. We will create a survey design for the entire cohort, which we will then properly subset according to the models that we wish to fit. We start by creating the main survey design, samp design reg complete. We then subset this to create the SOLNAS design, used to fit the calibration model. samp _ design _ reg _ complete = svydesign ( id =~BGid , strata =~strat , weights =~bghhsub _ s2 , data = IC _ data ) samp . solnas . design <-subset ( samp _ design _ reg _ complete , solnas == T ) Next, we'll fit the calibration model. To do so, we will use functions from the survey package, requiring use to specify our subset design. Finally, we will create our predicted values (xhat) from regression calibration using the "predict" statement. Note that we add the predicted values to our "datafinal" data set, as well as include them in the survey design. Now that our estimated covariate, xhat, is included in the survey design, we will subset the design further for fitting the standard, no auxiliary data analysis model which excludes subjects missing the gold standard variable. We create a third subsetted survey design, samp design reg sandwich, which will be used to calculate the proposed sandwich variance for the gold-standard approach. This is necessary because there are a few participants included in the calibration model who are not in the outcome model due to missing data, so we need a survey design that considers all participants for the case of non-nested subsets. Now, we write down the formula for our outcome model including the error-prone auxiliary data outcome and three covariates. Recall that we used regression calibration to correct for error in one covariate, so our model includes the predicted value xhat and two precisely recorded covariates, z1 and z2. We will also create the formula for the "naive" model fit to the error-prone covariate. formula = result~xhat + z1 + z2 formula _ naive = result~xstar + z1 + z2 We will now make sure our data is ordered properly before we begin calculating sum of the likelihood components. Next, we will calculate the D matrix and C matrix for our log-likelihood using the Rcpp functions. Additionally, we assign J, the number of auxiliary data visit times. In these next steps, we create our covariate matrix with one column per covariate (Xmat).
We also create a design matrix for the naive model, which includes the error-prone exposure xstar in the model over xhat. Initially, Xmat and Xmat naive will be in long form, with one row per visit. We also assign nbeta (the number of covariates/regression parameters) and uid, a unique indicator for each person's ID. Finally, we redefine our covariate matrix Xmat and our design matrix for the naive model, Xmat naive to have just one row per subject. Now, create a unique vector (GSdelta) with only one row per person indicating whether each person had the gold (reference) standard indicator available or not. This will be used to calculate the proposed log-likelihood contribution for each subject based on whether GSdelta=NA, 0 or 1. Additionally, we create the vector of observation times at which the gold (reference) standard is recorded, called GSVis. Lastly, we create a vector of 1's called "noweights" which will be used to fit the proposed estimator in the unweighted analysis.
GSdelta <-datafinal [ uid , " GS _ vis4 _ mis " ] GSVis <-rep (4 , N ) noweights <-rep (1 , N ) We now finalize the data set for the standard, no auxiliary data analysis using the unique IDs only such that data has N rows. This is the final dataset for standard, no auxiliary data analysis analysis that omits anyone who is missing the gold standard. Now, we create starting values for our survival parameters. First, to avoid maximization problems due to the ordered constraint of the survival parameters, we re-parameterize these in terms of a log-log transformation of survival function for S 2 , and a change in log-log of the survival function for all other parameters S 3 . . . S J`1 . We consider initial values of 0.1 for our survival parameters, then transform these based on this re-parameterization. We also define lower and upper bounds for our survival parameters. Our lower bound is infinity for the first re-parameterized survival function and 0 for the subsequent J´1 terms. Our upper bound is infinity for all terms. Finally, we create a vector parmi consisting of a starting value for beta, and starting values for re-parameterized survival parameters.
We then combine them into one matrix, and invert it, as required.
Finally, we obtain design-based standard errors using the variance approach of Binder (1983) and functions from the survey package by providing the influence function and survey design to vcov(svytotal()) (Lumley, 2011). A similar procedure was implemented in Boe et al. (2022).
Next, we need to compute the proposed sandwich variance estimate for the gold standard, interval censored approach. Since auxiliary data were not used in this analysis, we only stack estimating equations for the calibration and outcome models, not the sensitivity and specificity. We follow a similar process as we did for the proposed model, starting by saving our estimated sample size for the gold-standard approach, then beginning to form our estimating equation contributions and A matrix. We note that we need to consider both the sample size used to fit the standard model, as well as the sample size for all participants in either the calibration or outcome model. This is important for this example, where the calibration subset is a non-nested subset of the subset used to fit the standard model.
We begin by saving these two sample sizes, then constructing the four quadrants of our A matrix and combining them, as before.

# Compute influence functions
infl . IC <-as . matrix ( estfun . all . IC ) % * % t ( A . inv . IC ) / N _ GoldStd _ All # Compute the sandwich using vcov ( svytotal ()) sand2 <-vcov ( svytotal ( infl . IC , samp _ design _ reg _ sandwich )) We now save the estimated parameters, including the estimated regression coefficients and corresponding standard error estimates. beta _ proposed <-proposed _ fit _ data _ weight $ par [1] beta _ standard _ IC <-proposed _ fit _ data _ weight _ GS $ coefficients [2] se _ proposed _ final <-sqrt ( diag ( sand1 )) [7] se _ standard _ final <-sqrt ( diag ( sand2 )) [6] Now, we can use our estimated regression coefficients and their corresponding standard errors to construct a table with estimated hazard ratios (HR) and 95% confidence intervals (CI) associated with a 20% increase in consumption. We will also include relative efficiency calculations in our table, computed as the ratio of the estimated variance of the standard, no auxiliary data approach estimator to the estimated variance of the proposed method estimator, e.g. V arpβ Standard q V arpβ P roposed q :   (1) In this section, we derive equation (1) of the main manuscript following the assumption that the n i error-prone outcomes Yi j are conditionally independent given the true disease status and event time T i . This equation is derived using a similar approach as that outlined by Balasubramanian and Lagakos (2003). We begin by writing down the joint probability of observed data for subject i, as introduced in the first paragraph of Section 2.1 of the main manuscript, i.e.: Next, this probability is further broken down in terms of the individual auxiliary data outcome variables and corresponding visit times: Next, following the assumption that P pYi |T i , Ti , Lastly, we adopt the logic of Balasubramanian and Lagakos (2003) to show that for a prespecified visit schedule at which the auxiliary data are recorded, we have the following: These two probabilities would also drop out of the likelihood if they did not depend on the regression parameters of interest (β). From here, it is straightforward to see how we arrive at equation (1), the joint probability of observed data for subject i:

S2.2 Details of C ij term in the likelihood
In Section 2.1 of the main text, we introduce the likelihood contributions for all individuals in terms of C ij " r ś n i l"1 P pYi l |τ j´1 ă T i ď τ j , Ti l , ∆ i , V i qs, which is simply a function of the sensitivity pSeq and specificity pSpq of the auxiliary data. Recall that we assume constant and known sensitivity pSeq and specificity pSpq, defined as Se " P pYi l " 1|τ j´1 ă T i ď τ j , Tl ě τ j q and Sp " P pYi l " 0|τ j´1 ă T i ď τ j , Tl ď τ j´1 q. It is then straightforward to see that 1´Se " P pYi l " 0|τ j´1 ă T i ď τ j , Tl ě τ j q and 1´Sp " P pYi l " 1|τ j´1 ă T i ď τ j , Tl ď τ j´1 q. In this section, we provide the general form of the C ij terms for the case of no missed visits. The formula introduced below could be modified to allow for missed visits by summing up all terms θ j " P pτ j´1 ă T i ď τ j q across the pτ j´1 , τ j s defining a subject's observational interval. For the case of J total visits for all N subjects, the C ij terms take the following form: As an example, consider subject i " 1 who has observed auxiliary data vector Yi " r0, 0, 0, 1s corresponding to annual visit time vector Ti " r1, 2, 3, 4s. Suppose the visit times among all N subjects are also rτ 0 " 0, τ 1 " 1, τ 2 " 2, τ 3 " 3, τ 4 " 4, τ 5 " 8s. Then, for subject i with n i " J " 4 and j " 1, we have: Then, following a similar procedure for j " 2, . . . , 5, we see that for subject 1,

S2.3 Sandwich variance estimator
For a data structure similar to the HCHS/SOL study, a variance estimator that incorporates the survey design as well as the extra uncertainty added by the estimation of the unknown exposure and the sensitivity and specificity of the auxiliary data is needed. In the main text, we propose a sandwich variance estimator to accommodate this structure. The proposed sandwich variance estimator may be obtained by stacking the estimating equations for the sensitivity, specificity, calibration model, and proposed outcome model (Boos and Stefanski, 2013). We follow a similar approach to that outlined by Boe et al. (2022) and consider the stacked vector, ϕ " pSe, Sp, δ, ψq, where ψ " rβ, Ss. The vector ϕ includes the sensitivity and specificity of the auxiliary data as well as the parameters from the calibration and outcome models. We assume that Se and Sp are both of length 1, while δ has length j and ψ has length k.
Suppose we will estimate sensitivity and specificity at baseline by comparing measures of the auxiliary data, Yi 0 , to the corresponding baseline gold standard, Y i0 , where a value of 1 indicates a positive and 0 indicates a negative. We will assume a simple Bernoulli loglikelihood for the estimation of the sensitivity and specificity parameters such that l i,Se " Let l C pδq be the log-likelihood for the calibration model. Recall that in section 2.2 of the main manuscript, we define l π pS, βq " l π pψq " ř N i"1 q l i pψq as the log-likelihood function for the proposed method.
We note that in the case of the HCHS/SOL data, different subsets of participants contribute to the estimating of the different parameters. For example, n Se participants contribute to the estimation of baseline sensitivity, n Sp participants contribute to the estimation of baseline specificity, n C calibration subset participants contribute to the calibration model likelihood, and N participants contribute to the proposed outcome method likelihood. Each of these subsets overlap to some degree, and N ALL is defined as the total number of participants contributing to at least one of the four estimating equations. As such, any estimating equation contributions for those not in the subset used to fit the corresponding model are equal to 0.
We now define U i pϕq as the j`k`2-dimensional vector of stacked estimating equations formed for ϕ, which can be broken down into the estimating equations for sensitivity U i1 pϕq, specificity U i2 pϕq, the calibration model U i3 pϕq, and the outcome model, U i4 pϕq. For this example, U i1 pϕq " and U i3 pϕq and U i4 pϕq are the vectors of first derivatives of the log-likelihood functions l iC pδq and q l i pψq, respectively, with respect to the parameters being estimated. We may write U i pϕq as: (S1) Then, following Binder (1983) and Lumley and Scott (2017), the sandwich estimator for the variance ofφ that incorporates the survey design as well as the extra uncertainty from the estimated exposure, sensitivity, and specificity has the following form: where As illustrated by Boe et al. (2022), the most straightforward way to compute the proposed sandwich variance is to reformulate the variance in terms of the influence functions Apφq´1Ũ pφq, whereŨ pφq is the following matrix of transposed estimating equation contributions: Finally, we provide the form of the rj`k`2sˆrj`k`2s matrix Apφq, as follows: which can be simplified as follows: We note that the diagonal elements of Apφq correspond to the second-order partial derivatives of the four log-likelihood functions. The elements of the bottom row of the matrix are all non-zero, since the estimated sensitivity, specificity, and exposureX i in the proposed outcome model rely on the Se, Sp, and δ parameters from the stage 1-3 models, respectively.
All remaining elements of the matrix Apφq are 0 because the parameters that the derivative are being taken with respect to are not involved in the corresponding estimating equations.
Finally, we note that for the gold-standard, interval-censored approach that does not involve the auxiliary data, the stacked estimating equation used to form the sandwich variance needs to account for the survey design and the uncertainty added by the estimated exposure.
Thus, the proposed sandwich variance estimator for this method involves two components: the calibration model and outcome model estimating equations. For this two-stage setup, we refer the reader to section 3 of Boe et al. (2022), which specifically outlines the structure of the sandwich variance estimator for the complex survey design setting where the stage 1 model is a linear regression calibration model and the stage 2 model is a familiar generalized linear outcome model.

S3 Regularity conditions for asymptotic normality
We now provide sufficient regularity conditions for the proposed estimator to be asymptotically normal and achieve a ? N -convergence rate. We assume the following throughout that Sptq " S 0 ptq exppx 1 βq ; (4) the auxiliary outcome status (Yi ) is observed at n i followup times Ti " pTi 1 , ..., Ti n i q that are a subset of J+1 possible observation times satisfying 0 " τ 0 ă τ 1 ă τ 2 ă ... ă τ J ă τ J`1 " 8; and (3) τ V i is the observation time for the gold standard event status variable ∆ i , where τ V i P tτ 0 , τ 1 , τ 2 , ..., τ J u; (5) τ V i is the gold standard assessment time for individual i, where τ V i P tτ 0 , τ 1 , τ 2 , ..., τ J u and ∆ i is the corresponding gold standard event status variable; (6) the binary variable M i P t0, 1u indicates whether ∆ i is missing; (7) X i is the p-dimensional true covariate vector of interest with corresponding error-prone vector Xi , while Z i is the additionally observed q-dimensional vector of errorfree covariates, where all covariates are random variables with finite variance. In the main text, we define θ j " P pτ j´1 ă T i ď τ j q and S j " ř J`1 h"j θ h " P pT ą τ j´1 q for j " 1, ..., J`1. Additionally, we require that 1 " S 1 ą S 2 ą ... ą S J`1 ą 0, ensuring that 0 ă θ j ă 1 for j " 1, . . . , J.

S3.1 Proposed estimator for random sample
First, we consider the case where the data are assumed to be a simple random sample from the population and the covariates of interest are recorded precisely (i.e. error-free). The log-likelihood function lpψq " lpT i , ∆ i , Yi , Ti , M i , X i ; ψq is defined in section 2.1, equation 2.4 from the main text as: where ψ " rβ, Ss and S " pS 1 , S 2 , ..., S J`1 q 1 . Recall that we define the score function as U i pψq " Bl i pψq Bψ . The proposed estimator in this setting,ψ, is found by solving the score equation ř N i"1 U i pψq " 0. Let ψ 0 be the true vector of regression parameters of interest that solves E " Blpψq Bψ ı " E " ř N i"1 U i pψq ı " 0. Now, further assume that the log-likelihood lpψq " lpT i , ∆ i , Yi , Ti , M i , X i ; ψq is twice continuously differentiable with respect to ψ such that there exists an invertible Hessian matrix (Foutz, 1977). We additionally assume the regularity conditions made by Foutz (1977) for establishing consistency and uniqueness of our estimator. Then, appealing to standard maximum likelihood estimation (MLE) theory, with probability going to one as N Ñ 8,ψ is a unique solution to the likelihood equations that is consistent for ψ 0 and asymptotically normal (Boos and Stefanski, 2013). Specifically, one has: where Vpψ 0 q is the Fisher information matrix, denoted by " Ipψ 0 q´1.

S3.2 Proposed estimator for complex survey design
We now state the additional regularity conditions needed for the proposed method estimator to accommodate data from a complex survey sampling design by using a weighted log-likelihood function and a sandwich-form variance estimator to address within-cluster correlation and stratification.
Recall from section 2.2 that a sample of N subjects is drawn from a population of size N P OP resulting in the sampling probability π i . Following Lumley and Scott (2017), we assume that N Ñ 8 and N N P OP Ñ p P p0, 1q. Additionally assume that π i is known for the subjects in the sample and is bounded away from 0. The weighted log-likelihood equation of the main text is written as follows: l π pS, βq " proposed estimatorψ π may be found by solving the weighted score equation, Following the same logic applied for the random sample case and sinceψ π is also a maximum likelihood estimator, we have: where Vpψ 0 π q can be approximated using the implicit differentiation method of Binder (1983).
To use this variance estimation approach, assume that ř N i"1 q U i pψ π q is suitably smooth such thatψ π can be implicitly defined as a function of ř N i"1 q U i pψ π q. Further assume that the derivative matrix for q U i is full rank, invertible, and a continuous function of ψ π . Then, we can use a Taylor expansion of U i atψ π " ψ 0 π to arrive at the following estimator for the asymptotic variance ofψ π :Vrψ π s «´ř N i"1 Bψπ¯´1 . Details and theoretical justification are provided in Binder (1983).

S3.3 Proposed estimator for complex survey design and regression calibration
First assume that the error-free covariates X i and Z i are available for all sampled individuals, such that our log-likelihood l π,X,Z pψq " lpT i , ∆ i , Yi , Ti , M i , X i , Z i ; ψq takes the following form: l π,X,Z pψq " where ψ " rβ, Ss, β " pβ X , β Z q 1 , and S " pS 1 , S 2 , ..., S J`1 q 1 . Adopting the arguments from sections S3.1 and S3.2, the weighted proposed estimatorψ π,X,Z found by solving the weighted score equation ř N i"1 q U i pψ π,X,Z q " ř N i"1 1 π i U i pψ π,X,Z q " 0 can also be shown to be consistent for the true parameterψ 0 π and asymptotically normal. These arguments only apply to settings in which the true covariate X i is used in the proposed method instead ofX i .
We will now make similar arguments of consistency and asymptotic normality for a new version of our log-likelihood that incorporates regression calibration. Begin by assuming that n C N Ñ p P p0, 1q, where n C is the number of subjects in the calibration subset. Recall that we assume the classical measurement error model, X˚i " X i`ϵi , where ϵ i " N p0, σ 2 ϵ i q, as introduced in Section 2.3.1 from the main manuscript. Assume also that the following linear calibration model from the main manuscript holds: X˚i " δ p0q`δp1q Xi`δ p2q Z i`Wi , where the random measurement error term W i " N p0, σ 2 W i q. Define δ " pδ p0q , δ p1q , δ p2q q. Then, since the vector of estimated nuisance parametersδ is a linear regression estimator, we can appeal to standard MLE theory to establish that it is consistent for the true vector of parameters δ 0 and asymptotically normal. To apply regression calibration, the first momentX i " EpX i |δ, Xi , Z i q is imputed in place of X i in our outcome model. To establish asymptotic normality of our regression calibration estimator, we first assume δ is known and solve the following weighted log-likelihood equation lπ ,X˚,Z pψq " lpT i , ∆ i , Yi , Ti , M i , Xi , Z i ; ψq: lπ ,X˚,Z pψq " where ψ " rβ, Ss, β " pβ X , β Z q 1 , and S " pS 1 , S 2 , ..., S J`1 q 1 . We assume that distributions of the variables are such that when X i is replaced byX i , the log-likelihood lπ ,X˚,Z pψq " lpT i , ∆ i , Yi , Ti , M i , Xi , Z i ; ψq remains continuously differentiable with respect to ψ and that the Hessian matrix is still invertible. As before, we solve the weighted score equa- Ui pψ π,X˚,Z q " ř N i"1 1 π i Ui pψ π,X˚,Z q " 0 in order to obtain our weighted proposed estimator,ψπ ,X˚,Z . Under regularity conditions described previously, ψπ will be a unique, consistent solution to the vector of equations E " Blπ pψ π,X˚,Z q Bψπ Ui pψ π,X˚,Z q ı " 0.
In general, ψπ is not the same as ψ 0 π , as the parameter estimates from regression calibration are viewed as an approximation (Buonaccorsi, 2010). Using the techniques of Boos and Stefanski (2013) once again, we can verify the asymptotic normality our estimatorψπ ,X˚,Z , i.e.: ?
The regularity in equation S13 depends on known nuisance parameter vector δ from the error model. With the additional usual regularity assumptions for linear regression that guarantee a consistent and asymptotically normal estimator forδ, the regularity of our calibration estimator will still hold using this plug-in estimator for δ by appealing to Theorem 5.31 in Van der Vaart (2000). The variance Vpψπq is estimated using the sandwich variance approach described in the main text and outlined in detail for a structure similar to the HCHS/SOL data in section S2.3. When regression calibration is applied to the proposed method to adjust for covariate error, our estimator is only approximate but has been empirically shown to have minimal bias and good coverage probability when the true regression parameter is modest in size and the event of interest under study is rare.

S4 Supplemental details for HCHS/SOL data example
We adopted the same exclusion criteria used in the ongoing clinical investigation that seeks to understand the relationship between several dietary intake variables and the risk of chronic diseases in the HCHS/SOL cohort. We excluded any participants who reported diabetes or unknown status at baseline (N " 3428), had missing covariate data (N " 373), or had no auxiliary follow-up (N " 551), resulting in 12, 317 eligible participants. To mimic the planned analysis of the clinical investigation, for the 351 subjects in the data who reported a positive diabetes status after one or more missed annual follow-up calls, we imputed that the event happened at the midpoint of the missed follow-up times. For most subjects with a missed call, subjects subsequently reported no diabetes diagnosis had occurred since the last call and so a negative disease status was imputed for the prior follow-up calls. The proposed method is applied to a subset of 8, 200 HCHS/SOL cohort participants, including all eligible SOLNAS subset participants (N " 420) and HCHS/SOL participants from primary sampling units (PSUs) with 4 or fewer members (N " 282). The remaining subset members were selected by taking a random sample of 7498 participants that had not yet been selected.
Both the calibration and outcome models accounted for the HCHS/SOL survey design in the analysis, thus incorporating the variability of the survey design into both models. To account for the survey design in the calibration model, we subset the full cohort HCHS/SOL survey design to the participants in the SOLNAS substudy. We used this same subset function to properly subset the full cohort design to the group of 8, 200 participants used to fit the proposed outcome model and the 5, 922 participants with the reference standard at visit 2 used in the no auxiliary data model. We note that special consideration is required in the choice of the strata when using data from a complex survey design. Boe et al. (2022) For the HCHS/SOL example, we specified the strata as the cross-classification of the calibration subset (SOLNAS substudy) indicator with the strata from the HCHS/SOL design. Table S1: Simulation results are shown for exponential failure times assuming the Cox proportional hazards model with X " Gammap0.2, 1q and β " logp1.5q for (1) the grouped time survival approach that uses the true outcome data from all periodic visits and (2) Table S2: Simulation results are shown for exponential failure times assuming the Cox proportional hazards model with X " Gammap0.2, 1q, β " logp1.5q, and values of Se " 0.90 and Sp " 0.80 for the auxiliary data. The median percent (%) bias, median standard errors (ASE), empirical median absolute deviation (MAD) and coverage probabilities (CP) are given for 1000 simulated data sets for the proposed method and the standard interval-censored approach that does not incorporate auxiliary data. Average probability that the gold standard indicator ∆ is missing at year 4 2 CR " Average censoring rate for the latent true event time at the end of study 3 N " Sample size for proposed approach; if M R ą 0.0, sample size for no auxiliary data approach is smaller because of missingness in gold standard indicator ∆. 4 RE " median relative efficiency, calculated as the median of the ratio of the estimated variance of the standard, no auxiliary data approach estimator to the estimated variance of the proposed method estimator , e.g. V arpβ Standard q V arpβ P roposed q Table S3: Simulation results are shown for exponential failure times assuming the Cox proportional hazards model with X " Gammap0.2, 1q and β " logp1.5q where the values of Se and Sp incorporated in the analysis differ from the true values. The median percent (%) bias, median standard errors (ASE), empirical median absolute deviation (MAD) and coverage probabilities (CP) are given for 1000 simulated data sets for the proposed method and the standard interval-censored approach that does not incorporate auxiliary data. Here, N " 1000, M R " 0.4, CR " 0.5, and the true Se " 0.80 and Sp " 0.90 for the auxiliary data.

Proposed
No Auxiliary Data Se " Value of sensitivity incorporated into the analysis for the proposed method, while true Se " 0.80 2 Sp " Value of specificity incorporated into the analysis for the proposed method, while true Sp " 0.90 3 RE " median relative efficiency, calculated as the median of the ratio of the estimated variance of the standard, no auxiliary data approach estimator to the estimated variance of the proposed method estimator , e.g. V arpβ Standard q V arpβ P roposed q Table S4: Simulation results are shown for exponential failure times assuming the Cox proportional hazards model with X " Gammap0.2, 1q and β " logp3q. The median percent (%) bias, median standard errors (ASE), empirical median absolute deviation (MAD) and coverage probabilities (CP) are given for 1000 simulated data sets for the proposed method and the standard interval-censored approach that does not incorporate auxiliary data. Here, Se " 0.80 and Sp " 0.90 for the auxiliary data. Table S5: Simulation results are shown for exponential failure times assuming the Cox proportional hazards model with X " Gammap0.2, 1q and β " logp1.5q for 2 and 3 visit times at which auxiliary data are recorded. The median percent (%) bias, median standard errors (ASE), empirical median absolute deviation (MAD) and coverage probabilities (CP) are given for 1000 simulated data sets for the proposed method and the standard intervalcensored approach that does not incorporate auxiliary data. Here, N " 1000, M R " 0.4, CR " 0.5, and Se " 0.80 and Sp " 0.90 for the auxiliary data.

Proposed
No Auxiliary Data . . , N 2 RE " median relative efficiency, calculated as the median of the ratio of the estimated variance of the standard, no auxiliary data approach estimator to the estimated variance of the proposed method estimator , e.g. V arpβ Standard q V arpβ P roposed q Table S6: Simulation results are shown for data simulated to be from a complex survey with exponential failure times assuming the Cox proportional hazards model with X " N ormalpshape s`ω gs , scale s`ρgs q for an individual in block group g and stratum s and β " logp1.5q. The median percent (%) bias, median standard errors (ASE), median absolute deviation (MAD) and coverage probabilities (CP) are given for 1000 simulated data sets for the weighted proposed estimator and the weighted interval-censored approach that does not incorporate auxiliary data when both use a sandwich-form variance estimator to address within-cluster correlation. Here, Se " 0.80 and Sp " 0.90 for the auxiliary data.

Proposed
No Auxiliary Data M R 1 CR 2 N 3 % Bias ASE MAD CP % Bias ASE MAD CP RE 4 0.0 0.9 1000  1.300 1 Each model is adjusted for potential confounders including age, body mass index (BMI), sex, Hispanic/Latino background, language preference, education, income, and smoking status.
2 RE " relative efficiency, calculated as the ratio of the estimated variance of the standard, no auxiliary data approach estimator to the estimated variance of the proposed method estimator , e.g. V arpβ Standard q V arpβ P roposed q