Semiparametric estimation of the proportional rates model for recurrent events data with missing event category

Proportional rates models are frequently used for the analysis of recurrent event data with multiple event categories. When some of the event categories are missing, a conventional approach is to either exclude the missing data for a complete-case analysis or employ a parametric model for the missing event type. It is well known that the complete-case analysis is inconsistent when the missingness depends on covariates, and the parametric approach may incur bias when the model is misspecified. In this paper, we aim to provide a more robust approach using a rate proportion method for the imputation of missing event types. We show that the log-odds of the event type can be written as a semiparametric generalized linear model, facilitating a theoretically justified estimation framework. Comprehensive simulation studies were conducted demonstrating the improved performance of the semiparametric method over parametric procedures. Multiple types of Pseudomonas aeruginosa infections of young cystic fibrosis patients were analyzed to demonstrate the feasibility of our proposed approach.

shows the simulation result. One can see the misspecification of the rate model (1) that misses the critical covariateW i affects little the performance of our proposed estimator. The estimates are approximately unbiased and the approximation of the asymptotic variance performed well under such misspecification. The coverage probability is around the 0.95 nominal level, even when the censoring rate is high, and the empirical rejection rate is close to the 0.05 type-I error probability.
We also purposely use a misspecified weighted estimating equations method that ignores W i . We intend to show that the our estimator is more efficient than the weighted estimating equations method when the model is misspecified. The simulation results are shown in Table  S2. The same as the Table 1 in our main paper, we report the bias, empirical standard error of the estimator (σ 1 ), and mean square error ratios comparing to the estimators assuming no missing event category and using weighted estimating equations. Table S2 shows that, when the model is misspecified by the weighted estimating equations method, the estimatorβ w 1 remains approximately unbiased when the sample size n is large. However, the variation is relatively large, compared to our estimator, especially when the censoring rate is high. Our estimator can improve the efficiency up to 20% when the censoring rate is at 40%.

Additional Simulations for Time-Varying Covariate Models
In this section, we present the simulation results under a model with a time-varying covariate to show that our estimation method can also handle time-varying covariates. We generated the data by the following intensity functions whereW i is a binary random variable from a Bernoulli distribution with probability 0.5, and I(·) is an indicator function that indicates the time-varying effect ofW i on the intensity. All of the parameters and variables were set up the same as the scenario 2 in the simulation section of our main manuscript except β 3 , β 4 , andW i . We let β 3 = log(0.8) and β 4 = 0. The negative value of β 3 suggests thatW i decreases the intensity of type 1 events when t Table S1: Simulation results of parameter estimation under a misspecified model  is larger than or equal to 2. We fit the data by rate functions (2) is considered as a time-varying covariate model since the covariate W ij I(t ≥ 2) changes by time. Table S3 shows the simulation result based on 500 repeated samples under sample sizes n = 200, 400. We reported the same metrics in Table S1 for β 1 and β 3 , where β 3 is the coefficient of the time-varying covariate. Form Table S3, one can see that our estimator behave similarly when the covariate is time-varying. Both point and variance estimation perform well, resulting around 0.95 coverage probability. The simulation result demonstrates our estimator can work under a time-varying covariate model.

Assessment of Missing at Random Assumption
The missing at random (MAR) assumption (Little and Rubin, 2002) can be assessed by exploring whether the likelihood of missingness depends covariates. One can assume the probability of having a missing category follows a logistic regression model where z i (t) = (1, t, N i· (t−), Z i ) and is a column vector of corresponding coefficients of z i (t). To assess the MAR assumption in the Cystic Fibrosis Foundation Patient Registry (CFFPR) data, we include all of the covariates in Table 3 of the main paper for Z i and fit the logistic regression model above. Table S4 shows the coefficient estimation in odds ratio, its 95% confidence interval (CI), and Wald-type test p-value. The result indicates that the missingness is significantly associated with age, number of previous events, diagnostic method, and medication use. The overall testing for = 0 is significant (p-value<0.001), which indicates the missingness depends on the covariates.

Sensitivity Analysis when Missing at Random Assumption is Violated
The missing at random (MAR) assumption stands if κ j (t|Z ij ) = 0 in model (5) of the main paper. To check the robustness of our analysis for the Cystic Fibrosis Foundation   Patient Registry (CFFPR) data, we assigned different values of κ j (t|Z ij ) in the model (5) and examined whether the coefficient estimation of the proportional rates model is sensitive to different values of κ j (t|Z ij ). Here, we assigned constant values for κ 1 (t|Z ij ) and κ 2 (t|Z ij ) that are independent of Z ij . Figures S1 and S2 show the estimation results under different values of κ 1 and κ 2 when randomly choosing 10% of the patient's data for the sensitivity analysis. One can see the rate ratio estimation of the sex effect of female relative to male in nonmucoid PA infection in the solid lines of Figure S1 varies little under different combinations of κ 1 and κ 2 . Our proposed estimates are shown in a triangle when κ 1 = κ 2 = 0. Although the 95% confidence intervals are not shown, they also have little difference based on the proposed variance estimation. A similar result can be seen in Figure S2, where the point estimation are shown for the effect of the heterozygous relative to homozygous F508del genotype in nonmucoid PA infection. The result shows that our proposed estimators for the CFFPR data are quite robust to various values we considered for κ 1 and κ 2 .