Simultaneous inference procedures for the comparison of multiple characteristics of two survival functions

Survival time is the primary endpoint of many randomized controlled trials, and a treatment effect is typically quantified by the hazard ratio under the assumption of proportional hazards. Awareness is increasing that in many settings this assumption is a priori violated, for example, due to delayed onset of drug effect. In these cases, interpretation of the hazard ratio estimate is ambiguous and statistical inference for alternative parameters to quantify a treatment effect is warranted. We consider differences or ratios of milestone survival probabilities or quantiles, differences in restricted mean survival times, and an average hazard ratio to be of interest. Typically, more than one such parameter needs to be reported to assess possible treatment benefits, and in confirmatory trials, the according inferential procedures need to be adjusted for multiplicity. A simple Bonferroni adjustment may be too conservative because the different parameters of interest typically show considerable correlation. Hence simultaneous inference procedures that take into account the correlation are warranted. By using the counting process representation of the mentioned parameters, we show that their estimates are asymptotically multivariate normal and we provide an estimate for their covariance matrix. We propose according to the parametric multiple testing procedures and simultaneous confidence intervals. Also, the logrank test may be included in the framework. Finite sample type I error rate and power are studied by simulation. The methods are illustrated with an example from oncology. A software implementation is provided in the R package nph.


Introduction
The aim of survival analysis in a randomized clinical trial typically is to show a benefit of an experimental treatment over a control treatment.Under the frequent model assumption of proportional hazards, the relative treatment benefit can be quantified with a single parameter, the hazard ratio, which describes the shift of the survival function under treatment compared to the survival function under control at every time-point.
When aiming to establish superiority of a new treatment over control in terms of survival in a randomized controlled clinical trial, the null hypothesis of equal survival functions is typically tested by a logrank test, or an equivalent test based on the Cox model, and a hazard ratio estimate is reported to quantify the treatment effect.Under the assumption of proportional hazards, this approach is efficient in terms of power and allows for unambiguous conclusions on superiority, because one single parameter, the hazard ratio, is sufficient to describe the shift between survival functions at every time-point.
Recently, the assumption of proportional hazards has been found to be inadequate or at least questionable in relevant clinical settings [1,2,3,4,5].In particular, a delayed onset of treatment effect in immuno-oncology drugs has been discussed as important source of non-proportional hazards and, moreover, as a setting in which the traditional logrank test looses power and the effect quantification using the hazard ratio as a single summary measure may be flawed.Other sources of non-proportional hazards that have been discussed in the literature are, e.g., a heterogeneous patient population with population subgroups that respond differently to treatment, such as long-term survivors, modified efficacy after disease progression, or the need for rescue medication or treatment switching [5,6,7].
Several testing procedures have been proposed to remedy the potential loss of power of the logrank test in these settings.In particular, the use of weighted logrank tests has been proposed, putting more weight on event times with a more pronounced anticipated effect [8,9].If the pattern of non-proportional hazards is not well known beforehand, a maximum combination (MaxCombo) test of differently weighted logrank tests has been shown to be a robustly powerful method [10,11,12,5,13,14].However, in absence of further assumptions, a significant result for these tests, at first, only implies that there exists a time point for which the hazard function under treatment is less than the hazard function under control.Whether this result translates into a relevant (or in fact into any [9]) benefit in terms of the survival functions needs to be assessed by estimates for parameters that appropriately quantify the differences between survival functions [6,15].
The interpretation of the parameter estimate of the Cox model can be challenging under non-proportional hazards.In particular, the limiting value of the Cox model hazard ratio estimate will depend not only on the true hazard functions but also on study design parameters such as recruitment rate, study duration and censoring distribution that affect the timing of observed events.Alternative effect measures, proposed to be used under non-proportional hazards, include the difference in restricted mean survival times [16,17], average hazard ratios defined via predefined weighting functions [18,19,20], differences x-year milestone survival probabilities or differences in quantiles of the survival distribution.Several authors have argued that a single such parameter may not be sufficient and differences in survival curves should instead be assessed by a set of summary statistics [4,6,21].
In this paper, we propose a simultaneous inference framework for a set of multiple parameters, which may include differences in survival probabilities, differences in log survival probabilities, differences in complementary log log (cloglog) transformed survival probabilities, differences in quantiles of the survival functions, differences in log transformed quantiles, an average hazard ratio and the difference in restricted mean survival times.The logrank test, albeit being a non-parametric test, may be included, too.For completeness, the Cox model hazard ratio may be included under the assumption of proportional hazards.
We present multiple testing procedures and simultaneous confidence intervals for pre-specified sets of pa-rameters, such that the family wise type I error rate is controlled and multiple parameter estimates can be interpreted simultaneously in confirmatory manner.The multiple testing adjustment is based on the counting process representation of survival function estimates, by which we show that the considered estimates are asymptotically multivariate normal and by which we derive an estimate of their asymptotic covariance matrix.
The paper is structured as follows.In Section 2, we propose a general framework for the multivariate normal approximation of multiple survival parameters and show the application to particular estimates.In Section 3, we describe multiple testing procedures and simultaneous confidence intervals, In Section 4, we perform a simulation study to assess the operating characteristics of the proposed methods in terms of type I error rate and power under different scenarios.In Section 5, we illustrate the proposed methods in a worked example based on survival curves reported by Burtness et al. [22] (KEYNOTE-048) for the comparison of pembrolizumab versus cetuximab in treating recurrent or metastatic squamous squamous cell carcinoma.A software implementation of all proposed methods is implemented in the R package nph [5] (see Section 6).We conclude with a discussion.

Multivariate normal approximation for multiple survival parameter estimates
In this section we present a general framework for a multivariate normal approximation and the estimation of the covariance matrix for multiple parameters derived from survival functions.We subsequently apply the framework to different parameter estimates that are commonly used to quantify the difference between two survival curves.

General framework based on martingale representation
We consider a control and a treatment group, indexed i = 0, 1, including n i independent subjects with observations on possibly censored survival times.Denote with D i the set of observed event times in group i.We assume that the censoring times are stochastically independent of the true event times.
For further notation purposes, let S i (t) = P i (T i > t) be the survival function in group i, with T i a random event time from the respective group.The corresponding hazard function is Further, for a given data sample, let N i (t) be the number of observed events in the time interval [0, t] and Y i (t) the number at risk at time t in group i.We denote by Λi (t) = s∈Di,s≤t dN i (s)/Y i (s) the Nelson-Aalen estimator for the cumulative hazard, by Ŝi (t) = exp{− Λi (t)} the Nelson-Aalen estimator for S i (t) and by Ŝ− i (t) = exp{− s∈Di,s<t dN i (s)/Y i (s)} the left continuous version of Ŝi (t).Define qi (γ) = min{t : Ŝi (t) ≤ 1 − γ} as an estimate of the γ-quantile of the survival distribution in group i.

Further we denote by
s) the difference between observed and expected events up to time t.M i (t) is a martingale process that is key to the counting process representation of estimates in survival analysis [23,8].
As a general framework, we aim to quantify the difference in the survival functions under control and treatment by a set of parameters θ 1 , . . ., θ m .We assume that the true parameters are of the form θ k = θ k,1 −θ k,0 , where θ k,i is a function of the true survival function and possibly of the true censoring function in group i.
We further assume that the difference between the corresponding estimates θk,i and the true parameter values θ k,i can be approximated up to an asymptotically negligible residual term by a stochastic integral such that where the a k,i are constant parameters for which consistent estimates âk,i exist and the H k,i are predictable processes with respect to the martingale process M i for which consistent estimates Ĥk,i exist.The interval [0, t k ] is the time interval within which data is used to calculate θk,i .
Finally we assume the asymptotic stability condition holds [24]: There exists a function ρ i with values in (0, 1) such that for n i → ∞, sup 0<s≤max(t1,...,tm) | Yi(s) ni − ρ i (s)| → 0 a.s.Then, by the multivariate martingale central limit theorem, the vector )) asymptotically follows a multivariate normal distribution with mean zero and variances and covariances given by [8] cov A consistent estimator Σi for the covariance matrix of ( θ1,i , . . ., θm,i ) is obtained by replacing a k,i by âk,i , We assume continuous survival distributions such that two events occur at the same time with probability zero.However, in actual applications event times are not measured precisely but they are typically rounded, e.g. to full days, such that tied event times may occur.To account for ties, the term 1 if dN i (s) ≥ 1, see e.g.Klein and Moeschberger [25].

Application to specific parameters
To quantify between-group differences, we consider a range of parameters: differences in survival probabilities, differences in log survival probabilities, differences in cloglog transformed survival probabilities, differences in quantiles of the survival times, differences in log transformed quantiles, an average hazard ratio and the difference in restricted mean survival times.We also consider the Cox model score test (logrank test) statistic and (under the proportional hazard assumption) the hazard ratio corresponding to the Cox model.For all these parameters, estimators can be constructed that satisfy the assumptions laid out in Section 2.1.Especially, the estimators can be written in the form (1), as will be detailed below.Hence, their joint distribution can be approximated by a multivariate normal distribution, with covariances estimated by equation (2).Based on this approximation, multiple hypotheses tests can be constructed.The specific terms to calculate variance and covariance estimates according to (2) are summarized in Table 1.

Survival probabilities
The estimated difference in survival functions at times t is θj = Ŝ1 (t) − Ŝ0 (t).The asymptotic properties and the martingale representation for survival function estimates are well established [23,8].In particular, and by first order approximation [8], Hence the representation in terms of equation ( 1) is Ŝi (t) − S i (t) ≈ −S i (t) t 0 1 Yi(s) dM i (s).Alternatively, the ratio in survival probabilities at time t may be of interest, which may be included in the proposed framework at the log-scale in terms of θj = log Ŝ1 (t) − log Ŝ0 (t).Since log S(t) = −Λ(t) and by equation ( 4), the representation in terms of equation ( 1) is log Ŝi (t)−log S i (t) ≈ − t 0 1 Yi(s) dM i (s).Estimates and confidence intervals for the log-ratio of survival probabilities may be backtransformed to obtain the respective quantities for the ratio S 1 (t)/S 0 (t) at the original scale.
A further common transformation is the complementary log log (cloglog) of the estimated survival probability, such that the parameter θj = log(− log Ŝ1 (t)) − log(− log Ŝ0 (t)) may be included with the martingale representation log(− log Ŝi (t)) − log(− log S i (t)) ≈ − 1 log S(t) t 0 1 Yi(s) dM i (s).Here, the transformed parameter exp( θj ) may be of interest.This is an estimate for the ratio of cumulative hazards, Λ 1 (t)/Λ 0 (t) (which under proportional hazards corresponds to the hazard ratio).
Here a consistent estimate λi (q i (γ)) for the hazard at the respective quantile is required.We use a relatively simple approach and estimate λi (q i (γ)) under a locally constant hazard approximation as follows: Let e i be the total number of observed events in group i. Define a positive finite constant bandwidth K and the boundaries for a time interval that contains at least K √ e i events as Here, the maximum and minimum of an empty set are defined as −∞ and ∞, respectively.
The hazard in the interval is estimated as ratio of the number of events and the sum of all observed times, i.e.
With increasing number of events, the interval becomes narrower at while the absolute number of events within the interval gets larger.Under the assumption of continuous hazard function and by consistency of qi (γ), the resulting estimate is consistent.In the actual calculations we used K = 2.Alternatively, the local hazard may be estimated by kernel-density estimation [28].
When the ratio of quantiles is of interest, the difference at the log scale may be defined as parameter of interest such that θj = log q1 (γ) − log q0 (γ).By application of the delta method to the original approximation we obtain the required presentation as log qi (γ) − log q i (γ) ≈ −1 qi (γ) λi (q i (γ))

Average hazard ratio
A general class of average hazard ratios can be defined as , where L is a predefined time point and W (s), s ≥ 0 is a non-negative monotonically decreasing weight function with values in [0, 1] [18].We consider the average hazard ratio with weight function W (s) = Ŝ0 (s) Ŝ1 (s) and its corresponding estimate Here the left continuous estimator of the survival function is used to obtain a predictable function, which is required in the eventual application of the martingale central limit theorem.Note that both Ŝ and Ŝ− are uniformly consistent estimates of S, and with sufficient sample size their numeric difference is negligible.
The average hazard ratio based on the considered weight function is identical to P (T1∧L>T0∧L) P (T1∧L<T0∧L) and can be interpreted as a non-parametric effect measure similar to a Mann-Whitney statistic.Unlike the Cox model hazard ratio estimate, the limiting value of this average hazard ratio estimate does not depend on the censoring distribution (also see simulation scenarios 2 and 3 in Section 4).
To embed the average hazard ratio in the proposed framework we utilize the log transformed estimate θj = log The contribution of group i to the estimate of the log-average hazard ratio immediately fits into the proposed framework as Yi(s) dM i (s).By application of the delta method the representation according to (1) for the log transformed term is log Note that the statistic . This follows from the following arguments: By the martingale central limit theorem, the difference between both expressions, , converges in distribution to a normal distribution with mean zero and variance i and, consequently, Ŵ (s) are uniformly stronlgy consistent estimators, the variance of Yi(s) dM i (s) converges to 0. Hence, the asymptotic arguments established in Section 2.1 are not affected by using estimated weights Ŵ instead of the true (unknown) weights W and

Restricted mean survival time (RMST)
The RMST in group i up to a pre-specified time-point L is µ i = L 0 S i (t)dt.Let D i be the number of unique event times t i,1 < . . .< t i,Di−1 ≤ L in group i that are less or equal L. Further define t i,0 = 0 and t i,Di+1 = L and ∆t i,j = t i,j+1 − t i,j .The according estimate for the RMST is μi = Di j=0 Ŝi (t j )∆t i,j and the estimated RMST difference between the two groups is θj = μ1 − μ0 .μi may be represented in terms of equation ( 1) [29,30].First note that μi where we use that the error of the integral approximation is of order 1/n.By replacing Ŝi (t j ) − S i (t j ) by equation ( 5) and by changing the order of integration and summation the required form is obtained as The logrank test for the null hypothesis H 0 : λ 0 (s) = λ 1 (s), ∀0 ≤ 0 ≤ L may be of interest also under nonproportional hazard settings.The usual logrank test is asymptotically equivalent to the Cox model score test for the null hypothesis β = 0, where β is the Cox model hazard ratio.The tests are equivalent up to the variance estimate for the score test statistic.The score test may be directly included in the proposed framework to adjust for multiple testing as shown below.
For subject j = 1, . . ., n i in group i ∈ {0, 1}, let y ij (t) ∈ {0, 1} indicate whether the subject is at risk at time t, and let N ij (t) ∈ {0, 1} be the number of events of the subject in the time interval In a Cox model comparing two treatment groups up to a time point L, the score function (i.e. the derivative of the log partial likelihood) is It can be shown [31] that under the proportional hazards assumption (i.e.λ 1 (t) = λ 0 (t) exp(β), ∀t ≥ 0), and for β = β 0 being the true parameter value, dN ij may be replaced by dM ij , such that which can be rewritten as Note that under the null hypothesis H 0 : λ 0 (s) = λ 1 (s), ∀s ≤ 0 ≤ L, the proportional hazard assumption holds.
Further note that rejection of β = 0 entails rejection of λ 0 (s) = λ 1 (s) at least for some s.The reverse is not necessarily true, which results in low power of the logrank test under crossing hazards.The test statistic for the score test of the null-hyptothesis β = β 0 is U (β 0 ) as defined in 8. Hence, the Cox model score test for the null hypothesis β = 0 can be included in the framework of Section 2.1 by defining a parameter estimate θj = U (0) and by setting H k,i (s) = Y0(s)Y1(s) Y0(s)+Y1(s) and a k,i = 1 in equation ( 1).Note that Assumption 3 about uncorrelated contributions from both groups is satisfied for the score test despite both terms in equation ( 8) contain the at risk process of both groups, because we assume that the probability for equal event times is 0 (and ties only occur due to rounding of observed event times).Under this assumption, theorems 2.5 §2 and 2.4 §4 of Fleming and Harrington show that the covariance of statistics of the type assumed in equation ( 1) is 0.

Cox model hazard ratio
The Cox model hazard ratio can be included in the proposed framework of simultaneous inference under the assumption of proportional hazards.First note that the estimate of the log hazard ratio, β, is the solution of U ( β) = 0. Next, by standard asymptotic results and with β 0 the true log hazard ratio β [32].Here, U (β 0 ) can be decomposed in contributions from the two groups according to (8).The Hessian matrix H = dU dβ can be estimated by Therefore, the Cox model hazard ratio can be included in the framework of Section 2.1, by setting θ j = β, The theory of Section 2.1 applies to the hazard ratio under the assumption of proportional hazards.The operating characteristics of simultaneous inference including the hazard ratio under non-proportional hazards are explored in the Supplemental material.

Resampling based covariance matrix estimate
As an alternative method, the covariance matrix for a vector of parameter estimates θ = ( θ1 , . . ., θm ) may be estimated by a resampling approach similar to perturbation methods described by Zhao et al. [30,33] and Parzen et al. [34], which can be regarded as parametric bootstrap.
Based on the martingale central limit theorem and equation ( 2), the process Λi (t)−Λ i (t) = t 0 1 Yi(s) dM i (s), t ≥ 0 can be approximated by a Gaussian process with independent increments and covariance function cov( Λi (t) − . Accordingly, the distribution of this process can be approximated by the distribution of the process i (s , t ≥ 0, where Z j , j = 1, . . ., n i are independent standard normal random variables.
A perturbation sample of Λ − Λ is defined as i (s , t ≥ 0, where z j , j = 1, . . ., n i are realisations of independent standard normal random variables.And a perturbation of Λ is defined as Λ * = Λ + P * .Similar as with equation 2, in case of ties the expression A vector of parameter estimates can be regarded as function θ( Λ0 , Λ1 ) of the estimated cumulative hazard functions.In the proposed perturbation approach, the distribution of θ( Λ0 , Λ1 ) given the true cumulative hazard functions Λ 0 and Λ 1 is approximated by the distribution of θ( Λ * 0 , Λ * 1 ) given Λ0 and Λ1 .

Maximum-type hypothesis tests for multivariate normal statistics
We aim to test hypothesis with respect to parameters θ = (θ 1 , . . ., θ m ), as defined in the previous sections, with strong control of the family-wise type I error rate.We consider one-sided elementary null hypotheses k , k = 1, . . ., m and the global intersection null hypothesis of no difference in any of the parameters H 0 = ∩ m k=1 H k .In superiority trials, depending whether the parameter of interest are ratios or differences between treatment and control, the θ k is set to 1 or 0, respectively.The considered hypothesis tests can also be defined for two-sided hypotheses, but for the purpose to show superiority of treatment over a control we regard one-sided tests to be more relevant.
Multiple hypothesis tests can be constructed based on the multivariate normal approximation of the estimates θ = ( θ1 , . . ., θm ), see e.g.[35].Correlations of estimates can be high, hence the multiplicity correction based on the approximate joint distribution can be moderate compared to methods that control the family-wise type I error rate whithout taking into account the correlation structure, as the Bonferroni correction.We consider maximum type tests.Other test statistics, such as sum or chi-squared type statistics may also be used within the multivariate normal framework.Consider a vector of m parameter estimates θ appr.
∼ N m (θ, Σ).Let V be the diagonal matrix with diagonal entries of Σ Define a vector of elementary test statistics Under the global null hypothesis A single step maximum-type test for elementary null hypothesis can then be defined via multiplicity adjusted p-values for the hypotheses H k , k ∈ {1, . . ., m} and is given by P {max(Z) ≥ T j }.
The single step test can be improved applying the closed testing procedure [36] where each intersection null hypothesis ∩ k∈I H k , I ⊂ 1, . . ., m is tested by a maximum-type test for intersection hypotheses.The elementary hypothesis H k is then rejected by the closed test, if all intersection hypotheses that include H k (i.e.all ∩ k∈I H k with I ⊂ 1, . . ., m and k ∈ I) are rejected with the corresponding maximum-type test.A multiplicity adjusted p-value is given by the maximum of the elementary p-values of these intersection hypotheses.

Simultaneous confidence intervals
Lower confidence bounds with simultaneous coverage probability 1−α that correspond to the inversion of a onesided single-step multiple testing procedure are given by θj − V 1/2 j q α , where q α is defined as q α : P (max(Z) ≥ q α ) = α.The boundaries of two-sided confidence intervals with simultaneous coverage probability 1 − α that correspond to the inversion of a two-sided single-step test are defined as θj ± V 1/2 j q ′ α , where q ′ α is defined as In an actual analysis, the closed testing procedure may be preferred over the single step procedure as it is more powerful.It is possible to construct confidence regions that correspond to the inversion of the closed testing procedure.However, such regions would typically be of complex shape and not easily interpreted, and the projections of such a region onto univariate confidence intervals would typically not retain the advantage of the closed test over the single step test [37].

Simulation study
In six different simulation scenarios and for different parameter sets, we studied the operating characteristics of the proposed multivariate inference methods and compared them with the Bonferroni-Holm multiplicity adjustment and a single logrank test.
We considered seven parameter sets.In parameter set 1, we included the average hazard ratio and the restricted mean survival time as an example of joint inference using two alternative summary measures of a possible treatment benefit.
In parameter set 2, we included the between-group difference in survival probabilities after one, two and three years as well as 25% quantiles and the restricted mean survival time.This parameter set was intended to illustrate an analysis where one summary measure (RMST) is combined with several statistics that allow a point-wise description of the survival differnces.Parameter set 3 was defined similarly, albeit replacing survival probabilites and quantiles by their log-transformations.This set was included to study a possible effect of logtransformation on the finite sample properties of the distributional approximation.In similar spirit, parameter set 4 included cloglog transformed survival probabilites combined with the average hazard ratio.
Parameter sets 5 to 7 were designed to study simultaneous inference for the logrank test combined with several quantifying parameters.Parameter sets 5 and 6 included the logrank test and differences in untransformed or cloglog transformed survival probabilites, respectively.In parameter set 7 we combined the logrank test with estimates for the average hazard ratio and restricted mean survivial time.
In all scenarios, a cut-off value of three years was set for the restricted mean survival time, for (average) hazard ratios and for the calculation of the logrank test.
Table 2 shows a summary of the considered parameter sets and the true values of the respective parameters in the simulation scenarios.Further simulation scenarios, including also the Cox model hazard ratio (despite its limitation) are considered in the Supplemental material.
To assess the coverage probabilities of confidence intervals, simulations were performed for parameter sets 1 to 4 (see Table 2) with sample sizes 200, 500 or 1000 per group.Adjusted and unadjusted two-sided confidence intervals with nominal (simultaneous) coverage probability of 95% were calculated.The asymptotic covariance matrix estimate and the perturbation based estimate, both with adjustment for ties, were used.
To assess type I error rate and power of hypothesis tests, simulations were performed for parameter sets 5 to

Simulation scenarios
In all scenarios we compare two groups with equal number of patients per group in (200, 500, 1000).Recruitment was assumed to be uniform over 1 or 1.5 years (depending on the scenario) and the maximum follow up time was 3.5 years in all scenarios.Furthermore, we applied random censoring according to an exponential distribution with rate − log(0.9)such that, given no other events occur, on average within one year 10% and within 2 years 19% of subjects are censored.Simulated survival times were rounded to full days to reflect the degree of precision and the occurrence of ties as observed in actual trials.All simulations were repeated 50,000 times.
The scenarios are described in detail below, and the resulting survival functions, hazard functions and hazard ratios as function of time are shown in Figure 1.

Scenario 1, Delayed onset of treatment effect:
In Scenario 1 we sampled data for the treatment group from a lognormal(0.8,0.8 2 ) distribution and data for the control group from a lognormal(log(exp(0.8) − 0.5), (log(exp(0.8)− 0.5)) 2 ) distribution.The distributions were chosen to resemble a setting with delayed onset of the treatment effect.During the inital phase, the control group has slightly better survival, which may occur if the treatment effect is observed only after a certain duration of treatment, but the potential risk of side effects increases immediately after treatment start.The recruitment phase in the simulation was 1.5 years.
Scenario 2, Crossing survival curves, fast recruitment: In Scenario 2 we used Weibull(2,1.8)and Weibull(3.5,0.8)distributions (where parameters refer to scale and shape) for the treatment and the control group, respectively.The resulting distributions, as illustrated in Figure 1, show a pronounced crossing of the survival functions and the hazard functions.The assumed duration of the recruitment phase was one year.
Scenario 3, Crossing survival curves, slow recruitment: Scenario 3 was identical to scenario 2, except that the duration of the recruitment phase was two years, which results in a modified censoring pattern.
Scenario 4, Proportional hazards: In Scenario 4, data are sampled from Exp(0.5) and Exp(0.5 * 0.65) distributions, meeting the proportional hazards assumption with a hazard ratio of 0.65, with a recruitment phase of one year.
Scenario 5, Cure fraction: In scenario 5 we assumed that 30% of patients belong to a subpopulation of strong treatment responders in whom the treatment leads to complete cure and we further assumed that in the control group, patients are switched to active treatment after disease progression with 70% probability.
Transitions to progression or death were assumed to be governed by independent processes with constant rates.
Hazard rates for death were 0.69 per year before progression and 2.77 per year after progression and the progression rate was 1.39 per year.(This corresponds to median event times for the three processes of 12, 6 and 3 months, respectively.)The recruitment phase was 1.5 years.Data simulation for scenarios 5 and 6 was performed using The R library nph [5].
Scenario 6, Rescue medication: In the final scenario we assume that a rescue medication is applied to all patients in both groups after disease progression.The assumed hazard rates per year for death were 0.35 under treatment, 0.69 under control and 0.83 under rescue medication (corresponding to median times of 24, 12 and 10 months).Progression rates were 0.52 under treatment and 0.92 under control (corresponding to median times of 16 and 9 months).The recruitment phase was 1.5 years.

Simulation results
The empirical coverage probabilities of simultaneous confidence intervals as well as of unadjusted, univariate confidence intervals were in general close to the nominal 95%, for both the asymptotic and the perturbation covariance matrix estimate.In some scenarios, coverage probabilities showed a deviation in the order of one percentage point from the nominal level when the smallest studied sample size of 200 subjects per group was used.Results for the asymptotic covariance estimate are shown in Figures 2 and 3, results for the perturbation based covariance estimate are shown in the supplemental material in Figures S1 and S2.The univariate coverage of multiplicity adjusted intervals was greater than 95%, but the intervals were less conservative than simple Bonferroni adjusted intervals.Similar results in general close to nominal coverage were observed for parameter sets including the Cox model hazard ratio, see supplemental Figures S3 and S4.
Intervals based on the asymptotic and on the perturbation covariance matrix estimate performed largely similar, with some exceptions: In scenario 2 (crossing hazards) the simultaneous coverage for the parameter set containing average hazard ratio and RMST difference was slightly liberal even with large sample size when using the asymptotic covariance estimate.With the perturbation approach the coverage was conservative with small sample size and almost perfectly matching the nominal level with large sample size.Intervals for the difference in cloglog transformed survival were overly conservative with the perturbation approach applied to small sample sizes, in particular for early time-points.These intervals had coverage close to the nominal level with the asymptotic covariance estimate.
The type I error rate of one-sided unadjusted hypothesis tests for the studied parameters was well controlled at the 2.5% level with the excpetion of tests for untransformed survival probabilites.However, it is well known that the normal approximation for untransformed survival probabilites may not be entirely appropriate with small sample sizes.In contrast, tests for cloglog transformed survival did control the type I error rate.See column 'T1E unadj' in Tables 3 and 4.
Without adjustment, the type I error rate for the global null hypothesis of no difference in any included parameter was in the order of 6.5% to 7.5% in the studied scenarios.Multiplicity adjustment using the proposed multivariate normal approximation resulted in type I error rates close to the nominal 2.5%.Some inflation was still observed for parameter sets containing untransformed survival probabilities, which likely results from the inaccuracy even of the univariate approximation for these parameters (see column 'T1E adj' in Tables 3 and  4).Adjustment using the Bonferroni-Holm test resulted in strictly conservative tests and adjusted significance levels below those of the multivariate normal-based closed test (see column 'T1E adj' and 'T1E Holm' in the result tables).
The power of the multivariate normal-adjusted tests for the global null hypothesis was on average 4.0 percentage points below the power of corresponding unadjusted tests.For comparison, Bonferroni adjusted tests were on average 7.7 percentage points less powerful than the unadjusted tests.(See columns regarding power and rows with parameter label 'Any' Tables 3 and 4.) Similarly, the power for elementary hypothesis tests was on average 4.4 percentage points lower with the multivariate normal-based closed test compared to unadjusted univariate tests.The Bonferroni-Holm procedure resulted on average in 7.2 percentage points lower power compared to unadjusted tests.
That means, averaged across the studied settings, the proposed testing procedure retains almost half the power loss, which would occur with a simpler Bonferroni-Holm approach.
When comparing the approach to test multiple parameters with a single Cox model score test, in scenarios with strong non-proportionality of hazard functions (scenarios 1, 2 and 3), the hypothesis test for a difference in 3-year survival or 3-year cloglog-transformed survival was of similar (scenario 1) or larger power (scenarios 2 and 3) compared to the score test.
When including the difference for 1-year, 2-year, 3-year survival (or cloglog transformed survival) and the score test in one parameter set and adjusting for multiple testing, the power to show a difference in at least one considered parameter was similar (scenario 1) or considerably larger (Scenarios 2 and 3) than the power of a single unadjusted score test.Further, the power to show a difference in at least one parameter under multiplicity adjustment was similar to the unadjusted power of the most powerful univariate comparison (either 3-year survival or score test).
This implies that, first, under severely non-proportional hazards, testing for differences at a well chosen milestone time-point can be more powerful than the score test.And, secondly, testing several milestone timepoints and adjust for multiplicity will often be a better choice than selecting one time-point in advance and avoid multiple testing, as the multiplicity adjustment with the proposed method will mostly maintain the power of the most powerful univariate comparison.
When comparing the score test for the Cox model hazard ratio with tests for the two other summary measures (average hazard ratio and RMST difference), the score test performs by far best in the three scenarios with strongly non-proportional hazards.
In scenarios with proportional hazards (scenario 4) or moderate non-proportionality (scenarios 5 and 6), the score test was more powerful, in the order of 10 percantage points, than the best comparison for survival at either milestone time-point.Also, the unadjusted power of the score test was larger than the adjusted power to reject at least one of the considered null hypotheses by two to five percantage points.Tests for average hazard ratio and RMST difference had almost identical power as the score test in scenarios 4 and 6 and when combining all three tests, the adjusted power to reject at least one null hypothesis was also at the almost same value as the power of the unadjusted tests.In scenario 6, the score test and the test for RMST difference both had a power of approximately 77%, whereas the test for average hazard ratio was at 71.6%.Still, the power of the better tests was almost retained in adjusted power to reject at least one null hypothesis with a value of 75%.
Taken together, the simulation results show that in terms of power to show at least some difference, the score test (or logrank test) is in most settings a robust choice.However, to increase robustness or to aid in the interpretation of the pattern of differences between survival curves, the test may be complemented by tests for further parameters and, when applying the proposed multiplicity adjustment, the adjusted overall power to find some difference will typically remain at a very similar level as the power of the best included test.

Data example
As an illustrating example we considered comparisons from the study "Pembrolizumab alone or with chemotherapy versus cetuximab with chemotherapy for recurrent or metastatic squamous cell carcinoma of the head and neck (KEYNOTE-048): a randomised, open-label, phase 3 study" by Burtness et al. [22].The survival curves shown in this study exhibit obvious properties of non-proportional hazards.For the comparison of pembrolizumab alone versus cetuximab with chemotherapy, survival curves were crossing, with better survival under cetuximab with chemotherapy in the first eight months and subsequent better survival under pembrolizumab.
In the comparison of pembrolizumab with chemotherapy versus cetuximab with chemotherapy, survival curves were almost equal in both groups for the first eight months with subsequent separation of the two curves, showing a benefit for the pembrolizumab group.
We reconstructed a data set for the comparison of overall survival in the full population under pembrolizumab alone versus cetuximab with chemotherapy based on the numbers at risk and number of censored event-times, which are given for every five-month interval in Figure 2D of this publication [22].In the reconstructed data set, event times and censoring times, respectively, were equally spread within each five-month interval.The estimated survival functions for the reconstructed data are shown in Figure 4.The data set contains 301 subjects with 237 events in the pembrolizumab alone group and 300 subjects with 264 events in the cetuximab with chemotherapy group.The overall median and maximum follow up times are 0.96 years and 3.96 years.
In the study by Burtness et al. [22], confirmatory tests were performed for the Cox model hazard ratio between groups.As discussed, under non-proportional hazards, the expected value of the Cox model hazard ratio depends on trial characteristics such as the length of recruitment and follow-up periods and other parameters may be more appropriate to quantify the treatment effect, in particular under crossing survival curves (see, e.g.Magirr and Burman [9]).Accordingly, survival functions were further characterised in Burtness et al. by reporting 0.5year survival, 1-year survival and median survival times, however no inference for the between-group difference of these parameters and, consequently, no adjustment for simultaneous inference was provided.In our example, assume there is some information from previous studies that the survival curves will likely show late separation and a standard analysis may hence be affected by non-proportional hazards.
We consider two alternative analysis approaches.The first approach is focused on establishing a difference between survival curves based on a small set of parameters, the second one aims at a characterisation of differences by confidence intervals for a larger set of parameters.
In the first approach, two primary null hypotheses for the overall differences in survival functions and for the difference in two-year survival probabilities are defined and tested, respectively, with the Cox model score test and the Wald test for survival differences.Multiplicity adjustment at the simultaneous one-sided significance level of 0.025 is applied, using the closed-test based on the multivariate normal distribution with asymptotic covariance estimate.This analysis is intended to show at least some benefit of pembrolizumab (group 1) over cetuximab (group 0).Two tests are combined to complement for a possible lack in power of the score test compared to the potentially large difference in survival at a late milestone time-point.The according null hypotheses are H 01 : λ 1 (s) ≥ λ 0 (s), ∀s ≤ 0 ≤ 3.5 and .
The resulting unadjusted one-sided p-values for H 01 and H 02 are p 1 = 0.0100 and p 2 = 0.0053.The according multiplicity adjusted one-sided p-values are p 1,adj = 0.0100 and p 2,adj = 0.0082.Hence, both null hypothesis are rejected at the one-sided family-wise significance level of 0.025.Furthermore, the adjustment did not change the p-value of the score test, and only by a small amount increased the p-value of the test for 2-year survival differences.In this example, the estimated correlation between the two test statistics was 0.87, and this large correlation entails the very modest adjustment to the p-values.
In the second example analysis approach, the difference in survival curves is characterised by a parameter set that includes the differences in 0.5-year survival, 1-year survival, 2-year survival, median survival times and 3.5-year RMST.Simultaneous two-sided 95% confidence intervals for these parameters are calculated using the multivariate normal adjustment with the asymptotic covariance estimate.
The results of the second analysis are shown in Table 5.In this example, the width of Bonferroni-adjusted confidence intervals is 1.064 times the width of multivariate normal adjusted intervals.
The analysis shows that regarding 0.5-year survival, lower efficacy of pembrolizumab alone or with chemotherapy versus cetuximab cannot be ruled out, with a difference up to 13 percenentage points at the adjusted 95% confidence level.At 1-year, the relation has reversed with point estimates for survival probabalities and for the median, which is close to one year in both groups, favoring pembrolizumab.Albeit untercertainty remains at this time-point, reflected in the confidence intervals that cover the possibility of no between-groups difference in 1-year survival and in the median survival times.Only at longer time spans point estimates and confidence intervals for 2-year survival and the 3.5-year RMST difference support the conclusion of larger benefit under pembrolizumab.

Software implementation
The proposed methods were implemented in the R [38] function nphparams().This function was added to the previously published R library nph [5], which provides functions to simulate and analyse survival data under non-proportional hazards.
The example data set of Section 5, too, was added to the nph package under the name pembro.The following R code may be applied to reproduce the examplary data analysis:

Discussion
In absence of the proportional hazard assumption, the exact definition of a survival benefit under treatment versus control is ambiguous.Essentially, two distribution functions need to be compared and different aspects of these distributions may receive different emphasis depending on personal preferences or circumstances.E.g., a survey by Shafrin et al. [39] among melanoma patients and lung cancer patients and their treating physicians found that patients on average preferred a larger chance for increased long-term survival and in exchange were more willing to accept increased short-time risk for mortality as opposed to their physicians.
Consequently, to formally establish a benefit of treatment over control in a clinical trial under non-proportional hazards, more than one parameter for the difference in survival functions needs to be regarded.Our aim was to provide a formal inference framework that includes a wide range of suitable parameters and that allows for an efficient parametric multiplicity adjustment to control the type I error rate of hypothesis tests and the simultaneous coverage of confidence intervals.
All considered parameter estimates essentially are a function of the observed event process and as such can be combined in a joint counting process framework that establishes their asymptotic multivariate normal distribution.Simultaneous inference based on this distributional approximation results in more powerful procedures than adjustments such as the Bonferroni-Holm method, which do not take into account the underlying distribution.In particular when parameters are highly correlated, as is, e.g., the case for combinations such as 3-year survival and RMST in our simulation scenarios, only moderate multiplicity adjustment is required and the proposed methods result in little loss in efficacy compared to unadjusted univariate analyses.
Of note, in absence of proportional hazards, the Cox model hazard ratio is not robust to design characteristics as it depends on the timing of events and hence can be affected, e.g., by the recruitment rate and length of follow-up.Thus the traditional hazard ratio is of limited use to quantify differences in survival curves under non-proportional hazards.In particular with crossing survival curves (or more generally crossing hazard curves) the hazard ratio estimate should be interpreted with care.The logrank test, or equivalently the Cox model score test, however, is calculated under a global null hypothesis of equal hazard functions and therefore is a viable approach to establish that there is at least some difference between two survival functions.
As an alternative to the proposed multidimensional parametric approach, conclusions could also be drawn from overall inspection of the observed survival functions, and simultaneous inference could be based on confidence bands with simultaneous coverage [34,40].This would correspond to a fully non-parametric approach.
Such an approach may be suitable to inform the treatment decision of an individual patient, however, it is not suitable to define success criteria regarding efficacy in a clinical trial or when evaluating treatment strategies in clinical practice.To interpret and communicate effects of drugs with a complex pattern of efficacy over time, a set of quantifying parameters seems to be a good compromise between a single parameter, such as the hazard ratio, and a completely non-parametric approach of regarding the overall survival curves.
The multiple testing procedures described in Section 3 may be extended towards more complex methods.
A serial gate-keeping procedure [41] may be used to first show a difference between treatment and control by testing an intersection hypothesis for a small set of parameters for which a high power is expected, and in case of success assess the survival differences in detail with respect to a larger set of relevant parameters.E.g., in the example of Section 5, the intersection hypothesis test comprising the logrank test and the test for 2-year survival could be used to establish some difference and act as gate-keeper for the subsequent analysis of the larger parameter set.Of note, the rejection of the gate-keeping intersection hypothesis in the first step does not automatically imply the rejection of the corresponding elementary null hypotheses of the first set, because these are not included in the closed testing procedure that corresponds to the gate-keeping approach.To define more complex testing procedures with several levels, parametric graphical multiple testing procedures [42,43] could be defined using the estimated covariance matrix.
The first step in our example is similar in spirit to the combined test proposed by Royston and Parmar [44].
They suggested to perform a maximum combination test that includes the logrank test and RMST differences at several time-points.Royston and Parmar use permutation (or an approximation thereof) to calculate p-values.
However, their test can as well be performed within the asymptotic multivariate normal framework we presented, and may there be supplemented by simultaneous confidence intervals for the included RMST differences.
Though not covered in the present work, the simultaneous inference framework may be extended to include stratified analyses.One way to do so, would be to estimate the parameters of interest and their covariance matrix for each stratum separately and then calculate a weighted average of the per-stratum estimates and the corresponding covariance matrix.Weights could correspond to stratum size, however further investigations into the ideal choice of weights may be warranted.
Also, weighted logrank statistics may be included in the inference framework as further extension.One could also consider to perform interim analyses to allow for early stopping [14,45] and adaptations such as sample size reassessment [46] and modification of the set of parameters, e.g., adding milestone analyses at later time points.Depending on which type of data are considered for the adaptations [47,46], appropriate adaptive tests have to be implemented.
In summary, simultaneous inference for a predefined set of survival parameters allows for a robust assessment of treatment efficacy under non-proportional hazards.The required multiplicity adjustments can be performed efficiently based on their asymptotic joint normal distribution.
Quantile ratio log qi (γ)   Hazard ratio between treatment and control as function of time.Dotted and dash-dotted lines indicate the large sample limit of the Cox model hazard ratio estimate and the true average hazard ratio as defined in Section 2.2, both calculated with a cut-off at three years.

Figure 1 :
Figure 1: Simulation scenarios.Left column: True survival functions for treatment and control group.Vertical dashed lines indicate the 1, 2 and 3-year time-point.The horizontal line indicates the 25% quantile.Middle column: True hazard rates (per year) as function of time for treatment and control group.Right column:Hazard ratio between treatment and control as function of time.Dotted and dash-dotted lines indicate the large sample limit of the Cox model hazard ratio estimate and the true average hazard ratio as defined in Section 2.2, both calculated with a cut-off at three years.

Table 2
with the closed testing procedure described in Section 3 and, for comparison, with unadjusted and with Bonferromi-Holm adjusted tests where the elementary p-values were computed based on univariate normal approximations.The nominal (family-wise) one-sided significance level was set to 0.025.To calculate the critical values the asymptotic covariance matrix estimate with adjustment for ties was used (see Section 2.1).Simulation results using the resampling based covariance estimate (see Section 2.3) are given in the Supplemental Material.
) with sample size 200 per group, under the alternative and under the null hypothesis of identical survival curves.In the latter case, data for both groups was sampled from the distribution of the control group.For each parameter set (θ 1 , . . ., θ m ), the elementary null hypotheses H 1 : θ 1 = 0, . . ., H m : θ m = 0 versus one-sided alternatives were considered, as well as the global null hypothesis ∩ m i=1 H i .The null hypotheses were tested

Table 1 :
(2)mary of the expressions required for the variance-covariance estimation according to equation(2)for the considered parameters.The formal definition of the restricted mean survival time estimate μi and the Cox model log hazard ratio estimate β are given in Section 2.2.In the least column, L is the time-point up to which the (average) hazard ratio, RMST difference or logrank test are calculated.Parameter of interest θ k

Table 2 :
Parameter sets in the simulation study.All parameters considered in the simulation are listed and their true values are shown for the scenarios 1 to 6 of Section 4.1.For log-scaled parameters of Section 2.2, back-transformed values, i.e. ratios, are shown in the Table.The expected value of the Cox model hazard ratio estimate is shown for comparison.For the score test, the expected contribution of an individual to the summed score statistic is shown.Cross-marks (x) indicate which parameter is included in which set.

Table 3 :
Type I error rate and power observed in 50,000 simulation runs for scenarios 1-3.

Table 4 :
Type I error rate and power observed in 50,000 simulation runs for scenarios 4-6.Scenario Set Parameter T1E unadj T1E adj T1E Holm Pow unadj Pow adj Pow Holm

Table 5 :
Analysis of example data based on Figure 2D from KEYNOTE-048 95% Confidence intervals Parameter Pembro.Cetux.Difference