Methods for non-proportional hazards in clinical trials: A systematic review

For the analysis of time-to-event data, frequently used methods such as the log-rank test or the Cox proportional hazards model are based on the proportional hazards assumption, which is often debatable. Although a wide range of parametric and non-parametric methods for non-proportional hazards has been proposed, there is no consensus on the best approaches. To close this gap, we conducted a systematic literature search to identify statistical methods and software appropriate under non-proportional hazard. Our literature search identified 907 abstracts, out of which we included 211 articles, mostly methodological ones. Review articles and applications were less frequently identified. The articles discuss effect measures, effect estimation and regression approaches, hypothesis tests, and sample size calculation approaches, which are often tailored to specific non-proportional hazard situations. Using a unified notation, we provide an overview of methods available. Furthermore, we derive some guidance from the identified articles.


Introduction
In clinical studies with time-to-event outcomes, it is commonly assumed that the hazard functions of the treatment groups are proportional. However, several scenarios can lead to nonproportional hazards (NPH). Figure 1a and 1b illustrate the hazard ratio of a delayed and a diminishing treatment effect, respectively. A delayed treatment effect for the experimental arm can also lead to crossing hazards (see Figure 1c) if the comparator is an active treatment with an immediate response as is often the case in trials concerning immuno-oncology drugs. Other scenarios of crossing hazards are experiments where the treatment effect is non-homogeneous across subgroups, i.e. if the treatment is harmful in a subgroup but beneficial in its complement [1]. NPH can also occur in settings with long-term survivors in one treatment arm or if there is treatment switching to another arm after disease progression on the original arm. Under proportional hazards (PH), comparisons of hazard ratios or cumulative hazard ratios result in equivalent conclusions, whereas under NPH these results may vary substantially. Standard statistical tests for the comparison of time-to-event outcomes between groups such as the log-rank test or tests based on Cox regression models are not optimal for detecting relevant differences under NPH. Additionally, the hazard ratio estimate of the standard Cox regression model, a commonly used effect measure, is neither robust nor meaningful under NPH [2]. In contrast to PH, the interpretation of estimates of a specific effect measure, such as the hazard ratio or the cumulative hazard ratio, depend on the follow-up considered for evaluation in the presence of NPH. Well-established methods for time-to-event data are available when the PH assumption holds. However, there is no consensus on best practices under NPH. Moreover, approaches to deal with NPH are not globally optimal but depend on the specific NPH scenario. A variety of parametric and non-parametric methods for treatment effect estimation and hypothesis testing in NPH settings have been proposed. We aim to identify statistical methods and, if available, the corresponding software that are suitable for NPH. In contrast to other overview articles that focus on specific disease areas (e.g., oncology [1]), NPH patterns (e.g., switching treatment [3]), or specific methods (e.g., statistical testing [4,5]), the scope of this literature review is broader and based on a systematic approach to identifying relevant literature.
The remainder of this paper is organized as follows. In Section 2, we show the relevance of scenarios with NPH by investigating reconstructed data from clinical trials. In Section 3, we describe the literature search, data extraction and summarize the quantitative results of the review. The identified approaches are presented in a common notation, which can be found in Section 4, where we focus on NPH for the treatment indicator. We categorize and discuss approaches to estimate and model treatment or covariate effects under NPH in Section 5. Testing and sample size calculation approaches under NPH are discussed in Section 6. We compare the flexibility of the proposed methods presented in Sections 5 and 6 on theoretical grounds and highlight results of conducted comparison studies if available. Finally, we summarize and discuss the findings in Section 7. The Appendix A provides more detailed information on the literature search and data extraction. The Online Supplement S provides more detailed information on the estimation and testing approaches identified as appropriate for NPH.

Motivating Examples
Borghaei et al. [6] report a phase 3 trial comparing the effect of nivolumab (exp.) versus docetaxel (control) in nonsquamous non-small lung cancer with regard to overall survival. For the purpose of illustration, we consider here progression-free survival (PFS), a secondary endpoint. Using the webplotdigitizer [7] and the method described in [8]   group. The estimated KM curves are crossing, indicating a non-constant, crossing-hazards treatment effect. This is further investigated in Figure 2b. The blue line shows the estimated (time-dependent) hazard ratio, hazard of nivolumab group vs. hazard of docetaxel group, which is obtained by smoothing the increments of the cumulative hazard rates which are computed via the Nelson-Aalen estimator. Smoothening was done via kernel-based methods and global bandwidth as implemented in the R package muhaz. The estimated curve of the hazard ratio indicates an early toxic treatment effect of nivolumab as compared to docetaxel. After approximately 4 months, however, a period with a beneficial treatment effect for nivolumab follows. The crossing hazards result in crossing PFS curves, approximately two months after the hazard ratio suggests better performance of nivolumab. Borghaei et al. [6] suspect that a delayed effect of the nivolumab treatment causes this effect which is typical to immunotherapy. The time-invariant estimate of the hazard ratio under the PH assumption is indicated by the solid black line in Figure 2b. While the estimate of the constant hazard ratio close to 1, indicating no treatment effect, the estimates of the time-varying hazard ratio and the Kaplan-Meier curves suggest otherwise and provide additional insights. As a second example, we provide re-estimates of a phase 3 clinical trial reported by Robert et al. [9]. In this trial, previously untreated patients with metastatic melanoma without a BRAF mutation are treated with either nivolumab (exp.) or dacarbazine, i.e. chemotherapy (comparator). Analogous to the previous example, Figure 2c shows the KM estimates of the survival function with the endpoint overall survival from the reconstructed data. During follow-up, 51 out of 210 patients in the nivolumab group and 97 out of 208 patients in the control group died (reconstructed data). The estimates of the survival function suggest a delayed treatment effect of nivolumab. Initially, the estimated curve of the hazard ratio, hazard of nivolumab group vs. hazard of dacarbazine group, in Figure 2d is close to 1 but is declining for most of the time, indicating a beneficial treatment effect from around 1 month onwards. The estimate of the hazard ratio under the PH regimen also indicates a beneficial treatment effect but again, the estimate differs substantially from the time-varying estimate.

Systematic literature search and study selection
We performed a comprehensive literature search using two electronic databases, MEDLINE and EMBASE, on March 15th, 2022. Details on the literature search and the data extraction are provided in Appendix A.1. In total 907 articles were identified, which were screened for (a) Data is reconstructed from Borghaei et al [6]. (c) Data is reconstructed from Robert et al. [9]. eligibility. After the abstract screening and retrieval of full texts, a total of 411 articles were assessed for eligibility. In total, 200 articles (49%) were excluded. The most frequent reason for exclusion was that the articles neither developed nor applied any NPH method. The final analysis included 211 publications, see PRISMA flow chart in Figure 3. The complete list of included articles is available in Table S5 of the Online Supplement. Figure 4 shows the publication years of the articles included. In our review more than 70% of the articles were published in 2010 or later and only few before 2000. However, it has to be considered that the total number of published articles grew over the last years [11]. The vast majority of articles(>80%) introduce statistical methods for NPH; reviews and applications were less frequent (10%). With respect to the methods introduced in the identified articles we distinguished whether articles include methods for estimation of time-varying covariate/treatment effects and/or hypothesis testing. To further characterize the articles, we additionally introduce categories for articles focusing on estimation and/or testing methods. These categories are displayed in Table  1. Note that these categories are non-exclusive. The categories summarize the core contribution of the methods discussed in the corresponding articles. An explanation of the categories is given further below in Section 5 and Section 6. The allocation of each paper to at least one of the categories according to Table 1 can be found in the Online Supplement, see Table S2. A categorization of estimation approaches into more detailed categories discussed in Section 5 can also be found in the supplement, see Table S3. In Tables S2 and S3, we attempted to identify the main NPH contribution for each paper, sometimes ignoring possible extensions that might have been mentioned. However, papers often cover multiple topics resulting in papers being classified into more than one category. Table S4 gives details regarding the null hypothesis considered in the proposed testing approaches.
In total, 113 out of 211 (53%) articles identified in the review include methods for estimation, 72 out of 211 (24%) involve hypothesis testing methods, and 26 out of 211 (13%) involve both hypothesis and estimation methods.
Most methods for estimating time-varying covariate/treatment effects are based on time-varying coefficients for the hazard rates. Log-rank test approaches are the most frequent hypothesis test method that we identified in our literature review (Table 1). Methods for trials including an interim analysis are considered in 12% of the articles. The literature review identified articles covering different aspects of on survival analysis in  NPH settings. We identified articles proposing new test statistics for testing whether the survival is different in two treatment groups, as well as articles proposing new effect measures or regression models for quantifying the treatment effect in settings violating the common PH assumption. In 72 out of 211 (34%) articles freely accessible software is provided. Another 13% of the articles provide the code for the methods upon request. Software was considered to be freely available code in form of e.g. R packages, code snippets given in the text, or freely accessible code (e.g. supplement of article or online repository). Additionally, publicly available code or code snippets for commercial software are also categorized as freely available although the software needed to run the code is not freely available. Code snippets published in the articles sometimes implement only specific features or are used to deepen the understanding of the methods. Moreover, the code snippets are usually intended to enable users to apply the methods proposed. Simulation studies are reported in 158 (75%) articles. This is more pronounced in articles considering testing procedures, where 86 out of 98 (88%) papers provide simulation studies. For 91 out of 139 (65%) papers that focus on estimation procedures simulation studies are provided.

Notation and summary effect measures
Before we proceed with describing the method categories according to Table 1, we introduce the notation that is used throughout this paper and the supplement. We also define the identified treatment effect measures.

Notation
The number of subjects included in a trial is denoted by N . Z is a treatment indicator with Z = 0 indicating the control (placebo or comparator) and Z = 1 the experimental treatment arm. The outcome of interest is the time to event T , whereas C denotes the censoring time.
The event indicator is denoted by δ = I(T ≤ C). The maximum follow-up time of the trial is set tot and a specific time point during the follow-up time is denoted by t * . The distinct ordered event-times are indicated by t (i) , i.e. 0 < t (1) < t (2) < . . . where at each time t (i) at least one event occurred. Note that we define t (0) = 0. Covariates or factors other than the treatment indicator are denoted by x. The regression coefficients for the treatment indicator and covariates are denoted γ and β, respectively. Note that we use γ(t) to denote a time-dependent treatment effect. Additionally, we will use indexing of γ if more than one parameter is required to specify the treatment effect. With λ (Z) (t) we denote the hazard rate of the treatment group Z with covariates x at time t, i.e. lim →0 P (t≤T <t+ |T ≥t,Z,x) , and with λ 0 (t) the baseline hazard rate respectively. The cumulative hazard rate Λ (Z) (t) equals t 0 λ (Z) (u)du. The survival function of treatment group Z with covariates x, P (T >t|Z, x) is indicated by S (Z) (t) . Note that potential dependence of λ (Z) , Λ (Z) , S (Z) on covariates x is suppressed in the notation. The at-risk indicator Y i (t) denotes whether patient i is uncensored and event-free at time t, Y i (t) = 1, or not, Y i (t) = 0. The number of patients at risk at time t is denoted by Y (t) = i Y i (t). In general, we use '(Z)' in the superscript to denote group-specific quantities. Table S1 of the Supplement gives an overview of the used notation and the quantities defined in Section 4.2.

Effect measures
The treatment effect can be quantified, e.g., by the difference or ratio of the survival function at a chosen landmark time t * ≤t, i.e. S (1) (t * ) − S (0) (t * ) or S (1) (t * )/S (0) (t * ) [12]. Equivalent conclusions can be obtained by the cumulative HR (cHR) at time t * , cHR Alternatively, the τ th quantile of T (0) and T (1) may be compared, i.e. taking differences or ratios of t where S (Z) −1 denotes the inverse of the survival function of treatment group Z [12]. The above treatment effect measures are cumulative in that sense that they compare survival functions or cumulative hazard rates. Instantaneous differences between the treatment and the placebo group can be investigated by the instantaneous hazard ratio at HR (t * ) = λ (1) (t * )/λ (0) (t * ). However, the HR( t * ) cannot necessarily be interpreted as the current (causal) effect of the treatment, as the population of survivors in the two treatment groups may differ with respect to unmeasured characteristics or unadjusted covariates. Moreover, conclusions based on a single time point t * (or quantile τ) may not be meaningful as analyses at individual time points (quantiles) are non-informative for the time-points (quantiles) before or after the chosen t * (τ) where the treatment effect differs substantially or might even point in the opposite direction. The dynamic of differences between survival time distributions over the course of time may be investigated by calculating the above effect measures over a suitable grid of time. Given that the hazard rate is often the pivot of modeling approaches, the HR(t) is a common choice. However, it may be difficult to assess the overall effectiveness of the treatment from an examination of the effect measure over time. This is particularly relevant for the HR(t). For illustration assume a crossing hazards scenario as depicted by the blue dashed line in Figure 1c. The trajectory of the HR(t) alone does not provide relevant information if and when the survival curves cross. For that, the trajectory of the baseline hazard rate in the same time period is also required. This illustrates that "extreme" values of the HR(t) in a certain time period do not tell whether these extreme differences in treated and untreated individuals result in relevant differences between survival time distributions. A summary effect measure considers the entire survival curve (within the interval [0, t * ], t * ≤t) and is thus of average nature. An attempt to summarize the treatment effect in a single number is the average HR: and G (t) is a given event time distribution corresponding to a weighting function (see, for example, [13], [14], and [15]). Note, that competing definitions of the aHR using different weighting functions and various estimation techniques exist. The restricted mean survival time is the average survival time within the time period [0, t * ], t * ≤t, i.e. RMST (t * ) = t * 0 S(t)dt. An effect measure can again be constructed by the difference RMST (1) (t * )− RMST (0) (t * ) or the ratio, RMST (1) (t * )/RMST (0) (t * ) between the RMST of the treatment and the control group [12]. In a NPH setting summary effect measures do not cover the dynamics of treatment efficacy and hence, do not necessarily deliver an adequate picture of the nature of the treatment effect over the course of time. See [16] for a discussion and potential remedy that relies on calculating more than one summary effect measure over varying time ranges. For a comprehensive description of the survival distribution, the group-specific quantities λ (Z) , Λ (Z) or S (Z) are required. These can be estimated statistical models or stratification of non-parametric estimation approaches by Z. Stratification reduces the need for assumptions, such as the proportional hazards (PH) assumption, across different subgroups, as the estimation within each stratum relies solely on the information from that specific subgroup. Effect measures unconditional of covariates might require additional steps in estimation. As the covariates typically enter the survival function in a non-linear fashion, it will, in general, not be sufficient to plug where the expectation is taken with respect to the covariates, due to Jensens's inequality. The interested reader is referred to [17] and Chapter 10 of [18]. This may also hold for the hazard rate λ (Z) (t). Even if covariates would enter the hazard rate linearly, the expected covariate values given survival up to t of each treatment group would be necessary to obtain the respective quantity unconditional of covariates.

Estimation approaches for NPH treatment effects
This section describes the categories of identified estimation approaches for NPH treatment/ covariate effects. The first column of Table 2 shows the main categories as introduced in Table  1. Some categories are divided into sub-categories given in the second column of Table 2. The third column of Table 2 provides a brief description of the methods. References to the Supplement are given in the first two columns of Table 2, where a more detailed overview can be found. In the corresponding section of the Appendix, references to Table S3 are given. In Table S3 of the Online Supplement, each paper that was considered in this literature review is allocated into one or more sub-categories according to Table 2, indicating the paper's main contribution to address NPHs. Table S3 (column K) also provides information on whether the corresponding paper took a Bayesian estimation approach or not. The degree of detail and information given is hierarchical: Table 2 gives an overview, the referenced sections in the Online Supplement provide more detailed explanations including model formulas and a discussion of the literature, whereas Table S3 in combination with the respective papers (and the references therein) provide full information on the specific approaches to cope with NPH that we detected in the literature. The 4 th and 5 th column of Table 2 give a simplified impression of how flexible the corresponding approach is with respect to patterns of the hazard rate λ (0) (t) and time-varying NPHpatterns, respectively. Note that these statements focus on the treatment groups or, more generally, on the covariates where NPH are modeled. The assumptions on remaining covariate effects might be strict without mentioning this in Table 2. Hence, for many approaches, the flexibility of λ (0) (t) is determined by the flexibility of the baseline hazard rate λ 0 (t). The statements on the flexibility of HR(t) can be understood as the flexibility of λ (1) (t) in contrast to λ (0) (t) and this determines the applicability of the method to arbitrary complex NPH scenarios. If there are no assumptions on the trajectory of λ (0) (t) and the HR(t) this is specified as "none" in Table 2. We define the assumptions to be "strict" if the corresponding methods tend to be restricted to monotonically increasing or decreasing trajectories for the HR(t) and to a single change in slope for λ (0) (t). The statements "few" and "medium" indicate something in between. This categorization is closely related to a non-("none"), semi-("few"), and -parametric ("medium" to "strict") handling of λ (0) (t) and HR(t), where non-parametric approaches do not impose assumptions on the corresponding trajectories but parametric approaches tend to be limited in flexibility. Note that the 4 th and 5 th column of Table 2 is a statement about the flexibility of each method in accommodating varying scenarios of hazard rate functions (column 4) and time-varying HRs (column 5). Especially column 5 describes the capability of the methods to cope with varying NPH scenarios. However, this is not a statement as to whether estimates of the hazard rates and HR(t) are easy to obtain within the framework of the corresponding approach. The Kaplan-Meier approach is an example where it is not possible to compute hazard rates and hence the time-varying HR(t) across two strata without further smoothing approaches. The Kaplan-Meier (KM) estimate of the survival function and the Nelson-Aalen (NA) estimate of the cumulative hazard function are non-parametric estimators. Both do not impose any modeling assumption on the hazard rate. Hence, a stratified approach across treatment groups is suitable for any trajectory of HR(t). The estimates of S (Z) (t) and Λ (Z) (t) might be processed to estimates of (summary) effect measures such as aHR(t * ), cHR(t * ) or differences/ratios of the RMST (Z) (t * ). In quantile regression, either the τ th quantile or its logarithm are assumed to have a linear relationship with the covariates and the treatment indicator, e.g. ln{t τ } = x T β(τ) + Zγ(τ). Estimation procedures are either based on a generalized KM estimator or martingale-based estimation equations. Treatment efficacy at the τ th survival quantile can be evaluated by γ(τ). This process can be iterated over a grid of quantiles τ ∈ (0, 1), in order to get a complete picture of the potentially time-varying treatment effect over the course of time through γ(τ).
The stratified Cox model relaxes the PH assumption by stratifying the baseline hazard rate along the treatment indicator, i.e. λ (Z) (t) = exp{x T β}λ (Z) 0 (t). In case of the non-parametric Breslow estimate of the baseline hazard rates, no structure is placed on the baseline hazards. The model is suitable for any trajectory of HR(t). However, the PH assumption still holds for the remaining covariates. An estimate of the cHR(t) may be utilized to evaluate the treatment effect. The models in this category differ from the models of the time-varying coefficient category in that sense that they do not formulate a nonconstant HR(t) via a non-constant coefficient for the treatment effect γ(t). Instead, a specific function is imposed on the hazard rate such that NPH arise. The approaches in this category differ from fully parametric approaches in that they have semi-or non-parametric components and do not fully specify the distribution of the survival time by assumption. This category mainly encompasses the Yang and Prentice model. The hazard function in the Yang and Prentice model is set λ 0 (t). A time-dependent weight function puts all the weight to the first parameter γ 1 at the beginning. This weight is reduced over time and the weight for the second parameter γ 2 is analogously increased. Consequently, the hazard ratio is a time-weighted average of the short-term and the long-term hazard-ratio, exp{γ 1 } and exp{γ 2 }, respectively. The hazard rate equals λ (Z) (t) = exp{x T β + Zγ(t)}λ 0 (t). The timedependent treatment coefficient is piecewise constant, γ(t) = γ 1 + γ 2 I{t ≥ t * }, where I is the indicator function which is equal to one if the statement in the brackets is correct and 0 else. Note that the treatment coefficient is based on a time-covariate interaction and is not restricted to factors. The change point t * is pre-specified and more than one change point can be included, but black-box approaches also exist. Along with γ(t), the HR(t) is also piecewise constant and hence rather restricted. The model might be estimated via the partial likelihood, which places no modeling assumption on the baseline hazard. Parametric choices for the baseline hazard are also possible. The hazard rate equals λ (Z) (t) = exp{x T β + Zγ(t)}λ 0 (t). The timedependent treatment coefficient differs along with the chosen complexity and so do the trajectories of HR(t) that can be represented by the The weighted partial likelihood can be utilized to estimate a representative value for a time-varying treatment effect. This is usually the average hazard ratio aHR(t * ). The weights have to be chosen and are not necessarily suitable for any trajectory of the HR(t). Note that other approaches to estimate the aHR(t * ) exist, e.g. via KM. The hazard rate is equal to λ (Z) (t) = λ 0 (t) + x T β(t) + Zγ(t). Estimation is based on the increments of the martingale process and is estimated via least squares at the distinct event times. In Aalen's additive hazard model, there is no smoothness assumption on λ 0 (t) and γ(t) e.g. unlike when γ(t) is modeled via fractional polynomials or B-Splines. AFT's assume a specific distribution for T which typically leads to NPH, the Weibull distribution being a prominent exception. AFT's formulate a treatment and covariate-specific location parameter, GAMLSS extend this to shape and scale parameters. First hitting time models formulate a health process. The event occurs once the health process reaches a certain threshold (typically 0). Distributional assumptions on the health process determine the distribution of the survival time which typically has NPH. If the health process is assumed to be a Wiener process the distribution of the survival time is inverse Gaussian. We encountered a couple of machine learning approaches in the context of NPH scenarios such as trees (for example for finding the number and time points of a change point model) and forests, model averaging, k-nearest neighbor (in order to determine weights for weighted KM estimates), kernel smoothing based approaches and neural networks.
none to few none to few trtf, BART, mboost (R) Other approaches (S.2.10) This is a collective category of methods that did not fit properly into one of the other categories. Among these methods are the rank preserving structural failure time model, concordance regression, the accelerated hazard model, the semi-parametric proportional likelihood ratio model, a Bayesian dependent Dirichlet process to model the timeto-event distribution, and others.
Approaches with no or few assumptions on λ (0) (t) and HR(t) are suited for NPH scenarios that are more complex than those depicted in Figure 1a (delayed treatment effect), 1b (diminishing treatment effect), and 1c (crossing hazards). Such methods might also be utilized in the absence of knowledge of the trajectory of the HR and hazard rates or as a validity check of assumptions on its trajectory. Stratification along the treatment indicator is a general tool to relax assumptions on the underlying estimation procedure. A stratified KM estimation approach along the treatment and placebo (comparator) group is suitable for any two possible trajectories of S (0) (t) and S (1) (t) or λ (0) (t) and HR(t), respectively. If further (continuous) covariates are present the stratified Cox model might be utilized, where stratification along the treatment indicator is suitable for any trajectory of the HR(t). Time-varying coefficients offer arbitrary flexibility, depending on the complexity one allows for γ(t) in λ (Z) (t) = exp{x T β + Zγ(t)}. High flexibility of the treatment effect can in particular be achieved if γ(t) is modeled via (penalized) splines or Aalen's additive model. For the Royston-Parmar and the conditional transformation model, the basis functions in time can interact with the treatment indicator (or other covariates) in case of unknown or highly variable NPH scenarios. Machine learning procedures such as trees and forests as well as k-nearest neighbours or kernel approaches also offer high flexibility on λ (0) (t) and HR(t).
Procedures with limited flexibility on HR(t) could be used if the trajectory is known/ assumed, or if more flexible procedures suggest the appropriateness of more restrictive models. Less flexible models will typically be easier to analyze, as relatively few model parameters suffice to explain the trajectory of the treatment effect. Additionally, processing of model quantities, for example, hazard rates, to effect measures, such as the difference in RMST s or the aHR, might be more frequently analytically tractable than for the more flexible methods. Hence, results obtained from less flexible procedures might be easier to communicate. Further on, procedures with limited flexibility might avoid over-fitting the data which might result in more efficient estimates if the chosen procedure is well-suited for the data at hand. The short-and long-term HR model introduced by [19] is a suitable choice among the less flexible methods. The model moves the HR from an initial value HR(0) to HR(∞) in monotone fashion [20]. Hence, the model is suitable if PH or monotonically increasing/decreasing HRs are assumed. The Yang-and Prentice-model is in particular suited for delayed (Figure 1a), and diminishing treatment effects ( Figure 1b) as well as crossing hazard (Figure 1c) [20].
For the change point model, a delayed or diminishing treatment effect as well as crossing hazards can be modeled by a single change point. More complex NPH scenarios might be accommodated by multiple change points, where [2] provide a tree-based method to determine the number and position of change points. Furthermore, the accelerated failure time (AFT) model and its generalizations that also include covariates in the scale and shape parameters are a suitable choice. Note that the restrictions imposed on the HR(t) differ widely within the mentioned methods; for example, an AFT is way more restrictive than a regression model for location, scale, and shape parameters in general. Typically, effect parameters in a fully parametric model cannot be directly interpreted but most summary effect measures can be computed from the model.
The assumption of a homogeneous population or a homogeneous treatment effect can be dropped by utilizing frailty models. Both scenarios will typically lead to NPH on the population level, i.e. irrespective of individual, unobservable characteristics. Individual heterogeneity might even lead to crossing hazards on the population level if the treatment effect is beneficial but diminishing on the conditional level. This is caused by a catch-up process at later times of high-frail individuals from the treatment group who tend to survive longer due to the beneficial treatment [21, p.252]. It also highlights that a population HR(t) above one not necessarily means that the treatment has a detrimental effect. Empirical comparisons of NPH regression and estimation methods with simulated or real data without introducing new methodology have been rare in our literature review. Indeed, most papers provided simulation studies. However, giving recommendations on NPH methods based on the simulation studies is difficult for two reasons. Firstly, the simulation scenarios and procedures subject to investigation differ across the papers. Hence, an aggregated result is hard if not impossible to obtain from the existing simulation studies. Secondly, the simulation scenarios could have been chosen in order to demonstrate superiority of the new method [22]. Based on the frequency of the methods in our literature review, time-varying coefficients for the hazard rates are the most typical choice for incorporating NPH covariate/treatment effects. Within that category, the aHR(t * ) via the weighted partial likelihood (12 papers) and timevarying coefficients via (cubic and B-) splines (12 papers), followed by change point models (11 papers), are the most frequently studied methods. The second largest group is made up by parametric approaches. AFTs and generalized additive models for location scale and shape (GAMLSS) comprise the largest sub-group (18 papers) within the parametric approaches, followed by NPH approaches via the piecewise exponential model (9 papers). A review and simulation study on the aHR(t * ) can be found in [15]. Rauch et al. [15] investigated the performance of estimates of the aHR(t * ) either based on KM curves as discussed in Section S.2.1 or partial likelihood-based fitting procedures as discussed in Section S.2.3.5. Five different simulation scenarios, either with PH, strictly increasing or with strictly decreasing HR(t), and administrative censoring as well as an administrative-combined with a randomcensoring scheme were subject to investigation. The authors find few differences across the two estimation procedures when the shape parameter of the weighting function is 1 for the KMbased estimate. Inappropriately chosen weights might, however, inflate standard errors and introduce substantial bias. Different choices of weights might even result in opposite inference. Consequently, Rauch et al. [15] suggest to carefully check the estimated survival curves to judge the plausibility of the estimates. From a theoretical point of view [15], also note that the partial likelihood estimate of the aHR(t * ) might consider further covariates what is not possible for KM-based estimates of the aHR(t * ) apart from a stratified analysis. The KM-based estimate of the aHR(t * ), however, fulfills the independent increment property and so group-sequential and adaptive designs for tests relating to KM-based estimates of aHR(t * ) might be formulated. A comparison of a (reduced rank) time-varying coefficient, gamma frailty, relaxed Burr, and a cure-rate model to real world breast cancer data was conducted by [23]. The authors emphasize interpretational differences across those models that might highlight different features of the data. In that sense, the time-varying coefficient model makes the nature of the covariate effect visible whereas it is not able to shed light on individual heterogeneity as in the frailty model. They conclude, that the specific research question should guide the model choice.

Hypothesis tests for equality of survival curves
For the design and analysis of randomized controlled trials with time-to-event outcomes hypothesis tests for equality of survival curves from experimental and control treatment (H 0 : The equality of survival functions is also implied by the null hypothesis formulated in terms of the hazard rates H 0 : or in terms of the hazard ratio H 0 : HR(t) = 1 ∀ t ≥ 0. Note, that conclusions for times beyond the maximum follow-up timet should be avoided. Moreover, null hypotheses based on summary effect measures, e.g., H 0 : can also be considered as valid tests of the equality hypothesis, as rejection of a more specific null hypothesis implies rejection of equality. A "stronger" null hypothesis that survival in the experimental treatment is less or equal to the survival in the control arm, H 0 : only holds under the PH assumption but is not generally true under NPH [24]. Table S4 gives an overview of the used null hypothesis in the articles focusing on hypothesis tests. We classified whether the null hypothesis was defined as equality of survival, less or equal survival in the experimental arm, or whether it was an average-based hypothesis. Under the assumption of PH the log-rank test is the standard procedure. However, if the PH assumption does not hold, power is reduced and the alternative hypothesis cannot necessarily be interpreted as treatment benefit. Moreover, rejecting the null hypothesis H 0 : S (0) (t) = S (1) (t) ∀t ≥ 0 in settings with NPH means that there is a non-zero treatment effect at least in some time interval. For situations in which the PH assumption may not hold, alternative hypothesis tests and sample size calculation approaches have been proposed, which we identified in the literature review.
In our literature review, we identified three categories of hypothesis tests for the above mentioned null hypotheses in NPH scenarios: Log-rank tests, Kaplan-Meier-based tests, and combination tests. Table 3 gives an overview of these different types of tests. Additionally, Table S2 shows in which categories the identified articles fall. Table S4 provides an overview of whether the identified articles consider approaches for sample size calculation.
Augments the log-rank test with weights w(t (i) to emphasize observations based on their point in time. Null-and alternative hypotheses are identical to the standard log-rank test if the weights are positive.
survival, nph (R), LIFETEST (SAS) Modestly weighted log-rank test Robust variation of the Weighted Log-rank test. The weights are chosen such that in case of locally favorable hazards alone the test will not wrongly infer superiority of the treatment group. Null-and alternative hypotheses are identical to the standard log-rank test.

Kaplan-Meier tests
Tests based on the weighted sum of the differences of the KM estimates of the survival curves.

Test by Sooriyarachchi and Whitehead
Test for differences in survival curves based on the log odds ratio of the probability of surviving past a given time point t * . The null hypothesis is the equivalency of two survival curves. 1 See corresponding sections in Appendix for the method description.
With prior knowledge of the NPH pattern, weighted log-rank tests can consider certain time periods to be more relevant than others. Kaplan-Meier-based tests are especially appealing to practitioners due to the intuitive interpretation. Combination tests select a test statistic from a small set of prespecified test statistics based on the data and are therefore useful without any prior knowledge regarding the NPH pattern. Our literature review also identified articles reviewing and comparing hypothesis testing methods under different NPH settings. For instance, Yang [25] applies different tests including weighted log-rank tests, combination tests, and Wald tests based on estimators of the average hazard ratio or RMST to different randomized controlled trials to illustrate the virtually ignorable loss of power for reasonably PH situations and emphasizes the substantial gain of power using these approaches in contrast to the standard log-rank test in NPH situations. Many new tests are tailored to specific NPH situations. Therefore, Yang [25] favors the adaptively weighted log-rank test due to its overall trade-off. In the comparison study of Dormuth et al [4], in which data sets of oncology trials were reconstructed, the proposed log-rank permutation test of Ditzhaus and Friedrich [26] detected most differences between treatment groups. These results align with those of other articles investigating omnibus tests, e.g. Gorfine et al [27] and Royston and Parmar [5]. If there is uncertainty regarding the underlying survival time distributions, a more recent article by Dormuth et al [28] recommends the use of omnibus tests for comparisons between groups. Li et al [29], Callegaro and Spiessens [30], Royston and Parmar [5] and Lin et al [31] perform simulation studies for comparing different test statistics for settings with NPH. Li et al [29] applied amongst others tests of the log-rank test family, Kaplan-Meier-based tests, and combination tests to situations of crossing survival curves at early, middle, and late times. They concluded that the adaptive Neyman's smooth test [32] and the two-stage procedure of Qiu and Sheng [33] have higher power in the considered NPH settings, provide an acceptable power under PH, and their type I error rate is close to the nominal level. Therefore, Li et al [29] recommend the use of these tests as they are "the most stable and feasible approaches for a variety of situations and censoring rates". The comparison study of Callegaro and Spiessens [30] involves, among others, the weighted log-rank test with weights of the Fleming-Harrington weight family, max combo tests, and the likelihood ratio test for testing the treatment effect in a Cox model with time-varying coefficients. Callegaro and Spiessens [30] consider the latter to be often more powerful than the weighted log-rank tests. Lin et al [31] compare tests of the class of weighted log-rank, Kaplan-Meier, and combination tests. The comparison study did not identify a single test outperforming the others in all considered scenarios; e.g. delayed treatment onset, diminishing effects, crossing hazards, proportional hazards, and delayed effects with converging tails. The comparison study suggests the max combo test as a robust test across different NPH patterns without prior knowledge of the pattern. The review of Mukhopadhyay et al [34] compared the log-rank test to the Max-Combo test in immo-oncology trials identified through a systematic literature review. The authors concluded that the MaxCombo test is a "pragmatic alternative" to the log-rank test assuming NPH. The simulations of Royston and Parmar [5] suggest that the modified versatile weighted log-rank test, an unpublished modification of the versatile weighted log-rank test [35] with Stata code available on request from Royston, performs best in terms of power under NPH (early, late or near PH treatment effect) without the preconceived type of treatment effect.
In the last 20 years, there have been constant publications on log-rank tests. Research on combination tests, Kaplan-Meier-based tests, or other approaches has been comparatively rare. In the last 3 years, however, more research on these testing categories including permutation approaches, e.g. [26,36], was conducted.

Conclusions and discussion
We conducted a systematic literature review of effect estimation and testing methods that are able to cope with NPH in time-to-event analysis. Review articles focusing on different methods for NPH have been published previously. These reviews mostly focus either on a quantitative comparison for specific NPH scenarios [29], or a specific method class [15], or on a qualitative overview of available methods for specific NPH situations or disease areas, e.g. oncology [1]. We conducted a systematic literature search for methodological approaches for any NPH scenario, any model class, and not restricted to a specific disease area. Therefore, our review gives a comprehensive overview of the methods proposed and applicable to NPH settings. In total, our literature review includes 211 articles for final analysis. Of those articles, 113 focus on effect estimation, e.g. regression methods, 72 on testing, and 26 articles on both. In the effects estimation and testing literature, we identified categories to group articles according to their approach to the NPH situation. With respect to effect estimation the categories are Kaplan-Meier based estimation approaches, stratified Cox model, time-varying coefficients for the hazard rates, transformation models with time-covariate interaction, short-and long-term HR, joint models, frailty models, parametric models, machine learning approaches and others. With respect to testing the categories are log-rank tests, Kaplan-Meier tests, combination tests, and other tests. We have also broken down some of the categories into smaller sub-categories and attributed each paper in at least one of them. We delivered brief explanations of the categories and a more detailed discussion in the Appendix A where we also refer to the specific papers within the corresponding categories. The most common approach to tackle NPH for effect estimation are time-varying coefficients for the hazard rates (47 papers), and parametric approaches that assume a distribution for the survival time (38 papers), such as GAMLSS models. The most common testing approach for NPH are variations of the log-rank test (63 papers). We extracted and documented the software (R and SAS) utilized in the papers under review. In addition, well-known software for the individual testing and estimation categories was added by the group of authors. For a more complete overview of available R packages for time-to-event analysis see [37]. For the literature review, we excluded standard methods such as the stratified Cox model, unless the baseline hazards were stratified by the treatment indicator. Consequently, our review may have missed certain innovative proposals in this area. In addition, we have excluded methods that utilize external time-varying covariates which might lead to NPH over the course of time as, e.g., PKPD Models. Further, our search terms focused on terms related to NPH, which may not be a common term in other areas utilizing these methods. For review articles considered in this review, we manually added all investigated methods to the list of articles. Nevertheless, some of those may have been later discarded due to our in-or exclusion criteria, see Figure 3. Consequently, some of the considered review articles may investigate methods which have not been discussed in this review. A broad range of different methods is available for both treatment effect estimation and hypothesis testing. However, there is no consensus on the best approaches under NPH. Most papers reported simulation studies (158 of 211 papers). Nevertheless, the NPH scenarios and the methods under comparison differ making it difficult to aggregate and compare results across evaluations. Moreover, the NPH scenario and the competitors to newly introduced methodology might have been chosen in order to demonstrate superiority of the newcomer [22]. Only a few review articles comparing different methods through simulation studies (considered to be objective) have been identified by our review. In particular for effect estimation methodology, independent comparison studies including neutral comparison studies covering different NPH scenarios and a broad range of methods are not available. Review articles of testing procedures cover a broad range of different NPH settings and provide guidance for the choice of the test, which, however, can be different from one comparison study to another. These reviews offer some guidance on, for example, the permutation test by Ditzhaus and Friedrich [26], and the adaptively weighted log-rank [20] for specific NPH scnearios. Due to the hypothesis tests ex-amined not being consistent across the comparison studies, it is difficult to make a general recommendation for the use of a specific hypothesis test. Methods might be chosen based on theoretical considerations. In the absence of strong prior knowledge on the treatment/covariate effect, time-dependent treatment coefficients for the hazard rates might flexibly be modeled via a treatment spline interaction, where the corresponding basis functions are constructed on time. In the case of strong prior knowledge, more restrictive models might be preferred, such as a (single) change point model for a delayed treatment effect (Figure 1a). Moreover, different summary effect measures have been proposed which offer an alternative to the hazard ratio. The constant hazard ratio estimated by a Cox PH model is commonly used for time-to-event analysis but might be misleading under NPH as the hazard ratio is timedependent in this case. Alternatives involve, for example, the average hazard ratio and the ratio of RMSTs. These depend on the choice of the pre-specified time interval which is restricted by the maximum follow-up time. Additionally, its usefulness depends on the pattern of the treatment effect. For instance, the difference of RMST between treatment groups is not useful for delayed treatment effects [38]. Summary effect measures can be calculated based on Kaplan-Meier curves. For more complex data, e.g. multiple continuous covariates, other methods presented in Section 5 can be used to model the survival curves. Depending on the choice of the estimation approach it might be difficult if not impossible to obtain specific summary effect measures, however. We identified a variety of NPH approaches for both, effect estimation and testing procedures. Despite the lack of a one-size-fits-all approach, one should consider discarding known and commonly used methods that may not be appropriate for NPH situations. Adhering to invalid assumptions might lead to less reliable conclusions than choosing a non-optimal NPH approach for the data at hand. To fill the gap in comparisons of the methods for NPH, our further assessment will explore the advantages and disadvantages under a wide range of NPH assumptions of a selection of the identified methods, see https://www.encepp.eu/encepp/ viewResource.htm?id=49976.

Funding sources/sponsors
This work has received funding from the European Medicines Agency (Re-opening of competition EMA/2020/46/TDA/L3.02 (Lot 3)) "This document expresses the opinion of the authors of the paper, and may not be understood or quoted as being made on behalf of or reflecting the position of the European Medicines Agency or one of its committees or working parties."

A Appendix
A.1 Literature search and data extraction A.1.1 Systematic literature search and study selection We performed a comprehensive literature search using the electronic databases MEDLINE and EMBASE to identify relevant publications. Pubmed was used to search the database MED-LINE. EMBASE searches in both databases MEDLINE and EMBASE. However, we excluded the database MEDLINE for our search in EMBASE. Table A2 provides details of the literature search. The search includes two parts. One part of the search includes topic-related search terms such as "non-proportional hazards" in various spellings (part one of the search strategy (main search in MEDLINE and EMBASE), rows 1 to 9 of A2). In addition, in a second part of the search we used broader pre-specified topic-related terms (e.g. "crossing hazard*", "delayed effect", "treatment switch"). The part of the search with the pre-specified related search terms has been restricted to statistical journals using the Web of Science category 'Statistics & Probability' (part two of the search strategy (additional search in MEDLINE), row 10 to 19 of Table 3). Reference lists of included publications and relevant reviews were checked manually to identify any additional relevant articles that were not captured by the search. We included methodological research articles, reviews, and clinical investigations with timeto-event endpoints where methods for non-proportional hazards were applied. More specifically, the inclusion and exclusion criteria specified in Table A1 have been used. Applications of conventional methods for NPH such as stratified Cox regressions or the use of timedependent covariates in proportional hazards models [39] Methods for competing risk or recurrent event analyses

A.1.2 Data extraction
Two reviewers screened the search results for relevant articles independently. The reviewers screened titles and abstracts and excluded papers that clearly did not meet the inclusion criteria. During the abstract screening, an article was included for the full-text screening, if at least one of the two reviewers included the abstract. For all selected abstracts, we obtained the full texts and two reviewers reviewed these according to the pre-specified inclusion criteria (e.g. statistical methods, simulations, time-to-event endpoints, non-proportionality assumption). From the full text, data on basic characteristics of the investigation, on proposed methods, and, if applicable, on simulation studies conducted were collected by two independent reviewers. A third reviewer resolved all disagreements on the inclusion of the full texts and on the extracted data. Characteristics of the proposed methods include whether the methods use frequentist or Bayesian, parametric or non-parametric approaches and whether the methods adjust for covariates. We distinguished between articles including methods for treatment effect estimation, hypothesis testing and manuscripts covering both. We also extracted the availability of a software for the proposed method and differentiated whether the software was freely available, e.g. Github repository for code, R package or Code snippets in text, available, e.g. available upon request from the author, or not available or mentioned in the text. Additionally, the proposed methods were classified into one of the predefined categories: • Log-rank approaches, e.g. weighted log-rank test  "non-proportional hazard" OR "nonproportional hazards" OR "nonproportional hazard" OR "nonproportional hazards" OR "non proportional hazard" OR "non proportional hazards" Main search in MED-LINE and EMBASE without any journal restrictions 2 non-proportionality OR nonproportionality OR "non proportionality" 3 (NPH OR NPHs OR "non PH" OR "non PHs" OR non-PH OR non-PHs) AND (assumption OR assumptions) 4 "proportional hazard" OR "proportional hazards" OR "hazard assumption" OR "hazard assumptions" Number  Table S3.
The non-parametric KM estimate of the survival curve is defined asŜ (Z) (i) is the number of events of treatment group Z observed at t (i) and the product iterates over the ordered distinct event times t (i) . Alternatively, the NA estimate of the cumulative hazard function, i.e. Λ (Z) (t) = t (i) ≤t d (Z) (i) /Y (Z) t (i) , can be utilized. The treatment stratified KM estimates do not impose any modelling assumptions on HR (t). Thus, any trajectory of HR (t) can be incorporated. The KM (NA) estimators result in step functions and thus the estimation of HR (t) requires additional smoothing techniques, however. The aHR(t * ) can be re-formulated as being an expression of the survival functions where the estimates might be obtained via KM [15]. A doubly-weighted Nelson-Aalen estimator that accounts for dependent censoring and treamentspecific covariate distributions was introduced by [40] via inverse probability of treatment weighting and inverse probability of censoring weighting. The authors suggest to process the treatment-specific cumulative hazard estimators to cumulative treatment effect measures such as the cHR, relative risk, and differences in the RMST. Quantiles and t-year survival rates can be obtained by standard statistical software which computes KM survival curves such as the R package survival. The R software package survRM2 performs two-sample comparisons with the RMST and also includes a function to perform an ANCOVA-type covariate adjustment. The SAS procedure LIFETEST can be utilized to obtain non-parametric estimates of the survival function and to carry out tests of homogeneity across strata or the association of numeric covariates on survival. The R package AHR provides software solution for the aHR.

S.2.1.1 Pseudo values
Pseudo-values are usually calculated from KM-based estimates, for example the RMST (t * ) or S(t * ). The basic idea of the pseudo-value approach is to compute a quantity of interest at a given time t * , e.g. RMST (t * ) orŜ(t * ), and re-compute that measure with the i th observation dropped. Then, for example, NŜ (t * ) − (N − 1)Ŝ (−i) (t * ), whereŜ (−i) denotes the KM estimate of the survival function with the i th observation dropped, is the so-called pseudo-value, and the i th entry of a new dependent variable. This value is an estimate of the survival function for the i th individual in this example. Hence, the interrelationship between the survival function and the cumulative hazard rate, S (Z) (t) = exp(−Λ (Z) (t)) , can be used to transform the pseudo value into an estimate of the log cumulative hazard rate Λ (Z) (t) by applying log-log transformation on the pseudo values. The pseudo value procedure reduces time-to-event data, that are usually defined by two variables (time, event indicator) to a single continuous variable, and allows modelling by common regression models, i.e. linear models with the treatment indicator being a covariate, with possibly transformed pseudo-values. [41] Application of the pseudo value procedure onŜ(t * ), where the pseudo value NŜ (t * ) − (N − 1) × S (i) (t * ) is analysed by the linear regression model x T i β + Z i γ, i = 1, . . . , N , gives estimates of S (1) (t * ) − S (0) (t * ) through γ, assuming that x is held constant. A weighted pseudo value approach to estimate survival probabilities at time t * was introduced by [42]. They used the log-log transformation of the pseudo values and gave estimates for cHR (t * ). The weights were proposed to overcome missing information on group membership in stem cell transplantation patients, where early death stops the search for a suitable stem cell donor. Yang et al. [43] utilized the pseudo-value approach for the RMST : Subject to investigation was the RMST given survival up to a landmark t * with (chosen) follow up time t * + w, cRMST (t * , w), i.e. the expected life time within (t * , t * + w) given survival up to t * . Estimates of the RMST were obtained via KM. Then, in a leave one out step, the cRMST without the i th observation cRMST (−i) (t * , w) was computed. The individual differences N * cRMST (t * , w) − (N * − 1) cRMST (−i) (t * , w), with N * being the sample size conditional on survival up to t * , were then used as the dependent variable in a linear regression model x T i β + Z i γ. If the multiple landmark time points are considered in a single analysis, covariate-time interactions might be utilized to obtain time-dependent covariate effects, e.g. Z i γ(t * ), with γ(t * ) = γ 1 + t * γ 2 + t * 2 γ 3 , for the treatment component. The (time dependent) treatment efficacy for such an approach can be evaluated via γ(t * ) which estimates cRMST (1) (t * , w) − cRMST (0) (t * , w). Pseudo values were also considered by [44] and [45]. We did not encounter further effect measures in our literature review. The R packages pseudo and prodlim as well as SAS and R functions in [46] are available for computing pseudo-values. Pseudo-values for RMST (t) [but not for S(t)] can be calculated by PROC RMST in SAS.

S.2.1.2 Quantile regression
Quantile regression is an approach to systematically compare the estimated quantiles t 1 (1 − τ), Z = 1, 2, and τ ∈ (0, 1). A quantile regression approach has been established based on the generalized KM estimator [47] or alternatively, on martingale-based estimating equations [48]. In the latter case, the logarithm of the τ th survival quantile ln{t τ } given covariate information is assumed to have a linear relationship with the covariates, i.e. ln{t τ |x, Z} = x T β(τ)+Zγ(τ). The estimated treatment effect exp( γ(τ)) gives an expression oft (1) τ in the case of equality in x. A value larger than 1 of the above-mentioned ratio indicates a beneficial treatment effect at the respective quantile. See [49] for an application to evaluate the long-term benefit of immunotherapy. Xue et al. [50] developed a leave-one-out cross-validation approach for quantile regression. An implementation of quantile regression including methods applicable to data with censored observations is available in the R package quantreg. The SAS procedure quantlife provides an implementation of quantile regression based on generalizations of the KM and NA estimator.

S.2.2 Stratified Cox model
The stratified Cox model is a well-known model that relaxes the PH assumption for the stratification variable(s). We excluded the basic stratified Cox model, where stratification by variables other than the treatment indicator is sufficient, from our literature review. The semiparametric stratified Cox model, in particular, does not impose modelling restrictions on the trajectory of HR (t) along the stratified variables. The hazard rate is defined as λ (Z) (t) = exp(x T β) λ  Table S3 column B shows papers that belong to the category 'stratified Cox model'. Again, estimation of HR (t) requires additional smoothing techniques due to the step function nature of the Breslow estimate. An estimate of cHR (t * ) was suggested by [51] and [52]. The R package survival and the SAS procedure phreg include the stratified Cox model.

S.2.3 Time varying coefficients for the hazard rates
Time-varying coefficients for the hazard rates γ(t) can be utilized to incorporate, for example, a late or diluting treatment effect causing NPH. The hazard rate could be of the form λ (Z) (t) = exp(Zγ (t) + x T β) λ 0 (t). Note that the time-varying treatment effect γ (t) of those models can essentially be viewed as an interaction term of (some function of) time and the treatment indicator. The specific models differ in how those interactions are constructed. The approaches to time-varying coefficients, or time-covariate interaction respectively, are plentiful and encompass change point models, fractional polynomials, spline-based approaches, other functional time-covariate interactions and the additive model. The admissible trajectory of HR (t) = exp(γ(t)) depends on the complexity of γ (t). The aHR, or weighted Cox regression, which will be discussed in Section S.2.3.5 further below can be seen as less complex summary measure in a time-varying coefficient setting. Columns of Table S3 that belong to the broader category 'time-varying coefficients' of Table S2 start with a 'C'. Common approaches to fit the models are the maximization of the partial likelihood (e.g. [53], [54], or [55]), the full likelihood possibly via GLM routines (e.g. [56], [57], [58], or [59]) or via Bayesian approaches (see Table S3 column K). We added column K to Table S3 which indicates whether the corresponding paper took a Bayesian or a frequentist approach, such that the interested reader might have a look at the corresponding papers to learn more about specific fitting procedures in detail.

S.2.3.1 Change point of the treatment coefficient
A late treatment effect, for example, might be considered by γ (t) = γ I(t ≥ t * ) where I denotes the indicator function, a diluting treatment effect through γ (t) = γ 1 + γ 2 I(t ≥ t * ), with some (chosen) threshold value t * > 0, γ 2 > 0. Johannes [60] suggested a grid search to find the location of the change point t * associated to the highest partial likelihood value. A generalized version of the single change point model, introduces a piecewise constant regression coefficient γ k for each of the chosen disjoint time-intervals, i.e. (t) = K k=1 γ k I k (t), where I k (t) is the indicator function which is unity if t ∈ k th interval and 0 else. A tree-based method to find multiple change points was suggested by [2]; a change point is chosen such that it maximizes the score test statistic. This process is iterated over the new time partitions resulting from the former step. In order to avoid over-fitting [2] also considered stopping criteria and a pruning mechanism. Papers that were allocated to the category 'change point for time-varying effect' can be found in Table S3 column C.1.

S.2.3.2 Time varying coefficients with continuous paths A smooth path of γ (t) can be obtained by choosing a continuous function (in t).
For example, γ (t) = γ 0 + t γ 1 + t 2 γ 2 [61]. Gustafson [62] considered an additive hazard model where γ (t) = 1 − t t * γ 0 + t t * γ 1 for t < t * and γ (t) = γ 1 else. Sauerbrei et al. [55] chose γ (t) = γ 0 + log (t) γ 1 as default and provided an algorithm for model selection: Firstly, start with an initial model, possibly incorporating functional forms and interactions of the covariates. Secondly, re-run the model on a restricted time interval (0, t * ) and add relevant covariates from this step to the model of the entire time interval if necessary. Thirdly, for each covariate include time-varying coefficients, on the entire time interval, and add them to the final model if they provide a significant improvement. More flexible functions of γ (t), obtained through fractional polynomials in time (see Section S.2.3.3), might be chosen if the enhanced model provides a significantly better fit as indicated by a deviance test. Papers with some functional covariate-time interaction that do not belong to one of the more specific time-varying coefficient categories above or below can be found in column C.2 of Table  S3.

S.2.3.4 Splines
A non-parametric approach for time-varying coefficients can be incorporated through the use of splines. A popular choice are restricted cubic splines. Splines, in general, require the selection of knots that slice the time scale into disjoint intervals. In the case of restricted cubic splines, local cubic polynomials are fitted [65]. The term "restricted" results from handling the tails linearly, i.e. from 0 to the first knot and from the last knot to infinity, the function is linear. We denote the corresponding time-varying treatment effect γ (t) = γ 0 + γ 1 t + d B d (t) γ d+1 , where B d denotes the corresponding basis function and the sum iterates over all basis functions. The number of basis functions depends on the number of knots. Applications can be found in [66] for an excess hazard model, [56] in a network meta-analysis and [64]. The spline approach can be extended to arbitrary choices of basis function. Let γ (t) = d B d (t) γ d . The B d might, for example, be B-spline basis functions of arbitrary high degree and number of knots. The degree and the number of knots determine the number of basis functions or parameters, respectively, for the time-varying treatment coefficient. Overfitting can be avoided by introducing penalties on the flexibility. See, for example, [58] who exploited Poisson Likelihood regression, i.e. piecewise exponential hazards (see below), and penalized B-splines for survival model building. Similarly, [54] considered penalized B-splines and truncated splines. Argyropoulos and Unruh [57] also exploited the Poisson regression approach and include time-varying coefficients and multiple time scales via penalized cubic and thin plate splines. [67] considered time-varying coefficients via B-splines in an excess hazard model. A spline-based time-varying coefficient in a Cox-model setting is available via the R package dynsurv. The R package polspline contains the function hare which utilizes linear splines to model the baseline hazard and covariate effects. Column C.4 of Table S3 shows papers that utilized spline approaches in order to model timevarying coefficients. Note, that for none of the aforementioned models, the time-varying coefficient approach is restricted to the treatment effect (or binary variables) but applies to covariates more generally.

S.2.3.5 Average hazard ratio and summary effect measures obtained by the (partial) likelihood
The aHR can be calculated as summary measure over the follow-up periodt or some t * ≤t. The summary measure has already been mentioned in Section S.2.1 based on KM estimates of the survival function. Plentiful estimation approaches of aHR(t * ) based on the weighted partial likelihood are proposed in the literature. Consider the (true) model λ (Z) (t) = exp (Zγ (t)) λ 0 (t) . The idea is to estimate λ (Z) (t) = exp (Zγ) λ 0 (t) instead and interpret γ or exp(γ) as an average treatment effect by introducing weights on the partial likelihood (see, for example, [68], [69], or [14]). Results from standard Cox regression depend on the censoring distribution and a weighted Cox regression with inverse probability of censoring weights are proposed. Schemper [14] has accordingly recommended to choose the weighting function G (t) = S(t)/C(t), where S(t) defines the event time distribution of the complete sample and C(t) denotes the corresponding censoring distribution. For a review about differing definitions of the aHR as well as different weighting functions see [14]. Such approaches are not restricted to an estimate of aHR(t * ), however. Lin and León [70] obtained adjustment factors based on weights from the log-rank test where exp(γ) is the maximum treatment effect over the course of time. An aHR estimator for the Yang and Prentice model (discussed in Section S.2.6) was proposed by [71]. Papers that approached the aHR or similar summary effect measures in a semi-parametric fashion through the weighted partial or pseudo-likelihood can be found in column C.5 (Table  S3). Note, that we classified non-KM based estimations of the aHR, or weighted Cox regression, as time-varying coefficients in Table S2. Further notice, that papers which based the estimation of aHR(t * ) on KM estimates are not subsumed in column C.5 but column A of Table S3. The R package coxphw, and the SAS macro WCM provide weighted estimation of Cox regression which might be utilized to estimate the aHR(t * ) with appropriate weights.

S.2.3.6 Additive models
Aalen's additive model also considers time-varying coefficients and NPH [21]. The additive model exploits the martingale representation M(t) of the counting process where dt → 0 and dN i (t) = 1 denotes that the event occurred in an infinitesimal small interval following t, and 0 refers to no event in the aforementioned interval. The hazard rate is assumed to be additive in the predictors, i.e. λ (Z) (t) = x T β(t) + Zγ(t). This leads to a twist in the interplay of covariates on absolute and relative differences of the hazard rates as compared to the multiplicative hazard model: keeping everything else constant, the treatment effect in an absolute sense is λ (1) (t) − λ (0) (t) = γ(t) which is independent of the level of the other covariates, whereas the relative treatment effect is HR (t) = λ (1) (t)/λ (0) (t) = 1 + γ (t)/x T β which does depend on the level of the remaining covariates. If, for example, λ (Z) (t) = λ 0 (t)exp(x T β(t) + Zγ(t)) instead, this is the other way around. Instead of optimizing the (partial) likelihood, least squares methodology is utilized. The individual squared error contribution at t is equal to ( 2 , where λ (Z) (t)dt is subject to estimation. Note that the estimates of λ (Z) (t)dt are not restricted to the interval [0, 1] and thus, might not be interpretable as probabilities. The increments λ (Z) (t)dt will typically be estimated poorly. Estimates of Λ (Z) (t) can be obtained through i≥1:t (i) ≤t λ (Z) t (i) dt, where the summation over the increments achieves stability in the estimates. (Chapter 4.2.1 in [21]) Applications of the additive model can be found in [72] and [73]. Dunson and Herring [74] placed a model selection prior on an additive-multiplicative survival model and restrict the additive part to ensure non-negative hazards. Martinussen and Pipper [75] developed an odds-

S.2.4 Transformation models with time-covariate interaction
In this section we placed the Royston-Parmar model as well as the conditional transformation model (CTM). Both approaches have in common that time and covariate dependent model quantities are modelled via splines and spline by covariate interaction. Also both approaches can be motivated as being generalizations of chosen parametric models. The Royston-Parmar and the CTM are very similar for appropriately chosen reference functions.
The starting point of the Royston-Parmar model is a transformation of the survival function which is denoted by g(S (Z) (t)). Initially, the function is assumed to be linear in the covariates and the treatment indicator, i.e. g(S (Z) (t)) = g(S 0 (t))+x T β +Zγ. The point of reference, is either the Weibull distribution and PH or the log-logisitc distribution and proportional odds. We focus on the Weibull case. Then, g(S (Z) (t)) = ln{− ln{S (Z) (t)}} = ln{Λ 0 (t)} + x T β + Zγ. Note, that the Weibull AFT arises if ln{Λ 0 (t)} is linear in ln{t}. To allow for more flexibility, ln{Λ 0 (t)} is replaced by a restricted (natural) cubic spline function of ln{t}, s(t; ω) = d B d (ln{t})ω d , where ω is a parameter vector, and the accelerated failure time interpretation is lost. In the log-logistic case, the same path is followed in analogy, except that the log-cumulative hazard function is replaced by the log-cumulative odds function, that is the log-odds of an event occurring in the interval (0, t). NPH (or non-proportional odds) can be incorporated by an interaction of the natural cubic splines and the treatment indicator, i.e. s(t; ω) + s(t, Z; γ), with s(t, Z; γ) = d ZB d (ln{t})γ d , where γ is a vector of parameters. Further covariates might be included in s(t, Z, X; ω) in the same fashion to allow for more non-proportional effects. For sake of comparison with the conditional transformation model (CTM) we call the functions s(·) transformation functions.
For more detailed information about the Royston-Parmar model see [59]. An implementation of the Royston-Parmar model is available through the R package flexsurv [76]. The CTM formulates S (Z) (t) as 1 − G ( d h d (t, x, Z)) , where G is a chosen cumulative distribution function (cdf). If G is the minimum extreme value distribution, the Cox model is a special case for appropriately chosen (or estimated) transformation functions h d . In general, the transformation functions h d consist of interactions of time and covariates as well as parameters which are subject to estimation. The transformation function for the treatment indicator could for example be h(t, Z) = d B d (t)Zγ d . Möst et al. [77] suggested interactions of penalized B-splines in time t and B-splines or linear basis functions of covariates. Consider the case where the reference cdf of the CTM is that of the standard minimum extreme value distribution G(·) = 1 − exp{− exp{·}} and the Royston-Parmar has the Weibull model as reference as described above. Moreover, let all covariates be included in the NPH manner as illustrated above for the Royston-Parmar and the CTM. Then, the main difference between the Royston-Parmar and the CTM as reported here is the choice of basis functions, natural cubic splines vs penalized B-splines as proposed by [77], and the time scale on which the basis functions are computed, ln{t} vs t. Papers that focus on the Royston-Parmar model or the CTM can be found in column F of Table  S3. Note that all papers in that column focus on the Royston-Parmar models except for the paper [77], which is about the CTM.

S.2.5 Joint models
A NPH model can also be obtained by jointly modeling a longitudinal and a survival outcome. Certain time and treatment-dependent components of the longitudinal model might, for example, be the covariate input of the survival model. Let, η(t, Z) = γ L g(t, Z) be the conditional expectation of the longitudinal outcome at t (in the absence of further covariates), with g(t, Z) being a chosen function of treatment and time and γ L the corresponding coefficient vector in the longitudinal model. Then, the hazard might be modelled as λ (Z) (t) = exp x T β + Zγ + η(t, Z)γ 2 λ 0 (t). In this scenario, the treatment has a direct constant impact on the HR(t) through γ as well as an indirect time-varying impact through η(t, Z)γ 2 . See [78] for a model of that kind, that also includes further covariates and random effects in the longitudinal component. Articles focusing on joint models can be found in column D of tables S3. Xu et al. [78] also discuss the posterior distribution of the aHR(t * ) in a Bayesian setting. The package and macro JM provides a software solution for joint modeling of longitudinal and time-to-event data in R and SAS, respectively. The SAS macro is based upon NLMIXED, MIXED, GLIMMIX, and LIFEREG.

S.2.6 Short-and long-term HR
The methods of Section S.2.3 accounting for a time-varying treatment effect can essentially be considered as time-covariate interaction. Alternatively, parametric functions of the HR(t) can be assumed, where the time-varying effect results from interactions of parameters with time-dependent baseline measures such as Λ 0 (t), or S 0 (t). The models in this sub-section differ from fully parametric approaches in that they have non-or semi-parametric components of the baseline measures. We gathered those models in columns F.1 and F.2 in Table S3. Such a model was introduced by [19] and further studied by [79]. A potentially non-proportional treatment effect can be incorporated by the hazard function λ (Z) . The model is also termed short-and long-term HR model: Here, exp{γ 1 } is the short-term and exp{γ 2 } is the long-term hazard ratio of the treatment variable, respectively. This model includes strictly increasing and decreasing HR scenarios as well as PH if γ 1 = γ 2 . Crossing survival curves as well as no initial treatment effect are also sub-models of the Yang and Prentice model [19]. The Yang and Prentice model readily delivers an estimate of HR(t). An estimator of aHR(t * ) has also been established [25]. A generalization of the PH model has also been considered through λ (Z) (t) = λ 0 (t) × exp x T (β 1 + β 2 ) + Z (γ 1 + γ 2 ) + exp(x T β 2 + Z γ 2 ) − 1 log[Λ 0 (t)] , where β 2 and γ 2 are additional model parameters. The PH model is obtained by β 2 = γ 2 = 0. The corresponding survival function equals S (Z) (t) = exp − exp(x T β 1 + Zγ 1 )Λ 0 (t) exp(x T β 2 +Zγ 2 ) . Crossing survival curves are also possible. The baseline hazard function can, for example, be estimated via splines. [80] Papers that have the Yang and Prentice model in focus as well as the model introduced by [80] can be found in Table S3 column E. R Software packages for the Yang and Prentice model are available: YPPE which estimates the baseline quantities via piecewise exponential distribution [81], YPBP which estimates the baseline distribution via Bernstein polynomials [82] and YPmodel.

S.2.7 Frailty models
Frailty can introduce NPH on the population level even if the PH assumption holds on the individual level given the unobservable characteristics. Frailty, or unobserved heterogeneity, might be induced by unmeasured or unmeasurable covariates. High frail individuals are prone to "early" events due to a high individual or conditional hazard and vice versa. We denote the population hazard as the hazard rate with the individual frailty being "integrated out" in this paragraph. Assume an extreme case, with lower population hazard for the treatment group early on but higher population hazard at later stages, as compared to the control group. This is not necessarily a sign of a harming long-term treatment effect. Instead, this could be an indicator of successful treatment as this might be caused by long-term survival of high-frail individuals in the treatment group. The most straightforward frailty model can be expressed as λ (Z) (t|U ) = U exp x T β + Z γ λ 0 (t) , with the frailty random variable U ≥ 0. The corresponding being the frailty density of survivors. Analytical expressions for λ (Z) (t) are available if f U (u) is, e.g., the gamma density or another density of the Power Variance Family (see, for example, Chapter 6.2.3 in [21]) among others. See Aalen et al. [21] or Wienke [83] for a thorough discussion of individual frailty. A tutorial on frailty models that also discusses the population hazard ratio can be found in [84]. We took a very broad perspective of frailty in this work and also put, for example, cure-rate models in this category which could be motivated by a binary frailty model [83]. Unobserved heterogeneity has, for example, been considered through semi-parametric transformation models [85], spatially correlated frailty models [86], time-varying frailty models [87] and random delayed treatment effect two-point cure rate- [88], as well as responder-noresponder-models [89]. Papers having a frailty perspective are marked in column G of Table S3. The SAS procedures phreg and nlmixed, the R packages survival, coxme, frailtyEM, and frailtypack provide frailty models.

S.2.8 Fully parametric models
For the sake of discussion we separate parametric approaches into four classes: piecewise exponential, accelerated failure time (AFT), first hitting time (FHT) and GLMs.

S.2.8.1 Piecewise exponential hazards
The stratified piecewise exponential model assumes a constant hazard rate λ (Z) (t) = λ (Z) (k) within the specified interval t ∈ [t k−1 , t k ), with k= 1, . . . , K, t k < t k+1 , t 0 = 0 and t K = ∞. Then, the HR (t) is piecewise constant. Further covariates might be included in the usual PH manner: , where (elements of) β (1) might be equal to (their counterparts in) β (0) and t is in the k th time interval. The stratified piecewise exponential model approaches the stratified semi-parametric Cox model if the borders of the time intervals are set by the distinct event times. Hagar et al. [90] discussed a Bayesian non-proportional multiresolution hazard model. The hazard is piecewise constant and treatment specific. Across partitions, the hazard parameters might be correlated what can be regarded as a smoothening attempt. A second smoothening approach is imposed through the merging of adjacent time intervals if mortality patterns are statistically similar. Constant or time-varying covariate effects might also be added to the piecewise exponential model. Papers that model time variant covariate effects via the piecewise exponential model can be found in Table S3, column H.1. With respect to fitting procedures, the Poisson regression model (with an adequate offset) and its software implementations can be exploited (see, e.g., [91], [57] or [58]). The R packages eha and pch contain the piecewise exponential model. An R software implementation of the multi-resolution hazard model is available with the package MRH.

S.2.8.2 AFT and generalized additive models for location scale and shape
The AFT model assumes a distribution π with parameter vector θ (Z) for the survival time T . The parameters in θ are (partially) dependent on group membership Z and maybe of further covariates. In particular, log (T ) = x T β + Z γ + σ , where the distribution assumption about the error term determines π. Frequent choices of π are the Weibull, Log-logistic, and Lognormal distribution. AFTs usually result in NPH, with the Weibull being a prominent exception. AFTs can be extended by also modelling scale and shape parameters, σ in the above example, via link functions and covariates. In the generalized case, even the Weibull model contains NPHs [92]. An extension to interval censored data and gamma frailty can be found in [93]. Umbrella distributions like the generalized gamma [94] and the generalized F distribution [95], which include "standard" distributions as special cases, have also been modelled in the AFT context and its generalizations. Extensions which consider all parameters as function of covariates were also discussed in [96]. An approach to dimension reduction was introduced by [97] with a focus on genomic data, utilizing the continuum power regression (CPR) framework. The CPR-step is supposed to obtain a dimension reduction in the covariates and includes OLS, partial least squares, and principal components regression as special cases. Censoring is accounted for by adding the mean residual lifetime on censored observations in the CPR-step. The K "remaining" components are then the covariate input of a generalized F or a semiparametric AFT model. Delayed treatment effects [98] and cure rate models [99] were also considered for the Weibull model. Papers with a focus on the AFT model or generalizations thereof are marked in column H.2 of Table S3. A software implementation of the generalized gamma and generalized F distribution and other (user-defined) distributions is available via the R package flexsurv [76], where also more than one parameter might depend on covariates. The R package brms provides Bayesian parametric survival models. The R package spBayesSurv offers spatial as well as non-spatial Bayesian (generalized) AFTs (and others). The R package mpr can be used to fit, for example, Weibull models where both parameters depend on covariates. The R package gamlss.cens is an addon package to GAMLSS for the purpose of fitting censored versions of an existing GAMLSS family distribution. The SAS procedure lifereg also offers an implementation of AFT models.

S.2.8.3 First hitting time
First hitting time (FHT) models approach the survival distribution via an underlying unobservable health process H(t). An observable component of the process is the event or, more general a transition into another state. Typically, the event is defined to happen if H (t) ≤ 0 for the first time. The Wiener process is commonly chosen for H(t). This leads to T being inverse Gaussian distributed. The parameters of the Health process might be a function of covariates and possibly of random effects [100]. Race and Pennell [101] added random effects utilizing the Dirichlet Process to model subject-specific initial state and drift of the Wiener process. Yu et al. [102] modeled the drift parameter via cubic B-splines. He et al. [103] considered a deterministic decay path where random shocks might cause the event "prematurely" in a hip fracture setting. FHT models are marked in column H.3, Table S3.

S.2.8.4 GLMs & other parametric approaches
If the event indicator is regarded as the target variable, a Poisson GLM with log-link and an adequate offset is equivalent to the piecewise exponential hazard likelihood. Hence, Poisson GLMs have been utilized to fit the piecewise exponential survival model or as an approximation to a general survival likelihood by letting the time intervals for the distinct hazard parameters become small [57]. We mentioned this already in the discussion about time-varying coefficients. We did not classify those papers into the GLM category as they essentially modeled the common survival likelihood via GLM routines and hence, we feel, those papers are more adequately categorized into other classes, time-varying coefficients for example. Others, however, took a more genuine GLM approach. Among them [104] who exploited the longitudinal nature of multiple sclerosis data-set to model transitions in the disability progression. Motivated by oncology studies with NPH, where events are observed through periodic screenings, [105] considered discrete time. Consequently, GLMs at the distinct time points including only at-risk individuals at the corresponding follow-up time t (j) , i.e. Y i t (j) = 1, are suggested. GLMs might be incorporated into other methods, trees, for example (see Section S.2.9). Other parametric approaches including GLMs can be found in column H.4, Table S3. GLM routines from standard statistical software can be used, for example, the SAS procedure glm.

S.2.9 Machine learning approaches
Wey et al. [106] suggested an average survival model which is a weighted sum of parametric, semi-parametric, and non-parametric models. The weights in the suggested model are obtained through the minimization of a loss function. Distinct treatment group survival curves are obtained by "averaging out" the remaining covariates or confounders respectively. Inference regarding the treatment effect is then drawn from the difference in RMST of the treatment and the control group. Lowsky et al. [107] discussed a K-nearest neighbor approach for estimating survival curves via weighted Kaplan-Meier. The K-nearest neighbors of a new observation are determined by its distance to the covariate vectors in the data set. Then, a weighted KM curve is estimated. The weights for each observation are reciprocal to the distance to the new covariate vector and affect the number of deaths as well as the size of the risk set at each event time. This procedure could, for example, be stratified by the treatment indicator in order to quantify the treatment effect given the remaining covariates. The former two approaches can be found in column I.4, Table S3. Alternative approaches include trees and survival forests. Papers concerning trees and forests are marked in column I.1, Table S3. The tree-based method by [2] to find change points for time-varying coefficients has already been discussed above. A Bayesian additive regression tree (BART) for survival data with NPH was considered by [108]. The Likelihood is specified via probit regression at the distinct event times. The probability of an event, conditional on no previous events, at a given event time is derived via the BART. The BART is an ensemble of trees where the splitting criteria at the internal nodes within each tree are set by the event time and covariates. For a given covariate-time input the function value at the terminal node is then summed up over each tree. This estimate is used as the input of the standard normal cdf to obtain the probability of an event. Soft BART (SBART) was considered for interval-censored data by [109]. The SBART extends the BART with a smooth regression function and better ability to remove irrelevant predictors. Survival forests as well as an improved splitting criterion were discussed in [110]. A survival forest for a joint model was developed by [111]. Neural networks for censored survival data were discussed by [112] and [113] . Neural networks are marked in column I.2, Table S3. Kernel smoothing based approaches can be found in [114] regarding the hazard ratio and [115] with respect to differences in survival rates. Kernel smoothing based approaches are marked in column I.3, Table S3. The R package trtf contains transformation trees and forests that can be utilized for time-toevent analysis. The SBART for discrete time-to-event analysis is implemented in the R package BART. The R package mboost includes a gradient boosting algorithm for right-censored data.

S.2.10 Other approaches
Some papers were difficult to categorize into one of the previous sections. Among the papers in this section is [3], which discussed inverse probability of censoring weights (IPCW) approaches, the structural nested model (SNM), and the rank preserving structural failure time model (RPSFTM) in the context of treatment switching. Similarly to the SNM, the RPSFTM assumes that the treatment decreases or increases the survival time by the factor γ > 0. More precisely, T (1) = γT (0) and γ > 1 indicates a beneficial treatment effect. With that factor, counterfactual survival times are computed, i.e. u i = (time i spent in control group) + time i spent in treatment group γ . The parameter γ itself is found via a grid search, where the optimization criterion is a test statistic that compares the estimated survival curves of the placebo and the treatment group, where for the latter the counterfactual survival times are utilized. The value for γ that makes the two groups most alike in the sense of the test statistic is the point estimate. The R package rpsftm provides a software solution for the RPSFTM. The IPCW approach, where the weights might be utilized to compute KM curves or weighted partial likelihood estimates, attempts to account for informative censoring. Further proposals are a semi-parametric proportional likelihood ratio model [116], and concordance regression, where, brought into the two-sample setting, the likelihood is based upon P (T (1) > T (0) ) [117]. A Bayesian non-parametric dependent Dirichlet process for modeling the time-to-event distribution was studied by [118]. Chen and Wang [119] apply an accelerated hazard model, where the hazard is equal to λ (Z) (t) = λ 0 (t exp{Zγ + x T β}), and the baseline hazard is a smoothed non-parametric estimate. The authors suggest that the accelerated hazard model might be a good choice if hazard rates are similar after the start of follow-up but go apart due to different ageing processes. In column J, Table S3, we marked approaches that did not properly fit into other categories.

S.3 Hypothesis tests for equality of survival curves S.3.1 Log-rank tests
The standard log-rank test statistic [120,121] is the most widely used statistical test to compare the overall survival of two groups. The log-rank test is defined as with the weight function w(t (i) ) = 1. The number of distinct event times of the pooled sample is denoted by D, d i is the number of events at t (i) , and d i1 refers to the number of events at t (i) in the experimental treatment group [39].
As the shape of the survival curve influences the power of the test, multiple proposals for the weight function w(t) sensitive to particular NPH patterns are available [5]. Royston and Parmar (2020) [5] give an overview of the weight functions used for different NPH patterns, e.g. the Fleming-Harrington weight function w(t (i) ) =Ŝ t − , denoted G 0,1 , for early or late effects, respectively. Note that t − denotes the time just before time t. For testing that survival in the treatment group is stochastically less than or equal to survival in the control arm (H 0 : S 0 (t) ≤ S 1 (t) ∀ t ≥ 0 ), Magirr and Burman (2019) [122] propose modestly weighted log-rank tests. It is a variation of the weighted log-rank test which under arbitrary weights does not control the risk to conclude that a new treatment is more efficacious than standard care when it is uniformly inferior in terms of the survival function. Treatment may be uniformly inferior in terms of the survival function, but still, there may be time points at which the treatment has a favorable hazard. The logrank test works at the level of the hazard function, so if enough weight is put on the possibly small time interval with a favorable hazard, the treatment is declared significantly better than the control, even though this local benefit does not translate into any survival benefit. The modestly weighted test is constructed in such a way, though, that this fallacy can never happen. The weights are set to w(t (i) ) = 1 max{Ŝ t − (i) ,Ŝ(t * )} withŜ (t * ) denoting the KM estimate at a certain time t * based on the pooled data from both treatment arms. The choice of t * is a trade-off. A bigger value of t * results in lower weights for early events, which is useful for delayed effects, on the other hand too large t * will lead to unnecessarily high weights w(t (i) ) for late events. Sample size formulas for different weighted log-rank tests are given by e.g. Yung and Liu (2020) [123] and their R package npsurvSS. Wei and Wu (2020) [124] and Wu and Wei (2022) [88], Ye and Yu (2018) [125] provide R-code for their sample size formulas derived from weighted log-rank tests for cancer immunotherapy trials with delayed treatment effects. Various authors also investigated the use of weighted log-rank tests in group-sequential trial designs. Group-sequential trial designs plan interim analyses at pre-specified time points. The interim analysis can lead to early efficacy or futility stopping because of either sufficiently convincing results or a further investigation not being justifiable. The use of weighted log-rank test statistics in group-sequential designs can lead to a misspecified covariance of the test statistic due to the incorrectly estimated information fraction. The information fraction is not proportional to the number of interim events for weighted log-rank tests in general, e.g. late events on which the weight function usually places more weight in delayed treatment effect settings might be unavailable during interim analyses. Interim analyses are argued to be only sensible if a fixed time horizon for the final (primary) analysis is specified and if sufficient information up to the time horizon is available for the interim analysis [126]. Brummel and Gillen (2014) [127] focus on monitoring the weighted log-rank test statistic in group-sequential designs where information growth is nonlinear and propose using a constrained boundaries approach to maintain the planned operating characteristics of a groupsequential design. Hasegawa (2016) [99] proposes a semiparametric information fraction for group-sequential designs with delayed treatment effects. Kundu and Sarkar (2021) [128] focus on the deviation of information fractions in weighted log-rank test from that of standard log-rank test and propose a decomposition of effects on information fractions to provide a reasonable and practically feasible range of information fractions to work with. Li et al. (2021) [129] propose a group-sequential design based on the piecewise log-rank test, Zhang and Pulkstenis (2016) [130] provide closed-form solutions for the power and sample size calculation for group-sequential designs and Magirr and Jiménez (2022) [131] give practical guidance for the use of modestly weighted log-rank tests in group-sequential trials.

S.3.2 Kaplan-Meier based tests
Kaplan The unweighted Kaplan Meier test results in the difference between two RMSTs describing the mean event-free survival time up to a pre-defined time point t * [4]. Since equality of survival curves implies equal RMST, we can also test for the difference in RMST between treatment groups The null hypothesis can be tested using the Wald statistic M RMST Tests based on RMST do not rely on the PH assumption but are also not specifically designed to detect crossing survival curves [4].  [135]. The pre-specified time point t * for RMST is selected data dependently in Horiguchi et al. (2018). Therefore, a set of potential times t * = {t * 1 , . . . , t * K } with a fixed number K is assumed. The null hypothesis that there is no difference between 2 event time distributions against a two-sided alternative ∆ t * k 0 is tested with the test statistic M RMST 2 = max t∈t * |M RMST (t * ) | . The distribution under the null hypothesis is obtained using a wild bootstrap procedure. The approach of Horiguchi et al. (2018) [133] is available in the R package survRM2adapt. For cure rate survival models Sun et al. (2018) [135] compare different tests for cure rate survival data and showed in a simulation study that Kaplan-Meier-based tests (RMST test and weighted Kaplan Meier test) perform best among the considered test, e.g. log-rank, Wilcoxon rank test. Rauch et al [15] compared two test statistics for the average hazard ratio aHR testing the null hypothesis H 0 : aHR ≥ 1 to the standard log-rank test using the hazard ratio in settings with different underlying event times and censoring distributions. The two test statistics used for testing the average hazard ratio differ in their independent increments property. The comparison showed the advantage of the average hazard ratio tests in terms of power in NPH settings. Window mean survival time proposed by Paukner and Chappell [136] keeps the interpretability of RMST and unweighted log-rank tests and improves the power to detect differences in survival curves under NPH caused by late crossing or diverging curves. The difference in window mean survival time of the two treatment groups is the area between the two survival curves from t = t * 1 to t = t * 2 , with 0 ≤ t * 1 < t * 2 ≤t. The null hypothesis is that the difference in window mean survival time is zero. The test statistic is calculated by the ratio of the estimated difference in window mean survival time and its estimated variance. The simulation study of Paukner and Chappell [136] showed that the test of window mean survival time has higher power compared to the weighted log-rank test if the PH assumption holds. Sample size formulas based on the RMST test are provided by e.g. Tang [137], Royston and Parmar [138]. Yung and Liu [123] provide a R package npsurvSS for sample size and power calculations based on Kaplan-Meier-based tests. Brückner and Brannath (2017) developed group-sequential designs for the hazard ratio and proved that the sequential tests based on the average hazard ratio are asymptotically multivariate normal with independent increments property. An approach using the weighted Kaplan-Maier test for the calculation of stage-wise p-values in adaptive survival trials allowing to use discrete surrogate information for the interim analysis while controlling the type I error rate was proposed by Brückner et al. [139]. Sample size re-estimation using Kaplan-Meier-based tests is investigated by Wang [140].

S.3.3 Combination tests
Combination tests combine tests within a class or across classes of tests. The idea underlying the combination of tests is the difficulty to predict the existence and severity of NPH caused by e.g. delayed treatment effects. Combination tests allow covering various scenarios. The maximum combination (max combo) test is an example of such combination tests and is defined as the maximum of several weighted log-rank test statistics. Using the Fleming-Harrington weighted log-rank test denoted Z G ρ, γ the max combo test is defined as where Z G ρ k , γ k denotes one of K different weighted log-rank tests. However, the max combo test is not restricted to the Fleming-Harrington weight function in the log-rank tests. The p-value of the maximum combination test can be calculated based on the multivariate normal distribution. Ghosh et al [141] developed group sequential designs using two (modestly) weighted log-rank tests for the max combo test statistic. Ristl et al [24] investigated different sources of non-proportionality such as e.g. delayed treatment effect, disease progression, predictive biomarker subgroups, treatment switch after progression, and their effect on the power of weighted log-rank tests and maximum combination tests. Ristl et al [24] provide the R package nph to perform the statistical tests and to simulate survival data. Sample size procedure for maximum combination tests are e.g. available in Tang (2021). The use of combination tests specifically the max combo test in group sequential trial designs was investigated e.g. in Li et al [142], Wang et al [143], Prior [144]. Li et al [142] investigated obtaining the group sequential boundaries and the empirical power by simulation procedures for delayed treatment effects, whereas the approach of Wang et al [143] with an R-package GSMC available on GitHub is simulation free. Prior [144] investigated the use of different weighting functions in the maximum combination across the time points allowing flexibility to the accrued data. Combination tests are not only restricted to the weighted log-rank test but can also involve other classes of tests, e.g. Royston and Parmar [145] combine the Cox test with a test of the RMST difference by obtaining the p-value of the combination test via selecting the smallest pvalue of the single tests. The Cox test is based on the difference in log partial likelihoods of the Cox PH model with the binary treatment indicator as only covariate. It is closely equivalent to the standard log-rank test [145]. León et al [146] combine weighted log-rank tests with the RMST test. Chi and Tsai [147] propose the combination of weighted log-rank tests with weighted Kaplan-Meier tests. Zhang et al [53] propose a Cauchy combination test of multiple single change-point (CauchyCP) regression models.

S.3.4 Other tests
Besides the Kaplan-Meier-based test, the log-rank tests, and the combination approaches, we identified also articles that proposed hypothesis testing methods not fitting in either of these classes. For instance, Gorfine et al [27] proposed a test for K groups based on sample-space partitions, which is implemented in the R package KONPsurv.
A modification of the Kolmogorov Smirnov test was proposed by Fleming et al [148] who compare their proposal to the standard log-rank test and the Wilcoxon rank test for censored observations [149] and showed higher power under NPH. Sooriyarachchi and Whitehead (1998) propose a binary method for testing whether the survival curves of two treatment groups are equal. This approach needs the discretization of the time. The time intervals underlying the discrete time should include equal numbers of events. The effect measuring the treatment difference is the log odds ratio of the probability surviving past time point t * in the two treatment groups. The test statistic is derived from the log-likelihood of the log odds ratio and the nuisance parameter of the probabilities. Permutation procedures can be used to obtain the distribution of the test statistic under the null hypothesis of equal survival functions. For combinations approaches permutation approaches were suggested by e.g. Brendel [36] provide the R package GFDsurv for their proposed approach. For the application of newly proposed methods providing the corresponding software is of advantage. In some settings, numerical aspects are important in the development of such software. For most existing test statistics, Riemann integration is used. However, under complex NPH pattern involving high dimensional numerical integration this approach might not be feasible. Therefore, Tang [137] proposes a sample size and power calculation method for logrank tests and RMST tests via product integration and provides sample SAS code as online supplementary material. x

S.4 Classification of methods proposed in selected articles
x 19 x 20 x 21 x x 22 x x 26 x 27 x 28 x 30 x x  x  39  x  x  40 x 42 x 46 x x 53 x 56 x x 67 x 69 x x x x x x 72 x 75 x x 44 776 x 777 x 778 x 781 x 791 x 796 x x 797 x 804 x 813 x 821 x 834 x M3 x M5 x x M6 x M8 x M9 x M10 x M11 x M13 x E2 x E3 x x E65 x S.5 Classification of estimation methods  x x 46 x x 56 x 67 x 69 x x x 75 x x x 76 x 86 x 91 x 92 x x 97 x 103 x x 142 x S.6 Classification of articles including hypothesis tests