Why restricted mean survival time methods are especially useful for non-inferiority trials

Our attention was recently captured by the paper from Freidlin et al., in which the authors investigate the conditions under which testing non-inferiority with time-to-event data defining the margin as a difference in restricted mean survival time (DRMST) leads to higher power than defining it as a hazard ratio (HR). We agree with the authors that there is no magic in DRMST, and that it is important to clarify when, and why, one method is advantageous over the other. The authors have addressed the when, by providing a simulation study that indicates that DRMST is more powerful with low event rates, limited follow-up and large non-inferiority margins. This is a welcome addition to previous simulation studies that had either suggested a generalised power advantage of DRMST or shown differences but without investigating them further; however, the issue of clarifying why such differences arise remains. There are few reasons why using HR may be advantageous over DRMST: first, DRMST discards data on follow-up after t, as one of the scenarios in Freidlin et al. was designed to show. Second, it is compromised by loss to follow-up before t. More generally, when estimated non-parametrically, DRMST is less efficient than HR, which is generally estimated through semi or fully parametric models under the proportional hazards assumption. It is therefore not surprising to see an advantage in terms of power for HR in certain settings. However, in several scenarios, and in particular with large non-inferiority margins and low event rates, conclusions are reversed so that DRMST has a power advantage, and the reasons for this phenomenon are not well understood. We believe the concept of non-inferiority frontiers, which we introduced in a recent paper, helps to explain the why. The fundamental reason for the difference in power between DRMST and HR methods is that the null hypotheses are not the same, even if we make the non-inferiority margins match, as was done in Freidlin et al. and Weir and Trinqart. This is because the null hypotheses are actually curves in space or, as we called them, frontiers, rather than single points, which are simply used as assumptions for the purpose of designing a frequentist trial. Figure 1 gives graphical intuition for this point. Figure 1(a) shows the non-inferiority frontiers corresponding to DRMSTand HR-based tests in a simulation scenario similar to the first in Freidlin et al. The dashed line represents the line of treatment equality, the hollow dot represents the expected control event rate, and the cross is the corresponding frontier point, that is, the non-inferiority margin if the expected point was correct. The turquoise (HR) frontier passes closer to the expected point than the navy (DRMST) frontier, and hence requires a larger sample size to conclude non-inferiority. A similar phenomenon happens with binary outcomes where, for low event rates, the frontier corresponding to a risk ratio margin passes closer to the expected point than one based on a risk difference margin, and hence implies larger sample sizes. A larger event rate or a smaller margin changes the graph, as shown in Figure 1(b) and (c), respectively, so that the different frontiers are much more similar near the expected point and the other differences we listed above give HR the edge over DRMST. Estimating DRMST by fitting a Cox model could eliminate the remaining differences in favour of HR in these settings, making DRMST always at least as powerful as HR. Nevertheless, this should not be taken to mean that all non-inferiority trials should be designed using DRMST. Since different population-level summary measures imply different null hypotheses, we believe the choice should be driven initially by clinical considerations and

Why restricted mean survival time methods are especially useful for non-inferiority trials Matteo Quartagno , Tim P Morris and Ian R White Our attention was recently captured by the paper from Freidlin et al., 1 in which the authors investigate the conditions under which testing non-inferiority with time-to-event data defining the margin as a difference in restricted mean survival time (DRMST) leads to higher power than defining it as a hazard ratio (HR).
We agree with the authors that there is no magic in DRMST, and that it is important to clarify when, and why, one method is advantageous over the other. The authors have addressed the when, by providing a simulation study that indicates that DRMST is more powerful with low event rates, limited follow-up and large non-inferiority margins. This is a welcome addition to previous simulation studies that had either suggested a generalised power advantage of DRMST 2 or shown differences but without investigating them further; 3 however, the issue of clarifying why such differences arise remains.
There are few reasons why using HR may be advantageous over DRMST: first, DRMST discards data on follow-up after t, as one of the scenarios in Freidlin et al. 1 was designed to show. Second, it is compromised by loss to follow-up before t. More generally, when estimated non-parametrically, DRMST is less efficient than HR, which is generally estimated through semi or fully parametric models under the proportional hazards assumption. It is therefore not surprising to see an advantage in terms of power for HR in certain settings. However, in several scenarios, and in particular with large non-inferiority margins and low event rates, conclusions are reversed so that DRMST has a power advantage, and the reasons for this phenomenon are not well understood.
We believe the concept of non-inferiority frontiers, which we introduced in a recent paper, 4 helps to explain the why. The fundamental reason for the difference in power between DRMST and HR methods is that the null hypotheses are not the same, even if we make the non-inferiority margins match, as was done in Freidlin et al. 1 and Weir and Trinqart. 3 This is because the null hypotheses are actually curves in space or, as we called them, frontiers, rather than single points, which are simply used as assumptions for the purpose of designing a frequentist trial. Figure 1 gives graphical intuition for this point. Figure 1(a) shows the non-inferiority frontiers corresponding to DRMST-and HR-based tests in a simulation scenario similar to the first in Freidlin et al. 1 The dashed line represents the line of treatment equality, the hollow dot represents the expected control event rate, and the cross is the corresponding frontier point, that is, the non-inferiority margin if the expected point was correct. The turquoise (HR) frontier passes closer to the expected point than the navy (DRMST) frontier, and hence requires a larger sample size to conclude non-inferiority. A similar phenomenon happens with binary outcomes where, for low event rates, the frontier corresponding to a risk ratio margin passes closer to the expected point than one based on a risk difference margin, and hence implies larger sample sizes.
A larger event rate or a smaller margin changes the graph, as shown in Figure 1(b) and (c), respectively, so that the different frontiers are much more similar near the expected point and the other differences we listed above give HR the edge over DRMST. Estimating DRMST by fitting a Cox model could eliminate the remaining differences in favour of HR in these settings, making DRMST always at least as powerful as HR. Nevertheless, this should not be taken to mean that all non-inferiority trials should be designed using DRMST.
Since different population-level summary measures imply different null hypotheses, we believe the choice should be driven initially by clinical considerations and MRC Clinical Trials Unit, Institute for Clinical Trials and Methodology, University College London, London, UK later tempered by statistical considerations. This is well recognised in certain areas: for example, for vaccine non-inferiority trials, even though, for low infection risk, defining the non-inferiority margin as a risk difference would give much greater power, the margin is usually defined as a ratio, because a relative population summary is more meaningful (and transportable) in situations where baseline risk varies. The rest of this letter details the methods we used to produce the figure.

Details
Let l 1 and l 0 be the unknown event rates in the experimental and control arms, assumed constant. Let l e be the expected event rate in the sample size calculation, assumed the same in both arms. Let l f be the event rate in the experimental arm used to specify the noninferiority margin: that is, if l 0 = l e , then noninferiority means l 1 \ l f . Thus, on the HR scale, the non-inferiority margin is l f / l e . Allowing the control arm event rate to be unknown, this margin implies non-inferiority if l 1 / l 0 \ l f / l e . This equation relating the unknown l 1 and l 0 is the non-inferiority frontier on the constant-HR scale.
For constant event rate l and fixed horizon t, the restricted mean survival time (RMST) to time t is r(l) = (1 2 e 2lt ) / l. The non-inferiority frontier on this scale is r(l 1 ) 2 r(l 0 ) = r(l f ) 2 r(l e ), where r(l f ) 2 r(l e ) is the non-inferiority margin on the DRMST scale.
We draw the non-inferiority frontier for three settings. As a base case to illustrate the settings where DRMST shows benefit, we specify moderate event fractions and a large margin by l e = 0.5, l f = 0.75 and t = 1. This implies the non-inferiority margin is 1.5 on the HR scale and 20.083 on the DRMST scale. We plot the non-inferiority frontiers for l 0 ranging from 0 to 1, with values of l 1 found iteratively in each case. To illustrate the settings with large event fractions, we change t to 3. To illustrate the setting with a small margin, we change l f to 0.55. To graphically explain the greater distance of the DRMST frontier, we additionally show a possible confidence region for control and experimental rate from a hypothetical trial where the observed control rate matches exactly the expected one and the results are on the border of significance on the HR scale, that is, where the confidence region just touches the HR frontier; the region does not reach the DRMST frontier, hence, non-inferiority could be concluded on the DRMST scale, but not using HR.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Medical Research Council (MC_UU_12023/29). The circle represents a possible confidence region for the joint distribution of control and experimental event rates for a trial for which the estimated control rate matches the expected one and the estimated active rate is as in the alternative hypothesis.