Frequentist rules for regulatory approval of subgroups in phase III trials: A fresh look at an old problem

Background The number of Phase III trials that include a biomarker in design and analysis has increased due to interest in personalised medicine. For genetic mutations and other predictive biomarkers, the trial sample comprises two subgroups, one of which, say B+ is known or suspected to achieve a larger treatment effect than the other B−. Despite treatment effect heterogeneity, trials often draw patients from both subgroups, since the lower responding B− subgroup may also gain benefit from the intervention. In this case, regulators/commissioners must decide what constitutes sufficient evidence to approve the drug in the B− population. Methods and Results Assuming trial analysis can be completed using generalised linear models, we define and evaluate three frequentist decision rules for approval. For rule one, the significance of the average treatment effect in B− should exceed a pre-defined minimum value, say ZB−>L. For rule two, the data from the low-responding group B− should increase statistical significance. For rule three, the subgroup-treatment interaction should be non-significant, using type I error chosen to ensure that estimated difference between the two subgroup effects is acceptable. Rules are evaluated based on conditional power, given that there is an overall significant treatment effect. We show how different rules perform according to the distribution of patients across the two subgroups and when analyses include additional (stratification) covariates in the analysis, thereby conferring correlation between subgroup effects. Conclusions When additional conditions are required for approval of a new treatment in a lower response subgroup, easily applied rules based on minimum effect sizes and relaxed interaction tests are available. Choice of rule is influenced by the proportion of patients sampled from the two subgroups but less so by the correlation between subgroup effects.

contemporary trials are often related to one or more genetic mutations, gene expression or a function of several genetic markers and may be dichotomous, ordinal or continuous. Despite the loss of information, for practical reasons they are often dichotomised. In this paper, we focus on biomarkers that are dichotomous (naturally or by design), and prior to commencing a confirmatory Phase III trial, are expected to be predictive.
In our context, we define sub-populations of patients as either biomarker positive (Bþ) or negative (BÀ). A common situation is that the treatment efficacy is a priori assumed to be better (or at least as good) in Bþ compared to BÀ patients. For example, 1. a drug treatment may have been developed to target a genetic disorder defining group Bþ, 2. there may be an untested clinical hypothesis of better efficacy in Bþ, 3. empirical data (biological or early clinical) may indicate higher efficacy in Bþ.
Even if a treatment has been developed to target the biomarker of interest, some of the efficacy in Bþ may obtain in BÀ. Depending on the situation, one may expect no or minimal efficacy in BÀ, that a large proportion of the efficacy in Bþ is retained, or that efficacy in BÀ is difficult to predict even if the treatment is known to be efficacious in Bþ. The effect in BÀ may arise, for example, if the biomarker is intrinsically continuous, so that treatment efficacy varies continuously across its levels. Dichotomisation may then lead to positive, but smaller efficacy in BÀ. Alternatively, some patients in BÀ may have an unknown genetic defect acting on the same pharmacological pathway as the Bþ genetic mutation, which is the target of the intervention. Moreover, few biomarker tests have 100% sensitivity and specificity in practice, leading to diffusion of efficacy from patients incorrectly classified as Bþ or BÀ. As a result, biomarker level will interact with the treatment, with a higher treatment effect in Bþ patients than BÀ patients.
Although a new treatment may be most effective in the higher responding Bþ subgroup, trials often draw patients from both subgroups, since the lower responding BÀ subgroup may also gain benefit from the intervention. In this case, a problem emerges when deciding what constitutes sufficient evidence to approve the drug in the BÀ population. European guidelines state that "Confirmatory trials should reflect the target population to be treated" so that trials will sample from a target population. However, if the target population is heterogeneous, ensuring that the treatment effect is sufficiently large in a lower responding subgroup may still be warranted. Moreover, in order to improve efficiency of a trial, the higher responding subgroup may be oversampled (enriched sample), resulting in under-representation of lower responders relative to the target population. In such a case, regulators and sponsors are concerned that automatic approval for lower responders based on obtaining an overall significant effect could result in harm, for example if side effects outweigh potential benefit in this subgroup. If the low response BÀ subgroup makes up a very large proportion in the population, there may also be substantial cost to healthcare providers, but not the predicted benefits. Therefore, regulators and sponsors may wish to impose additional conditions for approval of the treatment in this subgroup. Current regulatory guidelines acknowledge the importance of heterogeneity in decision making and encourage subgroup analysis in confirmatory trials. 3 However, the guidance does not describe specific rules for approval of subgroups when heterogeneity exists.
There is a large literature on subgroup analysis in phase III trials. Much of this literature concerns post hoc exploration of a moderate to large number of subgroups using interaction tests, with issues such as data dredging and multiplicity well documented. 4 This study differs in that we are concerned with the situation where there is an overall significant effect, two pre-defined subgroups known to differ in treatment effect, and optimal rules for treatment approval in the lower responding subgroup are required.
Where hypothesis tests are applied to multiple subgroups, it is important to control family-wise error rate (FWER). 5 For two subpopulations Bþ and BÀ, there are different multiple testing procedures that control the FWER. 6 In most applications, formal testing focuses on F and Bþ using either a hierarchical approach (F followed by Bþ, or Bþ followed by F) or by splitting type I error between parallel tests; testing of BÀ is rarely included in applications for regulatory approval, as power is considered to be limited. 7,8 In this study, we concentrate on the situation where the intervention has statistically significant efficacy in F, with significance in the Bþ group assumed to follow due to higher efficacy in this subpopulation; conditions for approval in BÀare then developed and assessed. A strategy that conditions on significance in Bþ rather than F is closely related mathematically and results are expected to be very similar.
Although the study is motivated by trials of drugs targeting specific genetic mutations (see Gonzalez-Martin et al. 9 for a recent example), other examples of trials with similar structures define subgroups according to age (adults and children), 10 mild and severe disease, 11 early and late stage cancers, 12 as well as other biomarkers. 13 Proposed methods should apply to any trial including two subgroups with known or suspected treatment effect inequality.
We provide a brief literature review of methods and practice in this context in section 2 before describing proposed rules for exponential family models in section 3. Conditional power of the rules is explored in section 4 and applied retrospectively to two published phase III trials in section 5, before briefly discussing implications for future trial design. A discussion completes the paper (section 6).

Existing literature
A review of FDA drug approvals with required biomarker testing found that biomarker negative patients were simply excluded from the majority of trials. 14 Since exclusion was often not based on clinical evidence, these patients could be denied potential benefit from novel treatments. Moreover, provided that there is a sound biological basis for some benefit in biomarker negative patients, including them may also confirm the clinical utility of the biomarker itself.
Heterogeneity within a target population was also recognised in updated EMA guidance on the investigation of subgroups in confirmatory clinical trials, published in January 2019. 3 Whilst the guidance suggests that restriction of a trial population to a sub-population is justified if there are safety concerns or an anticipated lack of efficacy, it also calls for additional trials including the full breadth of the population to provide the best evidence of effect modifiers. Inclusion of the biomarker negative subgroup was highlighted as important, though it can create difficulties in analysis if the treatment effect is small or there are only a small number of such patients, resulting in low power to detect a significant treatment effect. Despite this, patients in this subgroup may still benefit from treatment, and are therefore harmed if the result is discarded for non-significance.
A 2016 review by Ondra et al. 15 found that two concepts underpin current methods for assessing subgroup effects, influence and interaction. An 'influence' condition sets a threshold that must be met by the treatment estimate of the subgroup of interest, whilst an interaction test sets a difference between treatment effects for two (or more) subgroups, in effect requiring that effects are sufficiently close. These methods may be used for approval of a treatment or as conditions which must be met for the subpopulation to be included in the next stage of analysis (adaptive designs). For example, Stallard et al. 16 compared different strategies for choosing which hypotheses to test in the second stage of analysis (either the full population or a subgroup, or both), which used either an influence or interaction test approach. Similarly, Matsui and Crowley 17 proposed a sequential design where in the first stage of their analysis they use superiority and futility boundaries to decide which populations go forward for further analysis. This preserves statistical power for detecting various profiles of treatment effects across the subgroups, and allows the biomarker negative population to be tested again if they do not cross the futility boundary.
Despite the development of different adaptive designs, interaction tests appear to be the main method used to assess subgroup heterogeneity. In our (unpublished) targeted systematic review of large clinical trials that carried out subgroup analyses in the New England Journal of Medicine, we found that approximately two thirds used interaction tests to decide whether there was significant treatment effect heterogeneity. Almost all other articles summarised within-subgroup effects and used significance tests with 5% type I error.
Although most of the literature rests on the frequentist paradigm, a Bayesian approach could also be considered. 18 By specifying a two-dimensional prior for efficacy in Bþ and BÀ, one can explicitly borrow information from one subpopulation when evaluating the other. This prior should reflect the clinical plausibility of a range of differential treatment effects between the two subpopulations.
This study was partly motivated by the design of the APEX trial, which compared betrixaban with standard dose enoxaparin in medically ill patients at risk of venous thrombosis. They carried out sequential analyses, the first on a subgroup defined by the biomarker D-dimer, the second on a subgroup defined by a combination of Ddimer level and age, and the third of the full population. If any result was negative, then subsequent tests were treated as exploratory. The first subgroup analysis was just above the pre-defined threshold for statistical significance of 5% (p ¼ 0.054), so that the subsequent subgroup analysis (elevated D-dimer level and age ! 75) and full population analysis had to be treated as exploratory, although hypothesis test statistics were 'significant' at p ¼ 0.03 and p ¼ 0.006, respectively. Clearly, such an analysis may have substantial implications for approval of the experimental treatment. A more traditional analysis plan would be to consider the full population first followed by the subgroups; however, conditions for approval in subgroups are less well established.
In our context, we may accept some heterogeneity between subgroups, provided that there is sufficient benefit in the BÀ subgroup. The issue is in choosing an acceptable difference between the treatment effect in the two subgroups, or equivalently, choosing a relaxed (higher) significance level for the interaction test. On the other hand, decision rules that focus on influence rely solely on the data in the BÀ subgroup, but require us to pre-specify a minimum bound for the acceptance threshold. In order to avoid the need to specify either a minimum treatment effect or a more relaxed interaction level, a decision rule that does not require additional parameters may also be attractive.
The question of how to deal with approval in a limited sub-population has important implications for maximising the patients who could benefit, which is particularly important for conditions where there are few treatment options. There remains uncertainty about how to address this issue, and how different decision rules perform according to issues such as prevalence of the high responder subgroup in the population and in the trial. We outline a simple strategy to choose appropriate and efficient decision rules in a frequentist framework.

Generalised linear model and subgroup notation
In practice, phase III clinical trials that have a biomarker-treatment interaction are analysed using linear, generalised linear or survival regression models. We restrict attention in this paper to the wide range of trial outcomes that have Normal, Binomial or Poisson distributions and review the general framework here, defining estimands of interest, estimators and statistics.
Generalised linear models that describe different treatment effects in the two subgroups have a linear predictor of the form where for patient i; i ¼ 1; . . . ; N; T i ¼ 0; 1 for control and experimental treatments, S i ¼ 0; 1 for subgroups BÀ (biomarker negative), Bþ (biomarker positive) and X i is a vector of baseline covariates, usually minimisation or stratification factors, included to increase precision of the treatment effect estimate or to adjust for chance imbalance. We assume that there are N patients in the trial overall, n ¼ N=2 in each trial arm and that pn in each treatment arm are drawn from sub-population Bþ, the remaining ð1 À pÞn are drawn from subpopulation BÀ.
For the exponential family of distributions, the expected response and the linear predictor are connected through the link function gðc i Þ ¼ g i . For trial outcomes that have Normal, Binomial or Poisson distributions, canonical link functions are the identity, logit and log functions, respectively.
We define the estimands of interest in the two sub-populations as the treatment effects l Bþ ¼ b 1 þ b 3 for the biomarker positive subgroup and l BÀ ¼ b 1 for the biomarker negative subgroup. Without loss of generality, positive values of l Bþ and l BÀ indicate that the treatment is beneficial. For Normal response variables, these are mean treatment effects in the two subgroups, for Binomial responses they are log odds-ratios and for Poisson responses they are log rate-ratios. Estimators of these estimands can be obtained using maximum likelihood aŝ l Bþ ¼b 1 þb 3 andl BÀ ¼b 1 . We can also estimate the approximate maximum likelihood (co-)variance components from the information matrix, so that for generalised linear modelŝ For inference for each group separately, we define Z-statistics Z Bþ ¼l Bþ =r Bþ and Z BÀ ¼l BÀ =r BÀ wherer Bþ andr BÀ are estimates of the standard errors of the estimands taken from the information matrix.

Likelihood assuming no correlation between l Bþ and l BÀ
In order to gain insight into the contribution of p, the proportion of trial patients drawn from the high-responding subgroup Bþ, it is useful to consider the case where the trial analysis is not adjusted for baseline factors, so that there is zero correlation between l Bþ and l BÀ . In this case we can writê Specifically, for the response Y i , i ¼ 1; . . . ; N, with canonical link and assuming 1:1 randomisation between treatment arms, the variance components can be approximated by: where h jk is the probability of an event in treatment arm j and subgroup k.
We note that in all three cases the variance includes the term 1=p or 1=ð1 À pÞ, which will facilitate investigation of the influence of the distribution of the sub-populations in the trial.

Making inferences in the full population in the general case.
We write the full population treatment effect as That is, the estimand for the full population is a weighted average of the subgroup specific estimands l Bþ and l BÀ , with weights given by the proportion of the trial sample drawn from each sub-population, p and 1 À p. Note that, for l F to be directly interpretable, these proportions should hold in the target population, otherwise some translation is required.
Since Z Bþ ¼l Bþ =r Bþ and Z BÀ ¼l BÀ =r BÀ , we havê If q is the correlation between the estimands, the Z-statistic for the full population is If there are no additional covariates in the model (b 4 0), then Covðb 1 ;b 3 Þ ÀVarðb 1 Þ, so that Covðb 1 ;b 1 þ b 3 Þ 0 and the correlation q is zero.
Alternatively, if b 4 6 ¼ 0 then Covðb 1 ;b 1 þb 3 Þ 6 ¼ 0, and the correlation q describes the association between estimated effects due to adjustment for covariates. In general, the correlation induced by covariance adjustment is expected to be small.
As an aside, we note that the correlation between the statistics Z Bþ and Z BÀ is also equal to q.

Proposed rules for approval of the drug in BÀ
3.3.1 Sequential testing and conditional power.
We define and evaluate three proposed rules for approval in the lower response population BÀ conditional on significance in the full population. Recall that we expect the treatment to be as effective or less effective in the BÀ population, but nevertheless it may be sufficiently effective to warrant approval. In this situation, evaluation of BÀ is only worthwhile if a significant effect has been established in the full population. Thus, we adopt a sequential testing strategy, first evaluating treatment in the full population and, conditional on a significant result, evaluating the treatment in BÀ. The conditional power of a decision rule is a natural method for assessing its value in this context; for decision rule Rn, conditional power is defined as (As an aside, for binary data where the analysis adjustment for baseline covariates is not required, this conditional probability can be calculated in closed form.) The denominator does not depend on the form of any proposed rule and is given by the observed statistic in the full population Note that the right-hand side is a function of three location parameters l Bþ ; l BÀ and p, and three variance parameters r Bþ ; r BÀ and q (see equations (2) and (3)), and conditional on these the standard normal deviate can be obtained from any statistical software.
We now consider conditional power of three classes of decision rule for approval of the drug in BÀ, summarised in Table 1.
Rules 1 and 2 are different types of influence rule, whilst Rule 3 is an interaction test. The algebra for calculating the conditional power of each rule is given in full in Appendix 1.
3.4 Rule 1: the statistic Z BÀ exceeds a pre-defined threshold L For the treatment to be acceptable in the BÀ subgroup, the significance of the average treatment effect should exceed a pre-defined minimum value, say Z BÀ > L. Given prior estimates of l BÀ and its standard deviation, we could calculate the sample size to ensure this threshold is achieved with a given probability. As an absolute minimum, the treatment effect and associated statistic should be positive (L > 0, assuming without loss of generality that positive effects signify treatment benefit), although such a mild condition is unlikely to be acceptable unless the drug has negligible adverse effects and little cost. Alternatively, setting L > 1.96 is a strict condition requiring a significant treatment effect in the BÀ subgroup at traditionally accepted levels. This is tantamount to repeating the trial in the BÀ sub-population and will not be feasible if either this sub-population is small or difficult to recruit from, or the treatment effect is modest. Despite this, for serious diseases with no alternative effective treatments, a smaller expected treatment effect may be sufficient to outweigh any concerns regarding safety and cost; in this case a value L 2 ð0; 1:96Þ should be pre-defined. In general, we may accept an intermediate value for L.
For rule 1, conditional power is given by the expression where the denominator is defined in equation (5) and can be obtained from standard statistical software.
For the numerator PðZ BÀ > L; Z F > 1:96Þ, writing X ¼ 1:96 À Z F and Y ¼ L À Z BÀ we show in Appendix 1 that the joint distribution of (X,Y) is Again, the numerator of the conditional power is a function of three location and three variance parameters l Bþ ; l BÀ , p, r Bþ ; r BÀ , q, through l F and r F . Conditional on these, the numerator is PðX 0; Y 0Þ and can be obtained from any statistical software.

Rule 2: The BÀ data should increase statistical significance
For interventions with a low adverse event profile, approval may be acceptable provided that the data in BÀ are not in conflict with those in Bþ. More formally, we might approve in BÀ provided that the data increase statistical significance, that is, on condition that Z F > Z Bþ . For rule 2 the conditional power has denominator defined in equation (5) and numerator given by PðZ F > Z Bþ ; Z F > 1:96Þ.
Making the transformations, X ¼ 1:96 À Z F and Y ¼ Z Bþ À Z F , we show in Appendix 1 that the joint distribution of X and Y is For the conditional power numerator, we calculate PðX 0; Y 0Þ using standard statistical software, conditional on subgroup-specific parameters.

Rule 3: No significant subgroup-treatment interaction at a I level
When a range of subgroup effects are explored (often post hoc), it is customary to perform interaction tests to identify specific subgroups for which the treatment appears particularly effective/ineffective for further investigation. From our targeted systematic review of literature, the type I error rate a I is almost invariably set to 5%, with no adjustment for multiplicity. Our objective here is quite different; specifically we use a I as a measure of how confident we are that the two subgroups have different treatment effects, in order to decide whether approval in BÀ is warranted. In this case, we might choose a value for a I that is greater than 5%, depending on our knowledge of the variation of the treatment effects and the number of trial participants in each of the four subgrouptreatment combinations.
For this rule, the numerator is Pððl Bþ Àl BÀ Þ=SDðl Bþ Àl BÀ Þ < z a I =2 ; Z F > 1:96Þ, where z a I =2 is the 100ð1 À a I Þ% quantile from the standard normal distribution and SD is the standard error of the difference in treatment effects in the two subgroups. We make the transformations, X ¼ 1:96 À Z F and Y ¼l Bþ Àl BÀ SDðl Bþ Àl BÀ Þ À z a I =2 , and write . Then we show in Appendix 1 that the joint distribution of X and Y is Again, the numerator for conditional power is PðX 0; Y 0Þ, which we obtain from standard statistical software, conditional on subgroup-specific parameters. Figure 1 illustrates the proposed rules for the (hypothetical) case of independent normally distributed estimands for the two subgroups (on the scale of analysis e.g. linear, log, logistic). The statistics Z Bþ and Z BÀ have a bivariate normal distribution with mean (2, 1) and variance diag(1), represented by the unlabelled contours. The condition that the hypothesis test is significant in the full population is represented by the volume under the (Z Bþ ; Z BÀ ) joint density that lies above and to the right of the red line. The volumes under the (Z Bþ ; Z BÀ ) joint density that lie above the purple, blue and green lines represent Rule 1 for the cases where L ¼ 1.96 (treatment effect in BÀ is significant), L ¼ 1 (treatment effect in BÀ is one standard error) and L ¼ 0 (treatment effect is positive), respectively. The volume that lies above the orange line represents Rule 2 (BÀ results add to the overall significance) and the volume that lies above the pink line represents Rule 3 (interaction not significant at the onesided 10% significance level). Conditional power for a certain rule, e.g. PðZ BÀ > 1jZ F > 1:96Þ, is the proportion of the total probability above the red curve (Z F > 1:96), that also lies above the boundary defined by the rule (e.g. above the blue line, so that Z BÀ > 1).

Illustration of proposed rules
In general, the conditional power for each of these three rules is given by the proportion of the density of (X, Y) that is consistent with the condition Z F > 1:96. As a specific example, consider Rule 1. Figure 2 shows the conditional power for Rule 1 for the case where estimands for Bþ and BÀ are independent (q ¼ 0), participants are drawn in equal numbers from the two sub-populations (p ¼ ð1 À pÞ ¼ 0:5) and thresholds for approval set at L ¼ 0; 1; 1:96.
Because Z Bþ enters into the power calculation only through X ¼ 1:96 À Z F , it is independent of the approval threshold L, so that the Y-axis does not change position for different values of L. In contrast, as L increases, the joint density of (X, Y) shifts down the Y-axis and the proportion above zero decreases, thus decreasing the conditional power as expected. As the two estimands are assumed independent, the contours are circular. From the covariance matrix for Rule 1, X and Y will be positively correlated if the two estimands l Bþ and l BÀ are (q > 0) and vice versa.
Similar patterns can be found for Rules 2 and 3.

Comparison of conditional power for the proposed rules
The conditional power for all three rules depends on the relative treatment effects in the two subgroups, which is driven by the biological mechanisms of the treatment. Above this, we explore how the proportion sampled from each sub-population p and 1 À p and the correlation between estimands q affects the power of the proposed rules. 4.2 Comparison of conditional power for proposed rules when r Bþ ¼ r BÀ ; p ¼ 0:5 and q ¼ 0 The top row of Figure 3 shows the relationship between the subgroup-specific statistics for treatment effect and conditional power in the simple case of equal standard errors, equal numbers in the two subgroups and independent estimands. For any value of l BÀ =r BÀ , power decreases as l Bþ =r Bþ increases, due to conditioning on observed Z F > 1:96; that is, PðZ F > 1:96Þ increases as l Bþ =r Bþ increases, thereby increasing the denominator in equation (4). We note that the contour lines curve for Rules 1 and 2, but for Rule 3, which is based on the interaction test, they are straight. The linear contours for Rule 3 do not hold if the estimands for the subgroups are non-independent, nor will they hold if data are binary or counts, see Appendix 1 for further details.  For this illustration, we set L ¼ 1 for Rule 1 and one-sided significance of the interaction to 0.1 for Rule 3. Changing these thresholds will result in a shift in the contour plots, whilst Rule 2 does not require specification of an additional parameter and is fixed. It is possible to closely align the three rules by choosing appropriate values for L and the interaction type I error if necessary.
4.3 Effect of correlation between estimands when r Bþ ¼ r BÀ ; p ¼ 0:5 and q ¼ 0.5 Including additional covariates X i in the trial analysis in equation (1) induces correlation between the treatment effect estimands (and therefore the statistics) for the two subgroups. The bottom row of Figure 3 shows how the conditional power changes when there is moderate/strong correlation (q ¼ 0:5) between the treatment effect estimates, compared with no correlation in the top row. In all cases, the contours are closer together. Conditional power for Rule 1 changes only slightly since it relies largely on the absolute size of l B À =r BÀ , whilst Rules 2 and 3 rely on both groups to a greater extent.
We note here that, if the two estimands are independent (q ¼ 0) and arise from normally distributed data with common sampling variance in the two subgroups, then conditional and unconditional power for Rule 3 are identical, since X ¼ 1:96 À Z F and Y ¼l Bþ Àl BÀ SDðl Bþ Àl BÀ Þ À z a I =2 in Rule 3 will also be independent. A brief proof is given in Appendix 1. However, for non-normal data or normal data with differential variance in the two subgroups or correlated l Bþ and l BÀ , conditional and unconditional power for the interaction test will not be identical.

Effect of relative subgroup size when sampling variance is homoscedastic across subgroups and q ¼ 0.5
We illustrate the influence of subgroup size on conditional power for the case where sampling variance is homoscedastic across biomarker subgroups. This will occur for normally distributed outcomes with the same sampling variance r 2 in each subgroup, but does not necessarily hold for other distributions. Figure 4 shows how conditional power for each rule varied if either 20% or 80% of patients came from population Bþ. The proportion of patients sampled from the two sub-populations has a greater impact on conditional power than correlation between the two treatment effects.
If the trial sample contains a high proportion drawn from the Bþ population (bottom row of Figure 4), then Rule 1 (with L ¼ 1) has lower power for small values of l Bþ =r Bþ . This arises because we are conditioning on significance in the full population, and the BÀ patients contribute only a small amount to the overall. Conversely, because BÀ patients contribute a large amount to the overall analysis in the top row of Figure 4, the overall significance only occurs if there is good power that the observed Z BÀ exceeds L ¼ 1. Similar effects are observed for Rules 2 and 3, in that the overall-significance condition induces higher conditional power for approving treatment in BÀ at lower values of l Bþ =r Bþ .
To provide further insight into the effect of relative sample size of the two subgroups, we plot conditional power against l Bþ =r Bþ for three different sampling proportions, 20:80, 50:50 and 80:20 with (q ¼ 0) in Figure 5. This shows that, if the proportion of patients in this group is low (20%) or equal (50%), Rule 1 (with L ¼ 1) has highest conditional power across the plausible range of values of l Bþ =r Bþ . Conversely, if the proportion of patients from Bþ is high (80%), Rule 3 has uniformly highest conditional power. Rule 2 never has highest power in these scenarios, but we note that results are dependent on the chosen values of L for Rule 1 and the Type I error for interaction test, whilst Rule 2 has the advantage of not requiring additional parameters.

Illustrative applications
In order to illustrate how these rules might be used in practice, we retrospectively apply them to two completed phase III trials: a small cardiac surgery trial of 352 patients and equal size subgroups, 19 and a much larger stroke trial (n ¼ 7513) which evaluated betrixaban. 13

AMAZE trial in cardiac surgery
AMAZE was a cardiac surgical trial in patients with atrial fibrillation (rapid/irregular heart rhythm). 19 This multicentre RCT randomised 352 patients 1:1 to a technique called ablation in addition to planned surgery, or to planned surgery alone (control arm). The primary outcome was return to sinus rhythm at one year post-surgery (binary). Although not part of the intervention, at the discretion of the operating surgeon 150/352 (p ¼42.6%) randomised patients had a section of the heart, called the left atrial appendage (LAA), removed during the procedure. This may be considered a more extensive procedure, conferring higher probability of a positive outcome. We define subgroups by whether the LAA was removed Bþ or left intact BÀ. Selected results from the AMAZE trial are shown in Table 2.
We estimate conditional power for approval of ablation in BÀ assuming that actual trial results were our estimates of group-specific effects before the trial started. For Rule 1, there would 74% power to observe a treatment effect in BÀ if the threshold was set at L ¼ 1. Rule 2 is not met in AMAZE since Z F < Z Bþ ; assuming observed trial results were "true", conditional power was only 38%. Finally, the interaction Rule 3 was also just  met if we set the decision threshold for the two-sided test statistic to 10% (interaction test p ¼ 0.119). Conditional power based on observed results was 37% for Rule 3; had this been specified during design of the trial, sample size could have been increased to increase confidence that Rule 3 would be met.

APEX trial in patients at high risk of stroke
Recall that one of the trials motivating this study compared treatment with the anticoagulant betrixaban with standard treatment of enoxaparin amongst hospitalised medically ill patients. 13 The primary outcome was a composite of clinical events caused by blood clotting (deep vein thrombosis, nonfatal pulmonary embolism or death from thromboembolism) up to day 42 post-randomisation.
The planned analysis took a sequential testing approach, but rather than starting with the full trial population, the order of testing began with a subgroup with a high chance of treatment response (but smaller treatment effect), followed by testing in two other pre-specified, progressively inclusive cohorts as follows: 1. Compare treatment arms in patients with elevated D-dimer level (for illustration can be considered BÀ). 2. Compare treatment arms in patients with elevated D-dimer or age ! 75 (for illustration can be considered extended BÀ). 3. Compare treatment arms in all enrolled patients (full population, n ¼ 7513).
If any test was negative, all subsequent tests were reported as exploratory. We provide selected results from the original trial publication in Table 3.
As the table shows, the first analysis including BÀ patients was not significant at the traditional threshold p ¼ 0.054, so that subsequent analyses were treated as exploratory, even though the experimental treatment effect was greater in the Bþ subgroup and overall analyses. Adopting a sequential strategy, conditional on a significant effect of betrixaban in the full population, our proposed rules for the elevated BÀ subgroup would result in the following recommendations: 1. Rule 1: BÀ would be approved if a one standard error threshold (L ¼ 1) was defined a priori as a clinically acceptable treatment effect. Assuming the trial result was "true" as for the previous example, the conditional power of this rule was 91%. If a significant effect in this subgroup was necessary (L ! 1:96), as suggested by the original trial analysis, then Rule 1 was not met. 2. Rule 2 was also not met because group BÀ data resulted in a decrease in the Z-statistic for the full population compared to Bþ. This has arisen because Bþ patients had a much higher treatment effect despite the lower event rate. The conditional power of this rule was only 34%. 3. There was a significant interaction between subgroup and treatment (one-sided p ¼ 0.0184) so that this trial also fails Rule 3 -the conditional power was 43%.
In summary, using our proposed sequential testing procedure, betrixaban would be recommended for treatment in elevated D-dimer patients only if we were prepared to accept a lower treatment effect compared with nonelevated D-dimer patients and that lower treatment effect resulted in a Z-statistic of at most 1.85 standard errors (Z BÀ > 1:85).

Implications for trial design
In this context, investigators define and document decision rules for the primary trial analysis during the design stage. Our proposal is that, should a separate decision on approval of a subgroup be required, then a decision rule should be agreed in discussion with regulators or other appropriate decision makers during the design phase. Our evaluation of three potential rules illustrates how to investigate the efficiency of different rules, although parameter inputs will be specific to each trial and will depend on available information around potential efficacy. Although our rules rely on Z-statistics for hypothesis tests, it is more usual to work with potential treatment effects and their standard deviations when designing a trial. Empirical estimates of variation in the primary outcome are typically available, particularly for the control arm of the trial. This may be a standard deviation for a continuous outcome, or the baseline risk of an event for patients receiving the current best treatment. Given these estimates, the sample size required for an overall significant treatment effect, the proposed sampling proportion, and the Rule 1 threshold L can be decided to ensure that the treatment effect in subgroup BÀ lies above a minimum treatment effect. In a similar way, the Rule 3 significance level can be chosen to ensure there is sufficient power to find an interaction if the BÀ estimate is much lower than Bþ.
The stages of design are as follows: 1. Using initial estimates of design parameters, including the sampling proportion p and correlation q, calculate the power of the test for the expected value of the treatment effect in the full trial population. 2. Choose a decision rule for recommending treatment in BÀ based on considerations of clinically important treatment effects, safety and biological mechanisms. 3. Given the sampling proportion p and the expected treatment effect sizes in the two sub-populations, calculate the power of your preferred rule, conditional on a significant overall test. 4. Calculate the power of the sequential testing strategy as the product of conditional and unconditional power in 1 and 3.
In practice the final power calculations will require an iterative process between calculation and elicitation of expert clinical knowledge of treatment effects and associated variance components, finalised in discussion with regulators or other decision makers.

Overview of results
Frequentist rules to assess whether approval of a new treatment should be accepted in a lower response subgroup, conditional on a significant effect overall, have been developed and evaluated. Approval based solely on a significant overall test may be unacceptable if there are severe side effects and/or if the subgroup drawn from the low response population is under-represented due to enrichment sampling. Rules are based either on measures of influence, such as the size of the effect in this subgroup, or the increase in significance due to inclusion of the subgroup, or on the difference in effect size between the groups (interaction). When choosing a rule during trial design, as well as specifying estimates of the expected outcomes and their variance components, investigators must either take a random sample from the full population, in which case the trial will represent clinical practice, or decide the proportion of patients to be sampled from each sub-population. Using conditional power as a measure of efficiency, the proportion of patients drawn from each sub-population had a large impact, but correlation between the groups induced by covariate adjustment was less important. For all rules, conditional power decreased as l B + /r B + increased for fixed l B À /r B À .

Discussion of individual rules
After ensuring that Z F > 1:96, the simplest approach is to perform tests in Bþ and BÀ separately as part of a closed testing procedure, and allow a more relaxed significance level for Z BÀ (Rule 1). This significance level is related to both the proportion of the trial sample in BÀ and the effect size that is acceptable given the safety profile of the treatment. Hence, the level can be set based on prior knowledge of treatment effect and prevalence of low responders in the population. In our illustration, for trials with 50% of trial patients in Bþ, Rule 1 had the highest power for our chosen threshold of Z BÀ > 1.
Rule 2 (BÀ patients should increase significance) is rather ad hoc but has the benefit of not requiring specification of an additional parameter. To satisfy Rule 2, BÀ must preserve at least some proportion of the estimated efficacy of Bþ. Further, the conditional power of Rule 2 decreases as the proportion of patients from BÀ decreases, which is also an attractive property. Despite these benefits, Rule 2 never demonstrated highest conditional power in our analyses.
Rule 3 uses an interaction test with a relaxed significance level to recommend approval in the BÀ subpopulation. Interaction tests in clinical trial publications typically aim to identify heterogeneity between subgroups and are mainly purely exploratory. However, a significant subgroup-treatment interaction at the 5% level may not preclude approval in the lower response group, provided that there is a minimum level of efficacy, particularly if side effects are mild or there are few alternatives for this subgroup. Our more targeted objective is equivalent to testing whether the difference in treatment effects between the two groups is within acceptable limits and can be reconstructed as an equivalence or non-inferiority test. That is, the (interaction test) significance level can be based on a priori estimates of the maximum acceptable difference between the two subgroups ( b 3 in equation 1). In our analyses, for trials where > 50% of patients arise from Bþ, Rule 3 had the highest power to approve BÀ given an overall significant result.

Regulator input
In practice, acceptability of these approaches will depend on regulators (for drug trials) or commissioners (for academic trials). Since the conditional power of the rules depends crucially on the values chosen for the parameters L and a I , as well as patient sampling, prevalence of high/low responders and analysis methods, early engagement with regulators/commissioners to discuss these decision rules is worthwhile. Discussions also need to consider potential harms (side effects), in order to set realistic and acceptable targets for efficacy. In practice, investigators/sponsors will be required to pre-specify and document these decision rules in discussion with regulators.

Strength, weaknesses and future research
One benefit of our proposed decision rules is that closed-form expressions for conditional power are available for continuous, binary and count outcomes (assuming known variances). This makes estimation of sample sizes relatively simple, and a wide range of scenarios can be explored during the design phase.
In our examples, we used retrospective power calculations to show the differences in conditional power for the three rules based on trial results. We stress that these calculations were provided for illustration only and we do not endorse retrospective power calculations to aid interpretation of statistically non-significant trial results (see for example Hoenig and Heisey 20 ).
In common with many statistical methods, there is an underlying assumption of normality when using generalised linear models. This will hold for most adequately powered, phase III trials where analysis is completed on a scale for which the sampling distributions of estimated coefficients can be assumed normal (e.g. logistic, log). For small trials, or for estimands with very skewed distributions, asymptotic approximations may not hold and analyses should be checked using simulations.
In this paper, we provided expressions for the case where patients were randomised 1:1 to the experimental and control arms, although extension to other allocation ratios is straightforward. It would also be relatively straightforward to extend the methods to biomarkers with more than two levels, although the number of patients at each level is likely to be small in this case, resulting in low conditional power for all proposed rules. An exception might be for biomarkers with ordered levels, in which case the subgroup effect and interaction with treatment could be linear terms in the analysis (equation (1)).
For time-to-event outcomes, power of the study depends directly on the number of events occurring rather than on the number of patients, so that power would also depend on recruitment and censoring patterns. Methods would need to be extended to accommodate these features. Further, we have not embedded these results in more formal decision analytic methods, and this would require further specification of costs, harms (side effects) and utilities (benefits) and would depend on the perspective of the investigator (sponsor or health provider).
In summary, in situations where additional conditions are required for approval of a new treatment in a lower response subgroup, easily applied rules based on minimum effect sizes and relaxed interaction tests are available. These depend on trial design characteristics, particularly the proportion of patients sampled from the two subgroups and must be pre-specified and documented in the Statistical Analysis Plan. The numerator is PðX 0; Y 0Þ and can be obtained from standard statistical software.

A.2 Rule 2 numerator
The numerator for Rule 2 is given by PðZ F > Z Bþ ; Z F > 1:96Þ. Making the transformations, X ¼ 1:96 À Z F and Y ¼ Z Bþ À Z F , the joint distribution of X and Y is found as follows: X is the same as for Rule 2, so that the expectation and variance of X are again 1:96 À l F =r F and 1, respectively.

The covariance of X and Y is
CovðX; YÞ ¼ Covð1:96 À Z F ; Z Bþ À Z F Þ ¼ VarðZ F Þ À CovðZ F ; Z Bþ Þ ¼ 1 À pr Bþ þ qð1 À pÞr BÀ r F The joint distribution of X and Y for Rule 2 is X Y $ BVN 1:96 À l F =r F l Bþ =r Bþ À l F =r F ; Again, the numerator for conditional power is PðX 0; Y 0Þ, which we obtain from standard statistical software.
We make the transformations, X ¼ 1:96 À Z F and Y ¼l Bþ Àl BÀ r 3 À z a I =2 , where r 2 3 ¼ r 2 Bþ þ r 2 BÀ À 2qr Bþ r BÀ . Again the expectation and variance of X are 1:96 À l F =r F and 1, respectively. The expectation and variance of Y are The covariance of X and Y is given by Again, the numerator for conditional power is PðX 0; Y 0Þ, which we obtain from standard statistical software.
A.4 Equality of conditional and unconditional power for rule 3 when q 5 0, data are normally distributed and subgroups have the same sampling variance To explore when conditional and unconditional power are the same, we identify conditions when PðR3jZ F > 1:96Þ ¼ PðR3Þ, or equivalently, the correlation of R3 and Z F > 1:96 is zero.
Recall that the covariance of X and Y is qð2p À 1Þr Bþ r BÀ À pr 2 Bþ þ ð1 À pÞr 2 BÀ r F r 3 If there is no correlation between the two subgroup treatment estimates q ¼ 0, then this will become Àpr 2 Bþ þ ð1 À pÞr 2 BÀ r F r 3 (6) Recall that for the normal distribution case with n patients allocated to each treatment arm and common sampling variance r 2 in the two subgroups, r 2 Bþ ¼ 2r 2 =pn and r 2 BÀ ¼ 2r 2 =ð1 À pÞn. Substituting these into