Familywise error rate control for block response-adaptive randomization

Response-adaptive randomization allows the probabilities of allocating patients to treatments in a clinical trial to change based on the previously observed response data, in order to achieve different experimental goals. One concern over the use of such designs in practice, particularly from a regulatory viewpoint, is controlling the type I error rate. To address this, Robertson and Wason (Biometrics, 2019) proposed methodology that guarantees familywise error rate control for a large class of response-adaptive designs by re-weighting the usual z -test statistic. In this article, we propose an improvement of their method that is conceptually simpler, in the context where patients are allocated to the experimental treatment arms in a trial in blocks (i.e. groups) using response-adaptive randomization. We show the modified method guarantees that there will never be negative weights for the contribution of each block of data to the adjusted test statistics, and can also provide a substantial power advantage in practice.


Introduction
Randomized clinical trials are often designed in such a way that a decision about treatment efficacy is reached as quickly as possible and with a minimum number of patients exposed to inferior treatment options. Response-adaptive randomization (RAR) can help achieve such goals by an allocation process that makes randomization of a newly recruited patient dependent on responses to treatment from previous study participants. This can offer advantages in terms of the benefit for patients recruited into the study, increasing the willingness of patients to participate in the study, or the study's power to detect treatment effects. Many different classes of RAR procedures have been proposed for various trial contexts, and a recent review of methodological and practical issues around the use of RAR in clinical trials can be found by Robertson et al. 1 However, one major concern is about the potential for type I error rate inflation arising from such studies. This is particularly the case from a regulatory viewpoint, where control of the type I error rate is required for confirmatory studies. 2, 3 Robertson and Wason 4 proposed a methodology that guarantees a type I error rate control for normally distributed outcomes by reweighting the usual z-statistic through iterative application of the conditional invariance (CIV) principle. For that purpose, they assume the existence of a hypothetical 'auxiliary design' in which the test statistic has a known null distribution. At interim analyses of the RAR trial (this may be after blocks/groups of patients or after every patient), the randomization ratios may be changed. The test statistic used to test for treatment efficacy, however, is calculated so that its null distribution matches the null distribution of the test statistic from the auxiliary design. Robertson and Wason 4 show that this guarantees type I error control at level if the test in the auxiliary design is a level test. For multi-arm trials, the testing procedure controls the familywise error rate (FWER), which is defined as the maximum probability of at least one type I error under any configuration of true and false null hypotheses.
In this article, we present an improvement of their proposal based on the CIV principle. It is simpler in that it requires only modifying the final test statistic so that it has a known variance, and not a known mean. This restricts the method to trials where the patients are allocated to the experimental arms in blocks (i.e. groups) using RAR while the allocation to the control arm is fixed. On the other hand, by construction, the method guarantees that there will never be negative weights for the contribution of each block of data to the adjusted test statistic, and the modified method can also provide a substantial power advantage in practice.
The outline of the rest of the article is as follows. In Section 2, we describe the proposed testing procedure and its connection with existing approaches. A simulation study is presented in Section 3 to compare the testing strategies, and a case study is given in Section 4. We conclude with a discussion in Section 5.

Proposed testing procedure
Consider a trial with K ≥ 1 experimental treatment arms and a common control arm. We assume that the allocation of patients to the control arm is fixed throughout the trial, so that there are a total of n 0 control observations. For the treatment arms, there is a 'burn-in period' which allocates b patients according to a predefined allocation scheme. After the end of this burn-in, patients are allocated in blocks to the treatment arms using a RAR scheme. In total, there are n patients allocated to the experimental treatments in the trial across B blocks.
For simplicity, we consider testing each elementary null hypothesis H k : k = 0 , where k and 0 are the expected responses from patients allocated to experimental treatment k and the control, respectively. Extensions to intersection hypotheses within a closed test procedure will be discussed in Section 2.2. Let a i denote the treatment allocation for the i-th patient on the experimental treatment arms, where a i = k if the patient i is allocated to experimental treatment k. We assume that each of the observations on the control arm follow a N( is the response obtained from patient i given that they are allocated to experimental treatment group k. The variance 2 is assumed known and, without loss of generality, we set 2 = 1.
The standard z-test for H k uses the test statisticX k −X 0 , whereX 0 andX k are the mean response for patients on the control and treatment k, respectively. However, if RAR is used, the distribution of the z-test is affected and hence the type I error rate may no longer be controlled, with the FWER being highly inflated under certain types of RAR.
In order to tackle this issue, like in Robertson and Wason 4 we introduce a hypothetical 'auxiliary design,' which can be thought of as one of the allowed randomization lists, chosen before the beginning of the trial. Unless there are reasons to choose otherwise, a default option would be to use fixed (equal) randomization to reflect the uncertainty before the trial begins over which of the treatment options will be superior. Since the allocations to the treatments are all fixed in advance in the auxiliary design, the standard z-test is a valid test that controls the type I error rate.
For the auxiliary design, let b i denote the treatment allocation for the i-th patient on the experimental treatment arms. We assume Y i | (b i = k) ∼ N( k , 1) is the response of patient i, given that they were allocated to treatment group k in the auxiliary design. For i ≤ b (i.e. during the burn-in period), we simply set a i = b i . Afterwards, the two designs diverge according to the RAR scheme used.
In this setup, the actual trial design (i.e. the realized allocations of patients to treatments using RAR) can be viewed as a series of data-dependent modifications of the auxiliary design, where we account for these modifications using the CIV principle. Robertson and Wason 4 proposed a modified test statisticT k for testing the hypothesis H k . This test statistic is a difference of weighted means of the observations on the treatment arm k and the control arm, where the weights are calculated recursively based on the number of allocations to the experimental treatments and the control.
We now describe a simplification of the handling of the control group in this setup. The basic idea of the modified proposal is to only match the variances of the test statistics under the auxiliary and actual trial designs, and not matching the means. Due to this, the CIV principle may no longer be used as a justification for the algorithm, but a minor modification of it still applies.
LetX k (n 0,k ) ∼ N( k , 1 n 0,k ) be the mean response in treatment group k after the burn-in period, where n 0,k patients have been allocated to treatment k according to a pre-specified allocation scheme. For each subsequent block j of recruited patients ( j = 1, … , B), the randomization probability for treatment k may be changed in some way such that in block j, we haveñ j,k patients in treatment arm k instead of the pre-planned n j,k patients from the auxiliary design. We proceed in this fashion up to a final block B. In every block, we assume that at least one patient will be randomized to each treatment k. Finally, we let m j,k = n j,k + … + n B,k denote the sum of the sample sizes of treatment group k in blocks j, … , B of the auxiliary design. For notational simplicity, we define m 0,k = n k , that is, n k is the total number of patients allocated to treatment k in the auxiliary design.
The following formulae define the resulting summary statistics and their (conditional) distributions: Block 0: This is the burn-in period. After this block, we define U 0 = n 0,k n kX k (n 0,k ) ∼ N( n 0,k n k k , n 0,k n 2 k ) and set w 0 = n k .

Block 1:
We define the following test statistics: Weighted data from block 1 of the auxiliary design: Weighted data from block 1 of the actual design: Weighted data from blocks 2, … , B of the auxiliary design: Here,Ȳ k (n 1,k ) is the mean response of patients in block 1 who are allocated to treatment k in the auxiliary design,X k (ñ 1,k ) is the mean response of patients in block 1 who were allocated to treatment k in the actual design, andȲ k (m 2,k ) is the mean response of patients in all blocks after block 1 who are allocated to treatment k in the auxiliary design. Sinceñ 1,k is a function of the entire data from the patients in block 0, the given distribution ofŨ 1 is conditional on the block 0 data. We now form the overall test statistics for blocks 1, 2, … , B in the trial, , using the block 1 data from the auxiliary and actual designs, respectively. We then choose the weight w 1 so that the conditional variances of U (1) andŨ (1) are matched. Hence, w 1 = w 0 ⋅ √̃n 1,k +m 2,k n 1,k +m 2,k , leading to the conditional distribution given the block 0 data asŨ (1) Block j ∈ {2, … , B − 1}: After every block, the randomization probabilities may be changed. For block j we define the following test statistics: Weighted data from block j of the auxiliary design: Weighted data from block j of the actual design: Weighted data from blocks j + 1, … , B of the auxiliary design: HereȲ k (n j,k ) is the mean response of patients in block j on treatment k in the auxiliary design,X k (ñ j,k ) is the mean response of patients in block j on treatment k in the actual design, andȲ k (m j+1,k ) is the mean response of patients in blocks j+1, … , B on treatment k in the auxiliary design. Sinceñ j,k is a function of the entire data from the patients in blocks 0, 1, … , j − 1, the given distribution ofŨ j is conditional on these data. We form the overall test statistics for blocks j, j + 1, … , B in the trial, , using the block j data from the auxiliary and actual designs, respectively. We choose the weight w j so that the conditional variances of U ( j) andŨ ( j) are matched. Hence Consequentially, the conditional distribution ofŨ ( j) given the data from blocks 0, … , j − 1 is Since no additional block follows, we match the conditional variances of and hencẽ .

Final test statistic:
At the end of this procedure, we obtain the test statisticŨ cannot be easily derived due to the response-adaptive modifications.
Since we did not modify the randomization to the control treatment, we have a consistent estimate of 0 fromX 0 (n 0 ) ∼ N( 0 , 1 n 0 ), where n 0 is the total number of patients on the control andX 0 (n 0 ) is the mean response in the control group across all blocks. If we had known 0 at the time of doing the blockwise-RAR, we could simply have subtracted 0 from every observation X i and run the algorithm in the described way on X i − 0 instead of X i . Since E(X i | a i = k) = k = 0 under H k , this would have led to E(Ũ j ) = 0 for j = 0, 1, … , B and removed the need to match the weights on the means as well. As we have a record of all weights w 1 , … , w B , we can do this 'post-hoc' usingX 0 (n 0 ) as a consistent estimator of 0 : If 2 is not known, and hence must be estimated from the data,T∕̂is asymptotically normally distributed.
Regarding the estimation of 2 , the suggested approach is no different from other group-sequential or adaptive designs. The most natural choice would be to use usual pooled estimatorŝ2 j = 2 per block j from the observed individual responses x ijk and then combine these in some appropriate way (e.g. by summing them up with equal weights 1 √ j+1 after every block j). Since the estimate of mean and variance are stochastically independent in this setting, these per-block estimates are all consistent estimates of 2 .
Early stopping: The approach can be modified to allow early stopping for efficacy if the number of control patients is fixed per block. In that case, we can setup the approach similar to a group-sequential trial with -spending. Rather than calculatingT only once at the end of the trial after block B, we would allow an interim analysis after block S < B. The corresponding test statisticT (S) is calculated by treating S as the last block (i.e. setting B = S in the above algorithm forT) and usingX 0 (n S,0 ) instead ofX 0 (n 0 ), where n S,0 is the total number of patients allocated to the control in blocks 0, 1, … , S andX (n S,0 ) is the mean response of these patients. Since the number of control observations in blocks 0, 1, … , B are fixed in advance,X 0 (n S,0 ) ∼ N( 0 , 1 If there is a sequence 1 , … , B with ∑ B j=1 j = , then rejecting H k whenT (S) ≥ Φ −1 (1 − S ) controls the type I error rate for treatment k by the Bonferroni inequality. In ordinary group-sequential designs, the correlation between test statistics from different stages is known (Jennison and Turnbull 5 ), allowing an improvement over Bonferroni (i.e. assigning a sequence of test levels 1 , … , B with a sum larger than , but still preserving the type I error of ). Unfortunately, this improvement is not available here since the correlation betweenT (S) andT depends on v S+1 , … , v B which are not available at the time of the interim analysis. The correlation could be calculated if after the interim analysis, the auxiliary design would be strictly followed for blocks S + 1, … , B. However, this would defeat the purpose of RAR.
Note that a test ofT (S) at level at a random time S would only control the type I error rate if S was stochastically independent ofT (1) , … ,T (S−1) under H k . Clearly, calculatingT (j) after every block and then deciding when to test based on the observed value could lead to an inflation of the type I error rate.

Connection with existing approaches
There is a close connection between the proposal of Robertson and Wason 4 (and hence our modified proposal) and more traditional adaptive designs. Assume that we want to test the one-sample hypothesis H 0 : = 0 adaptively. Before the start of the study, we plan a first interim analysis after m 0 patients and a weight w 0 ≤ 1 for this first step. At the interim, we calculate the test statistic Subsequently, we pick a new sample sizem 1 for the next stage (up to the next interim) and a weight w 1 for this stage. The weight andm 1 may both depend on the data from stage 0, but the restriction w 2 0 + w 2 1 ≤ 1 must be obeyed. At the end of the first stage, we calculate T 1 = √m 1 (X (m 1 ) − 0 ) given the stage 0 data and T (1) N(0, 1) under H 0 . At the next stage (stage 2), we pick a new sample sizem 2 and weight w 2 for the following stage, with w 2 0 + w 2 1 + w 2 2 ≤ 1. We combine the test statistics T 2 = √m 2 (X (m 2 ) − 0 ) and T (1) using w 1 , that is, by forming a test statistic T (2) = w 1 T (1) + √ (1 − w 1 ) 2 T 2 . We continue in this fashion until at some point we decide to call the final analysis B. Then, at the penultimate analysis (the last interim before this final one), we spend the rest of the weights such that ∑ B j=1 w 2 j = 1. Hence, the last weight w B = √ 1 − ∑ B−1 j=1 w 2 j is fixed by design. The weights w j are allowed to depend on all data up to interim analysis j −1, so the weights from later stages can be iterated updates of previous weights. This approach was described by Brannath et al. 6 The proposals described in this article and by Robertson and Wason 4 can be viewed as an application of this general procedure, with a special way of calculating the weights and a 'time horizon' (in terms of the total number of patients) which is given from the start. In addition, the method described in this article does not combine standardized stagewise test statistics directly, but rather corrects for fluctuations in the expected values by adjustment at the end of the trial.

Application within a closed test procedure
The approach described above for controlling the type I error rate for a single hypothesis H k generalizes to controlling the FWER by applying a closed test procedure (CTP). In the CTP (Marcus et al. 7 ), H k , k = 1, … , K is rejected if and only if all H  ,  ⊆ {1, … , K} with k ∈  are rejected at level where H  = ∩ k∈ H k . In order to test H  : k = 0 ∀k ∈ , several approaches can be considered. For example: • All observations of the experimental treatment arms in  are pooled and treated as a single treatment arm to test against the control treatment with the approach described in Section 2. This is called 'closed z-test' in the simulations below. This approach will lead to low power unless all treatments in  are truly effective. Otherwise, the test statistic for the treatments in  will be 'diluted' towards the null by the contribution of the treatments in  that are truly null. • Assume that the approach from Section 2 is used on all experimental treatment arms separately, leading to test statistics T k , k = 1, … , K. Using max(T k ) k∈ as the test statistic for H  , H  is rejected if max(T k ) k∈ ≥ Φ −1 (1 − || ). This is the Bonferroni-Holm method in the simulations of Section 3.
In contrast, a 'Dunnett-like' closed test (see Magirr et al. 8 ) is not straightforward. The marginal null distributionT k is N(0, 1), but the conditional correlation ofT k 1 andT k 2 is not independent of the sample size modifications induced by the RAR procedure.

Simulation studies
To investigate the operating characteristics of the suggested design, we use the set-up of a trial with B = 3 blocks (not including the burn-in), with block sizes (40, 40, 40) for all of the experimental treatments and (20, 20, 20) for the control.
In the burn-in period, five patients are allocated to each of the treatments including the control. We set the true control mean 0 = 0, and = 0.05. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on {1, … , K}, where K is the number of experimental treatment arms. The R code to generate the data and reproduce the simulation results can be found at https://github.com/dsrobertson/FWER_block_RAR.

Bayesian adaptive randomization
We compare the methods under a Bayesian adaptive randomization (BAR) scheme. Following Robertson and Wason, 4 we use a block-randomized BAR scheme by Wason et al. 9 The randomization probabilities ( 1 , … , K ) for the experimental treatments at the ( j + 1)th stage are given by k = , where P( k > 0 | X) is the posterior probability that k is greater than 0 given the observed data x, see the Supplemental material for full details. In our simulations, we set = 0.5. As well, since our proposal requires there to be at least one patient allocated to each experimental treatment per block, we ensure this is the case by allocating the last K * ≥ 0 patients in each block to each of the K * experimental treatments that have zero observations.

Error inflator scheme
To assess the FWER and power in a situation where type I control is known to be violated, we also investigate the allocation scheme presented in Section 2.3 of Robertson and Wason, 4 adapted to block randomization. This rule keeps on allocating patients to treatment 1 (apart from one patient per block to each of the other experimental treatments) as long as the mean response of treatment 1 remains below a fixed threshold of 0.5. As soon as the fixed threshold is crossed, all subsequent patients not randomized to control are allocated with equal probability to the other experimental treatments (except for one patient per block on treatment 1). Full details are in the Supplemental material. Tables 1 and 2 show the weights from two simulations under BAR and the error inflator scheme, respectively. Throughout, we set 1 = 0, 2 = 1 for the experimental treatments. The proposal from Robertson and Wason, 4 denoted RW, is compared with that from Section 2.  In both examples, the weights can be very different for the two methods and also differ from both the observed sample sizes and the sample sizes in the auxiliary design. Both weight calculations up-weight the treatment arm with fewer allocations, while the RW weights tend to be more variable overall. The weight calculation from Robertson and Wason also produces negative weights for the control group for the error inflator scheme -something that is not possible with the calculation from Section 2. Note that the statements refer to a single simulation run. For the error inflator scheme, this is a case where treatment 2 crossed the threshold after block 0. Observed sample sizes and weights are very different for other simulation runs where this does not happen.

Simulation results
To investigate the performance of the various approaches, we conducted simulations for both the BAR and the error inflation scheme. As a standard comparison, we also provide simulation results for fixed (equal) randomization in the Supplemental material. The weighing approach from Robertson and Wason, 4 the proposal from Section 2 and the naive approach (treating observed sample sizes as if they had been fixed in advance) were used. In all these approaches, the closed test procedure and the Bonferroni-Holm procedure are applied to adjust for the multiplicity arising from the testing of experimental treatments against a common control. In Tables 3 and 4, disjunctive power is the probability to reject at least one false null hypothesis (if there is one) and error is the FWER. Nominal test levels are assumed to be 5%. Table 3 shows the results for the BAR scheme. As is well known, the closed test procedure has a slight power advantage if the treatments are equally effective, but is inferior when one of the treatments is effective, but the other(s) is not. FWER inflation did not occur in the simulations, even if the naive approach is used. The naive, the RW and the proposed approach lead to practically identical type I errors and power here. Although our new method does not perform better in terms of power, we consider it reassuring that it does not lose any power compared to the standard analysis either. Hence, in this specific case, there is no 'price to pay' for the guarantee of type I error control.
The results for the error inflator scheme are shown in Table 4. We see that the error inflator scheme indeed does not control the FWER. The inflation remains modest with a FWER not exceeding 7.5% in any of the simulation scenarios. This is, however, an inflation which is clearly beyond random simulation variation and which would not be acceptable in a confirmatory clinical trial. As expected, no FWER inflation arises when the two adaptive test methods are used. In line with what Robertson and Wason 4 observed, there is a price to pay for the FWER control: both methods tend to suffer from power losses relative to their naive counterparts. The proposed procedure, however, has higher power than the RW approach and for some scenarios, the gain is substantial. For example, in scenarios 2, 5, 6 and 7, the power gain for the new Holm test compared with the RW test is between 10% and 20% in absolute terms. We hypothesize that this has to do with the fact     that the variation of the weights is limited and cannot diverge as wildly from the observed sample sizes as they might with the RW approach (as illustrated in Table 2).

Case study
In this section, we revisit the case study used in Robertson and Wason 4 of a phase II placebo-controlled trial in primary hypercholesterolemia to compare the effects of using the SAR236553 antibody with high-dose or low-dose atorvastatin, as compared with high-dose atorvastatin alone (Roth et al. 10 ). The primary outcome was the least-squares mean percent reduction from baseline of low-density lipoprotein cholesterol. Patients were randomly assigned, in a 1:1:1 ratio, to receive 80 mg of atorvastatin plus placebo, 10 mg of atorvastatin plus SAR236553 or 80 mg of atorvastatin plus SAR236553. For convenience, we label these interventions as the 'control', 'low dose' and 'high dose', respectively. We use the observed values from the trial and assume that the distribution of the primary outcome variable is distributed according to N(17.3∕3.5, 1) for the control, N(66.2∕3.5, 1) for the low dose and N(72.3∕3.5, 1) for the high dose, respectively. We suppose the trial is conducted using the BAR scheme presented in Section 3.1, with three blocks each of size 15 for the experimental doses (low and high doses). In the burn-in period, eight patients are allocated to low dose and eight patients to the high dose. Like in the actual trial, a total of 31 patients are allocated to the control, and 61 to the experimental doses. Table 5 gives the results for a simulated trial with standard weights (i.e. realized sample sizes) 31 for the control, 27 for the low dose and 34 for the high dose. For the experimental doses, the realized breakdown per block (excluding the burn-in period) was (6, 6, 7) for the low dose and (9,9,8) for the high dose, respectively. We see that both the RW approach and our proposed approach give similar test statistics and weights in this case. In particular, given that all the observed p-values are < 0.001, using either approach would lead to rejecting the null hypothesis of no treatment effect.

Discussion
In this article, we have proposed an improved testing strategy based on the one by Robertson and Wason, 4 which guarantees FWER control in the context of block-randomized response-adaptive trials with a fixed control allocation. Our proposal is simpler but is more restrictive as it is not applicable to fully sequential RAR or having an adaptive control allocation. However, our proposal guarantees that the weights are non-negative, and there can be substantial power gains in some settings.
As noted by Robertson and Wason,4 since the proposed testing procedure is based on the CIV principle, it has the additional important flexibility of being valid when the allocation is changed due to external information. Our proposal is also designed for normally distributed outcomes, although it can apply to other types of outcomes asymptotically. However, a natural extension of this work would be to work directly with binary endpoints (for example) and potentially apply the CIV principle to this setting.
The use of the CIV principle to 'reweight' the test statistics raises interesting questions around the design of optimal response-adaptive trials (i.e. the formulation of RAR procedures that optimize certain criteria). For example, some RAR procedures incorporate a formal power constraint, but this is based on standard test statistics. If an alternative testing strategy such as our proposed one is used, then there is a mismatch between the optimality criterion and the subsequent analysis of the trial.
More generally, it is important to remember that there can be trade-offs between the different objectives in a trial. For example, we have seen that insisting on the use of RAR methods which guarantee FWER control can lead to a substantial loss in power. As another example, more 'extreme' RAR procedures (i.e. those that skew the randomization probabilities close to 0 or 1) that perform well in terms of patient benefit metrics may conversely have low power and mean that analysis methods developed for a completely random sample are no longer appropriate. Hence the question of whether to use RAR as opposed to a fixed randomisation scheme is not a simple one, and crucially depends on the trial context and goals.

Data availability
The R code to generate the data and reproduce the results in this article can be found at https://github.com/dsrobertson/FW ER_block_RAR.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: DSR was funded by the UK Medical Research Council (MC_UU_00002/14) and the Biometrika Trust.

Supplemental material
Supplemental material for this article is available online. Web Appendices and tables are available with this article.