Covariate adjustment in Bayesian adaptive randomized controlled trials

In conventional randomized controlled trials, adjustment for baseline values of covariates known to be at least moderately associated with the outcome increases the power of the trial. Recent work has shown a particular benefit for more flexible frequentist designs, such as information adaptive and adaptive multi-arm designs. However, covariate adjustment has not been characterized within the more flexible Bayesian adaptive designs, despite their growing popularity. We focus on a subclass of these which allow for early stopping at an interim analysis given evidence of treatment superiority. We consider both collapsible and non-collapsible estimands and show how to obtain posterior samples of marginal estimands from adjusted analyses. We describe several estimands for three common outcome types. We perform a simulation study to assess the impact of covariate adjustment using a variety of adjustment models in several different scenarios. This is followed by a real-world application of the compared approaches to a COVID-19 trial with a binary endpoint. For all scenarios, it is shown that covariate adjustment increases power and the probability of stopping the trials early, and decreases the expected sample sizes as compared to unadjusted analyses.


Introduction
In conventional, fixed size randomized controlled trials (RCTs), adjustment for baseline values of covariates known to be at least moderately associated with the outcome has been shown to increase the power of the trial [Hernández et al., 2004, Hernández et al., 2006, Benkeser et al., 2021, Harrell and Slaughter, 2021].This is because covariate adjustment improves the precision of the estimated treatment effect and accounts for the outcome heterogeneity within each treatment arm that is explained by the adjustment variables [Lewis, 1999, Senn, 2013, Benkeser et al., 2021].Adjustment also corrects for any chance imbalance of important baseline variables which may exist post-randomization [Kahan et al., 2014].Therefore, covariate adjustment in the primary analysis of clinical trials is now recommended by both the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) [US Food andDrug Administration, 2021, European Medicines Agency, 2015].Additionally, systematic reviews have suggested its use in practice has grown over time [Austin et al., 2010, Ciolino et al., 2019].Recently, these power increases have been demonstrated in more flexible frequentist designs, such as information adaptive and adaptive multi-arm designs [Van Lancker et al., 2022, Lee et al., 2022].However, the simulation scenarios investigated in each of these designs contained at least one of the following: only continuous outcomes with no treatment-covariate interactions where the marginal and conditional estimands are the same; only a single sample size; a data generating process containing only a small number of variables; or only a small number of covariate adjustment models.A more comprehensive investigation is needed to better understand the benefits of covariate adjustment under a broader array of flexible design scenarios.
While the impact of covariate adjustment has been demonstrated in the flexible frequentist designs mentioned above, it has not been characterized within flexible Bayesian adaptive designs, where early stopping at an interim analysis is permitted given evidence of treatment superiority or futility.In these designs one may learn about a treatment effect while potentially requiring fewer participants than fixed designs.Additionally, Bayesian trials allow for seamless incorporation of prior information for model parameters including covariate effects.The impact of combining prior information with covariate adjustment has not been previously investigated.With the growing interest in Bayesian adaptive designs, a characterization of covariate adjustment for several commonly used sample sizes and outcome types would be highly valuable for researchers.
In this work, we consider the impact of covariate adjustment in Bayesian adaptive designs which allow for early stopping for superiority and provide a step-by-step tutorial for post adjustment marginalization.We explore several data generating processes for continuous, binary, and time-to-event outcomes, and consider adjustment models which include several forms of misspecification while incorporating varying levels of prior information for the covariate effects.The covariate adjustment described herein is performed using generalized linear models (GLMs).However, the methods, results, and recommendations discussed below are not specific to these models, and are expected to generalize to other parametric and nonparametric models.
The manuscript is organized in the following manner.We first introduce and describe Bayesian adaptive designs which allow for early stopping, as well as the targets of inference (i.e., estimands), which are marginal treatment effects for each endpoint.We then describe the specific collapsible and non-collapsible estimands used in this manuscript.For the non-collapsible estimands, we describe their estimation and marginalization through a Bayesian framework.This is followed by a simulation study which shows the impact of covariate adjustment on design operating characteristics including power, the probability of stopping the trial early, and expected sample size, for multiple sample sizes.We then show a real-world application of covariate adjustment within a COVID-19 RCT and end with a discussion.

Bayesian Adaptive Designs with Early Stopping
Bayesian adaptive designs allow for predetermined changes to the trial design at interim analyses based on evidence provided by the accumulating data [Berry et al., 2010, Giovagnoli, 2021].These designs include sequentially randomized trials which allow for early stopping for superiority or futility at interim analyses.Interim analyses are performed to determine whether to stop the trial early and declare treatment superiority or futility, or to continue the trial.The decision to stop the trial early at an interim analysis is controlled by a predefined decision rule.Decision criteria may be defined with respect to different functionals of the posterior distribution of the parameter of interest or estimand.
Posterior or posterior predictive probability statements about the estimand are commonly used statistics.In the RCT setting, estimands are typically defined as marginal treatment effects in a specified population of interest.We adopt this convention throughout, but delay further discussion of marginal estimands and their posterior estimation until the next section.
In the Bayesian adaptive designs described in this work, interim or final decisions are defined with respect to posterior probability statements about a marginal treatment effect, γ(θ), which is a function of model parameters θ.The alternative hypothesis of the trial is formulated as this marginal treatment effect being greater than a clinically meaningful threshold, γ 0 : A Bayesian test statistic can be defined as the posterior probability of this alternative hypothesis given the data, D nt = {Y nt , A nt , X nt×p }, which may include any observed outcomes (Y nt ), treatment assignments (A nt ), and p additional covariates (X nt×p ), for the n t participants who are enrolled in the trial at an interim or final analysis conducted at time t: T (D nt ) = P (H A |D nt ) = P (γ(θ) > γ 0 |D nt ). ( This statistic is then used to define a decision rule, which declares treatment superiority at any interim or final analysis if the statistic exceeds some upper probability threshold, u, i.e., if If a superiority declaration is made at an interim analysis, the trial stops early.The common approach in the design of Bayesian adaptive trials is the "hybrid" approach, where the upper probability threshold, u, is optimized so that the trial design has desirable values of frequentist operating characteristics [Berry et al., 2010].For example, power (P) and the Type 1 error rate (T1E) are defined as follows: Since the sampling distribution of a Bayesian posterior probability is generally unknown, calibration of the design to meet frequentist operating characteristics requires simulation studies.Note that models that adjust for covariates under most settings result in analytically intractable posterior distributions.Therefore, the evaluation of T (D nt ) within every trial simulation requires posterior sampling or approximation techniques.In this paper we use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution when not available in closed form.

Estimands and Bayesian Estimation
In what follows, let A i be defined as a binary treatment assignment for the i th participant, where A i = 1 represents being randomized to the treatment group and A i = 0 to the control group.Let X nt×d represent a matrix of j = 1, ..., d covariates measured at baseline for i = 1, ..., n t participants in the study at an interim or final analysis conducted at time t.Let xi = (x 1 , x 2 , ..., x d ) i represent the row of this matrix corresponding to the full covariate pattern of the i th participant, and let Y i be an arbitrary outcome of interest for the i th participant.For notational simplicity, subscripts are dropped for the remainder of this article except when strictly necessary.Treatments are assigned through simple randomization and are thus independent of all covariates measured at baseline (i.e., A ⊥ ⊥ x), but we allow for the possibility of chance imbalance of covariates between treatment groups.
We use the term "unadjusted analysis" to refer to a model which includes only the treatment assignment indicator (A), and the term "adjusted analysis" for a model which includes p ≤ d additional covariates, X ⊆ X, with x i = (x 1 , ..., x p ) i representing the row of X corresponding to the i th participant's covariate pattern used for adjustment.For adjusted analyses, many different covariate sets may be adjusted for in addition to the binary treatment indicator.This includes any covariates Z ⊆ X where a treatment-covariate interaction exists.Let φ be the regression coefficient for the treatment assignment indicator, β 0 be the intercept, β = {β 1 , ..., β p } be the vector of covariate main effects, and ω = {ω 1 , ..., ω m } be the vector of treatment-covariate interaction effects for those covariates used in the adjustment model within a GLM setting.Additionally, let ζ be a set of nuisance parameters not of direct interest but which are required for model specification (e.g., baseline hazard parameters in a time-to-event setting).Then let In this manuscript, the decision rule used at an interim or final analysis is defined with respect to a posterior probability statement about a marginal estimand γ(θ), presented in Equation 1, which is the average treatment effect in a population of interest.We note this marginal estimand can be defined as a contrast of population average quantities, such as a difference of means or a ratio of population-level event or survival probabilities.Let µ(θ; A) be the population level parameter for treatment group A used as input for contrast f (•).Then γ(θ) can be represented by: Unadjusted analyses yield posterior samples from the marginal parameter µ(θ; A) directly: Adjusted analyses, however, yield posterior samples from the conditional parameter µ(θ; A, X) for fixed covariate pattern X: When a treatment effect is collapsible, it can be represented as a contrast between either the marginal or conditional parameters for a fixed covariate pattern X: Thus, under collapsibility, samples from the posterior distribution of the marginal estimand can be obtained from either an unadjusted or adjusted analysis.When a treatment effect is non-collapsible, the marginal treatment effect cannot be represented as a contrast between conditional parameters for a fixed covariate pattern X [Daniel et al., 2021].This is commonly the case for treatment effects modeled by GLMs or in the presence of treatment-covariate interactions.
As an example, consider a hypothetical RCT with a binary endpoint which follows the following logistic regression model with binary treatment assignment A and binary covariate X, where P (X = 1) = P (A = 1) = 0.5 and where φ = log( 5), β = log(10), and θ = {φ, β}: Define µ(θ; A, X) to be the treatment specific conditional risk.The conditional odds ratio for those who are treated versus untreated can be represented as a contrast of these conditional risk parameters, and is 5 regardless of the value of X.To find the marginal odds ratio, the treatment specific conditional risks must be averaged with respect to the distribution of X before calculating the odds ratio, effectively collapsing over X in a stratified two-by-two table.
Doing so gives treatment specific marginal risks µ(θ; A) which are used to obtain a marginal odds ratio of 4.1.This value is smaller than that for the conditional odds ratio and shows the marginal odds ratio cannot be represented as a contrast of the treatment specific conditional risks.Thus, the odds ratio is non-collapsible (see Section G of the Online Supplementary Material for more details).Under non-collapsibility, samples from the posterior distribution of the marginal estimand can still be obtained directly from unadjusted analyses, but not from adjusted analyses.To obtain samples from the posterior distribution of the marginal estimand using an adjusted analysis, the posterior samples of µ(θ; A, X) must be marginalized with respect to the distribution of X, yielding samples of µ(θ; A) which are then used in the contrast: This marginalization is commonly called standardization or Bayesian G-computation for point treatments [Kalton, 1968, Freeman Jr and Holford, 1980, Lane and Nelder, 1982, Saarela et al., 2015, Remiro-Azócar et al., 2020, Keil et al., 2018, Daniel et al., 2021].The integral over p(X) is approximated through summation where, for each of s = 1, ..., S Monte Carlo samples θ s from π(θ | D nt ), the following calculations is performed: The x i are the covariate patterns used for adjustment and which are contained in the joint empirical distribution of the collected sample data.The weights w s = (w 1,s , ..., w nt,s ) are sampled as w s ∼ Dirichlet(1 nt ), corresponding to the Bayesian bootstrap, where 1 nt is the n t -dimensional vector of 1's [Rubin, 1981, Oganisian and Roy, 2021, Linero and Antonelli, 2023].This can then be used to obtain a single sample from the posterior of the marginal treatment effect: Note that when any X j for j = 1, ..., p is the propensity score being jointly modeled with the outcome of interest, a different Bayesian bootstrap procedure should be utilized [Stephens et al., 2022].For the remainder of the manuscript, it is assumed X does not contain a propensity score estimated in this manner.
The procedure above enables S samples from the posterior distribution of the marginal estimand γ(θ) to be obtained using an arbitrary adjustment model.This allows for a direct performance comparison between adjustment models within Bayesian adaptive designs.Below we describe this procedure within the context of specific models that are most commonly used for different outcome types in clinical trials.
Since this treatment effect is collapsible, posterior samples of the marginal estimand can be obtained from either an unadjusted or adjusted analysis, and samples from the posterior distribution of the treatment indicator coefficient φ are commonly used.

Difference in Means: Treatment-Covariate Interactions
Consider again a continuous outcome, but under the assumption of at least one treatment-covariate interaction (i.e., Z = ∅; treatment effect heterogeneity).The difference in means is non-collapsible.Estimation proceeds assuming independent outcomes and the following model: Since this estimand is non-collapsible, posterior samples of the conditional µ(θ; A, X) must be marginalized using (2) before forming the contrast to obtain a posterior sample of the marginal difference in means.An outline of this procedure is provided in Appendix A in the Online Supplementary Materials.those assigned to treatment versus those assigned to control.The odds ratio represents the ratio comparing the odds of an event for those assigned to treatment versus those assigned to control.Estimation proceeds assuming independent outcomes and the following model: To obtain posterior samples of the marginal relative risk or odds ratio, posterior samples of the conditional µ(θ; A, X) must be marginalized using ( 2) before forming the contrast, an outline of which is provided in Appendix A in the Online Supplementary Materials.

Hazard Ratio
Let T denote the time to an event of interest.Let h(t | A) represent the hazard, the instantaneous event rate at time t, for those assigned to treatment A: Under the assumption of no competing risks, there is a one-to-one relationship between the hazard and survival probability at time t: Further assuming proportional hazards, an estimand of interest is the marginal hazard ratio γ(θ) := log{µ(θ; A = 1)}/ log{µ(θ; A = 0)} where µ(θ; A = a) = S(t | A = a; θ).This estimand represents the ratio comparing the hazard of those assigned to treatment versus those assigned to control.We note that the estimation framework described below is general and may be utilized to target other estimands of interest (e.g., the risk difference or risk ratio).
Under an RCT framework which allows for right censoring only, the time from a participant's initial enrollment to an event of interest may occur after the trial has ended.Let T i be the i th participant's observed event time or right censoring time.Let δ i be the i th participant's observation indicator, where δ i = 1 means the event time is observed before the end of the trial and where δ i = 0 means the event time is right censored.Let Y i = {T i , δ i } be the observed data.On the hazard scale, the hazard of the event at time t for the i th participant can be modeled as below, where h 0 (t) is the baseline hazard function: This yields the corresponding survival probability: The baseline hazard function may be flexibly modeled, with one possible choice being through M-splines [Brilleman et al., 2020].Let M (t; ψ, k, δ) be an M-spline function: Here ψ is the vector of coefficients for the L M-spline basis terms, with degree δ and knot locations k.Integrating this M-spline function yields the following I-spline function, which is evaluated using the same coefficients, degree and knot locations: Both M-spline and I-spline functions can be evaluated analytically [Wang and Yan, 2021].By flexibly modeling the baseline hazard with M-splines, the hazard and the survival probability become, respectively: Estimation then proceeds by assuming independent outcomes and the following model: Posterior samples of the marginal hazard ratio are obtained by first marginalizing samples of µ(θ; A, X) = S(t | A = a, X; θ) using ( 2) and then forming the contrast [Daniels et al., 2023, Stitelman et al., 2011, Remiro-Azócar et al., 2020].This procedure is outlined in Appendix A in the Online Supplementary Materials.

Simulation Study
In this section we perform simulations for the design and models described in the previous sections.We consider a design with a superiority stopping rule where superiority is declared at any interim or final analysis performed at time t if T (D nt ) > 0.99.The same value of u = 0.99 is selected for all maximum sample sizes to control the overall Type 1 error rate of the unadjusted model (i.e., Type 1 error rate below 0.05).The unadjusted model is selected as a conservative choice for trial planning purposes, since the true strength of any covariate effects and adjustment benefit may not be known in practice at the trial planning stage [Benkeser et al., 2021].Note that our interest is in comparing the performance of different adjustment models, not optimizing u for each maximum sample size and model, so a single conservative value of u above is chosen for all simulations.Our marginal treatment effects of interest are the difference in means of a Normal endpoint under the assumption of no treatment-covariate interactions, the relative risk under a binary endpoint, and the hazard ratio under a time-to-event endpoint.Data generating processes with five covariates and a treatment assignment indicator are used for multiple sample sizes with each endpoint.We consider adjustment models which include several forms of misspecification and which incorporate varying levels of prior information for the covariate effects.To obtain marginal estimates from the adjustment models, the procedures described in the previous section are utilized.We follow a setup similar to that reported in a previous study which investigated covariate adjustment for endpoints commonly used in COVID-19 RCTs [Benkeser et al., 2021]: for each maximum sample size with each endpoint, three treatment effect values are chosen.The first is the null treatment effect, and the second and third are those where the unadjusted model achieves roughly 50% and 80% power.This excludes the scenarios of a maximum sample size of 100 under the binary and time-to-event endpoints, whose second and third treatment effect sizes are chosen as those where the unadjusted model achieves roughly 30% or 40% power.
This ensures all simulations maintain realistic values for the marginal treatment effects, and that the impact of covariate adjustment is compared at a value of the treatment effect for which trials are commonly powered (i.e., 80%).For each maximum sample size with each endpoint, the impact of covariate adjustment is quantified through the values of the following design operating characteristics: power, Type 1 error rate, expected sample size, probability of stopping early, bias and root mean squared error (RMSE).

Data Generating Mechanisms
For each combination of endpoint and maximum sample size {100, 200, 500, 1000}, the data generating mechanisms for the treatment assignment and covariate distributions measured at baseline are shown below, where joint independence between all variables is assumed.Letting η represent the linear predictor used in the data generating mechanisms, the set {β, φ} represents the conditional covariate and treatment effects on the linear predictor scale: Note that the variables {X 6 , X 7 , X 8 } are noise and are not predictive of, or correlated with, any other variables in the data generating mechanism.For the binary endpoint, β 0 is optimized to generate datasets which exhibit the correct marginal control event risk of p ctr = 0.3 (Appendix B in the Online Supplementary Materials).For the continuous and time-to-event endpoints, β 0 = 0.For the time-to-event endpoint, an exponential baseline hazard with rate λ = 0.02 is used.For the non-collapsible treatment effects, the true values of the marginal estimand γ Thus, the reported values of the marginal estimands are obtained through simulation (denoted by S(•); Appendix B in the Online Supplementary Materials), and the values of γ and φ are reported together.For the continuous endpoint, β = (0, 0.5, −0.25, 0.5, −0.05, 0.25).For the binary endpoint, β = (−1.26, 1, −0.5, 1, −0.1, 0.5).For the time-toevent endpoint, β = (0, 1, −0.5, 1, −0.1, 0.5).For each maximum sample size (max ss) within each outcome type, 1,000 treatment-covariate datasets are generated.These are used to generate 1,000 different outcome vectors for each value of the marginal treatment effect within the corresponding maximum sample size.The specific parameter values used for all simulations are included in Table 1.
Table 1: Simulation study parameter settings for the marginal estimand (γ) and conditional treatment effect (φ) for each endpoint and maximum sample size (max ss).The marginal estimands for the continuous, binary, and time-toevent are, respectively, the difference in means under the assumption of no treatment-covariate interactions, the relative risk, and the hazard ratio.

Adjustment Models
Six adjustment models are considered for all endpoints: 1. correct: 2. no quad: 4. correct prior: 5. correct strong prior: The correct model corresponds to an adjustment model which matches the data generating mechanism.The no quad model drops the quadratic component of X 3 from the correct model.The correct noise model adds noise variables {X 6 , X 7 , X 8 } to the correct model.These three models include priors for all parameters which are weakly informative only.The correct prior model is the same as the correct model, but includes priors for the covariate effects centered at the values used in the data generating mechanism.Similarly, the correct strong prior model both centers and rescales these priors to be more informative.Note that the prior for the treatment indicator coefficient remains weakly informative in these models.Finally, the unadjusted model includes only the binary treatment indicator and uses weakly informative priors.
All simulations are performed using R (version 4.2.1).All modeling is performed using the GLM and survival functionality of the rstanarm package (version 2.21.2), a front-end to the STAN probabilistic programming language [R Core Team, 2020, Goodrich et al., 2020, Brilleman et al., 2020].For all coefficients other than {β † j , β † † j }, the package's default weakly informative priors are used [Gabry and Goodrich, 2022].These priors induce moderate regularization and help improve computational stability.The prior for the intercept is placed after all covariates have been internally centered by rstanarm.Under a Normal likelihood for the continuous endpoint, this equates to the following  priors, where β 0,c represents the intercept's prior after covariate centering has been performed: φ ∼ Normal(0, 2.5(s y /s x )) Under the binary and time-to-event endpoints, the above priors for {φ, β} are used with ȳ = 0 and s y = 1.For the time-to-event endpoint, the coefficients of the M-spline basis (ψ) are constrained to a simplex to ensure identifiability of both the basis and linear predictor intercepts.Thus, the basis coefficients receive rstanarm's default Dirichlet prior [Brilleman et al., 2020].All models specify three Markov chains, each with 2,000 posterior samples.Half of the samples within each chain are used during the warm-up period, so 3,000 posterior samples in total are available for inference.Given the scale of the simulations performed, visual diagnostics assessing convergence of the Markov The following null and alternative hypotheses are specified for the continuous, binary and time-to-event (TTE) endpoints, where γ(θ) is the marginal difference in means, marginal relative risk, and marginal hazard ratio, respectively: Binary and TTE: For the continuous and binary endpoints, all outcomes are assumed to be observed immediately upon participant enrollment.For the time-to-event endpoint, it is assumed all outcomes are observed strictly after enrollment.For the continuous endpoint, interim analyses are performed after every 25, 50, 125, and 250 participants are enrolled for maximum samples sizes of 100, 200, 500, and 1,000, respectively.For the binary and time-to-event endpoints, interim analyses are event driven.For the binary endpoint, interim analyses are performed after at least 10, 20, 50, and 100 new events occur for maximum sample sizes 100, 200, 500, and 1,000, respectively.For the time-to-event endpoint, interim analyses are performed after at least 20, 40, 100, and 200 new events occur for maximum sample sizes 100, 200, 500, and 1,000, respectively.These numbers are chosen for each endpoint to ensure that on average the total number of analyses performed under the null treatment effect is less than four, which helps control the Type 1 error rate.They are also large enough to ensure there is a moderate chance of stopping at an early interim analysis under the non-null treatment effects.For the continuous and binary endpoints, interim analyses are performed until the trial is stopped early for superiority or until the maximum sample size is reached, at which time the final analysis is performed.
For the time-to-event endpoint, interim analyses are performed until the trial is stopped early for superiority or until 50 time units from the start of the trial is reached, at which time the final analysis is performed.For this endpoint, participant enrollment is permitted until 25 time units.This ensures that participants enrolled at later time points are under follow-up long enough to have a moderately high probability of experiencing the event before the end of the trial.
It also ensures that the maximum number of participants will be enrolled if there is not clear evidence of superiority at an early interim analysis.No loss to follow-up is assumed and administrative censoring of those still at risk is performed at 50 time units from the start of the trial.

Simulation Study Results
Within each outcome type and maximum sample size, the following design operating characteristics are investigated: power, Type 1 error rate, probability of stopping early, expected sample size, posterior median bias, and RMSE.Since the sampling distribution of the test statistic T (D nt ) is unknown, power and the Type 1 error rate are estimated via Monte Carlo using the 1,000 datasets.While this number is lower than that required by the FDA for adaptive simulations used in RCTs [US Food and Drug Administration, 2019], our goal here is to compare model adjustment performance, not to obtain precise estimates of operating characteristics.The probability of stopping early is estimated as the proportion of times the trial stops before performing a final analysis.In the continuous and binary outcomes, this is the proportion of times the trial stops before enrolling the maximum number of participants.In the time-toevent outcome, this is the proportion of times the trial stops before 50 time units.In Bayesian adaptive designs which allow for early stopping, sample size is a random variable.Thus, expected sample size is estimated as the average of the 1,000 sample sizes at trial end.Posterior median bias is defined as the bias resulting from using the posterior median γ obtained from an adjustment model as an estimator for the value of γ used in the simulation, and is estimated through Monte Carlo using the 1,000 datasets.The Monte Carlo distribution of RMSE is displayed for each value of the marginal estimand.Here the entire posterior distribution from an adjustment model is used as the estimator for γ, so this is equivalent to the posterior expected squared error loss.For each of the 1,000 simulations, a single value of RMSE is obtained using the 3,000 posterior draws γ s for the value of γ used in the simulation: Results for the continuous, binary, and time-to-event endpoints are are displayed in Figures 1-3 and Tables 2-4.For all endpoints, adjusting for variables known to be associated with the outcome increases the power of the trial and the probability of stopping the trial early as compared to the unadjusted analysis (Figures 1-3).Additionally, failing to correctly specify the functional form of a covariate (no quad) has only a minor impact on power and the probability of stopping early.Under all scenarios, incorporating stronger prior information appears to provide little to no benefit as compared to the weakly informative priors used in the correct models.This results from the priors being dominated by the data due to the high effective sample sizes.For the binary endpoint, this is induced by the control event risk of 0.3, which ensures that a moderately large number of events occurs throughout the trial.For the time-to-event endpoint, this is induced by the exponential baseline hazard rate of λ = 0.02, which ensures there are a large number of events within the maximum time limit of 50 time units.
Compared to other adjustment models, adjusting for noise (correct noise) has minimal impact under most scenarios.However, for the smallest maximum sample size under the binary endpoint (max ss = 100), adjustment for noise slightly increases power and the probability of stopping early as compared to the correct model.This may result from non-negligible correlation being induced between the outcome and noise variables under this setting, and so adjustment provides a further power benefit.However, this comes at the cost of a strong inflation in the Type 1 error rate as compared to the correct model (i.e., T1E = 0.079 versus T1E = 0.063 in Table 3).This underscores the importance of adjusting only for variables which are known to be associated with the outcome and in a pre-specified manner [Hauck et al., 1998, Committee for Proprietary Medicinal Products, 2004, Senn, 2013, European Medicines Agency, 2015].
There is some suggestion that adjusted analyses tend to have slightly lower RMSE than unadjusted analyses under all scenarios.However, no clear pattern emerges for posterior median bias for the non-null treatment effects, where all adjustment models are comparable for practical purposes (Figures D. 1-D.3

in the Online Supplementary Materials).
For all scenarios except the smallest maximum sample size under the binary endpoint (max ss = 100), posterior median bias for the non-null treatment effects is negative.This results from the estimated treatment effect being further from the null than the true treatment effect (i.e., overestimation), which is expected in trials which allow for early stopping for treatment superiority (i.e., truncated trials) [Mueller et al., 2007, Walter et al., 2019, Robertson et al., 2022].For the null treatment effect, early stopping for superiority leads to non-zero but minimal bias under all endpoints and maximum sample sizes (Tables 2-4).For the binary endpoint, it is consistently larger in magnitude for the unadjusted model.We note that bias under the null is negative for the continuous endpoint but positive for the binary and timeto-event endpoints.This results from the marginal difference in means being unbounded below, whereas the marginal relative risk and marginal hazard ratio are bounded below by zero.When bias under the null is evaluated for these latter estimands on the log scale, most values become negative as in the continuous endpoint case.We elaborate further on this overestimation induced bias in Section F of the Online Supplementary Material.Under all scenarios except the smallest maximum sample size (max ss = 100) for the binary endpoint, the Type 1 error rate is maintained below 0.05 for all adjustment models (Table 2-4).Under the smallest maximum sample size within the binary endpoint, however, all adjustment models lead to increased Type 1 error rate as compared to the unadjusted model.This results from using too many covariates in the adjustment model given the low effective sample size, a phenomenon known as over-stratification [Kahan et al., 2014].We observe that as the maximum sample size, and thus effective sample size, increases, the inflation in the Type 1 error rate disappears.Under the time-to-event endpoint, there is minimal inflation in the Type 1 error rate for adjusted analyses as compared to the unadjusted analysis.This may also result from over-stratification as described above.Future work should determine the optimal number of covariates to include in an adjustment model under these scenarios to avoid over-stratification.Considering the substantial power gains achieved by adjusting under the time-to-event endpoint, and that the Type 1 error rate is maintained by selecting a conservative value of the probability of superiority threshold u, this slight increase in the Type 1 error rate is not likely to be problematic, however.Across all non-null treatment effect scenarios, the adjusted models have lower expected sample sizes than the unadjusted model.When combined with the probability of stopping early results, this implies that adjusted analyses are stopping more often and at earlier interim analyses for all endpoints.
We note that the reduction in expected sample size is not as great for the time-to-event endpoint as compared to the other endpoints.This results from the maximum sample size being included for any interim analyses conducted past the halfway point of the trial, since all trial participants are enrolled by this point.A final simulation (included in Section E of the Online Supplementary Material) which incorporated varying degrees of prior information on the treatment effect was performed for the binary endpoint.This resulted in increased power and probability of stopping early for smaller maximum sample sizes, but at the cost of inflated type 1 error.A more complete investigation of including informative priors on treatment effects remains as future work.

Application: CCEDRRN-ADAPT
In this section we consider the design of a hypothetical platform trial which seeks to study the effectiveness of oral therapies against mild to moderate COVID-19 infection in individuals discharged from Canadian Emergency Departments.
The trial design takes advantage of an already established network of physicians and researchers called the Canadian COVID-19 Emergency Department Rapid Response Network (CCEDRRN).We consider the first stage only, where a single oral therapy is compared to the standard of care.The binary outcome of interest is a composite endpoint of 28-day hospitalization or mortality.Realistic values used in the trial simulation performed below are taken from a COVID-19 Emergency Department risk model, developed by the CCEDRRN researchers [Hohl et al., 2022a].This risk model was developed using data from a high quality, population-based registry, which was also developed by the CCEDRRN researchers [Hohl et al., 2021, Hohl et al., 2022b].While the binary outcome for the risk model is all cause emergency department and in-hospital mortality, this is very likely to be highly correlated with the trial's composite outcome.Thus, the simulation's results are expected to generalize to the composite endpoint as well.
The data generating mechanism for the trial simulation is shown below, where a single maximum sample size of 3,000 is chosen due to the very low marginal control event risk (p ctr = 0.07) and reflects the sample size used in CCEDRRN-ADAPT.Letting Y = 1 represent 28-day hospital admission or mortality, the marginal estimand of interest is the relative risk, γ(θ) = µ(θ; A = 1)/µ(θ; A = 0) where µ( is intention-to-treat, that is, we compare the outcome distributions of those who are assigned to treatment versus those who are assigned to control.As above, let η represent the linear predictor used in the data generating mechanisms.The set {β, φ} represents the conditional covariate and treatment effects on the log-odds scale.As in the binary simulation case, the value of β 0 is optimized to ensure the generated datasets exhibit the correct marginal control event risk of p ctr = 0.07.The values of φ are selected as those where the unadjusted model achieves approximately 50% and 80% power.The value of γ = γ(θ) corresponding to a specific value of φ is determined through simulation, described in Appendix B of the Online Supplementary Materials.Thus, the values of φ and γ are reported together.Let F be a  The covariate distributions are simulated using values of summary statistics for the derivation cohort from a previously reported risk model for individuals presenting to Canadian Emergency Departments with COVID-19 symptoms [Hohl et al., 2022a].Covariates included in the risk model and whose summary statistics were available are included in the simulation.These include age in years (X 1 ), respiratory rate upon arrival to the Emergency Department and measured in breaths/minute (X 2 ), female sex (X 3 ), chest pain (X 4 ), and arrival from police or ambulance (X 5 ).As the risk model was developed before COVID-19 vaccines were widely available, vaccination status is not included as a potential covariate.The corresponding risk model coefficients are used as the conditional effects {β 1 , ..., β 5 } in the simulation.Since the summary statistics did not include covariance information, joint independence between all variables is assumed.Due to study inclusion criteria, the range of the distribution of age is set to be [18,90].Due to biological contraints, the range of the distribution of respiratory rate is set to be [12,40].Values for age and respiratory rate are simulated from the truncated normal distribution F described above.
A total of 1,000 treatment-covariate datasets are generated, each containing 3,000 participants.Using these same 1,000 datasets, different outcome vectors are generated for each value of the marginal relative risk γ ∈ {1, 0.73, 0.63}, which corresponds to φ ∈ {0, −0.42, −0.60}.The outcome vectors are generated as follows:  The trial uses the same null and alternative hypotheses as those specified under the binary endpoint simulation above and includes a probability of superiority threshold of u = 0.99.Estimation for the marginal relative risk proceeds as previously described.Four adjustment models are considered and mirror those used in the binary simulations: correct, correct prior, correct strong prior, and unadjusted.All outcomes are assumed to be observed immediately upon participant enrollment.Interim analyses are event driven, and 75 new events are required to be observed before performing an interim analysis.This ensures a moderately high probability of stopping at an earlier interim analysis given treatment superiority.Interim analyses continue until the trial is stopped early for superiority or until 3,000 participants are enrolled, at which time the final analysis is performed.
Results for the CEDRRN-ADAPT design simulation study are summarized in Figure 4 and Table 5. Adjusting for variables known to be associated with the outcome increases the power of the trial and the probability of stopping the trial early as compared to the unadjusted analysis (Figure 4).As in the binary simulations above, including stronger priors on the covariate effects has minimal impact on power and the probability of stopping early as compared to the weakly informative priors used in the correct models.While there is some suggestion that adjusted analyses tend to have slightly lower RMSE than unadjusted analyses, all adjustment models have comparable posterior median bias for the non-null treatment effects (Figure D.4 in the Online Supplementary Materials).The Type 1 error rate is maintained below 0.05, and bias under the null is minimal, for all adjustment models (Table 5).As in the binary simulations above, there is slight inflation in bias under the null for the unadjusted model, however.The adjusted models have lower expected sample sizes than the unadjusted model across both non-null values of the relative risk.Again, we see the adjusted analyses are stopping more often and at earlier interim analyses as compared to the unadjusted analysis.
For example, under a relative risk of 0.63, the correct model stops at the first interim analysis 50% of the time whereas the unadjusted model stops at the first interim analysis only 40% of the time.

Discussion
The impact of covariate adjustment and incorporation of prior information on covariate effects has not been previously investigated in Bayesian adaptive trials with early stopping criteria.In this article, we assessed this impact using a variety of adjustment models and incorporated varying levels of prior information or types of model misspecification.It was shown that covariate adjustment increases power and the probability of stopping early, and decreases expected sample size over all scenarios.Furthermore, adjusting for covariates leads to trials which stop more often and at earlier interim analyses, and can decrease RMSE as compared to unadjusted analyses.These findings are fairly robust to adjustment for noise variables, but extra caution is needed for small sample size trials (max ss = 100) with binary endpoints where noise adjustment may lead to inflated Type 1 error.This reinforces existing best practice of only adjusting for covariates which have been pre-specified by subject matter experts [Hauck et al., 1998, Committee for Proprietary Medicinal Products, 2004, Senn, 2013, European Medicines Agency, 2015].For the scenarios explored here, which had moderate effective sample sizes, weakly informative priors on the covariate effects perform well, with stronger prior information providing little benefit.This included the CCEDRRN-ADAPT COVID-19 RCT design.Although the control event risk (0.07) of this trial is small, the maximum sample size was selected to be large enough to power the study for reasonable values of relative risk, which ensured moderately high effective sample sizes.Including stronger prior information is expected to be helpful in trials which have small effective sample sizes (e.g., oncology trials), so future work may consider covariate adjustment within these contexts.Although we did not consider designs which also include futility stopping rules, we expect that the conclusions can be generalized.
In our simulation study, all covariates were assumed to be jointly independent.The assumption of independence may not hold in some cases and these should be investigated further.Since they carry similar information, adjusting for covariates which are moderately or highly correlated may yield smaller power increases than adjusting for approximately independent covariates.Future work might perform a simulation study to assess how different strengths of association between covariates impacts the results reported here.Another limitation of the current work is that adjustment was shown under only a single set of covariate effects within each endpoint.There is strong evidence in favor of covariate adjustment when the covariate effects are known to be strong.However, previous simulations (not shown) showed that adjustment under weaker covariate effects yielded only modest benefit.This finding was especially pronounced in the case of the non-linear endpoints.Future work might investigate a wider range of covariate effects to ascertain the magnitude that is required for covariate adjustment to be more than just modestly beneficial, though this may be context dependent.
It is important to note that the marginalization procedure used in this work is sensitive to the joint empirical distribution of the covariates with respect to which the conditional posterior samples are being marginalized.That is, the marginalization procedure is sensitive to the participants enrolled in the trial.While trialists work hard to obtain representative samples for use in RCTs, the participants volunteer and consent to be enrolled in the trials.Thus they are not a random sample from the population of interest [Lindsey and Lambert, 1998].If the sample is not as representative of the population of interest as desired, indirect standardization could be performed.Here, conditional posterior samples would be marginalized with respect to a set of covariate patterns which more closely resembles the population of interest.Additionally, the participant covariate patterns could be augmented by pseudo-participant covariate patterns until the sample is more representative.These could be acquired from registry data or from participants in previous trials which contained a similar population and target of interest.
This work focused on Bayesian adaptive designs which employed simple randomization, where covariate adjustment takes place within an adaptive decision rule.Covariate adjustment may also occur within an adaptive allocation rule, such as in Covariate Adjusted Response Adaptive (CARA) designs [Rosenberger andSverdlov, 2008, Villar andRosenberger, 2018].These designs estimate treatment arm allocation probabilities from models which include covariates.While frequentist in nature, similar ideas can be applied within Bayesian response adaptive randomization designs.Since adaptive randomization leads to greater covariate imbalance across treatment arms, including covariate adjustment in both the adaptive decision and adaptive allocation rules may provide greater benefit than either do individually.This is an interesting direction for future work.
The simulation study above included variants of the correct adjustment model, which corresponded to the data generating mechanism used.In reality, this will be unknown.Variable selection and shrinkage methods might be employed to select covariates to be used in the adjustment model.Particularly, Bayesian model averaging can be used where the decision rules would use marginal posterior samples obtained from a consensus of plausible models and might be less sensitive to any single adjustment model misspecification.However, this approach is likely to be very computationally intense.
Another interesting direction of future research is exploration of Bayesian nonparametric models, such as Gaussian Processes, to consider more flexible functional forms that adjust for covariates.This approach might be especially advantageous in a setting where the underlying association between the adjustment covariates and outcome is complex and hard to correctly specify, which might include a high degree of covariate non-linearity or interaction.
A.3 Time-to-event Outcome: Marginalization of Conditional Hazard Ratio Estimates Let Y = {T, δ} be defined as in the section for hazard ratios in the manuscript, where the target of inference is the marginal hazard ratio: γ(θ) = h(t | A = 1)/h(t | A = 0) = log{µ(θ; A = 1)}/ log{µ(θ; A = 0)} where µ(θ; A = a) = S(t | A = a; θ).Estimation proceeds assuming independent outcomes, no competing risks, and the following model: As the hazard ratio is non-collapsible, s = 1, ..., S marginal posterior samples from adjusted analyses are obtained through marginalization of the log transformed survival probabilities: Dividing log-transformed survival probabilities can be numerically unstable and is undefined for all t such that S(t|A = a) ∈ {0, 1}.
Thus we use a more numerically stable identity [Stitelman et al., 2011, Remiro-Azócar et al., 2020]: The integrals are approximated using the Bayesian bootstrap procedure described in Section 3 of the manuscript.After fitting the flexible, semi-parametric proportional hazards model, s = 1, ..., S samples are obtained from the joint posterior distribution of the model parameters, π(θ | D nt ).Let θ s represent the s th draw from this joint posterior distribution.For every row i = 1, ..., n t in the sample data, a value of A i = 1 is assigned.Then for each θ s the following procedure is performed.For the t corresponding to the time from the start of the trial to the current analysis, the n t values of the indexed conditional survival probabilities, µ(θ s ; A i = 1, X i = x i ) are calculated.A vector w s = (w 1,s , ..., w nt,s ) is drawn from a Dirichlet(1 nt ) distribution.Using w s , the n t values are then averaged, nt i=1 w i,s µ(θ s ; A i = 1, X i = x i ), which marginalizes them with respect to the observed X = x, yielding a single sample µ(θ s ; A = 1) from the posterior distribution of µ(θ; A = 1).For numerical stability, a log{− log[•]} transformation is applied to yield a single sample from the posterior distribution of log{h(t | A = 1)}.This occurs for all θ s to yield S draws from the posterior distribution of log{h(t | A = 1)}.This entire process is then repeated for A i = 0, to yield S draws from the posterior distribution of log{h(t | A = 0)}.These posterior draws are subtracted and then exponentiated to yield samples from the posterior distribution of the marginal hazard ratio γ(θ).A brief summary is below.
1. Fit a flexible semi-parametric proportional hazards model.3. Create one copy of the sample data where A i = 1 for all i = 1, ..., n t .4. For each θ s , perform the following: (a) For each i = 1, ..., n t , calculate the conditional survival probabilities at time t corresponding to the time from the start of the trial to the current analysis, µ(θ s ; A i = 1, X i = x i ).(b) Sample w s = (w 1,s , ..., w nt,s ) from a Dirichlet(1 nt ) distribution.(c) Average these n t values to marginalize with respect to the observed X = x: We first select a value of β 0 on the log-odds scale in the adjusted data generating models, such that the simulated datasets have a specific marginal control event risk (p ctr ).We then use β 0 and the obtained values of φ (i.e., those where the unadjusted model achieves 50% and 80% power, also on the log-odds scale) to select the reported value of the marginal relative risk.
To find β 0 , let Y be a binary outcome.As a reminder, we define A as the treatment assignment indicator, where A = 1 means being assigned to the treatment group and A = 0 means being assigned to the control group.Let l be the number of participants assigned to control, k be the number of participants assigned to treatment, and l + k = n be the total number of participants potentially enrolled in the trial.Let X n×p be the set of covariates used in the adjusted data generating model, and X i be the row vector corresponding to the values of the covariates for the i th participant.
Recall the marginal control event risk, p ctr , is the risk of having an event in those assigned to control.Then p ctr can be defined with respect to an adjusted data generating model as follows: Given a fixed value of p ctr , conditional covariate effects β on the log-odds scale, and initial simulation of the treatment assignment and covariate distributions, {A, X}, β 0 can be optimized using the last line above (i.e., using uniroot() in R).
In the simulations for each relative risk value within each maximum sample size, 5,000 datasets (each with 5,000 participants) were generated using {β, A, X} as described under the binary outcome data generating mechanism.From these, 5,000 values for β 0 were found and the mean of this distribution was selected as the value of β 0 .Using this and the value of φ, 5,000 values for the marginal relative risk were obtained by dividing the proportion of events in those assigned to treatment ( Ê[Y |A = 1]) by the proportion of events in those assigned to control ( Ê[Y |A = 0]).The mean of this distribution was then reported as the value of the marginal relative risk corresponding to β 0 and φ.

B.2 Ascertainment of Marginal Hazard Ratio (Time-to-event Outcome)
Our goal is specify a value of the reported marginal hazard ratio which corresponds to the value of φ (on the loghazard scale) used in the adjusted data generating models.Recall our assumption of proportional hazards, where the marginal hazard ratio is not time-dependent.Let Y = {T, δ} be defined as in the section for hazard ratios in the manuscript.Define A as the treatment assignment indicator, where A = 1 means being assigned to the treatment group and A = 0 means being assigned to the control group.Let t be the maximum duration of the trial and P (T > t|A = 1) = S(t|A = 1) and P (T > t|A = 0) = S(t|A = 0) be the survival probabilities at time t for those assigned to treatment and control, respectively.In the simulations for each hazard ratio value within each maximum sample size, 5,000 datasets (each with 5,000 participants) were generated using {β, A, X, t = 50} as described under the time-to-event outcome data generating mechanism.For each dataset, the value of the marginal hazard ratio was calculated as: The mean of this distribution was then reported as the value of the marginal hazard ratio corresponding to φ.   [ Walter et al., 2019] shows that overestimation is to be expected when trials permit early stopping for superiority.They consider the case of frequentist group-sequential designs and compare three different stopping rules which differ in how the overall α is divided among the interim and final analyses.The Pocock, O'Brien, and Fleming (PCK) stopping rule evenly divides α across all analyses (interim and final) keeping the stringency of the stopping criteria constant.This is the frequentist group sequential stopping rule most like the Bayesian stopping rule employed in the current manuscript, where a single value for the upper probability threshold u is used (u=0.99),thereby also keeping the stringency of the stopping rule constant across all interim and final analyses.It is shown that overestimation is expected for the PCK stopping rule, and so we conclude it should also be expected for the Bayesian stopping rule employed here, thus inducing the observed bias in the treatment effect under the simulation scenarios.In Section 3.1 of [Walter et al., 2019], the authors discuss observing greater over-estimation for the Haybittle and Peto (HP) stopping rule as compared to the PCK stopping rule.They state "rules (such as HP) that have a more stringent threshold for stopping involve a greater risk of over-estimation if the rule is actually invoked."This suggests that Bayesian stopping rules which have more stringent initial stopping criteria may lead to increased bias as compared to the stopping rules employed in the current manuscript, though we do not explore this further here.

D Summary graphics for bias and RMSE
Considering bias under the null, the difference in signs between the continuous endpoint versus the binary and timeto-event endpoints reflects the lower bounds of the estimands.The difference in means under the continuous endpoint is unbounded below, whereas the relative risk and hazard ratios are bounded below by zero.When bias is calculated on the log-relative risk and log-hazard ratio scales, most values for bias under the null for both the binary and timeto-event endpoints become negative as well, with greater bias for smaller sample sizes as in the continuous endpoint.This is explained visually in

1
Difference in Means: No Treatment-Covariate Interactions Consider the difference in means of a continuous endpoint under the assumption of no treatment-covariate interactions (i.e., Z = ∅; homogeneity of the treatment effect), γ(θ) := µ(θ; A = 1) − µ(θ; A = 0) where µ(θ; A = a) = E[Y | A = a; θ].This marginal estimand represents the difference in expected outcomes between those assigned to treatment versus those assigned to control.Estimation proceeds assuming independent outcomes and the following model:

Figure 1 :
Figure 1: Continuous outcome.A) Power and B) probability of stopping early.Panels correspond to various maximum sample sizes (max ss).Points are jittered horizontally.

Figure 2 :
Figure 2: Binary outcome.A) Power and B) probability of stopping early.Panels correspond to various maximum sample sizes (max ss).Points are jittered horizontally.

Figure 3 :
Figure 3: Time-to-event outcome.A) Power and B) probability of stopping early.Panels correspond to various maximum sample sizes (max ss).Points are jittered horizontally.

Figure 4 :
Figure 4: COVID-19 trial with binary outcome.A) Power and B) probability of stopping early.Points are jittered horizontally.
Apply a log{− log[•]} transformation to yield a single sample from the posterior distribution of log{h(t | A = 1)}.5.This yields S samples from the posterior distribution of log{h(t | A = 1)}.6. Repeat steps 3-4 for A i = 0 to yield S samples from the posterior distribution of log{h(t | A = 0)}.7. Subtract and then exponentiate to obtain S samples from the posterior distribution of the marginal hazard ratio, γ(θ).B Ascertainment of Marginal Estimand Values B.1 Ascertainment of Marginal Relative Risk (Binary Outcome)

Figure D. 1 :Figure D. 2 :
Figure D.1: Continuous outcome.A) Posterior median bias and B) root mean squared error.Panels correspond to various maximum sample sizes (max ss).Points are jittered horizontally.

Figure F. 1 :
Figure F.1: 100 posterior distributions of the relative risk (top) and log-relative risk (bottom) for a trial with a binary endpoint and maximum sample size of 100.Vertical line represents the null treatment effect.
Figure F.1 and F.2, where 100 posterior distributions (Figure F.1) and posterior medians (Figure F.2) have been plotted for the null treatment effect for a trial with a binary endpoint under a maximum sample size of 100.The vertical lines represent the null values used for calculation of bias.In the top panel of Figure F.1 on the relative risk scale, we see many right-skewed posteriors which lead to some posterior median estimates which are much greater than the null(Figure F.2, top panel).This induces positive bias under the null.When we move to the log scale, the right-skewed distributions become more normal in shape and are more centered around the null, but some of the posteriors which were closer to the lower boundary of 0 on the relative risk scale become left-skewed(Figure F.1, bottom panel).This results in some posterior medians becoming much less than the null (Figure F.2, bottom panel) which leads to negative bias under the null as in the continuous endpoint case.When calculating bias for non-null treatment effects on the relative risk and hazard ratio scales, the posteriors are pushed further toward 0 than in the figures included in this section which is why they can still attain negative values and clearly exhibit overestimation in these cases.

Figure F. 2 :
Figure F.2: 100 posterior medians of the relative risk (top) and log-relative risk (bottom) for a trial with a binary endpoint and maximum sample size of 100.Vertical line represents the null treatment effect.

Table 2 :
Continuous outcome.Type 1 error rate (T1E), bias under the null (Bias * ), and expected sample size at three different values of the marginal difference in means (γ).

Table 3 :
Binary outcome.Type 1 error rate (T1E), bias under the null (Bias * ), and expected sample size at three different values of the marginal relative risk (γ).

Table 4 :
Time-to-event outcome.Type 1 error rate (T1E), bias under the null (Bias * ), and expected sample size at three different values of the marginal hazard ratio (γ).

Table 5 :
COVID-19 trial with binary outcome.Type 1 error rate (T1E), bias under the null (Bias * ), and expected sample size at three different values of the marginal relative risk (γ).