A comparison of model-free phase I dose escalation designs for dual-agent combination therapies

It is increasingly common for therapies in oncology to be given in combination. In some cases, patients can benefit from the interaction between two drugs, although often at the risk of higher toxicity. A large number of designs to conduct phase I trials in this setting are available, where the objective is to select the maximum tolerated dose combination. Recently, a number of model-free (also called model-assisted) designs have provoked interest, providing several practical advantages over the more conventional approaches of rule-based or model-based designs. In this paper, we demonstrate a novel calibration procedure for model-free designs to determine their most desirable parameters. Under the calibration procedure, we compare the behaviour of model-free designs to model-based designs in a comprehensive simulation study, covering a number of clinically plausible scenarios. It is found that model-free designs are competitive with the model-based designs in terms of the proportion of correct selections of the maximum tolerated dose combination. However, there are a number of scenarios in which model-free designs offer a safer alternative. This is also illustrated in the application of the designs to a case study using data from a phase I oncology trial.


Introduction
The aim of phase I clinical trials investigating a single therapy is to find the highest dose that can be administered whilst ensuring that patients are at a low risk of serious side effects.To offer patients a higher chance of successful treatment, there is willingness to accept a dose that leads to more toxic responses, commonly labelled as dose-limiting toxicities (DLTs).The highest dose for which the treatment has a prespecified probability of leading to a toxic outcome is called the maximum tolerated dose (MTD).In an analysis of over 400,000 clinical trials conducted between 2000 and 2015 [21], it was found that 57.6% of all phase I oncology trials successfully progressed to phase II.It was found that in 73% of trials excluding oncology, treatments were successful in moving to phase II, thus demonstrating the importance of successful dose-finding methods in oncology, where drugs are clearly harder to develop.
In this work, we consider phase I oncology trials in which a combination of two therapies are investigated.Here the objective is to identify a maximum tolerated dose combination (MTC), the highest dose combination with a probability of toxicity at the target.Phase I oncology trials in this dual-agent setting have recently provoked notable interest [22].In particular, it was found that immunotherapy, a targeted agent that stimulates the immune system to fight cancerous cells [3], can provide benefit to patients when administered in combination with chemotherapy or another targeted agent [17].One difficulty in the dual-agent setting is that the order of toxicity is unknown for some combinations -if the amount of one compound in the combination is increased while another is decreased, it is unknown whether the overall toxicity goes up or down.
A number of dose-finding methods for dual-agent combination phase I trials relaxing the monotonicity assumption on the order of some of the combinations have been proposed in the literature.They broadly belong to one of three categories; rule-based, model-based and model-free (also known as model-assisted) designs.Rule-based designs (e.g.3+3+3 or extensions of this [6]) rely on a number of prespecified rules to determine when a dose is escalated, de-escalated and chosen as the MTD.Model-based designs (e.g.six-parameter model [19]) model the relationship between dose and probability of toxicity through a parametric function.Through the course of a trial, parameter estimates are updated to better describe this relationship.The model-free designs [1,10], do not pre-specify any relationship between dose and toxicity, thus do not rely on any parametric assumptions in their search for the MTD.However, unlike rule-based designs, the decision process in which the dose can be escalated or de-escalated is assisted with a statistical model.
Despite numerous papers demonstrating flaws in rule-based designs and their performance in drug combination trials [15,18,2], it was reported that less than 5% of combination trials in oncology between 2011 and 2013 deviated from rule-based designs [16].It is perhaps the restrictions associated with model-based designs, such as difficulty of implementation or communication to clinicians, that have made these less commonly used in real trials.Recently, model-free designs have attracted attention due to their practicality [20], although these have not yet been fully evaluated in the literature.
The objective of this work is to review four recently proposed model-free dose-finding designs for phase I dual-agent combination studies, namely, the Bayesian Optimal Interval design [7,BOIN], the Keyboard design [23,KEY], the surface-free design [9,SFD], and the product of independent beta probabilities design [8,PIPE].We evaluate their performance in an extensive simulation study.To compare the methods on equal grounds, we propose a calibration procedure that selects the parameters of each of the designs that maximise the proportion of correct selections (subject to a safety constraint).We compare the performance of these designs to the Bayesian Logistic Regression Model (BLRM), a model-based approach that uses a two-parameter logistic model for each compound [13], as well as a non-parametric optimal benchmark.We also evaluate the performance of each of the designs in a case study of neratinib and temsirolimus [5], to highlight the differences between approaches in a real trial setting.
The rest of the paper continues as follows.We provide a review of model-free designs in Section 2, before using a novel method to calibrate the parameters of each design leading to good performance in Section 3. We then present detailed results from our simulation study across a wide range of toxicity scenarios in Section 4, including a conventional model-based design for comparison.Each design is also applied to a real case study of a dose finding trial from combination therapies in Section 5. We finish with a discussion of our results and thoughts in Section 6.

Methodological Review
In this section, we describe the dose escalation procedure for each of the four approaches in a general dose-finding trial.It is assumed patients enter the trial in cohorts, and the dose combination for the next cohort is assigned once the previous cohort's responses are available.We first define the admissible combinations for each design.These are the dose combinations that are allowable for assignment for the next cohort of patients based on the last tested combination.We then describe the details of the escalation procedure in each of the designs in the following setting.Consider a dual-agent trial with I doses of drug A, denoted and d B j for i = 1, . . ., I and j = 1, . . ., J. The total number of patients who receive combination d ij and the number of those who experience a toxic response on d ij during the trial are denoted n ij and y ij respectively.The probability of toxic response at d ij is written as π ij and the target toxicity is denoted φ.

Admissible Combinations
Before deciding on a dose for the next cohort, each design defines a set of combinations that are admissible; i.e. combinations that the next cohort could be allocated to.These are best illustrated with a diagram, Figure 1.Suppose we are at d 22 in Figure 1, indicated by the ' ' symbol.Admissible combinations for the BOIN and KEY designs are the same combination or adjacent combinations to the current one, represented by the ' ' symbols.
In addition to these combinations, the other designs we consider also allow for diagonal de-escalation, where the next cohort is administered a combination that is one dose level lower in each drug, and also allow for anti-diagonal escalation, meaning the next cohort receives a combination that is one dose level higher in one drug and one dose level lower in the other.These are depicted by the ' * ' symbols in Figure 1, where reaching d 11 requires diagonal deescalation and reaching d 31 or d 13 requires anti-diagonal escalation.The rationale is that by enabling faster movement across the combination grid, the design can move to the MTC quickly, and de-escalate quickly if patients are treated at highly toxic combinations.
All designs prohibit diagonal escalation, where the next cohort receives a combination one dose level higher in each drug and no dose levels can be skipped.These non-admissible doses are shown by the '×' in the red cells in Figure 1.
Figure 1: Illustration of the admissible combinations for each design.The ' ' symbol illustrates the current dose combination, and the symbols ' ' and ' * ' represent the possible combinations the next cohort could receive for different designs.

The BOIN Design
The Bayesian Optimal Interval (BOIN) design [7] uses the intuitive estimator πij = y ij /n ij for the probability of toxicity at combination d ij , so that πij is the proportion of observed toxic responses on d ij across the whole trial.The estimator πij only updates after patient responses are observed on d ij , and is then used to guide dose escalation.This escalation process is defined by pre-specified values of λ e , λ d to which πij is compared after each cohort.Values of λ e < φ and λ d > φ are chosen to locally minimise the chance of incorrect escalation and de-escalation decisions during a trial, and are calculated using constants φ 1 and φ 2 .Whilst φ is the target toxicity, φ 1 is the highest toxicity probability deemed sub-therapeutic and φ 2 is the lowest toxicity probability deemed overly toxic.These can be specified by the clinicians.λ e and λ d are defined in Equation (1): . ( Both λ e and λ d are invariant to d ij , n ij and y ij , so that optimising these parameters depends only on constants φ, φ 1 and φ 2 .After defining λ e and λ d , the rules for the dose-finding procedure are as follows: • Otherwise, λ e < πij < λ d and the next combination is the same. In this way, dose skipping, diagonal escalation and diagonal de-escalation are prohibited -see Section 2.1 for more details.If the next combination is to be chosen from an empty A E or A D (for example the current combination is the highest in both doses and the design chooses to escalate), then the next cohort receives the same combination.The design assumes each patient response is independent, y ij ∼ Binomial(n ij , π ij ) and assigns a vague Beta(1,1) prior distribution to each π ij , giving the posterior distribution for π ij as To choose between combinations in the chosen set, the BOIN design computes the posterior probability P(π ij ∈ (λ e , λ d )|n ij , y ij ).The combination maximising this probability is administered to the next cohort.For combinations yet to be tested, calculating this probability is based on the vague prior distribution only.In the event of ties, which is always the case when multiple potential combinations are yet to be administered, the next combination is selected at random from the chosen set.Note that no toxicity information is borrowed between the combinations under this model as the combinations are treated independently.The design uses an overdosing criterion stating that a combination, and any that are more toxic under monotonicity, satisfying P(π ij > φ|n ij , y ij ) ≥ BOIN for some overdosing probability threshold 0 < BOIN ≤ 1, cannot be administered to the next cohort.For the BOIN design, if d ij satisfies this condition, dose d ij and higher combinations are eliminated from the trial, and the dose maximizing P(π ij ∈ (λ e , λ d )|n ij , y ij ) within A D is chosen for the next cohort.If combination d 11 satisfies the overdosing criterion, the trial is terminated earlier for safety.
After all patients are treated, estimates of each π ij are calculated via matrix isotonic regression [4].The simple technique guarantees that estimates of π ij at higher combinations are at least as high as estimates of π ij at lower combinations, which follows the assumption of monotonicity.The MTC is selected as the combination with estimated π ij closest to φ via isotonic regression [4].

The Keyboard Design
The Keyboard design (KEY) [23] is very similar to the BOIN design, defining an interval about the target toxicity φ, denoted I target = (φ − ∆ 1 , φ + ∆ 2 ), for constants ∆ 1 , ∆ 2 > 0, which can be chosen by the clinicians.A combination with estimated toxicity probability within this interval is said to have acceptable toxicity.The design then divides the (0,1) space into "keys", defined as intervals I t of equal length ∆ 1 + ∆ 2 (allowing for shorter keys at either end of (0,1)) for t = 1, . . ., T , where T is the number of keys.The interval I target is fixed pre-trial, chosen to minimise the chance of incorrect escalation and de-escalation decisions.
The KEY design assigns a vague Beta(1,1) prior distribution to each π ij , and assumes that the number of toxic responses follows a binomial distribution, y ij ∼ Binomial(n ij , π ij ).The posterior distribution for each π ij is computed as in Equation 2. Again, this means there is no borrowing of toxicity information across combinations.The design then identifies the key I t that is most likely to contain π ij , labelled I max , Once the key I max is identified, escalation and de-escalation decisions happen as follows: • If I max > I target , the next combination is chosen from A D = d (i−1)j , d i(j−1) .
• If I max = I target , the next combination is the same.
To choose between combinations in A E (or A D ), the design computes the posterior probability The combination maximising this probability is administered to the next cohort.The remainder of the escalation process and the selection of the MTC is analogous with the BOIN design, with an identical overdosing rule using KEY and the MTC chosen via isotonic regression [4].
Both the BOIN and KEY designs model dose combinations independently, however in the following two designs, the connections between the dose combinations are also taken into account.

The Surface-Free Design
The surface-free design (SFD) [9] does not restrict the MTC search to a parametric surface and does not require the order of toxicity between combinations to be known.The main idea is to parametrise ratios between toxicity probabilities for different combinations, defining θ = 1 − π 11 , and 1−π i−1,j .Then θ is the probability of a patient having no toxic response on the lowest dose combination and θ i denotes the ratio between the probability of a patient having no toxic response on dose combinations d ij and d (i−1)j for j = 2, . . ., J and i = 2, . . ., I. Similarly, τ j = 1−π i,j 1−π i,j−1 is defined as the ratio between the probability of a patient having no toxic response on d ij and d i(j−1) for j = 2, . . ., J and i = 2, . . ., I. Thus, the probability of toxicity for each combination d ij is Due to monotonicity, each ratio θ i , τ j ∈ (0, 1) and the SFD assigns each of these ratios an independent Beta prior distribution.The hyper-parameters of the prior distributions can be chosen to match the clinicians' prior mean estimates of toxicity probability on each combination and effective sample sizes.
After each cohort, the SFD updates the posterior means for ratios θ, θ 2 , . . ., θ I , τ 2 , . . ., τ J using Bayes theorem, which can be related back to π ij through Equation ( 4) to give estimates of the toxicity probabilities.In this way, the SFD is borrowing information across various drug combinations previously collected in the trial to make an informed decision on escalation.Additionally, the continual multiplication of Beta random variables implies that π ij for higher combinations has higher variance, allowing for more cautious escalation at higher combinations.Considering all neighbouring combinations apart from the one higher in both doses, the next combination is chosen as the one with estimated π ij closest to φ.An overdosing criterion prohibits any combination from being administered if P(π ij > φ|n ij , y ij ) ≥ SF D for some SF D > 0, and the trial is terminated if this is satisfied for d 11 .
Once all patients have been treated, the MTC is selected as the combination with toxicity probability closest to φ.Note that the SFD design is more computationally intensive than the other model-free designs as MCMC methods are required to sample from the posterior distribution.

The PIPE Design
The PIPE design [8] differs from other model-free designs in that it was originally proposed to find the MTC contour, labelled M T C φ .This is a line partitioning the combination space into safe and overly toxic combinations.Those below the contour are believed to have toxicity prob-ability less than target toxicity φ, whilst those above are believed to have toxicity probability greater than φ.
Assuming the π ij are independent, they are assigned a Beta prior distribution, π ij ∼ Beta(a ij , b ij ) for hyper-parameters a ij and b ij , for i = 1, . . ., I and j = 1, . . ., J. Priors can be prespecified if knowledge on the toxicity of combinations is available.Assuming each patient is independent such that y ij ∼ Binomial(n ij , π ij ) ∀i, j, the posterior for π ij can be written as The posterior distribution is only updated after a cohort of patients is treated on the corresponding combination, but the M T C φ is re-estimated regardless of which combination was tested.The monotonicity assumption means that the PIPE design needs only to consider contours satisfying this property, limiting the number of possible contours to I+J I .Each contour can be represented by a binary matrix, where entries are 0 or 1 depending on whether estimates of the toxicity probability for a combination are below or above the contour respectively.Let ϑ be the set of all monotonic contours for an I × J dose combination space and define C s ∈ ϑ as the binary matrix representing the contour s = 1, . . ., I+J I .
To estimate the M T C φ given the current data, the design calculates the posterior probability of each toxicity probability being less than or equal to φ, that is where the right-hand side of Equation ( 6) is equal to the cumulative distribution function of a Beta distribution.Equation (7) gives the general formula for calculating the probability that the M T C φ is defined by the matrix C s : where [i, j] represents the entry in the ith row and jth column of the binary matrix C s .The contour maximising Equation ( 7) is the contour most likely to be the M T C φ given the current data.This contour then assists the escalation process by identifying the combinations closest to it, before the design selects one of these for the next cohort based on a weighted randomisation procedure.This involves weighting each combination by the inverse of their sample size, with the rationale being varied experimentation around the M T C φ .Escalation continues in this way until all patients are treated, at which point all combinations closest from below the M T C φ are recommended for phase II.The design uses an overdosing rule which considers the expected probability of d ij being above the most probable M T C φ averaged over all monotonic contours.This is written as and d ij cannot be administered to the next cohort if q ij ≥ P IP E for some P IP E > 0. A trial is terminated if combination d 11 satisfies this condition.
The PIPE design can recommend multiple combinations for phase II, as it recommends all combinations closest from below its M T C φ .For consistency across designs, in our implementation we ensure only one combination is recommended as the MTC.Therefore for each recommended combination, we find the posterior mean probability of toxicity, which can be calculated using the posterior distributions in Equation ( 6).The combination with posterior mean closest to φ is selected as the MTC, only choosing a combination at random in the event of a tie.

Calibration of Designs
Model-based and model-free designs based on a Bayesian framework give clinicians more control over their performance.The PIPE design, the SFD and most model-based designs allow for knowledge on the toxicity of each drug from monotherapy trials to be incorporated into the design through their prior distributions.As the BOIN and KEY designs assign vague priors to the toxicity probabilities, their behaviour is primarily determined by the pre-defined intervals guiding escalation.Although it is in theory possible to incorporate historical data through the prior in the BOIN and KEY designs, for the purpose of this comparison, it would defeat the purpose of a design with all escalation boundaries pre-specified at the design stage for ease of implementation.
In this comparison study, the purpose of the calibration procedure is to give all designs a set-up which leads to consistently high proportions of selections of combinations with toxicity probability close to φ in all scenarios.To achieve this, we calibrate each design using a novel two-stage approach.The first stage of the calibration is concerned with choosing values for hyper-parameters that give a good performance in selecting the MTC without considering safety.The second stage then focusses on safety, calibrating the overdose rule taking into account not only good performance in terms of correct selections, but also the number of patients who are treated at unsafe doses.
This approach employs a grid search over hyper-parameter or interval values (depending on the design), each stage involving running simulations over four clinically plausible scenarios and determining which values lead to superior performance.We refer to the priors resulting in superior performance across the four scenarios as operational priors.Although, for the purposes of this work, they will serve as a way of a fairer comparison between the Bayesian designs, addressing the challenge of ensuring that the same amount of prior information is used for each design, the obtained operational priors can be also applicable in the practical case where no reliable prior information about the compound is available.
To evaluate which design inputs lead to superior performance in recommending the MTC in the first stage, the proportion of correct selections (PCS) is examined in each of the four scenarios.That is, the proportion of trials in which a design selects any combination with a true toxicity probability of exactly 0.30.To summarise overall performance across these four scenarios, the geometric mean PCS is considered.Suppose x 1 , . . ., x N represent the PCS in N scenarios.The geometric mean, , is used instead of the arithmetic mean because it has the useful property of penalising cases in which PCS are more dispersed across scenarios.The design with priors or intervals resulting in highest geometric mean PCS across the four scenarios will be the design variant we choose.For the remainder of this section, the mean will refer to the geometric mean.We note that during the first stage of the calibration procedure, no overdosing rules are included, meaning no trials are to be stopped before all patients have been recruited, because we choose to calibrate the parameter controlling the overdosing rule in the separate second stage.Once the first stage of calibration is complete, this will lead to the selection of intervals for the BOIN and KEY designs, and operational priors for the PIPE, SFD and BLRM designs.
The second stage of the calibration procedure is for , the parameter regulating the overdosing rule in each model-free design.Calibration of involves decreasing its value starting from 1, and observing the proportion of correct outcomes in the chosen scenarios.As selecting overly toxic combinations is more of an ethical concern, as a general rule we take as a starting point the highest value of resulting in at least 85% of trials recommending no combinations when considering an overly toxic scenario.We acknowledge this proportion may differ in practice depending on the clinicians' judgement.It is important to note that the interpretation of differs between designs because of the construction of each overdosing rule, and should be accounted for when communicating with clinicians.This is reflected by subscripts for the individual designs in the following specifications.The second stage of the calibration procedure for each design is illustrated in Figure 2.

Setting
Each design is calibrated in the same setting that is then explored in the simulation study (see Section 4.1), representative of a phase I trial in oncology.There are two drugs with three dose levels each, which results in nine combinations, and the first cohort is treated at the lowest combination.The objective is to select a single combination as the MTC with true toxicity probability φ = 0.30.The sample size is 36 patients for which are recruited in cohorts of three patients.All combination-toxicity scenarios are presented in Table 1.However, four scenarios are chosen to explore noticeably different clinical cases, in which the number and location of the MTCs vary, whilst restricting the number of scenarios makes the procedure computationally feasible.
In stage 1 of the calibration procedure, Scenarios 1, 8, 10 and 13 are chosen.Scenarios 1 and 13 are chosen to represent the extremes: when the highest combination is the only true MTC and all others are safe, and when the lowest combination is the only true MTC and all others are overly toxic, respectively.Scenario 8 covers situations in which most combinations are safe but true MTCs do not lie on the same diagonal.Scenario 10 captures the case where most combinations are overly toxic and true MTCs lie on the same diagonal.Note that we often refer to the set of combinations in a scenario as the combination grid.
In stage 2, simulations are run for each design over Scenarios 8, 10, 13 and 14 for different values of .In Scenarios 8, 10 and 13, the PCS is as previously defined, whilst in the unsafe Scenario 14 we consider the PCS as the proportion of trials in which no combinations are selected.We refer to selecting no combinations in Scenario 14 as the 'correct outcome'.

Calibrating the BOIN Design
To guide dose escalation, the BOIN design relies on the interval (λ e , λ d ) around the target toxicity.Interval boundaries λ e and λ d are a function of φ, φ 1 and φ 2 , where φ 1 = a 1 φ and φ 2 = a 2 φ for constants a 1 < 1, a 2 > 1.To calibrate the design, we run 4000 simulations for each scenario for pairs (a 1 , a 2 ) from the sets a 1 = {0.85,0.80, . . ., 0.40} and a 2 = {1.15,1.20, . . ., 1.60}, resulting in a total of 100 pairs.As constants a 1 and a 2 deviate further from 1, the interval becomes wider, thus the design will choose to escalate and de-escalate on fewer occasions.
The optimal values are found to be a 1 = 0.65 and a 2 = 1.4,which substituting into Equation (1), we generate the interval boundaries λ e and λ d to give the interval (0.245, 0.359) to guide dose escalation.This interval implies that escalation occurs if π ij is below 0.245, de-escalation occurs if π ij is above 0.359, else the combination remains the same.
In the second stage of calibration, we find that as BOIN decreases, the design benefits more in Scenario 14, where the proportion of trials in which no combinations are recommended increases (see Figure 2).For BOIN ≤ 0.84, over 85% of trials recommend no combinations in Scenario 14.The trade-off in the other scenarios with this BOIN value is that PCS increases steeply when BOIN increases, as well as the number of patients treated on overly toxic doses increasing.Therefore BOIN = 0.84 is chosen.

Calibrating the Keyboard Design
Using a similar method to BOIN, we first calibrate the parameters which define the interval for KEY.The interval I target = (b 1 , b 2 ) guides escalation entirely so is an important component of the design.We run 4000 simulations across each scenario for pairs (b 1 , b 2 ) from the sets b 1 = {0.27,0.25, . . ., 0.19} and b 2 = {0.33,0.35, . . ., 0.41}, resulting in a total of 25 pairs.Mean PCS are displayed Figure 2 in the online supplementary materials, indicating that interval (0.21, 0.39) yields the highest mean PCS, which differs from the recommendation of (0.25, 0.35) in the original paper [23].As explained in Section 2.3, this means escalation occurs only if the posterior probability P(π ij ∈ (0.03, 0.21)|n ij , y ij ) is higher than P(π ij ∈ (0.21, 0.39)|n ij , y ij ).
In the second stage of calibration, we find that as KEY decreases, the design benefits more in Scenario 14, where the proportion of trials in which no combinations are recommended increases.Choosing KEY = 0.84 leads to approximately 85% of trials correctly selecting no combinations in Scenario 14, as shown in Figure 2, in line with the value obtained for the BOIN design.

Calibrating the Surface-Free Design
The SFD assigns Beta priors to each of its parameters; the ratios between toxicity probabilities.In this setting, there are five ratios (θ, θ 2 , θ 3 , τ 2 & τ 3 defined in Section 2.4) to parametrise, meaning a total of 10 hyper-parameters for the beta priors must be defined for the operational priors.Instead of specifying these directly, we specify a prior mean and prior effective sample size for each ratio, which can be used to calculate the corresponding hyper-parameters.To make the calibration task computationally feasible, we assume that all prior mean ratios, m, are equal (meaning the increase in dose corresponds to the same proportion increase in toxicity) and all effective sample sizes for each ratio, s SFD , are equal.Thus we only need to calibrate pairs of m and s, which we choose from sets m = {0.95,0.925, . . ., 0.85} and s SFD = {1, 2, . . ., 5}.
For each pair, we run 500 simulations (which is lower than other model-free designs due to the computational demands of the design) and examine the mean PCS across the four scenarios.Our results in Figure 4 in the online supplementary materials show that the mean PCS is highest for m = 0.875 and s SFD = 4.This is equivalent to every ratio being assigned the prior distribution Beta(3.5, 0.5), and corresponds to mean prior toxicity probabilities on d 11 and d 33 of 0.125 and 0.487 respectively.
For the calibration of SFD , in Figure 2, we found that SFD = 0.65 resulted in at least 85% of trials selecting no combinations in Scenario 14.There is evidence to suggest that increasing or decreasing SFD not only has a sizeable effect on the PCS in Scenario 13, but also the number of patients treated at unsafe doses, demonstrating the design is highly sensitive to changes in its overdosing rule.

Calibrating the PIPE Design
Similar to the SFD, the PIPE designs assigns beta priors to each π ij .A prior mean and prior sample size for each π ij are specified, giving a total of 18 values to specify from which the hyperparameters for the beta priors can be calculated.To make calibration feasible, we assume that prior sample size s PIPE is equal for each combination and to set the prior means, we divide the grid of combinations into five diagonal segments, with toxicity increasing as we move through each segment.In this way, the design follows the monotonicity assumption.To assign a toxicity to each combination, we specify the toxicity of the lowest combination, ρ, and the size of the increments in toxicity between each segment, δ.In the illustration in Figure 3, we have chosen ρ = 0.05 and δ = 0.025 to construct the grid.
Our approach involves calibrating three parameters simultaneously to create operational priors, and are chosen from the sets ρ = {0.025,0.05, 0.075, 0.10}, δ = {0.025,0.05, 0.075, 0.10} and s PIPE = {1/72, 1/36, 1/18, 1/9}.For each triple, we run 2000 simulations in each of the four scenarios, which is fewer than for the BOIN and KEY designs due to the minor increase in computational expense.We provide one grid in Figure 5 in the online supplementary materials to account for mean PCS on each s PIPE value.The triple s PIPE = 1/18, ρ = 0.05 and δ = 0.025 leads to the highest mean PCS, although we observe that there were many triples that resulted in similar values.We note our choice of prior sample size, s PIPE = 1/18, only differs to the recommendation of 1/9 in the original paper [8].For prior sample sizes s PIPE ≤ 1/18, the design is found to be robust.Mean PCS only varies between 37% and 40%, suggesting that a number of operational priors could lead to consistently high PCS.
For the second stage of the calibration, the value of PIPE is varied as shown in Figure 2, and PIPE = 0.50 is chosen as it provides at least an 85% chance of correctly recommending no combinations in Scenario 14, as well as balancing the number of patients treated at unsafe doses in the four considered scenarios.

Simulation Study
In this section we describe the setting for the simulation study before presenting the results, including a comparison to a model-based approach and a non-parametric optimal benchmark.

Setting
In order to compare the discussed designs, we conduct a simulation study, performing 2000 simulations of each of the 15 scenarios depicted in Table 1 for all five designs.As before, the objective is to select a single combination as the MTC with true toxicity probability φ = 0.30.Any combination with probability of toxicity greater than 0.33 is labelled as overly toxic, and any combination with probability of toxicity in the interval [0.16,0.33] is labelled as acceptable.In this section, the mean refers to the arithmetic mean.All simulations are carried out using R [14], with code provided in the online supplementary materials.
In general, the number of overly toxic combinations available for selection increases as we move through Scenarios 1 to 14. Scenario 1 has a single MTC which is the highest combination available.Scenarios 3 and 4 contain very few overly toxic combinations and have MTCs on the edge of the grid.Scenario 5 is similar to these, except its only MTC is located in the centre of the grid.In Scenarios 2, 6, 7, 8, 9 and 10, there are multiple combinations to explore which have toxicity probability φ.In particular, Scenarios 8 and 9 aim to investigate design behaviour when underlying MTCs are not on the same diagonal.Scenarios 11, 12 and 13 represent settings in which most combinations are overly toxic, meaning designs should avoid combinations away from d 11 .Scenario 14 is of importance because all of its combinations are overly toxic, making the trial very unethical.In this instance, the only correct outcome is to recommend no combination for phase II.Scenario 15 represents a situation where all combinations are true MTCs, and is used to monitor escalation behaviour when combinations are safe and increasing the dose of either drug does not affect toxicity.
In order to accentuate the differences in the designs, we do not implement any accuracy or sufficient information rules, as these may mask some key elements of the designs.We focus on the operating characteristics of proportion of correct selections (PCS) and proportion of acceptable selections (PAS) as measures of accuracy, and proportion of overly toxic selections and the number of patients treated on unsafe dose combinations as measures of safety.

A Model-Based Comparator
To provide a comparison between model-free and model-based designs, we also consider a conventional model-based approach in our simulation study, the Bayesian Logistic Regression Model (BLRM) [13].In this approach, the toxicity probability for each combination, π ij , are modelled as in Equation 8for i = 1, . . ., I and j = 1, . . ., J, where doses d A i and d B j are scaled by reference doses.Let d ij be combination of d A i and d B j , while n ij and y ij are the number of patients and toxic responses on each combination respectively.Parameters α 1 and β 1 describe the toxicity of drug A, α 2 and β 2 describe the toxicity of drug B, and η models the interaction between drugs.The five parameters are assigned normal prior distributions, and the likelihood is a product of Bernoulli densities, proportional to After each cohort is observed, the joint posterior distribution is approximated using MCMC methods, and samples of each parameter are drawn from their full conditional distributions.Estimates of π ij are made by sampling parameters from their posteriors and substituting these along with the corresponding doses into Equation 8.Note that all parameters except η are sampled on the log scale and then exponentiated since they must be positive. .
The BLRM can only escalate to combinations satisfying the neighbourhood constraint and Escalation With Overdose Control (EWOC) principle.The neighbourhood constraint prevents escalation or de-escalation to any combination that is more than one dose level of either drug away, and also prevents escalation to a combination in which both dose levels are higher.For a trial with target toxicity φ = 0.30, the EWOC principle states that d ij can only be administered if P(π ij > 0.33) < BLRM .The combination maximising the probabilistic statement P(0.16 < π ij < 0.33) is administered to the next cohort.If no combinations satisfy the two constraints, the trial is terminated.Once the sample size has been exhausted, the MTC is selected from combinations which have been experimented on with at least six patients, and is the one maximising P(0.16 < π ij < 0.33).The BLRM requires dosing quantities for each drug to be specified, in all of the implementations of the BLRM, these doses are 100, 200 and 300mg for each drug.The same proposed calibration procedure as is applied to the other designs is applied to the BLRM, with details provided in the online supplementary materials.

A Non-Parametric Optimal Benchmark Comparator
While the primary goal of this work is to compare the performance of different model-free designs to each other, there is a risk that all methods might perform equally poorly on some scenarios.In this case, the comparison of the designs to each other would not identify why the poor performance is observed -due to the challenging scenario or due to all designs having difficulties identifying a particular MTC.To provide context for the comparison of operating characteristics, we include the performance of the non-parametric benchmark for combination studies, a tool that provides an estimate for the upper bound on the PCS under the given combination-toxicity scenario [11,12].The benchmark takes into account the "difficulty" of a scenario in terms of how close the toxicity risks for the combinations (under this scenario) are to the target level of 30%, and also accounts for the unknown monotonic ordering in the combination setting.We refer the reader to the recent work by Mozgunov et al. [12] for further technical details on the benchmark for combinations implementation.

Proportions of Correct and Acceptable Selections
Figure 4 presents the summary of the operating characteristics of the considered designs in terms of the PCS and PAS (with the full set of results given in the online supplementary materials).Scenarios 14 and 15 have been excluded as these have no true MTCs for the design to select.For scenarios in which the only acceptable combinations are also correct combinations (Scenarios 6, 9, 10, 11 and 13), the PCS and PAS are equal.The mean PCS across Scenarios 1-13 for the BLRM, BOIN, KEY, PIPE and SFD designs is 40.0%,39.8%, 42.4%, 31.2% and 41.5% respectively, whilst the mean PAS are 58.4%,58.7%, 62.0%, 56.0% and 59.0% respectively.First of all, the benchmark reveals the differences in how challenging it is to identify the MTC in the considered scenarios: the PCS for the benchmark varies between approximately 35% under Scenario 7 to more than 80% under Scenario 13.As expected, the benchmark corresponds to the highest average PCS and PAS -55% and nearly 70%, respectively.Similarly, under the majority of scenarios the benchmark corresponds to the highest PCS and PAS as it employs the concept of the complete information.The largest difference between the benchmark and other designs can be seen under Scenario 13.At the same time, there are scenarios under which the benchmark is outperformed by a competing design -this can be a sign of the design favouring particular combinations under the calibrated priors -for example under Scenario 7.
The variety of performances across the scenarios demonstrates the variability between the different designs in different settings.Considering the model-free designs, on average the KEY design has the highest proportion of both correct and acceptable selections, but is vastly outperformed in some scenarios by the SFD design.In six of the scenarios, the KEY has the highest PCS out of all the model-free designs, being superior in scenarios with few overly toxic combinations.However, for example in Scenario 11, where the MTC is the middle dose of drug A and lowest dose of drug B, the SFD outperforms the next best performing design by 20.6%.The PIPE design shows poor performance in many scenarios, most notably in Scenario 1 where the PCS is 5.5% and PAS is 54.0%.A likely reason is that for the PIPE design, the choice of MTC must be below the MTC contour, and a scenario where the true MTC is the highest dose combination gives rise to underestimation since we cannot explore above the true MTC contour.In addition, the procedure discussed in Section 2.5 to choose one MTC from the recommended set will make our results differ from those originally reported by Mander and Sweeting [8], where a 'correct selection' was defined as the MTC being in the set of recommended doses.
When considering the BLRM as a comparator, we see that in many scenarios the BLRM outperforms the KEY.For example, in Scenario 1 where the MTC is the highest combination, the BLRM has PCS over 20% higher than the next best performing design, the KEY.In fact, when including the BLRM in the comparison, the KEY is only the best performing design in one scenario, Scenario 8.The SFD does however outperform the BLRM in some cases, with the BLRM having the highest PCS in Scenarios 1, 2, 3, 5, and 7 and the SFD is the best performing in Scenarios 6, 9, 10, 11, and 12.

Proportions of Overly Toxic Selections
Figure 5 illustrates the proportion of overly toxic selections for each design.Scenarios 1 and 15 have no overly toxic combinations, so the proportion is zero for these cases.We observe that the SFD and the BLRM recommend more overly toxic combinations on average, in 20.4% and 17.8% of trials respectively.This is evidence of the trade-off between selecting combinations close to φ and the willingness to recommend more overly toxic combinations.
In three of the scenarios, the SFD recommends overly toxic combinations in over 25% of the simulated trials and in 6 of the scenarios, it is the design with the highest proportion of overly toxic recommendations.The BLRM stands out in Scenarios 9 and 11 with a very high percentage of simulated trials recommending overly toxic doses, driving up the average across scenarios.
The PIPE design demonstrates a very low proportion of overly toxic selections with a mean of 9.2% across the 13 scenarios, 6.2% below any of the other designs.It has the lowest in all but three scenarios.This is a further illustration of the feature of the design to recommend combinations near but lower than the estimated MTC contour.
A focus on Scenario 14, where all dose combinations are overly toxic, shows the BLRM is the most efficient at stopping for safety, with 93.7% of simulations not recommending any dose combination.

Number of Patients Treated at Overly Toxic Combinations
Figure 6 outlines the mean number of patients treated at overly toxic combinations in Scenarios 1-15 for each design.Note that we report the number rather than proportion of patients, as this will also give insight into how effectively each design stops for safety.
The most notable feature of these results are the large number of patients treated at overly toxic combinations by the BLRM design.This aggressive escalation is driven by the informative prior, calibrated to give high values of PCS.We refer the reader to the online supplementary materials where an alternative prior leading to more conservative escalation (but considerably lower PCS and PAS) is explored.
The SFD, KEY and BOIN have reasonable performance, with the SFD showing a strong performance with the lowest overall mean number of patients treated on overly toxic doses of seven patients.
Careful attention must again be paid to Scenario 14, where all dose combinations are overly toxic.The PIPE design treats an average of 20 patients per trial, over six cohorts, which is an unacceptable level of exploration in such a scenario.In this scenario, we also consider that although the BLRM showed good performance in stopping early for safety in the highest number of simulated trials, it also has a high number of patients treated on average before stopping.
We see that overall the model-free approaches are more conservative in their escalation than the BLRM, with fewer patients treated on unsafe doses, with no noticeable increase in PCS.Of the model-free approaches, the SFD shows the most promising PCS, at the cost of somewhat higher overly toxic selections.It is also worth noting that the SFD has a substantially higher computational cost than the other model-free designs.To investigate the escalation behaviour further, we consider the application of each method to a case study in the following section.

Case Study
The simulation study gives insight into the operating characteristics of each design, however for further insight into the escalation behaviour, we apply each method to an example case study.We consider a phase I oncology (breast and lung cancer) study enrolling patients to dosing combinations of four dose levels of neratinib and temsirolimus [5].A total sample size of 60 patients (cohorts of size 2 or 3) were treated on 12 of 16 possible dosing combinations.Results from 52 patients were included and 10 DLTs were observed, with full results of the trial displayed in Table 2.
The purpose of this case study is not to investigate whether each design chooses the same MTC as the real study did, but to give an illustration of how each design explores the dosing grid, given identical patient responses.
In order to use the calibrated prior specifications, and in line with the simulation study, we restrict the dosing grid to three doses of each drug, removing the lowest dose of temsirolimus and the highest dose of neratinib.We also fix the cohort size to three patients and maximum total sample size to 36.
To ensure a fair comparison between designs, we define a fixed set of 36 ordered patient responses for each dose combination.The first patient responses in this set are the true y ij DLT responses and n ij − y ij non-DLT responses, in a random permutation (note that this is the same random permutation for each of the methods).The remaining 36 − n ij responses are generated in the following way.Each patient has an individual probability of DLT, generated from Beta(1 + y ij , 1 + n ij − y ij ).Then a binary response is generated with this probability.Where there were no patients assigned to the dose combination in the real study, the individual P(DLT) is generated from a Beta (3,3) distribution, to indicate the dose combination is unsafe, since this is the reason the combination was not escalated to.This process uses the information from the real study, but also introduces enough variability in the subsequent responses to account for the small sample size.
Table 2 displays the results of each of the methods, with the number of patients treated at each combination, the number of DLTs observed, and the concluded MTC highlighted in bold.
The BOIN and KEY designs show very similar exploration, first escalating in neratinib, then temsirolimus.The highest combination is not explored, as the combinations with the next lowest dose of each drug were considered unsafe.The only difference is that the KEY assigns one more cohort to the 200mg/50mg combination, even when the previous cohort had 2/3 observed DLT responses.
The PIPE design explores differently, not escalating to the highest dose of temsirolimus at all, even though only 1/12 DLT responses were observed on the 160mg/50mg combination.The SFD explores more of the highest dose of temsirolimus than any other of the model-free designs, although still not the highest combination.An interesting observation here is that the final recommended dose has observed 4/9 DLT responses, a level that would generally be an unsafe standard.This is in line with the simulation results that showed this design to have the highest level of overly toxic selections.
The BLRM is executed with two prior distributions, the calibrated prior and the alternative prior.Both show a more aggressive escalation than the model free designs, with patients allocated to the highest combination.The calibrated prior gives a slightly more aggressive approach with a second cohort assigned, even when the first observed 2/3 DLT responses.This also means that the dosing grid is not as well explored as some of the model-free designs, as the lowest dose of temsirolimus is only explored in combination with the lowest dose of neratinib.These results are in line with the simulation study for the calibrated prior, where the BLRM had on average the most patients treated on overly toxic doses and also a high proportion of overly toxic recommendations.
The case study highlights some key differences in the approaches, illustrating how both the escalation schemes and final recommendation differ.Particularly of note is the somewhat aggressive behaviour of not de-escalating when observing 2/3 observed DLT responses, and recommending a final dose combination with 4/9 observed DLT responses from both the SFD and BLRM.This behaviour, that could be considered unsafe, is not necessarily obvious from simulation results and underlines the importance of studying the individual escalations in an example case study.It is also important to consider that in practice, such a statistical approach is a guidance for dose recommendation that should be supported by an overall evaluation of the safety, pharmacokinetics and clinical rationale.

Discussion
This paper provides a review of a wide range of combination designs in phase I oncology, exploring the more recently proposed model-free designs in detail, as well as providing a novel approach for the calibration of such designs.The comprehensive simulation study we conduct suggests that model-free designs are competitive with the BLRM in terms of the proportion of correct combinations selected.The operating characteristics of model-free designs in a number of scenarios suggest they offer a safer alternative.The case study example highlighted the key differences in how the methods explore the dosing grid given the same patient responses, with more aggressive approaches missing the lower doses, and conservative approaches missing the higher ones.
The discussed results depend upon the specification of the intervals for the BOIN and KEY designs, and the operational priors for the PIPE, SFD and BLRM designs, which were calibrated using a novel approach.This included calibrating the overdosing rules in each design to reduce the risk of recommending overly toxic combinations for phase II.Naturally, our work does not allow for comparison between designs when complete and reliable prior information on the toxicity of each drug is available.In practice, the PIPE, SFD and BLRM designs can exploit this prior knowledge to help the escalation process.
The calibration procedure, although novel in approach, is relatively straightforward to implement.It does however highlight the computational intensity of the different methods.Both the BLRM and SFD are very computationally intensive, with the calibration procedure taking substantially longer than for any of the other designs.It has shown great promise in specifying prior distributions that yield high PCS values.
Moreover, our simulations do not allow for the early selection of an MTC.For example, if at least 9 patients are treated at a combination and the next cohort is recommended to be treated at this combination, then a trial could be stopped and this combination selected as the MTC.We acknowledge this rule is useful to reduce sample sizes, especially in scenarios where the true MTC is a low dose combination.Another limitation in the work is in only evaluating 3 × 3 combination grids.This was chosen as a balance between providing a large enough grid to observe interesting differences between designs, but at the same time being computationally feasible and realistic of a dose-finding study and sample size.We hence acknowledge our results may not necessarily hold for all settings with varying grid sizes, and emphasize that the prior specifications we have recommended here for comparison would need to be re-calibrated for a different grid size, sample size or cohort size.
An additional area of interest for such dose-finding studies is the sample size and cohort size.Conducting a sensitivity analysis on both of these for each design would be an excellent opportunity to investigate whether designs can still achieve high PCS with fewer patients, or significantly higher PCS with extra patients, and whether a larger or smaller cohort size would lead to better exploration of the dosing grid.
Finally, we conclude this comparison with an overview of recommendations for the use of each design in the context of this work.The BOIN and KEY designs give a balanced approach, with a good level of PCS and PAS across a range of scenarios.Overly toxic explorations and selections are also well balanced across scenarios.The PIPE design is more cautious in its selection, with a consistently low proportion of overly toxic selections, although at the cost of also recommending correct combinations a lower proportion of the time.The Surface Free design offers a high PCS and PAS and a generally low number of patients treated at overly toxic selections, but this must be balanced with the high proportion of overly toxic selections.The BLRM provides the most aggressive approach with a calibrated prior, with a large number of patients treated on overly toxic doses, however a good level of PCS and PAS.With an alternative, intuitive prior, the number of overly toxic explorations is reduced, but at the cost of the high PCS values.

Data Availability Statement
The data that supports the findings of this research are available in the Table 2 of this article, originally from Gandhi et al. [5], with all other data simulated according to the specifications described.

Figure 4 :
Figure 4: An illustration of the PCS and PAS for Scenarios 1-13 for each design.The solid bars measure the PCS and the more transparent bars measure the PAS.The rightmost group of bars show the means.

Figure 5 :
Figure 5: An illustration of the proportion of overly toxic selections across Scenarios 1-15 for each design.The rightmost group of bars show the means.

Figure 6 :
Figure 6: An illustration of the number of patients treated at overly toxic combinations during trials in Scenarios 1-15 for each design.The rightmost group of bars show the means.

Table 2 :
Results for each of the designs applied to the case study, including the raw trial data of the study byGandhi et al.