Optimal study designs for cluster randomised trials: An overview of methods and results

There are multiple possible cluster randomised trial designs that vary in when the clusters cross between control and intervention states, when observations are made within clusters, and how many observations are made at each time point. Identifying the most efficient study design is complex though, owing to the correlation between observations within clusters and over time. In this article, we present a review of statistical and computational methods for identifying optimal cluster randomised trial designs. We also adapt methods from the experimental design literature for experimental designs with correlated observations to the cluster trial context. We identify three broad classes of methods: using exact formulae for the treatment effect estimator variance for specific models to derive algorithms or weights for cluster sequences; generalised methods for estimating weights for experimental units; and, combinatorial optimisation algorithms to select an optimal subset of experimental units. We also discuss methods for rounding experimental weights, extensions to non-Gaussian models, and robust optimality. We present results from multiple cluster trial examples that compare the different methods, including determination of the optimal allocation of clusters across a set of cluster sequences and selecting the optimal number of single observations to make in each cluster-period for both Gaussian and non-Gaussian models, and including exchangeable and exponential decay covariance structures.


Introduction
The cluster randomised trial is an increasingly popular experimental study design.It is used to evaluate interventions applied to groups of people, like classrooms, clinics, or villages, or when the outcome for one individual in the group depends on the outcomes for the others, as is the case with infectious diseases, for example 1,2 .The design of a cluster trial involves the specification of all aspects of the study, many of which are determined by practical, ethical, and contextual restrictions.However, one major aspect of cluster trial design that can be resolved, or at least supported, with statistical analysis is the sample size of individuals and clusters, when observations are captured from the clusters and individuals, and when each cluster receives the intervention(s).
From both an ethical and practical standpoint, minimising the number of individuals, clusters, or observations required to achieve an inferential goal is highly desirable.For cluster trials, inferences are almost always based on the variance of the estimator of the treatment effect.Designs that minimise the variance of a specific parameter, or combination of parameters, are said to be 'c-optimal'.However, for any particular design problem, enumerating all the different possible designs and their associated variances to identify the c-optimal design is often impossible, given the number of possible variants.Therefore, we use algorithms that can quickly identify an efficient, or 'optimal', design.][5] There has been recent methodological advances in the optimal experimental design literature to identify optimal designs in studies with correlated observations, as well as several recent studies to look at the problem specifically for certain types of cluster randomised trial.In this article, we review the literature on optimal cluster randomised trial designs and review and translate more general methods and algorithms from the broader literature to this context.We present results for different cluster trial design scenarios using a range of methods to illustrate use of the different approaches and to identify optimal cluster trial design for a range of contexts.

What is the optimal cluster trial problem?
There are multiple types of optimality in the experimental design literature, which are referenced using a "alphabet" system of letters. 6,7The primary objective of a cluster randomised trial is almost always to provide an estimate of the treatment effect of an intervention and an associated measure of uncertainty for one or more outcomes.Other parameters in the statistical model, such as the covariance parameters or intraclass correlation coefficient, are not of primary interest.For example, the predominant method used to justify the sample size within a particular study design is the power for a null hypothesis significance test of the treatment effect parameter 3,4 .Thus, efficiency and optimality in this setting relates to minimising the variance of the treatment effect estimator, which is c-optimality.
We now introduce concepts and notation to describe the methods to identify c-optimal cluster trial designs.We assume time is modelled discretely where there are repeated measures, such that observations are considered to have been made within clusters at discrete points in time.Approximations can be made to a continuous time model by finely discretising time within this framework; using a regular grid over a continuous space is a common strategy in optimal design work. 8In what follows, we represent matrices using capital letters, e.g.X.We use a subscript to denote submatrices or elements of a matrix, in particular, we notate X A as the rows of X in set A with all the columns, and Σ A as the sub-matrix of Σ with rows and columns in A. Lower case letters represent scalars and so X i,j indicates the element of X in the ith row and jth column.
We assume there are N possible observations that could be made as part of the study.An observation consists of a single 'design point' that generates an outcome datum.Often observations are grouped into higher-level units around which the study is designed.For cluster trials with cross-sectional sampling each observation will be a unique individual grouped into cluster-periods and then clusters.For cohort designs, observations will be grouped within an individual trial participant and then in clusters.We refer to an experimental unit as the smallest indivisible set of observations for the design problem.For example, we may wish to choose which whole cluster sequences to include, so the cluster sequence is the experimental unit.Equally, we may select clusterperiods, if we do not need to include all time periods within a cluster, or indeed which specific observations.The set of all experimental units is the design space.To simplify matters, we assume each observation is in one and only one experimental unit.However, experimental units in the design space need not be unique.For example, in the absence of individual-level covariates, an observation made in a cluster at a given time will have identical values of the fixed effect parameters to all other observations in the same clusterperiod.
The top left panel of Figure 1 (Design Space A) represents a cluster randomised trial design space.This design space includes the following restrictions: i No reversibility, i.e. clusters can only cross from control to intervention states.ii There must be contemporaneous comparison in at least one time period, i.e. a before-and-after design would not be permitted since it would not include a randomised comparison.
Each row indicates a cluster sequence, and each column a discrete time interval or period.Within each cell, there are a pre-specified number of observations.Each row in the diagram is illustrated only once, but there may be multiple repeats of the same row in the design space depending on the formulation of the problem.Where there are multiple instances of the same cluster sequence, the design space includes the most common types of cluster randomised trial design with repeated measures: a parallel design, in which a cluster receives the intervention in all periods or the control in all periods, or a stepped-wedge cluster randomised trial, where the intervention roll out is staggered such that all clusters start in the control condition and then one or more clusters receives the intervention in each time period until all clusters are in the intervention state.A 'hybrid' design consists of a mix of parallel and stepped-wedge cluster sequences, and a 'staircase' design includes only the cluster-periods on the diagonal.Figure 1 illustrates these designs.Given that this design space incorporates among the most widely used cluster randomised trial designs, and that these restrictions reflect common real-world limitations, it is an obvious choice for many applications.However, more complex design spaces (such as Design Space B) are required to allow for alternative designs like cluster cross-over.Such a design space is illustrated in Figure 2, which removes the no reversibility restriction.In these cases the cross-over design is almost always the optimal design. 9e base our analyses around a generalised linear mixed model (GLMM).For outcome vector y: where F is a statistical distribution with mean µ and scale parameter σ, and h −1 is a link function.We assume that the distribution F is in the exponential family.The matrix X is the design matrix of fixed effects, the matrix Z is the design matrix for the random effects u, and D is the covariance matrix of the random effects.We discuss below the specification of X, Z, and D.
We assume that the matrices X and Z have N rows and so contain all of the observations in the design space.There are J experimental units, which we denote as E j for j = 1, ..., J where each E j ⊂ [1, ..., N ] contains a subset of the rows.We also denote the design space as D := {E j : j = 1, ..., J} and a specific design as d ⊂ D. Our aim is to identify the 'optimal' design d * of size m < J by selecting the most efficient set of m experimental units from D.
Most experimental design criteria are based on the Fisher information matrix.For the GLMM above, the information matrix for the generalised least squares estimator, the best linear unbiased estimator, for a particular design is: where Σ is the covariance matrix of the observations y.Then the c-optimal design criterion is: where c is a vector consisting of zeroes except for a one in the position of the treatment effect parameter.For some designs, such as if there were no observations in the treatment condition, M would not be positive semi-definite and so we define the variance as infinite, i.e. the design provides no information on the parameter.The formal design problem is then to find d * ⊂ D that minimises f such that |d * | = m < J, i.e. d * = arg min d f (d).

GLMM Specifications for Cluster Randomised Trials
Without loss of generality, we focus on models for cluster trials where individuals are cross-sectionally sampled in each cluster-period.Where relevant and also without loss of generality, we use r to represent the number of observations per cluster-period.For a comprehensive discussion of different models relevant to cluster randomised trials, see Li et al 5 .

Covariance Function
The observed outcome for an individual i in cluster k at time t is specified as y ikt with linear predictor η ikt = x ikt β + s ikt where s ikt = z ikt u represents the random effects.The covariance function defines the entries of the covariance matrix D. We define a covariance function as: where ∆t = |t − t | and ∆k = |k − k |.We define the following covariance functions: In the cluster and nested exchangeable functions above, the parameter τ 2 represents the between cluster variance, ω 2 is the within-cluster, between-period variance.For the autoregressive function, τ 2 similarly represents the between cluster variance, with λ the autoregressive parameter describing the rate of temporal decay.For Gaussian-identity models, we use σ 2 to denote the observation-level variance.Gaussian-identity models are often re-parameterised in terms of other parameters, in particular: ICC Intra-class correlation coefficient.Equal to ρ = τ 2 τ 2 +σ 2 for EXC1 and AR1 and ρ = τ 2 +ω 2 τ 2 +σ 2 +ω 2 for EXC2.More precisely, this is the 'within-period ICC' for designs with repeated measures and EXC2 function. 3AC Cluster-autocorrelation coefficient.Equal to r = τ 2 τ 2 +ω 2 for EXC2 and not defined for the other models.
Matrix X The n × P matrix X is a matrix of covariates.For cluster trials with repeated measures X typically consists of an intercept, time period indicators for T − 1 time periods, and a treatment indicator.Equivalently, matrix X may be specified without the intercept and with T time period indicators.We use this specification for the examples in subsequent sections.For some trial designs, investigators may consider alternative specifications.For a parallel design, the treatment effect estimator from a model that does not adjust for time period is unbiased.However, such a specification would result in a biased treatment effect estimator where the intervention roll out is staggered over time.Thus, we assume that for most applications where staggered designs feature in the design space, adjustment for time is incorporated in the specification for X.We may also consider adjusting for continuous functions of time, such as polynomials, if the time periods are an approximation to continuous time.For example, Hooper and Copas 10 consider cubic and piecewise continuous polynomials.
In this discussion, we also assume there is a single treatment that enters the model as a dichotomous treatment indicator.More complex cluster trial designs may feature multiple arms and treatments, 11 including continuous treatments representing dose.We do not consider these designs here, however the optimal design methods below can be extended to these cases.

Methods and previous literature
We divide the currently available methods for the optimal cluster trial design problem into three categories: (i) derivation of exact formulae for the treatment effect variance or precision for specific models and design spaces; (ii) general 'multiplicative' methods that derive weights to place on each unique experimental unit; and (iii) general combinatorial optimisation algorithms designed to select the optimum m items from a discrete set of size J.

Exact Formulae
For simpler models one can derive explicit formulae for f (d).Given a statement of the variance or precision one can then either determine an algorithm to identify an optimal solution, or use it to calculate the variance for a wide range of designs and/or parameter values and compare numerically or graphically.Many such studies are based on the formula for the variance of the treatment effect estimator in the linear mixed model with EXC1 covariance given by Hussey and Hughes. 12irling and Hemming 9 provide perhaps the most notable study of this type for cluster trials.They derive a formula for the precision of the treatment effect estimator in a linear mixed model under covariance structure EXC2 along with individual-level cohort effects, although we drop the individual level cohort effects for this summary.They consider the problem of determining which periods to introduce the intervention into each of the m clusters.Each cluster is observed in each of the T time periods, and each cluster-period has n observations.One can map this problem onto Design Space A in Figure 1 where each row is an experimental unit repeated m times, and the goal is to identify the optimal m set of experimental units.
We can rewrite model (1) as a linear mixed model for individual i in cluster k at time t as: where J kt is an indicator for if cluster k has the intervention at time t, w t is a time period indicator with time period parameters γ t , and α 1k ∼ N (0, τ 2 ) and θ kt ∼ N (0, ω 2 ) are the cluster and cluster-period random effect terms, and u ikt ∼ N (0, σ 2 ) is the error term.The fixed effect parameters are β = [δ, γ 1 , ..., γ T ] T and δ is the treatment effect parameter.With discrete clusters and time periods, we can aggregate model (5) into a model for the cluster-period means: where ȳkt = 1 r r i=1 y ikt is the mean outcome for cluster k in time period t and Var(e kt ) ∼ N (0, ω 2 + σ 2 r ).Girling and Hemming, following work by Hussey and Hughes 12 and others, then show that the precision of the treatment effect estimator δ is given by: where ρ = where the dot indicates the index over which the mean is taken.Girling and Hemming 9 provide a method for using Equation (7) to produce an optimal design under a no reversibility constraint.We assume the clusters are numbered such that a lower numbered cluster has greater than or equal number of intervention periods than any higher numbered cluster.We then map the cluster-period indexed to coordinates on a unit square (j, t) → (x 0j , x 1t ) where x 0j , x 1t ∈ [−1/2, 1/2].All cluster-periods start in the control state, and then starting with cluster 1 in the T th period, one successively changes the cluster-period to an intervention state in the order of decreasing values of Rx 1t − x 0j until J•• cluster-periods are included in the treated set.Examples of this method are provided in the article, which we reproduce in the examples section.
Lawrie, Carlin, and Forbes 13 derive explicit formulae for the optimal proportion of clusters to allocate to each sequence (row) in the Design Space A to minimise f (d) using a linear mixed model and the EXC1 covariance function.They show that the optimal proportion of clusters allocated to the tth sequence in the stepped-wedge design space (see Figure 1) with T − 1 sequences, φ t is: A similar analysis for EXC1 structure and a linear mixed model is given in Woertman et al. 14 Zhan, Bock, and Heuvel 15 extend Lawrie et al's analyses using Girling and Hemming's work to identify more general 'optimal unidirectional switch designs' by extending the probability weights (8) to a larger design space with sequences incorporating exclusively control or intervention conditions, and with EXC1 covariance functions.Here, unidirectional switching means no reversibility, giving, for example Design Space A in Figure 1.The more general probability weights for the design space with T + 1 sequences are: Zhan, Bock, and Heuvel also extend this analysis to smaller design spaces including only a subset of the rows of Design Space A. We discuss below methods for rounding proportions to whole numbers of clusters.
There are several other studies that derive expressions for the treatment effect variance to identify efficient study designs.Hooper and Copas 16 consider a linear mixed model with AR1 covariance for a cluster randomised trial with continuous recruitment.They consider a parallel study design with baseline measures and aim to identify the when the intervention should be implemented in the intervention arm under different sample sizes and covariance parameters.They calculate the value of (3) for a large range of models and graphically compare the results.Copas and Hooper 17 take a similar approach with a linear mixed model with EXC1 covariance with a parallel trial design.They aim to identify optimal sample sizes and the proportion of data to collect in baseline and endline periods.Moerbeek 18 also uses an explict criterion, although not strictly for c-optimality, as they aim to identify an optimal sample size within treatment and control groups subject to a budget constraint.They consider only a single time period, such that the treatment effect estimator is a difference in means.Lemme et al also consider a similar cost-benefit optimisation approach for multicentre trials. 19eriving explicit formulae for the variance or precision is appealing due to its relative simplicity.Identifying a c-optimal design does not require specialist tools and can be done using spreadsheet software.However, these methods are typically limited to specific models and designs, such as exchangeable covariance structures, linear models, and equal cluster-period sizes.The mathematical approach used to derive the precision formula does not carry over to more complex covariance structures or design spaces, nor to problems where the experimental unit is an observation or cluster-period.One can calculate the value of the c-optimality criterion directly for any design, as Hooper and Copas 16 do.However, the number of designs one must calculate the variance for grows exponentially and prohibitively with the size of the design space.More general methods are required for these extended problems.

Multiplicative weighting methods
Determining probability weights for experimental units, as the studies cited above do explicitly 13,15 , is a useful strategy to simplify the optimal design problem.One can generalise this approach to tackle more complex models and design spaces.We place a probability measure φ on D so that our design is characterised by φ := {(E j , φ j ) : j = 1, ..., J} where φ j ∈ [0, 1] are weights.The optimal design problem can then be re-stated as finding a design that minimises f (φ).
Elfving's Theorem Elfving's Theorem is a classic result in the theory of optimal designs. 20The original formulation considered independent, identically distributed observations.Holland-Letz, Dette, and Pepelyshev (2011) 21 and Sangol (2011) 22 generalised the theorem to the case where there is correlation within experimental units and multiple observations, such as within a cluster, but not between experimental units, such as if the experimental unit was a cluster-period or observation.Elfving's theorem provides a geometric characterisation of the c-optimal design problem.If the experimental units are uncorrelated, the information matrix in Equation ( 2) can be rewritten as: thus for the approximate design φ we can write: which we can rewrite as: where where co denotes the convex hull.This set leads us to a generalised Elfving theorem: For proof see 21,22 .
Sagnol 22 shows how the generalised Elfving theorem can be used to define a secondorder cone program, which is a type of conic optimisation problem than can be solved with interior point methods.This program returns the optimal values of φ 1 , ..., φ k .We provide functionality for the problems we consider in this article in the R package glmmrOptim, including this program.Other proposals exist for identifying the optimal weights, such as using a multiplicative algorithm based on an upper bound for the solution. 23.
Mixed Model Weights Girling (forthcoming) has proposed an algorithm for finding the optimum set of weights that can be applied to the case when experimental units are equivalent to cluster-periods.Since observations in this context are exchangeable within a cluster-period, when the weights are rounded to number of observations (see next section), the result is equivalent to when the experimental unit is a single observation.We consider the aggregated cluster-period model (6).The best linear unbiased estimator for the linear combination b = c T β can be written as where a = [a 1,1 , ..., a 1,T , a 2,1 , ..., a K,T ] is a vector of weights, with a k,t the estimation weight for cluster k and time t, and a = L T a .As before, L is the Cholesky decomposition of Σ and F = L T X.By the Gauss-Markov theorem, the estimator is unbiased if giving us the generalised least squares estimator.In the linear model case we can write Σ = σ 2 N W + ZDZ T , where W is a diagonal matrix of weights such that the number of observations in the cluster period (k, t) is φ k,t N with kt φ kt = 1.The variance of the estimator can then be written as: where g(.) is the covariance function (4).Ignoring the first part of the final line, which is not determined by the cluster-period weights in each cell, the Cauchy-Schwarz inequality shows that: which then gives us a lower bound on the variance when adding in the coviarance terms.This inequality becomes an equality, and hence the minimal variance, when the weights Prepared using sagej.clsare set as: The above argument therefore suggests a simple algorithm to identify the cluster-period weights than minimise the variance, and hence are the c-optimal design, which is shown in Algorithm 1.We have implemented this algorithm in the R package glmmrOptim as the "Girling algorithm".To ensure the algorithm terminates, we had to add additional steps to the algorithm described below.In particular, on each iteration of the algorithm we exclude cluster-periods where the weight is smaller than some lower bound (10 −7 ) to avoid the weights continually shrinking and causing possible floating point errors; and excluding time periods in the linear predictor if the total weights for that period are zero.The algorithm applies to only the cases where the experimental units are clusterperiods or individual observations; however, one might sum the weights within larger experimental conditions for different contexts.
Algorithm 1 Optimal mixed model weights for J experimental units with a target total number of observations N and is the tolerance of the algorithm. procedure end for δ ← arg max j |φ j − φ j | for all j ∈ {1, ..., J} do φ j ← φ j end for Σ ← (σ 2 /N )diag(φ −1 ) + ZDZ T end while end procedure

Rounding proportions of experimental units
Where a method produces an optimal design in terms of the proportion of experimental units of each type to include, we must use a rounding procedure to translate it into exact numbers.There are several methods for converting proportions to integer counts that sum to a given total.The problem was famously identified for converting popular vote totals in states into numbers of seats in the US House of Representatives; the solutions are named after their proposers. 24Pukelsheim and Rieder (1992) 25 following others 26 argue that the procedure of John Quincy Adams is the most efficient method of rounding to an exact design.As Pukelsheim and Rieder note though, the design weights do not contain enough information to exactly identify a experimental design, and so multiple designs may be generated.However, this procedure is based on the assumption that a 'fair' allocation includes at least one experimental unit of each type.For many cluster trial design problems we do not require this restriction, for example, a parallel trial is optimal in some cases. 9In other cases though, there may be practical reasons to ensure staggering of the roll-out, 10,27 in which case this rounding scheme would be the most efficient.Hamilton's rounding procedure is an alternative method.We initially assign Jφ j clusters to each sequence (where x is the floor of x), and the incrementally add clusters according to the largest remainder Jφ j − Jφ j .In the later examples (and the implementation in the R package glmmrOptim) we use all rounding procedures and then select the design with the smallest value of f (d) since evaluating the variance for the small number of possibly optimal designs does not bear a high computational cost.
While the solutions generated by different rounding schemes, and the algorithms discussed in the next section, may in fact be an exact optimal solution, they cannot guarantee such a result.In the results section we provide several examples where the results of the methods may disagree.The equivalence theorem 28 provides precise conditions to check whether a given design is indeed optimal.However, it requires knowledge of the optimal design.Girling and Hemming 9 use an approach of comparing the relative efficiency of the design to that of a cluster cross-over, which is the most efficient if it is within the design space.Not all design spaces include the cluster crossover design, and so the optimal design may not be known.Holland-Letz, Dette, and Renard (2012) 23 derive a lower bound for the relative efficiency of a given design in the context of a pharmacokinetic study with correlated observations.

Combinatorial Optimisation Algorithms
Watson and Pan 29 show how the c-optimal design criterion in Equation ( 3) is a 'monotone supermodular function', which means it is amenable to one of several combinatorial optimisation algorithms that are well-studied in the literature.A supermodular function is one for which, given a design d ⊂ D and a smaller design 30 Intuitively, one can see this is the case for the design problems considered in this article since it states that the decline in variance from adding a new experimental unit E is smaller for larger designs.The function is monotone , which means that the variance will be at least as large if you remove any observations.The advantage of these algorithms is that they allow identification of optimal designs in cases where there is correlation between experimental units, such as when the experimental units are cluster-periods or single observations.
2][33] We exclude the greedy search algorithm here, as it starts from the empty set and successively adds observations.As we require a minimum of P observations to ensure a positive semidefinite information matrix, the algorithm therefore performs poorly as Watson and Pan 29 show.The local and reverse greedy searchers are shown in Algorithm box 2. These algorithms are also implemented in the R package glmmrOptim.
Finding the subset of size m from the design space that minimises f (d) is an NPhard problem, however, much work has been produced from the 1970s onwards on computationally efficient methods of finding approximate solutions.In some cases, these algorithms give a 'constant factor approximation', that is the worst case result has a provable bound on f (d)/f (d * ) if d * is the c-optimal design.For the design problem we consider in this article, only the local search has a constant factor approximation.However, we also include the reverse greedy search as it, or similar variants, have appeared in the literature for cluster trials.
The local search algorithm starts from a design of the desired size m and then makes the swap of an experimental unit in the design with one not in the design that leads to the greatest reduction in the c-optimality criterion.Such swaps are made until no further value improving swaps are available.The worst possible design that this algorithm produces under a cardinality constraint (i.e.|d| ≤ J) has a value no larger than 3/2 times the true c-optimal design. 32This bound can be improved to 1 + 1/e with certain extensions to the algorithm. 34he reverse greedy algorithm starts from the complete design space and successively removes the experimental unit that results in the largest decrease in variance.Proofs of the constant factor approximation for the reverse greedy algorithm depends on the 'steepness' or 'curvature' of f (.), which depends on the value of f (∅), where ∅ is the empty set, i.e. a design with no observations.A reasonable choice for the variance of an estimator from a design with no observations is infinity, as we specify in (3).However, the resulting curvature of the function then means there is no constant factor approximation bound. 30,35Alternatively we could say f (∅) is undefined, and we would again lack a theoretical guarantee.
Watson and Pan 29 investigate these algorithms for a range of study designs, including cluster randomised trials.They find that empirically the reverse greedy and local search algorithms provide similar performance in terms of the variance of the resulting design.The reverse greedy search is deterministic, while the local search starts from a random design, so Watson and Pan run the local search multiple times and select the best design.They also suggest several approaches to improve the computational efficiency of these algorithms.
Kasza and Forbes 36 use a reverse greedy approach to identify optimal designs.They describe the method as estimating the 'information content' of clusters or cluster-periods in a design space like Figure 1, where their measure of information is the marginal change in variance from removing the observations from the design.The results presented by Kasza and Forbes 36 are qualitatively similar to those using other methods and algorithms, such as those presented below.
Hooper, Kasza, and Forbes (2020) 37 examine optimal cluster trial designs in the context of the linear mixed model with covariance function AR1.They consider a discrete approximation to a continuous time model with continuous recruitment and polynomial functions of time.The design space consists of individuals regularly spaced over a time interval within clusters; the individuals constitute the experimental unit.They aim to provide a set of illustrative optimal designs under different parameter values for the covariance function.The method used to identify these designs could also be described as a variant of the 'reverse greedy' algorithm.Each iteration of the algorithm is supplemented with a type of local search, although the swaps of experimental units that can be made are limited at each step to preserve a no reversibility restriction.The designs presented by Hooper, Kasza, and Forbes 10 are often qualitatively different from those presented here resulting from other methods.However, the design space they use includes a wide range of other designs, and their specfication of X does not include time period indiciators, which may account for some of the differences.

Computational Complexity
The computational complexity of the local and greedy searches scales as O(m 4 r 3 (J − m)) and O(J 3 r 3 (J − m), respectively, 29 where r is the number of observations in an experimental unit.These algorithms scale relatively poorly with the size of the design.However, the approach taken by Girling and Hemming 9 discussed above suggests a way of improving the computational time of these algorithms when the experimental unit is a cluster or cluster-period.Equation ( 6) specifies a model for the cluster-period mean under covariance function EXC2.A similar model can be specified for the AR1 function with equal sized cluster-periods: The advantage of using a model for the cluster-periods is that it only requires a single swap or addition to change an experimental unit as opposed to n swaps or additions to the design.

Non-Gaussian Models
The multiplicative weighting, optimal mixed model weights, and combinatorial methods all require calculation of the covariance matrix Σ and its inverse.For Gaussian models with identity link function Σ = σ 2 I + ZDZ T , so it can be calculated exactly.For non-Gaussian models, such as Binomial or Poisson, generating Σ can be computationally demanding.For non-linear models, an approximation to Σ and hence to the information matrix M , is typically used. 38Breslow and Clayton (1993) 39 used the marginal quasilikelihood of the GLMM to propose the first-order approximation: where W is a diagonal matrix with entries W i,i = ∂µ ∂η 2 Var(y|u) , which are the GLM iterated weights. 40Here, W is evaluated at the marginal mean Xβ.For the optimal mixed model weights algorithm we can generate Σ = 1 N W −1 diag(φ −1 ) + ZDZ T .Zeger et al (1988) 41 suggest that when using the marginal quasilikelihood, approximations can be improved by 'attenuating' the linear predictor.For example, for the binomial-logit model one would use ) where a = 16 √ 3/15π.For other types of optimality this attenuation can improve the resulting designs, 38 however for c-optimality there was little evidence of a difference in the designs considered by Watson and Pan. 29Other information matrix approximations that may be relevant for non-Gaussian models include using the GEE working covariance matrix or higher order approximations, however, these methods are either more restrictive or there is little evidence they improve the designs.The approximation also permits the use of cluster-period mean models, like ( 6) and ( 13), with heteroskedastic errors given by Var(e jt ) = Wjt,jt rjt where the r jt is the number of observations in cluster j at time period t, and W jt,jt the individual-level variance of an observation in that cluster-period.For the non-Gaussian examples we give below, we use Equation 14 without attenutation.
Morbeek and Maas (2005) 42 examine optimal designs for clustered studies with a binomial-logisitic mixed model.The derive an approximation to the variance of the treatment effect parameter under the EXC1 covariance function using a linearisation approach with the marginal quasilikelihood.They specifically aim to identify the optimal number of individuals within a cluster in a cost-benefit framework.

Robust Optimality
The methods to generate an optimal design have so far assumed the model parameters are known.However, a well known issue for optimal experimental design methodology is that a design that may be optimal for one set of parameters or model specification may perform poorly for another.Robust methods that are efficient across a range of designs are therefore desirable.There are multiple possible criteria for modifying the c-optimal design criterion to account for multiple designs.For example, Girling and Hemming 9 consider a minimax criterion in which they identify a design that maximises (minimises) the minimum (maximum) precision (variance) over all values of the correlation between cluster-period means.This results in a 'hybrid' trial design (see Figure 1).Van Breukelen and Candel (2015) 43 also consider a minimax criterion to identify a robust optimal cluster trial design when the ICC is unknown.Similarly to Moerbeek 42 , they use a cost-benefit framework and examine the optimal design under a fixed budget.
As a robust optimality criterion, the maximin function is not necessarily generally applicable.For the combinatorial methods, we require that the objective function is supermodular, and the maximum of a set of supermodular functions is not necessarily supermodular.As an alternative, we can use a 'weighted average'.In particular, we assume there is a set of L candidate models and we specify a prior probability for each model p 1 , ..., p L with the property L l=1 p l = 1.Dette (1993) 44 and Lauter (1974) 45 propose the following generalisation of the c-optimality criterion: where M d,l = (X T d,(l) Σ −1 d,(l) X d,(l) ) −1 represents the information matrix for design D under the lth model.As well as the parameters varying between model specification, the vectors c l and matrices X (l) and Σ (l) can vary between models, for example, there may be different specifications of time and covariance functions.
Dette 44 generalises the Elfving theorem for this robust criterion for models with uncorrelated observations.One can further generalise this theorem to the case where observations are correlated within experimental units following the results of Holland-Letz et al 21 and Sagnol 22 .However, a specification for a program to solve this generalised problem using conic optimisation methods, extending the results of Sagnol in the single model case, is not currently available, and remains a topic for future research.An extension of the optimal mixed model weights method to robust optimal designs is similarly an open question.
Another robust c-optimality criterion is the weighted average: Both this criterion and ( 15) can be used with the combinatorial search methods, since they are also supermodular and maintain the same theoretical guarantees.Following Dette 44 we describe a design that maximises either of these criteria as being c-optimal for the class A with respect to the prior p.

Results and Examples
In this section we provide a range of examples to illustrate the use of the methods and summarise results from several of the papers cited above.For the combinatorial algorithms, we use the reverse greedy algorithm.For multiplicative weighting methods, we select the best design from a variety of different rounding methods.Where applicable we also compare the results to those presented by Girling and Hemming 9 .

Clusters as Experimental Units
For the first set of examples we consider Design Space A in Figure 1 with seven unique cluster sequences and six time periods.Our goal is to identify a design of m = 10 clusters.Each row is repeated up to five times in the design space, which is to say each sequence could be duplicated up to five times in the final design.Limiting the number of duplicate sequences to five, rather than ten, prevents the final design being, for example, a purely before and after design, while permitting parallel, stepped-wedge, and hybrid designs (although, we have not found a scenario where before-and-after design is optimal).Before and after designs may not be desirable as they lack any randomised comparison; treatment status will be correlated strongly with secular temporal trends, which is why they are unlikely to be optimal.We consider the linear mixed models given in Equations ( 6) and ( 13) with EXC2 and AR1 covariance functions, respectively.The method proposed by Girling and Hemming is applicable in the EXC2 case (the scenario here is the same as that given in Figure 5 of Girling and Hemming 9 ). Figure 2 shows the results using the EXC2 covariance function with m = 10 individuals per cluster-period.The resulting designs for each set of covariance parameters are the same from each method, with only a couple of exceptions.However, the difference between the variances from the designs do not exceed 0.0001.In all cases, the design from the combinatorial method has the lowest variance.Figure 3 shows the results from the model with AR1 covariance function.As with EXC2, the designs are generally the same from both combinatorial and weighting methods, but where there is a difference, the combinatorial method produces a design with marginally lower variance.For both covariance functions, as the level of correlation within a cluster and between periods or the overall level of within cluster-period correlation gets higher, the degree of 'staggering' increases.
The previous example assumes any design might be permissible within the design space.However, more restrictive design problems may be of interest given practical Prepared using sagej.clslimitations on intervention roll out.As an example, we may require there to be only two trial arms within which all clusters receive the intervention at the same time.The question is then when each arm should receive the intervention (if at all).We can consider this problem as selecting two experimental units from Design Space A containing the seven experimental units in Figure 1, since the variance of this design is proportional to a design with J clusters allocated 1:1 to each of the two sequences.Figure 4 shows the optimal two cluster sequences using combinatorial and weighting methods.The two methods agree for all parameter values with the AR1 covariance function, however, for the EXC2 function the weighting method produces designs with higher variance.For low values of the CAC or λ and the ICC a parallel design is optimal.For higher values of these parameters, inclusion of baseline or endline observations in which both trial arms are in control or treatment states, respectively, is superior to a purely parallel design.

Single Observations as Experimental Units
For the next examples we specify a single observation as the experimental unit.The design space is as specified in Figure 1 with seven clusters and six time periods, and each cluster-period has ten unique individuals who each contribute an observation.Using the combinatorial algorithms, our goal here is to select 80 observations of the 420 possible observations up to a maximum of ten per cluster-period.The mixed model weights can also be calculated using Algorithm 1 for comparison.Figures 5 and 6 show the results for the EXC2 and AR1 covariance functions, respectively.In general, the levels of within cluster-period correlation (CAC or λ) appear to determine the optimal design, with higher levels resulting in greater numbers of observations placed along the main diagonal.Not all the designs are exactly symmetric, which may suggest the algorithm has not found the exactly optimal design.

Non-Gaussian Models
For non-Gaussian models, we illustrate how the parameters β affect the resulting optimal design.We consider the design problem given for the examples shown in Figures 5 and  6 with single observations as experimental units and Design Space A of Figure 1 with up to ten individuals per cluster-period.We specify a binomial-logistic model.In all the examples we use parameters τ 2 = 0.16 and ω 2 = 0.04 for EXC2 or τ 2 = 0.20 and λ = 0.8 for AR1, giving an approximate ICC of 0.05.The time period parameters are specified to give a control group mean outcome proportion of either 5%, 25%, or 50% and odds ratios for the six time periods of 0.8, 0.9, 1.0, 1.0, 1.1, and 1.2, respectively.The treatment effect is an odds ratio of either 0.5 or 1.5.
Figure 7 shows the optimal designs of 80 individuals for the binomial-logistic example using the combinatorial and optimal mixed model weight algorithms.When the base rate is low, the relative difference in individual-level variance between time periods is larger, and the resulting designs favour placing more observations in those later time periods.When the base rate is higher, the designs more closely resemble those from the linear model in Figures 5 and 6.The optimal weights suggest that when the base rate is low in this example, we should place all our efforts in the last periods; the combinatorial algorithms have specified a cap of ten observations per cluster-period and so distribute the observations in the next-best cluster-periods.

Robust Optimal Designs
To illustrate robust optimal designs we consider the 18 models and parameter values represented by the panels Figure 1 and 2. We assume that there is no prior knowledge of the likely values of the covariance parameters, nor the covariance function, and so assign equal prior weights to all 18 designs.We use the weighted average robust criterion (16), and run the local search algorithm 100 times, selecting the lowest variance design.The left panel of Figure 8 shows the resulting optimal design with respect to the equal weighting prior.Similarly to Girling and Hemming 9 , the design is a 'hybrid' trial design with six of ten clusters following a parallel trial design, and the remaining four a staggered implementation roll-out.We also identify a robust optimal design for individual experimental units with the 18 designs shown in Figures 5 and 6 using the same procedure.The resulting design is shown in the right-hand panel of Figure 8.

Comparison of algorithms
The correlation between observations in a cluster randomised trial setting complicates identification of optimal study designs.Indeed, there have been relatively few studies on the topic of optimal cluster trial designs, particularly when compared with individuallevel randomised controlled trials.However, recent methodological advances provide several approaches for approximating c-optimal designs with correlated observations.We have discussed three different types of method within a general framework for cluster trials with discrete time: using exact formulae for specific models specifications and design spaces and using an algorithm or enumerating and evaluating multiple relevant designs; determining weights to place on each experimental units in a design space; and, combinatorial algorithms for selecting an optimal subset of experimental units.These categories are not exhaustive and new methods may be developed using novel approaches.Each of the three types of method has their advantages and disadvantages.Minimising exact functions for the estimator variance would be preferable, but explicit formulae are only available in the simpler cases.Many authors (e.g. 9,13,15) consider the linear mixed model with cluster and cluster-period exchangeable random effects, for example.The combinatorial algorithms produced the lowest variance design in all the examples we considered where we could compare methods, but were generally more computationally demanding when one takes into account the suggestion to run the algorithm multiple times and select the best design.The optimal mixed model weights algorithm identifies the optimal weights for each cluster-period, although may not produce an exact design when rounding the totals.The optimal mixed model weights algorithm is much faster to run than other generic algorithms.For the examples presented in Figures 5 and 6, the reverse greedy search took around one minute, the local search ten seconds, and the model weights 50 milliseconds.In many circumstances, it is difficult or impractical to specify exact numbers of individuals, and so weights would be sufficient, in which case the mixed model weights are likely the best choice given its efficiency.However, for more complex design problems, such as setting maximum or minimum numbers of observations in different cluster-periods, the combinatorial approaches may be required.

Small sample bias
A well recognised issue for cluster trials, and GLMMs in general, is that the generalised least squares estimator of the standard errors of β in Equation ( 2) exhibits a small sample bias.7][48] ).All of the examples given in this article may well suffer from this issue.There are two reasons for the bias.First, the information matrix M d is estimated in practice by evaluating the the covariance matrix at the estimated values of the covariance parameters.The GLS estimator (2) does not account from this additional variability from estimating the covariance parameters.Second, the estimator for the information matrix is itself a biased estimator the variance of β.Kackar and Harwell 49 describe an approximation to the small sample variance of β for linear mixed models that accounts for the estimation of the covariance parameters and Kenward and Rogers 50 extend this approximation to also account for the bias.One might consider therefore using these "corrected" estimators in place of the generalised least squares information matrix in the optimality criterion.However, it is not clear whether this approach would perform well or not; both corrections are first-order approximations and can exhibit behaviour that may undermine the performance of the algorithms.For example, in exploratory testing we found it was possible, while using fixed covariance parameter values, for a smaller design to have a marginally smaller "corrected" variance than a larger design.The algorithms produced similar, but not identical, "optimal" designs using these corrected matrices though.Optimal designs with small sample corrections thus remains an important topic for future research in this area.

Usefulness of optimal designs
Optimal designs are not always practical.For example, many of the designs in Figures 5 to 8 where the experimental unit was the individual included cluster-periods with a single individual.It is very unlikely that this would ever be implementable in practice given the logistics of data collection within clusters such as hospitals, clinics, or schools.However, one can view these optimal designs as a benchmark against which to justify a chosen study design.Hooper 10 suggests that there is a common misconception among cluster trial practitioners that the stepped-wedge design is more efficient than a parallel trial.The results of Girling and Hemming, 9 which are replicated in Figure 2, and others show Prepared using sagej.clsthat this is not the case.The most efficient design depends on the covariance parameters, and in the case of a non-linear model, the parameters in the linear predictor too.Indeed, a useful heuristic is that emerges from these results is that the less variable the cluster means over time, the more 'variable' the intervention should be (i.e. more staggered over time).Identifying an optimal design can help design a practicable trial that is more efficient than might otherwise be considered.Where individual-level experimental units are used, it can identify which cluster-periods to exclude entirely and which to place more effort into.Kasza et al 36 propose just such an approach based on a 'reverse greedy' type algorithm.
The framework we use to present these methods requires enumeration of all the unique experimental units.For more complex design problems the design space can then become very large.For example, Hooper et al 16 use a discrete approximation to a continuous time model, and aim to identify when a cluster should start and stop recruiting and when it should implement the intervention.There is a very large number of possible cluster sequences that would fit within this design space given the large number of time increments, even with the no reversibility and symmetric restrictions they use.Enumerating the complete design space and subjecting it to one of the algorithms above would likely be highly computationally demanding.Indeed, this issue raises the question of how one might approach cluster trial optimal design question with continuous time.Other examples in the literature in which a treatment variable is potentially continuous, have relied on selecting a small number of discrete possible values; the finer the discretisation the better the result. 8Extending this to larger numbers of possible conditions, or treating time as truly continuous thus remains a topic of future research.One potentially useful approach for continuous time models may be 'particle swarm' optimization and other 'nature inspired' methods. 51

Bayesian optimal designs
We have also not considered Bayesian optimal design.While Bayesian methods are relatively rarely used for the design and analysis of cluster randomised trials, there are growing number of examples (e.g. 52).Chaloner 53 provides a review of Bayesian optimal experimental design criteria.Bayesian optimal designs are based on maximising a utility function for the experiment.The resulting optimality criteria though are highly similar to their Frequentist counterparts, but they introduce the added complexity of needing to integrate over the prior distributions of the model parameters.There have been several methodological advances and new algorithms proposed for identifying Bayesian optimal experiemental designs.For example, Overstall and Woods (2017) 54 provide perhaps the most general solution to this problem for non-linear Bayesian models.The algorithms in this article might also be used to find approximate solutions to Bayesian cluster trial design problems.For example, the robust criterion (16) could be translated to a Bayesian context where the weights are derived using a Riemann sum approximation to the integral over the prior distributions. 29However, further research is required into Bayesian methods for the design and analysis of cluster randomised trials.

Conclusion
The final choice of study design for a cluster randomised trial results from the confluence of a range of practical, financial, and statistical considerations.However, there is an ethical obligation to try to minimise the sample size required to achieve a research objective.Methods to identify optimal or approximately optimal study designs therefore serve a useful purpose where there is flexibility in the roll out of an intervention.We have identified several methods relevant to cluster randomised trials, which can be used on a standard computer in a short amount of time.We would therefore suggest that examining the optimal trial design should be a step in the design of every cluster randomised trial.

Additional Results
Prepared using sagej.cls

Figure 1 .
Figure 1.Examples of cluster trial design spaces and study designs for six time periods.Each row represents a cluster, or cluster sequence, and may be repeated more than once in the design space.Each cell is a cluster-period and contains one or more individual potential observations.Design Space A encodes a no reversibility assumption and includes contemporaneous comparisons.Design Space B allows for both addition and removal of the intervention over time.Parallel, stepped-wedge, hybrid, and staircase are all designs within both design spaces.

τ 2 τ 2
+ω 2 +σ 2 /n is equivalent to the ICC at the cluster-period mean level and R = T ρ 1+(T −1) ρ is the cluster mean correlation.The coefficients a d and b d are determined by the study design:

Algorithm 2
Combinatorial algorithms to generate a design of size m procedure LOCAL SEARCH Let D 0 be size m design Set δ = 1 and D ← D 0 while δ > 0 do for all element E k ∈ D and

Figure 2 .
Figure 2. Optimal study designs with ten clusters and six time periods for different values of the ICC and CAC using a linear mixed model with EXC2 covariance structure with m = 10 individuals per cluster-period.'Combin' are results from the combinatorial local search run 100 times and selecting the best design, 'G-H' are results using the method from Girling and Hemming, and 'Weight' are designs produced by estimating experimental unit weights.The number is the estimator variance from the design.

Figure 3 .
Figure 3. Optimal study designs with ten clusters and six time periods for different values of the ICC and autoregressive parameter λ ('lambda') using a linear mixed model with AR1 covariance structure with m = 10 individuals per cluster-period.'Combin' are results from the combinatorial local search run 100 times and selecting the best design and 'Weight' are designs produced by estimating unit weights.The number is the estimator variance from the design.

Figure 4 .
Figure 4. Optimal study designs of two cluster sequences and six time periods for different values of the covariance parameters with the EXC2 and AR1 covariance functions.C = Combinatorial local search.W = experimental unit weights.The number on each panel is the treatment effect estimator variance for the design.The rows are difference values of the ICC.

Figure 5 .
Figure 5. Optimal study designs of 80 individuals with seven clusters and six time periods using a linear mixed model with EXC2 covariance structure with different values of the ICC (rows) and CAC (columns).Results from the combinatorial reverse greedy search (with up to ten individuals per cluster-period) and optimal mixed model weights algorithms.The number for the left two columns is the estimator variance from the design.The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns).

Figure 6 .
Figure 6.Optimal study designs of 80 individuals for different values of the ICC and λ using a linear mixed model with AR1 covariance structure with different values of the ICC (rows) and autoregressive parameter λ (columns).Results from the combinatorial reverse greedy search (with up to ten individuals per cluster-period) and optimal mixed model weights algorithms.The number for the left two columns is the estimator variance from the design.The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns).

Figure 7 .
Figure 7. Optimal study designs of 80 indivudals with ten clusters and six time periods for different values of the base rate (rows) and intervention effect size (columns) with a binomial-logistic mixed model.Results from the combinatorial reverse greedy search (with up to ten individuals per cluster-period) and optimal mixed model weights algorithms.The number for the left two columns is the estimator variance from the design.The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns).

Figure 8 .
Figure 8. Robust optimal study designs of 80 indivudals with ten clusters and six time periods with respect to a prior that weights each possibility from earlier examples equally.Results from the combinatorial local search run 100 times and selecting the best design.The left panel is for a design space with clusters as experimental units, and the right panel where individuals are experimental units.The numbers in the cells on the right panel show the intervention status.

Figure 9 .
Figure 9. Optimal study designs with ten clusters and six time periods for different values of the ICC and CAC using a linear mixed model with EXC2 covariance structure with m = 100 individuals per cluster-period.'Combin' are results from the combinatorial local search run 100 times and selecting the best design, 'G-H' are results using the method from Girling and Hemming, and 'Weight' are designs produced by estimating experimental unit weights.

Figure 10 .
Figure 10.Optimal study designs with ten clusters and six time periods for different values of the ICC and autoregressive parameter λ ('lambda') using a linear mixed model with AR1 covariance structure with m = 100 individuals per cluster-period.'Combin' are results from the combinatorial local search run 100 times and selecting the best design, 'G-H' are results using the method from Girling and Hemming, and 'Weight' are designs produced by estimating experimental unit weights.