Meta-analysis of the severe acute respiratory syndrome coronavirus 2 serial intervals and the impact of parameter uncertainty on the coronavirus disease 2019 reproduction number

The serial interval of an infectious disease, commonly interpreted as the time between the onset of symptoms in sequentially infected individuals within a chain of transmission, is a key epidemiological quantity involved in estimating the reproduction number. The serial interval is closely related to other key quantities, including the incubation period, the generation interval (the time between sequential infections), and time delays between infection and the observations associated with monitoring an outbreak such as confirmed cases, hospital admissions, and deaths. Estimates of these quantities are often based on small data sets from early contact tracing and are subject to considerable uncertainty, which is especially true for early coronavirus disease 2019 data. In this paper, we estimate these key quantities in the context of coronavirus disease 2019 for the UK, including a meta-analysis of early estimates of the serial interval. We estimate distributions for the serial interval with a mean of 5.9 (95% CI 5.2; 6.7) and SD 4.1 (95% CI 3.8; 4.7) days (empirical distribution), the generation interval with a mean of 4.9 (95% CI 4.2; 5.5) and SD 2.0 (95% CI 0.5; 3.2) days (fitted gamma distribution), and the incubation period with a mean 5.2 (95% CI 4.9; 5.5) and SD 5.5 (95% CI 5.1; 5.9) days (fitted log-normal distribution). We quantify the impact of the uncertainty surrounding the serial interval, generation interval, incubation period, and time delays, on the subsequent estimation of the reproduction number, when pragmatic and more formal approaches are taken. These estimates place empirical bounds on the estimates of most relevant model parameters and are expected to contribute to modeling coronavirus disease 2019 transmission.


Introduction
This is a brief note to explain the approach taken to producing samples from results given as parameterised distributions cited in the literature. The purpose for this is in combining multiple studies which report a quantity (e.g. the serial interval of an infection) as modelled by a parameterised statistical distribution (e.g. a Gamma distribution), with a central estimate of the mean, a central estimate of the standard deviation, and typically confidence limits on both of those quantities (or credible intervals if the distribution was estimated using a Bayesian framework). Combining such results into a single estimate through meta-analysis does not fit within the standard approaches, as these generally assume a normally distributed single dimensional effect, and whilst this is probably valid for the means of the parameterised distributions to be treated in this way, it is not necessarily a valid assumption for the parameter defining the spread (i.e. the standard deviation).
The challenge in combining such distributions is essentially that of estimating the mixture of all possible distributions that are compatible with the results published in all the studies. The resulting mixture distribution may then be further analysed in a range of ways.
For example an additional important capability for us is the ability to combine studies that present results as a parameterised distribution, with other studies where only empirical estimates are made on the quantities of interest, and the raw data is available. In this case generating representative samples from the original report is important, so that parameterised results can be combined with empirical results. This approach is akin to parametric bootstrapping, but where the bootstrapping is performed not on data but on the uncertain estimate of the parameterised distribution.
The key step of this re-sampling is the conversion of an uncertain parameterised distribution into a representative set of precisely specified distributions that can in turn be sampled. The set of studies described in Table 1 is a typical example of the kind of information analysed with this approach.
The specific problem of generating a set of representative set of precisely specified parameterised distributions from an uncertainly specified result is somewhat similar to that of sampling parameter values within a Bayesian framework where the mean and standard deviation of a parameter distribution are themselves specified by prior distributions. In this scenario however the choice of distribution for the mean and spread parameter (usually variance) as hyper-parameters, can be assumed. Because of the Central Limit theorem, a sensible choice for the prior of the mean is a Gaussian distribution, but the spread parameter typically has support between zero and infinity, and often weakly informed priors chosen for this, from either uniform distributions or half-t family (including the Cauchy distribution) (Gelman 2006 In our situation, we are doing the reverse, and given a mean and standard deviation, and confidence intervals for each, but no knowledge of the distributions of these quantities, the challenge is to produce a set of sampling distributions that accurately reflect the study definition.

The sampling distribution of the mean
To do this we need to make some assumptions about the nature of the sampling distribution of the mean. Fortunately this is rather simple, as a key finding of the central limit theorem, as regardless of the underlying distribution, as the number of samples of a distribution increases the sampling distribution of the mean (E n ) is a Gaussian whereμ is the central estimate of the mean 2 : Knowing that the sampling distribution of the mean is a Gaussian, we can use this assumption to estimate thē σ √ n quantity, which is the standard deviation of the sampling distribution of the mean, from the confidence intervals, giving us a fully specified sampling distribution of the mean.

The sampling distribution of the variance and standard deviation
A normally distributed variable x with an expected value µ and a standard deviation σ is sampled n times. the set of x n observations has a mean of x and observed variance of S 2 n , the sampling distribution of the variance can be shown to be a Chi-squared distribution 2 with n − 1 degrees of freedom. Given that the Chi-squared distribution is a particular form of a Gamma distribution (here parameterised with shape, α and rate, β) and given the definition of the Nagakami-m distribution 3 , the following holds: With no information about the nature of the underlying distribution this expression is a bounding limit on the sampling distribution of the standard deviation, and we use this in the situation where the central estimate of the standard deviation is given, alongside the sample size, but with no other information. However this estimate is unreliable in the situation where there are small numbers or there is kurtosis in the distribution 4 , and could lead to a broader range of samples than would be compatible with the reported results when these are Gamma, or Log-normally distributed.
In O'Neill (2014) 5 the asymptotic sampling distribution of the variance is explored with respect to the kurtosis of the underlying distribution, and this modifies the degrees of freedom applied to the Chi-squared distribution above, to the following expression, where κ is the kurtosis of the underlying distribution (this is their result 14).
Information about the kurtosis of the underlying distribution is available from the confidence limits on the standard deviations quoted in source studies and a closed form expression for these is given in O'Neill (2014) 4 . This involves the population size from which the sample is taken which is information we do not generally have. With both confidence intervals, it would be possible to eliminate the unknown population size (or we could reasonably assume it is very much larger than our sample size), but it is also possible to estimate the associated Nakagami distribution numerically from the confidence intervals and central estimate of the standard deviation (σ) from the expression above. These again describe bounding distributions for the sampling distribution of the standard deviation.

Generating samples from uncertain distributions
The main purpose of this approach is to generate a representative sample set from uncertainly specified parameterised distributions such that they can be combined. To test this we investigate a list of published studies that give estimates of the serial interval of SARS-CoV-2 as a parameterised distribution (see Table  S2.1). Using the methodology described above, for each of these studies a sampling distribution for the mean and standard deviation is estimated. This is used to generate 1000 representative parameterised distributions for each study. From these 1000 distributions, 1000 random samples are taken representing 1,000,000 generated samples per study. With all studies taken together and with equally weighting, the combined sample has a mean and SD of 5.32 ± 4.34, (95% CI -0.03-15.99) however in reality we would take a number of samples proportional to the study size when combining.
More relevant though is comparing the distribution of means and standard deviations recovered from the re-sampling process. In this case keeping 1000 sample from each of the 1000 inferred distributions from each study separate, and summarizing the samples shows us how well the distribution sampling is performing. In the following figure, for the Bi et al. 2020 study, we see the sampled mean and standard deviation in each of the 1000 precisely specified distributions, compared to the quoted central estimates (solid red) and confidence intervals (dashed lines) in Panel A, and in panel B the associate shape and rate parameters.  Bi et al. 2020

. Central estimates and confidence intervals from that study are marked in red lines in panel A. Panel B is the equivalent distributions expressed in shape and scale parameters of the Gamma distribution.
Combining the summaries from the Figure S2.1 above, allows us to reconstruct the uncertainty in the mean and standard deviation in our sampled data, and reconstruct central estimates and confidence intervals, for each source, which are shown in Table S2.2. These compare well with the originally reported values from the papers, where the numbers of cases is sufficiently large, or the reported confidence intervals are not excessively wide. It is less accurate where the very small numbers in some of the studies leads to wide confidence intervals, for example Zhang et al. 2020. In such cases the ability to replicate the exact shape is arguably less important for our intended purpose as such small studies will be relatively down-weighted during meta-analysis.

Conclusion
We have presented a short summary on the method we use to generate samples from uncertainly specified parameterised distributions. We have demonstrated it is able to produce both a set of exactly specified parameterised distributions that is representative of the uncertainty in the original specification, but also that from this approach we can generate a set of samples that cover the range of possibilities described in the original source in a representative way. When aggregated these samples recover the original uncertainty with a reasonable degree of fidelity. This method is used in our approach to meta-analysis of quantities such as the serial interval that are reported in the literature as uncertainly defined parameter distributions, rather than as single dimensional effect sizes.