Estimating a time-to-event distribution from right-truncated data in an epidemic: A review of methods

Time-to-event data are right-truncated if only individuals who have experienced the event by a certain time can be included in the sample. For example, we may be interested in estimating the distribution of time from onset of disease symptoms to death and only have data on individuals who have died. This may be the case, for example, at the beginning of an epidemic. Right truncation causes the distribution of times to event in the sample to be biased towards shorter times compared to the population distribution, and appropriate statistical methods should be used to account for this bias. This article is a review of such methods, particularly in the context of an infectious disease epidemic, like COVID-19. We consider methods for estimating the marginal time-to-event distribution, and compare their efficiencies. (Non-)identifiability of the distribution is an important issue with right-truncated data, particularly at the beginning of an epidemic, and this is discussed in detail. We also review methods for estimating the effects of covariates on the time to event. An illustration of the application of many of these methods is provided, using data on individuals who had died with coronavirus disease by 5 April 2020.


Introduction
Data on time to an event are said to be right truncated if they come from a set of individuals who have been randomly sampled from a population using a sampling mechanism that selects only individuals who have experienced the event by a given time, called the truncation time. An example is data on the time from onset of coronavirus disease  symptoms to death collected from a sample of individuals who all developed symptoms and died by 5 April 2020. Among individuals in the population whose symptoms began on, say, 20 March, only those whose time from symptom onset to death was less than 16 days could be included in the sample. For these people, the truncation time is 16 days. Likewise, among those whose onset was on 31 March, only those whose time to death was less than 5 days could be included. Their truncation time is 5 days. Among the sampled individuals, the proportion whose time to death is less than t days (e.g. t = 14 days) will be greater than the proportion in the population of people who (eventually) die with COVID-19. So, a naive estimate of the distribution of time to death in this population will be biased. Moreover, the average time to death in sampled individuals whose symptom onset was on 31 March will be shorter than the average time in those whose onset was on 20 March, even if time to death is independent of onset time in the population.
Ideally, we would have a random sample of individuals who experience an initial event (e.g. onset of COVID-19 symptoms) that places them at risk of a final event of interest (e.g. death with  and follow them to see how many experience the final event and the times from initial to final event. However, this is not always feasible. For example, in the case of COVID-19 many people will have experienced symptoms but never reported them. Another example is HIV/AIDS, where many people will not have discovered they are infected with HIV (the initial event) until they were diagnosed with AIDS (the final event).
The analysis of right-truncated data requires statistical methods that account for the fact that each of the sampled individuals must have experienced the final event by their truncation time. The purpose of this article is to review such methods in the context of an infectious disease epidemic, like COVID- 19. Much of the statistical methodology for right-truncated data was developed in the 1980s and early 1990s in the context of the HIV/AIDS epidemic (e.g. 1,2 ). However, some of it predated the 1980's. For example, some methods described in this article involve the idea of reversing the time axis, that is, counting backwards in time from the final event to the intial event. This reversal has the effect of making right-truncated data left-truncated. The Lynden-Bell 3 estimator is the extension of the original Kaplan-Meier estimator to handle left truncation. Some of the other methods are based on Turnbull's general algorithm for estimating a distribution function under general patterns of censoring or truncation 4 . Recently, the emergence of COVID-19 has highlighted the need for methods that handle right truncation. Data from early in the epidemic have been used to estimate the distributions of time from infection to symptom onset, symptom onset to hospitalisation, symptom onset to death, and hospitalisation to death. Although many researchers have accounted for the right truncation of the data (e.g. 5-7 some have not done so or it is unclear whether they have (e.g. [8][9][10] ).
An important issue when estimating the distribution of time to event using right-truncated data is that of identifiability. When the maximum truncation time that can be observed in the sample is less than the maximum time to event in the population, the time-to-event distribution can only be estimated up to a constant of proportionality. In the article, we shall pay particular attention to this issue.
The structure of the article is as follows. In the section 'Estimating marginal distribution of time to event' we look at methods that estimate the marginal distribution of time from onset to death in the population of people who (eventually) die from the infection. Some of these methods model both time of onset, and hence truncation time, and time from onset to death, whereas others model only the time from onset to death. Some methods are parametric; others, non-parametric. We review these methods and investigate their asymptotic relative efficiency (ARE) for estimating the mean time from onset to death. Section 'Estimating and testing covariate effects' looks at methods for estimating the effect of covariates on the distribution of time to death, and for testing independence between covariates and time to death. These include parametric and semi-parametric methods. An illustration of the application of some of these methods to COVID-19 deaths is described in the section 'Application to COVID-19'. Section 'Discussion' contains some practical recommendations and a brief discussion of more general truncation patterns and of censoring.

Estimating marginal distribution of time to event
Let X and Y denote the times of an individual's initial and final events, respectively, with 0 ≤ X ≤ Y . These two events could be, respectively, onset of COVID-19 symptoms and death from COVID-19. Alternatively, they could be, for example, infection with COVID-19 and hospitalisation, or HIV seroconversion and AIDS diagnosis. Let T = Y − X denote the individual's time from initial to final event; we call this the 'delay'. We assume that the support of X includes zero and that X and T are either both continuous or both discrete variables; if they are discrete, we suppose, without loss of generality, that they take integer values and we interpret integrals as sums. Unless stated otherwise, we shall assume T is independent of X . Let f * T (t) and F * T (t) denote, respectively, the probability density (or mass) function of T and the distribution function of T .
We obtain an i.i.d. sample, (x 1 , t 1 ), . . . , (x n , t n ), from the probability distribution of (X , T) given X + T ≤ τ for some τ > 0, that is, from where f X (x) denotes the conditional probability density (or mass) function of X given X ≤ τ, and I(.) denotes the indicator function. If X and T are discrete, we shall assume, again without loss of generality, that τ is an integer. We should like to estimate F * T (t), but first we need to discuss whether this is possible.

Identifiability
The maximum truncation time in the sample is τ. If this is greater than the maximum delay in the population, then F * T (τ) = 1 and F * T (t) can be estimated from the data. If, on the other hand, the maximum truncation time is less than the maximum delay in the population, then F * T (τ) < 1 and F * T (t) is only (non-parametrically) identified up to a constant For example, if we assume f X (x; λ) ∝ exp (λx) and T ∼ Gamma(θ 1 , θ 2 ), then We might call L 1 the 'joint-conditional' likelihood, because it is based on the joint distribution of X and T and is conditional on the final event occurring by time τ. Notice that equation (3) can be equivalently written as This highlights that L 1 depends on the distribution of T only through F T (t), its conditional distribution given T ≤ τ. In the example given immediately above, this distribution is Any distribution F * T (t) for which equation (5) describes F T (t) = F * T (t)/F * T (τ) would yield the same likelihood L 1 . Thus, the data do not distinguish between T ∼ Gamma(θ 1 , θ 2 ) and, for example, F * T (t) = F T (t). In the former case, ; in the latter case, F * (τ) = 1. Another possibility is that F * T (t) = F T (t) × 0.001, in which case F * (τ) = 0.001. In short, F * (τ) could theoretically be anywhere between 0 and 1 and only F T (t) is nonparametrically identified.
In practice, we might believe that the parametric model accurately describes the whole of the delay distribution, or we might have additional information that makes us confident that F * T (τ) = 1, or the data themselves might suggest that F * T (τ) equals 1 or is close to 1. The latter might be the case if there were a reasonably large number of sampled individuals with truncation times close to τ and all or almost all of these individuals had delays that were far less than τ. The absence of any longer delays among these people who could have been sampled even if their delays had been longer might lead one to conclude that longer delays are rare. However, caution is warranted, because there remains a possibility that the distribution of delays is bimodal, with a fraction of the population having much longer delays than the rest. For example, this might be the case for COVID-19 if some very seriously ill patients are kept alive on ventilators for a long period of time.
We should be careful not to be misled by a good fit of the parametric model to the data. We can assess the fit on this parametric model only over the range t ∈ [0, τ]. In all three examples of distributions that give rise to equation (5), this fit is perfect. In particular, if the proportion, θ θ 1 2 τ 0 t (θ 1 −1) exp ( − θ 2 t) dt/Γ(θ 1 ), of the probability mass of the Gamma(θ 1 , θ 2 ) distribution that lies in the interval [0, τ] is far from 1, then we are only judging the fit of the model in the early part of the distribution. Note that while it is common in time-to-event studies (without right truncation) for administrative censoring to prevent assessment of the fit of a parametric model in the tail of the distribution, this censoring prevents F * T (t) from being (non-parametrically) identified only for times t after the censoring time.
An equivalent likelihood. So far, we have assumed the data (x 1 , t 1 ), . . . , (x n , t n ) represent a sample from the population and that the sample size n is fixed. An alternative framework involves assuming that initial events are generated from a (non-homogeneous) Poisson process with rate h(t) at time t, the delays T are independent of the initial event times, and we observe all N individuals which gives rise to the likelihood function Kalbfleisch and Lawless 2 show that the maximum likelihood (ML) estimates and observed and estimated Fisher Information for (λ, θ) are the same whether they are obtained from L 1 or L 2 . Thus, θ (and λ) can be estimated from whichever of these likelihoods is most computationally convenient, even if the assumption that the initial events are generated from a Poisson process does not hold (as might be the case, for example, if the initial events occur in clusters).

Applications.
Kalbfleisch and Lawless 2 illustrate the use of L 2 . They estimate the distribution of time T from infection to onset of AIDS in blood transfusion patients. A common assumption is that at the beginning of an epidemic, cases arise from a Poisson process with rate h(t) = λ 0 exp (λt). This means that α = λ 0 { exp (λτ) − 1}/λ and f X (x; λ) = λ exp (λx)/{ exp (λτ) − 1}.
Kalbfleisch and Lawless make this assumption and assume that T has a Weibull distribution. They obtain the ML estimate of (λ, θ) using a Fisher scoring algorithm. Another example of the use of L 2 is given by Cox and Medley, 11 who estimate the distribution of the time T taken for an AIDS diagnosis to be reported to the Communicable Disease Surveillance Centre. They allow the rate of AIDS diagnoses to be increasing sub-exponentially, by using h(t; λ) = λ 0 exp (λ 1 t + λ 2 t 2 ), and test the null hypothesis that λ 2 = 0. They consider several parametric models for the distribution of the reporting delay T . Salje et al. 7 use equation (1), the basis of the likelihood L 1 , to estimate f T (t), but do not use ML. In their application, T is the time from hospitalisation with COVID-19 to death. They assume f X (x), the distribution of hospitalisation times in the population, is known and model f * T (t) as a mixture of an exponential distribution and a log normal distribution. They estimate the parameters of this mixture distribution by finding the values that minimise the sum of squared differences between the distribution of f T (t | X + T ≤ τ) implied by equation (1) and the observed distribution of delays T in the sample.
Modelling the distribution of T Parametric modelling. A third likelihood function that can be used to estimate θ comes from factorising f X , and using only the second factor. This yields the likelihood For example, if we assume T ∼ Gamma(θ 1 , θ 2 ), then We might refer to L 3 as the 'conditional-on-initial (event time)' likelihood. The issues regarding the identifiability of F * T (t) that were discussed in section 'Parametric modelling of joint distribution of X and T ' continue to apply when L 3 , rather than L 1 , is used.
Non-parametric modelling L 3 is also the basis of a non-parametric estimator,F (NP) . This estimator is obtained by applying the familiar Kaplan-Meier estimator in reverse-time, that is, to τ − T . By reversing time, right truncation of T (i.e. T must be ≤ τ − X ) becomes left truncation of τ − T (i.e. τ − T must be ≥ X ). Left truncation (or 'late entry') is easily handled by the original Kaplan-Meier estimator (or more accurately, the Lynden-Bell 3 estimator, which extends the Kaplan-Meier estimator to handle left truncation). For simplicity, we shall now describe the estimatorF (NP) T (t) for discrete T ; Lagakos et al. 12 describe the corresponding estimator for continuous T .
Let D j = n i=1 I(T i = j) be the number of delays observed to equal j (j = 0, . . . , τ), and let M j = n i=1 I(T i ≤ j ≤ τ − X i ) be the number of delays observed to be at most j in those individuals whose truncation time τ − X is such that a delay of j would have been observed.
is a consistent estimator of F T (t).F (NP) T (t) is asymptotically normally distributed and its variance can be consistently estimated using the following adaptation of the Greenwood formula for the variance of the Kaplan-Meier estimator 12 : This guarantees that the confidence interval lies within the interval [0, 1]. An alternative way to calculate the same estimatorF (NP) as the number of sampled individuals with X = x and T = t and fit the model Obtaining an estimate of the variance of F (NP) T (t) this way is, however, more difficult. The non-parametric estimatorF (NP) T (t) can be used to check the fit of a parametric model for T . When doing this, it is important to compareF (NP) For example, if we maximise the likelihood L 3 given by equation (7), we obtain both an estimate of F * T (t) = P(T ≤ t), the distribution function of a gamma distribution, and an estimate of should be compared with the latter. It is important to notice that only the M τ individuals with X = 0, and hence a truncation time of τ, contribute to the calculation ofĜ τ . Therefore, if M τ is small,Ĝ τ will have large variance, which causesF (NP) T (t) also to have large variance not just for t = τ but for all values of t. In this case, it is advisable to choose a value τ * < τ such that M τ * is reasonably large and replace the estimatorF (NP) Relative efficiency of likelihoods L 1 and L 3 Unlike L 1 (or equivalently L 2 ), L 3 does not require a model f X (x; λ) to be specified for the distribution of the initial event times. This eliminates the risk that such a model may be misspecified. However, it has the disadvantage that some of the information in the data is being discarded, which makes L 3 less efficient than L 1 , especially when τ is small. Brookmeyer and Gail 13 found that the ML estimator of θ based on L 3 could be considerably less asymptotically efficient that the estimator based on L 3 when the density of X is known. Jewell 14 and Kalbfleisch and Lawless 2 also comment that the loss of efficiency from using L 3 can be considerable. In section 'Study of ARE of estimators' we carry out a more extensive study of relative efficiency, comparing: (i) L 1 treating the distribution of X as known; (ii) L 1 treating this distribution as unknown; (iii) L 3 ; and (iv) the likelihood L 4 described in section 'Modelling the distribution of X given Y '.
A simple example illustrates the information that L 1 uses but L 3 does not. Suppose X and T are discrete, with X equal to either 0 or 1 and T equal to either 0 or 1, and τ = 1. We observe 10 individuals with (X , T ) = (0, 0), 10 individuals with (X , T ) = (0, 1), and no individuals with (X , T ) = (1, 0). If we use L 3 , we would estimate f T (0) = f T (1) = 0.5. Now suppose that we know that f X (0) = f X (1) = 0.5. Unlike L 3 , L 1 uses this information. Since we have only observed individuals with X = 0, it seems likely that there are quite a few individuals with X = 1 whom we have not observed. Since we have not observed them, they must all have T = 1. This suggests that f T (1) > 0.5.
In section 'Parametric modelling of joint distribution of X and T', we considered the use of L 1 only with parametric models. If X and T are discrete variables (possibly formed by discretising continuous variables), the distribution of X can instead be modelled non-parametrically. Writing λ = (λ 0 , . . . , λ τ ), with λ x = f X (x), L 1 then becomes Here, F T (t; θ) can be a non-parametric or parametric model. Kalbfleisch and Lawless 2 show that the ML estimate of θ obtained from L (NP) 1 is identical to that obtained from L 3 . This is true whether T is modelled parametrically or nonparametrically and regardless of how finely time (if continuous) is discretised. This shows that when L 1 is more efficient than L 3 , this greater efficiency comes from the modelling assumptions L 1 makes about the marginal distribution of X .
Modelling the distribution of X given Y Verity et al. 5 proposed a fourth likelihood function, which arises from factorising f X , and using only the second factor. This factor can be written as They assume that the initial events follow a Poisson process with rate h(t) = λ 0 exp (λt) and that T ∼ Gamma(θ 1 , θ 2 ). Equation (10) then becomes which is the density of a truncated gamma distribution with shape θ 1 , rate θ 2 + λ and truncated to [0, y]. This gives the likelihood We might refer to L 4 as the 'conditional-on-final (event time)' likelihood. It is evident from equation (11) that only θ 1 and θ 2 + λ are identified. Practical use of L 4 therefore requires that λ be known.
Verity et al. 5 analysing data on 24 COVID-19 deaths that occurred in China very early in the epidemic, assumed λ = 0.14 per day and estimated that the mean time from onset of symptoms to death was 19 (95% CI 16-50) days. They did not explain why they used L 4 , rather than L 1 . In view of the small sample, it was probably impractical to estimate both λ and θ. However, L 1 could have been used instead, also with λ fixed at 0.14. The appeal of L 4 may have been ease of use: it is just the likelihood of a truncated gamma distribution. Another advantage of L 4 relative to L 1 , which may have been relevant, is that the former is a valid likelihood no matter how individuals are sampled, provided that the sampling probabilities depend only on Y . Validity of L 1 as a likelihood requires a simple random sample of individuals with X + T ≤ τ.
Equation (11) is derived from the assumptions that f X (x) ∝ exp (λt) and T has a gamma distribution. The 'conditional-on-final' likelihood could also be derived from other assumed distributions for X and T , but would not have the form of a truncated gamma distribution.

Study of ARE of estimators
We carried out a study of the AREs of the ML estimators of the expected delay, E(T ), based on (i) L 1 with known λ ('L kwn 1 '), (ii) L 3 , and (iii) L 4 (with known λ), all relative to the ML estimator based on (iv) L 1 with unknown λ ('L est 1 '). We assumed f x (x) ∝ exp (λx) and T ∼ Gamma(θ 1 , θ 2 ), and considered multiple scenarios defined by different combinations of values of λ, θ 1 , θ 2 and τ. For the distribution of X , we used λ = 0, 0.035, 0.07, 0.14 and 0.28. For the distribution of T , we used θ 1 = 1, 2, 5 and 10, and set θ 2 = θ 1 /19, so that T has mean E(T ) = 19. The mean of 19 (days) was chosen because it was the estimate calculated by Verity et al. (2020) early in the COVID-19 pandemic. τ was varied from 10 to 60. Note that the AREs are invariant to the choice of E(T ) = 19, in the sense that they do not change if λ and θ 2 are both multiplied by some constant and τ is divided by the same constant (keeping θ 1 unchanged).
To calculate an ARE, we first calculated the asymptotic variance of each of the four ML estimators of θ = (θ 1 , θ 2 ). Then we obtained the corresponding asymptotic variance of each of the four ML estimators of E(T ) using the Delta Method. The ratio of the asymptotic variances of two estimators of E(T ) is their ARE. The formulae used for these calculations are given in the Supplemental Materials. Figure 1 shows the results. Each row corresponds to a different value of λ; each column, to a different value of θ 1 . The x-axis of each graph represents τ and the y-axis represents the ARE. The ARE of L kwn 1 (relative to L est 1 ) varies from slightly over 1.0 to about 1.4. It increases with λ; it also increases with τ, at least when τ ≤ 30. The ARE of L 3 (relative to L est 1 ) varies from 0.67 to almost 1. It decreases with increasing λ or θ 1 , and mostly increases with τ. In particular, it is close to 1 when X is uniformly distributed (λ = 0) and T is exponentially distributed (θ 1 = 1), and is 0.67 when λ = 0.28, θ 1 = 10 and τ = 10. When X is uniformly distributed (λ = 0), L 4 has exactly the same efficiency as L 3 (see Supplemental Material for proof). However, as λ increases, L 4 becomes relatively more efficient, and approaches the efficiency of L kwn 1 , especially for larger values of τ.
We also calculated the AREs for the ML estimators of the median of T . These were almost identical to the AREs for the expectation, E(T) (see Supplemental Material).

Estimating and testing covariate effects
Let Z be a covariate or vector of covariates. We assume, unless stated otherwise, that Z is independent of X . We may be interested in testing whether T is independent of Z and/or estimating the effect of Z on T .

Parametric models
A parametric model f * T (t | Z = z; β) can be specified for the distribution of T given Z. For example, T might be assumed to have a gamma distribution with log E(T | Z) = β 0 + β 1 Z and shape parameter β 2 . Then β = (β 0 , β 1 , β 2 ) can be estimated by maximising L 1 , L 2 or L 3 . Just as in the case with no covariates (section 'An equivalent likelihood'), the ML estimate and Fisher information for β obtained from L 1 and L 2 are identical. 2 If Z is a function of X , this same method can still be applied using either L 1 or L 3 . A likelihood ratio test or Wald test can be used for the null hypothesis that one or more elements of the vector β equal zero.
Kalbfleisch and Lawless 2 give an example of using L 2 with a Weibull regression model for the effect of age Z on time T from HIV infection to AIDS diagnosis.

Semi-parametric models
Brookmeyer and Liao 15 propose a generalisation of the discrete-time estimatorF (NP) T (t) to estimate covariate effects. Fit the τ binomial regression models g{P(T = j | T ≤ j ≤ τ − X , Z = z) = β 0j + β 1 z (j = 1, . . . , τ) simultaneously, where g is a specified link function. Letβ 0j andβ 1 denote the resulting estimates. Then calculateF T (t | Z = z) = τ j=t+1 {1 − g −1 (β 0j +β 1 z)}. In the absence of covariates, this is equivalent toF (NP) T (t). Brookmeyer and Liao recommend using the complementary log link, g(p) = log { − log (1 − p)}, because the model then implies F T (t | Z = z) = {F T (t | Z = 0)} exp (β 1 z) , which provides an interpretation of β 1 . The null hypothesis that β 1 (or a subvector of a vector β 1 ) equals zero can be tested using a likelihood ratio test or Wald test. This method can also be used when Z is a function of X .
Kalbfleisch and Lawless 16 extend the method of Brookmeyer and Liao to continuous time and derive score tests of the null hypothesis that β 1 = 0. Following a similar approach, Lagakos et al. 12 had earlier described a log rank test for testing independence of T and a binary covariate Z.
The Poisson regression approach to calculatingF (NP) T (t), described in section 'Non-parametric modelling', is extended by Brookmeyer and Damiano 1 to perform a likelihood ratio test of the global null hypothesis that β 1 = 0. This is done by including interaction terms in the Poisson model. This approach is less useful, however, for testing whether a set of covariates is conditionally independent of T given another set of covariates or for estimating covariate effects. 15 Proportional hazards models A common assumption in both parametric and semi-parametric models for a time-to-event outcome is that hazards are proportional. Brookmeyer and Liao's 15 semi-parametric model, described in section 'Semi-parametric models', assumes proportional hazards, but in reverse time. This differs from the usual proportional hazards assumption, which is in forward time and states that hazard ratios β 1 are constant over time and 1 − F * and Liao's model implies that when β 1 z > 0 (respectively, β 1 z < 0), the (forward-time) hazard ratio comparing Z = z to Z = 0 is initially greater (less) than one and decreases (increases) monotonically over time, becoming equal to one at time τ − 1.
Finkelstein et al. 17 show that the (forward-time) proportional hazards assumption suffices to identify F * T (τ), provided that the hazard ratios of the covariates in the model do not all equal zero. When the hazard ratios all equal zero, F * T (τ) is not identified, just as in the non-parametric case with no covariates. Finkelstein et al. describe how to fit the semiparametric proportional hazards model by ML. Provided that the hazard ratios of the covariates do not all equal zero, this provides estimates of the hazard ratios and F * T (τ). Unfortunately, the identification of F * T (τ) relies entirely on the proportional hazard assumption. If this does not hold, the estimate of F * T (τ) can be heavily biased. Moreover, if there is only one covariate Z in the model, its hazard ratio is estimated very imprecisely. Finkelstein et al. discourage the use of their method for estimating F * T (τ) or the hazard ratio of a single covariate. When there are multiple covariates with non-zero hazard ratios in the model (and possibly additional covariates with zero hazard ratio), Finkelstein et al. find that these hazard ratios can be estimated more precisely. However, it is unclear how big might be the effect of a small violation of the proportional hazards assumption on the bias of these estimates when F T (τ) < 1. Alioum and Commenges 18 suggest that when there is only one covariate, its hazard ratio could be estimated twice, once with F * T (τ) fixed at its lowest value considered plausible, and once at its highest plausible value. However, the resulting range of hazard ratios in their example is very wide.
Perhaps   19 propose a simpler Cox regression approach with inverse probability weighting. This uses a modification of the Cox partial likelihood, which does not depend on the baseline hazard. This method involves weighting the observed individuals so that they represent both themselves and those individuals who were not observed because of the right truncation. The method requires either that F * T (τ) = 1 or that F * T (τ | Z = 0) is known. If F * T (τ | Z = 0) is unknown, a sensitivity analysis can be performed, analysing the data using a range of plausible values of F * T (τ | Z = 0). This method can be applied using the R package coxrt. We are not aware of software being available for implementing Finkelstein et al.'s method.

Application to COVID-19
The first case of COVID-19 in the UK was reported on 30 January 2020. Public Health England (PHE) receives reports every day of COVID deaths from National Health Service England, the Demographics Batch Service (DBS) and Health Protection Teams (HPT). DBS and HPT also report date of symptom onset, when available, for these deceased individuals.
Here we illustrate some of the methods discussed above using data available early in the epidemic, specifically at 9 April. We estimate the distribution of time ('delay') from symptom onset to death and investigate the effects of sex and age on this distribution. This is intended only as a simple illustration of methods; results should be interpreted with caution.
To allow for reporting delays, we exclude deaths occurring between 6 and 9 April; around 80% of deaths are reported within 4 days. 20 Of the remaining 7415 deaths, the symptom onset date was known for 316 (4.3%). Of these 316, we excluded 12 because of missing sex or age. The remaining 304 constitute the sample we shall use. 180 were male and 124 female; 25 were aged under 65, 33 aged 65-74, 100 aged 75-84, and 146 were aged over 85. Figure 2 shows the distribution of onset times X . The distribution is skewed, with most onsets being in the second half of March. This reflects exponential growth in the early phase of the epidemic. The earliest observed onset date was 1 February (time zero); the two individuals with onset on that day have the maximum truncation time of τ = 64 days. Only 13 other individuals have onsets before 2 March (time 30), and so truncation times τ − X greater than 34 days; most truncation times are less than 20 days.
The mean delay in the sample is 7.1 days (range: 0-52 days). As only those who die by 5 April can be included in the sample (right truncation), the mean in the population could be much greater. Using the R package flexsurv, 21 we estimated the distribution of delays in the population by fitting two parametric models: a gamma distribution and a log normal distribution. Each was fitted in four ways: by maximising L 1 with unknown λ ('L est 1 '), L 1 with known λ ('L kwn 1 '), L 3 and L 4 . For L kwn 1 and L 4 , we assumed λ = 0.14, the estimate calculated early in the epidemic by Verity et al. 5 . Figure 3 shows the estimates of the survival distribution (i.e. 1 − F * T (t)). These are quite diverse. For example, estimated survival at 30 days varies from 0.21 to 0.88. However, as expected, these estimates are all greater than the proportion (0.02) of the sample who have delays greater than 30 days. The estimates from the log normal model are systematically greater than those from the gamma model, and all estimates have wide associated confidence intervals. For a given model, the estimates from L 3 and L est 1 are similar to one another, and those from L kwn 1 and L 4 are almost identical to one another. This is perhaps not surprising, given our findings in section 'Study of ARE of estimators' for the gamma distribution. There we found that: (1) when λ ≥ 0.07, L 3 and L est 1 have similar asymptotic efficiency, unless τ is small compared to the mean delay E(T ); and (2) L 3 and L est 1 have similar asymptotic efficiencies when λ < 0.14 and the shape parameter of the gamma distribution equals 1. For the COVID data, the estimates of E(T ) from the gamma model varied from 19 for L 3 or L est 1 to 36 for L kwn 1 or L 4 , and the estimates of the shape varied from 1.16 to 1.24.
For both the gamma and log normal models, the estimates of λ from L est 1 were 0.11 (95% CI 0.09-0.12). The difference between this estimate and the assumed value of λ = 0.14 used by L kwn 1 and L 4 may explain why the estimates of survival from L est 1 and L 3 are lower than those from L kwn 1 and L 4 . Compared to λ = 0.11, λ = 0.14 implies a later average onset time, E(X ), in the population, and hence a higher proportion of people in that population that have delays too long to be sampled (T > τ − X ). This implied greater extent of right truncation implies a greater difference between the mean delay in the sample and the mean delay in the population.
We used the non-parametric estimate of survival conditional on T ≤ τ * , that is, 1 −F (NP) T (t), to assess the fit of the parametric models. To avoid having wide confidence intervals forF (NP) Figure 3). This illustrates the point made in sections 'Identifiability' and 'A joint-conditional likelihood' that two models can give similar estimates of F T (t) and yet very different estimates of F * T (t). Next we fitted the gamma and log normal models with sex and age group (0-64, 65-74, 75-84 and ≥ 85) as covariates, again using the R package flexsurv. This was done using the likelihood L 3 ; flexsurv does not currently allow L 1 or L 4 to be used with covariates. The gamma (respectively, log normal) model assumes that the log rate (respectively, mean of the log delay) is a linear function of the covariates. Delays were estimated to be longer for males than females and for younger than for older people. Both effects were borderline-significant in the gamma model. Neither was significant in the log normal model, although age was found to be significant when a trend test was used (see Supplemental Materials).
If the gamma or log normal model is misspecified, tests of covariate effects may not be valid. Brookmeyer and Liao's method allow tests that do not depend on parametric assumptions. Using this method, we found again that delays are longer for males and younger people. Neither effect reached statistical significance, although age was significant when a trend test was used (see Supplemental Materials).
Finally, we used Vakulenko-Lagun's et al.'s Cox regression method. This requires that either F * T (τ) = 1 or we specify a value for F * T (t | Z = 0). Figure 3 suggests it is unlikely that F * T (τ) is close to 1. Using the R package coxrt, we fitted the model that included the covariate sex, varying F * T (t | Z = 0) over the full range from 1.0 to 0.1, where here Z = 0 means male. The method uses inverse probability weighting to account for the right truncation, with the weights being functions of a Kaplan-Meier estimate of the truncation time distribution. As explained in the Supplemental Materials, this Kaplan-Meier estimate could not be calculated for our data set, until we excluded the 12 individuals with onset times before 1 March. This makes 1 March the new time zero, and so τ now equals 36. Figure 5 shows how the estimated log hazard ratio associated with being female changes as P(T ≥ τ | Z = 0) = 1 − F * T (τ | Z = 0) changes. Females are estimated to have a greater hazard than males (and hence shorter mean delay) and the estimated log hazard ratio increases as P(T ≥ τ | Z = 0) increases. However, the confidence intervals, calculated using 1000 bootstrap samples, indicate that this effect is not significant, at least not until P(T ≥ τ | Z = 0) = 0.1. There were convergence problems: the percentage of bootstrap samples for which convergence was not achieved was 0.0% when P(T ≥ τ | Z = 0) is 0.2 or less, 0.3% when it is 0.3, 1.4% when it is 0.5, 3.1% when it is 0.7, and 7.9% when it is 0.9. This may make the estimated confidence intervals unreliable for the largest values of P(T ≥ τ | Z = 0). We also tried to fit the model with age group, both as a four-level unordered categorical variable and an ordinal categorical variable with linear effect, but the fitting algorithm did not converge. We could, however, fit the model with age as a binary variable, although again with some convergence problems in the bootstrap samples (see Supplemental Materials).

Discussion
We have considered ML estimation of the marginal distribution of the delay T , using four likelihoods. Likelihoods L 1 and L 2 are based on the joint distribution of T and the time of the initial event X ; L 3 on the distribution of T given X ; and L 4 on the distribution of T given the time of the final event Y = X + T . Estimates from L 1 and L 2 are identical. L 3 has the advantage of not requiring a model for f X (x) but the disadvantage of yielding the least efficient estimates. L 4 requires f X (x) to be known exactly. When f X (x) is known, L 1 is more efficient than L 4 . L 1 also has the advantage over L 4 that it can be used when f X (x) is a function of unknown parameters. However, L 4 has the advantage that, unlike L 1 and L 3 , it yields valid estimates even when the sampling probabilities depend on the actual values of Y , rather than only on whether Y ≤ τ. In our study of asymptotic efficiency, we found the ARE of L 3 relative to L 1 varied between 0.67 and 1 when f X (x) was unknown. Because L 3 does not use information on f X (x), these AREs became more marked when f X (x) was known, varying between 0.58 and 0.92. AREs of 0.67 and 0.58 correspond to reductions in sample size of 33% and 42%, respectively. These AREs were calculated assuming a gamma distribution for the delay and exponential growth over time in the number of initial events. In the early phase of an epidemic, the assumption of exponential growth may be tenable, but it is unlikely to hold later on. More research on the AREs when X and T have other distributions is warranted, as well as on finite-sample relative efficiencies.
The non-parametric estimator,F (NP) T (t), of the delay distribution has the attraction of not requiring distributional assumptions. It does, however, only estimate the distribution of T conditional on T ≤ τ; the unconditional distribution of T is estimable only by using parametric assumptions. One use ofF (NP) T (t) is to assess the fit of parametric models over the range t ∈ [0, τ]. However, the confidence intervals associated withF (NP) T (t) may be very wide at the beginning of an epidemic, when the numbers of sampled individuals with large truncation times τ − X may be small.
To estimate the effects of covariates on the delay, and to test whether these effects are non-zero, L 1 or L 3 can be used with parametric models. Brookmeyer and Liao's (1990) semi-parametric model can also be used, particularly for the purpose of testing whether the effect of a single covariate is zero. The semi-parametric Cox regression method of Vakulenko-Lagun et al. (2019) allows hazard ratios of covariates to be estimated under a standard proportional hazards assumption. However, it does assume that the covariates are independent of the truncation time τ − X , and hence of X . Moreover, it requires that F * T (τ) be equal to one or that an interval can be specified within which F * T (τ) is believed to lie. As this interval becomes wider, the uncertainty in the hazard ratios increases. We also had some convergence problems when fitting these models to the COVID-19 data (section 'Application to COVID-19').
The R package flexsurv can be used to fit parametric models using L 1 , L 3 and L 4 , and also to calculateF (NP) T (t). Brookmeyer and Liao's (1990) method can be applied using any software for fitting generalised linear models. The R package coxrt applies the method of Vakulenko-Lagun et al. (2019). We have focussed on ML estimation, but also Figure 5. Estimate of log hazard ratio associated with sex=female as a function of P(T ≥ τ | sex=male). Dotted lines indicate 95% confidence limits calculated by bootstrap; these may be unreliable when P(T ≥ τ | sex=male) is large (see text).
Bayesian analyses can be carried out using the likelihoods L 1 , L 2 , L 3 and L 4 . Indeed, the analysis of Verity et al. (2020) was Bayesian, using L 4 with informative priors.
In addition to being right-truncated, Y may be censored. This is easily handled in parametric models by replacing f * T (t i ) in L 1 and L 3 by F * T (t U i ) − F * T (t L i ), where [t L i , t U i ] is the interval within which individual i's delay is known to lie. If individual i is left-censored, t L i = 0; if right-censored, t U i = ∞. The estimatorF (NP) T (t) is easily extended to allow left censoring of Y . 22 The non-parametric estimator of F T (t) under general censoring of both X and Y and right truncation of Y is described by Sun. 23 Alioum and Commenges (1996) generalise Finkelstein et al.'s (1993) method to allow interval censoring of Y . Double truncation (i.e. simultaneous left and right truncation) of Y is addressed by, among others. 4,24,25,18,23,26,27 The R package DTDA 28 can be used to calculate the non-parametric estimator of F T (t) when Y is double truncated (it can also be used when Y is only right-truncated, but calculatingF (NP) T (t) using equation (8) is faster). Brookmeyer and Gail 13 showed that when Y is double-truncated and the distributions of X and T are modelled parametrically, the efficiency gain from using L 1 rather than L 3 to estimate f T (t) could be considerably greater than the efficiency gains we showed in section 'Study of ARE of estimators'. In the Supplemental Materials, we extend our study of ARE to double-truncated data and replicate Brookmeyer and Gail's finding.
F T (t) and F * T (t) describe the distribution of T in the population of individuals who will (eventually) experience the final event. They do not describe what proportion of the population will never experience the final event. An alternative sampling mechanism randomly samples individuals who experience any one of a number of mutually exclusive types of final event by time τ. For example, one might have a random sample from the population of individuals who develop COVID-19 symptoms and go on to die or recover by time τ. This situation of competing risks and right truncation is discussed by Hudgens et al. 29 and de Una-Alvarez, 30 who describe how to estimate cumulative incidence functions. These functions describe what proportion of individuals who die or recover by time τ will die by time t and what proportion will recover by time t (t ≤ τ).
In addition to the problems of right-truncation, censoring and competing risks, other issues can complicate the estimation of a time-to-event distribution early in an epidemic. Overton et al. 31 describes some of these, which include the possibility that some individuals who experience the final event leave the country before being detected, changes over time in the definition of the final event or in the format of the data being collected, and delays in reporting the final event.