The cure model in perinatal epidemiology

In the statistical literature, the class of survival analysis models known as cure models has received much attention in recent years. Cure models seem not, however, to be part of the statistical toolbox of perinatal epidemiologists. In this paper, we demonstrate that in perinatal epidemiological studies where one investigates the relation between a gestational exposure and a condition that can only be ascertained after several years, cure models may provide the correct statistical framework. The reason for this is that the hypotheses being tested often concern an unobservable outcome that, in view of the hypothesis, should be thought of as occurring at birth, even though it is only detectable much later in life. The outcome of interest can therefore be viewed as a censored binary variable. We illustrate our argument with a simple cure model analysis of the possible relation between gestational exposure to paracetamol and attention-deficit hyperactivity disorder, using data from the Norwegian Mother, Father and Child Cohort Study conducted by the Norwegian Institute of Public Health, and information about the attention-deficit hyperactivity disorder diagnoses obtained from the Norwegian Patient Registry.


Introduction
Perinatal epidemiological studies investigating the possible effects of some gestational exposure on a postnatal condition can roughly be split into two categories. Those where the condition is observable immediately after birth and those where it may take years before the condition is ascertained, if ever. This paper is concerned with the latter. Smoking and low birth weight; infant supine position and sudden infant death syndrome; and foetal alcohol spectrum disorders fall in the first category. The association between prenatal marijuana exposure on neuropsychological conditions 1 and the association between prenatal exposure to pharmaceuticals and neurodevelopmental disorders belong to the second category. The present study was motivated by the hypothesis linking gestational exposure to paracetamol and an increased risk of neurodevelopmental disorders, attentiondeficit hyperactivity disorder (ADHD) in particular, [2][3][4][5] hypotheses that are pertinent examples of the latter category.
From a statistical modelling perspective, the main difference between these two types of hypotheses is that the data in the latter are plagued by censoring. That is, the outcome in studies in the second category may be unknown at the time of study due to a lack of follow-up. Thus, for studies in the first category, standard regression analysis is a natural choice (e.g. linear, Poisson, logistic), while for the latter type of studies, the need to handle censoring often leads to survival analysis methods being employed (e.g. the Cox model). A consequence of opting for a survival analysis model is that the outcome is defined as the time to diagnosis, a convenient choice due to the availability of efficient survival analysis software, but that, we argue, can in many cases be an imprecise operationalisation of the outcome in view of the hypothesis being tested. The reason for this is that in perinatal studies belonging to our second group, the hypotheses often concern an exposure that is only present during pregnancy, and consequently the outcome of interest should be thought of as occurring when the effect of the exposure ceases to have an effect, that is, at birth. Think of a frailty model with hazard ZaðtÞ, where Z is a frailty variable. Our reading of the hypotheses in the second group is that they concern the effect of the exposure on the distribution of Z, but not on aðtÞ. By defining the outcome as the time to diagnosis, one is effectively testing another hypothesis than initially intended. In this paper, we show that in cases where the outcome (occurring at birth, but unobservable at that time) can be thought of as binary, the class of statistical models known as cure models is a viable alternative to standard regression and survival analysis models. In concluding, we also propose modelling alternatives for situations where the unobservable outcome variable is continuous, and for situations where the presence of the condition under study can be ruled out during the course of a life.
In the statistical literature, cure models have received much attention in recent years. [6][7][8][9][10][11][12] The name stems from medical applications where some patients never experience a relapse of the disease under study, and these patients are therefore considered cured. Cure models have also been proposed in the field of reproductive epidemiology to account for the possibility of some of the individuals under study being sterile. 13 It is worth noting that the motivation typically underlying cure models is rather different from the argument we put forward in this paper. Typically, cure models are solidly anchored in the survival analysis world, while our approach, which is focused on the probability of belonging to the susceptible group, is more akin to a misclassification-or missing data problem. In other words, in this paper, we are less interested in survival quantities such as hazard rates and survival functions per se, but view them as nuisance parameters that must be tended to in order to make inferences on the parameters determining whether a child is born susceptible or not. See Farewell 14 for an early paper advocating for cure models in a similar manner.
The article proceeds as follows. In Section 2, we provide a brief introduction to the cure model, and motivate this class of models in light of the hypothesis linking paracetamol and ADHD (hereafter referred to as the paracetamol-ADHD hypothesis). This section also contains some theoretical results on simple logistic and Cox models when such are fitted to data that contain a cure fraction. These results are illustrated with two small simulation studies. In Section 3, we fit different cure models to the data on gestational exposure to paracetamol and ADHD, and compare these with a logistic regression and a Cox regression model. The aim of this application is to investigate whether our reading of the paracetamol-ADHD hypothesis finds empirical backing, and illustrate the fact that all three classes of models are likely to lead to rather similar conclusions about the paracetamol-ADHD hypothesis.

The cure model and ADHD
In this section, we first, using the paracetamol-ADHD hypothesis as our example, elaborate on why we find the class of cure models appropriate for the perinatal studies discussed in this paper. Subsequently, we give a brief introduction to the standard mixture cure model.

The paracetamol-ADHD hypothesis
The use of cure models in perinatal epidemiological studies can be motivated by the directed acyclic graph (DAG) in Figure 1. In this DAG, x represents the gestational exposure, Y is a binary indicator representing the condition the child is born in, while u is a set of confounders. In perinatal studies belonging to our second category, we think of Y as an indicator of a being born susceptible (Y ¼ 1) or nonsusceptible (Y ¼ 0) to the condition in question, i.e. the variable Y indicates the incidence of, or vulnerability to, a particular disease or condition, or a lifetime free of the disease or condition under study. 14 The variable T is the minimum of the time at which the presence of the condition in the child is discovered and a censoring time, d is an indicator taking the value 1 if the value Y ¼ 1 is discovered before censoring and z is a set of postnatal variables influencing the time to an eventual diagnosis.
The paracetamol-ADHD hypothesis suggests that gestational exposure to paracetamol is associated with ADHD. More precisely, it states that -all else equal during the gestational period -two children with the exact same gestational exposure to paracetamol should lead to the same conclusion about the effect of gestational exposure to paracetamol on the risk of ADHD, even though the two children were diagnosed at different ages. This entails that the exposure effectively ceases to have an effect once the child is born, which is the reason for there not being a direct arrow from x to ðT; dÞ in Figure 1. From this perspective, the outcome variable of interest is not the time to diagnosis, nor is it the time to onset of ADHD, but rather a latent susceptibility variable whose realisation takes place when the exposure ceases, which is at birth. This latent variable is represented by the Y in the DAG, so according to the hypothesis, it is the relation between x and Y we seek to make inferences on. That is, had Y been observable, we would have analysed the relationship between x and Y by a binary regression analysis.
The Y's are, however, only partially observable so the T; d's are what we have at our disposal for making inferences on the relation between x and Y. It is tempting to use the censoring indicators d as stand-ins for the latent Y's. The problem with this is that the probability of observing a diagnosis is not the same as the probability of being susceptible. The former probability depends on the distribution of the diagnosis times, hence the need to model the diagnosis times, which is what the cure model of the next section does.

The standard cure model
As above, let Y be the indicator of susceptibility (Y ¼ 1), or of a lifetime free of the condition (Y ¼ 0), with p the probability of Y ¼ 1. The time to diagnosis is a variableT subject to right censoring, i.e. what we observe is T ¼ minT; C È É and d ¼ IfT Cg, where C is a random censoring time. The standard cure model takes the population survival function as given by where SðtÞ ¼ PrT ! tjsusceptible À Á is the survival function of the susceptible group. This latter survival function is assumed to be proper in the sense that it tends to 0 as t ! 1, hence S pop ðtÞ ! 1 À p, which is the nonsusceptible fraction of the population. Both p and S(t) are typically modelled as functions of covariates, common choices being a logistic function for the probability of being susceptible, and a proportional hazards model for the survival function of the susceptible group. That is, for the i'th individual in terms of a baseline hazard function a 0 ðtÞ that might be parametric or nonparametric, and covariate vectors x i and z i that can be equal, overlapping or completely different. As regards perinatal studies, an important feature of the cure model (equation (2)) is that it allows the researcher to distinguish between prenatal and postnatal covariate effects. The covariate vector x governs the distribution of Y, while the covariate vector z governs the distribution of the diagnosis times. This means that in order to give the effect estimates of the covariates in x a direct causal meaning, they must be present during the gestational period. The covariate vector z, on the other hand, might contain covariates that do not influence the foetus, such as characteristics of the kindergarten or the school the child attends.
For the perinatal studies that are the object of this paper, two features of the model in equation (1) should be pointed out. First, by using this model, we are assuming that the nonsusceptible individuals are never diagnosed with the condition in question, that is, we assume that there are no false positives in the sample. In the case of ADHD, this assumption may be questioned. In the US, there is evidence of ADHD overdiagnosis in some communities, 15 meaning that the prevalence of ADHD is higher than the standard 3-5% prevalence estimate. [16][17][18] In the data set we analyse in Section 3, only about 2.3% of the children are diagnosed with ADHD. Since this number is well below the standard prevalence estimates, it would lead one to believe that false positives are not a major issue in our data.
Notice that if d ¼ 1, then we know that Y ¼ 1, while if d ¼ 0, we do not know whether the individual is susceptible or nonsusceptible. This brings us to the second point, if the data contain information on nonsusceptibility (e.g. a medical test that ascertains immunity to a certain disease), then this information ought to be taken into account. As it stands, the model in equation (1) cannot incorporate such information (see Remark 1 in Section 4 for further discussion).
The log-likelihood function of the model in equation (2) is If a 0 ðtÞ is parametrically specified, it is straight forward to maximise this log-likelihood. When the hazard rate is nonparametric, the log-likelihood can be maximised using the expectation-maximisation algorithm introduced in Sy and Taylor 7 and Peng and Dear. 8 The R-package smcure 19 implements this algorithm. The asymptotic theory of the maximum likelihood estimator in the semiparametric case was worked out by Fang et al. 9 and Lu 10 , building on previous work of Murphy 20 for the Gamma frailty model.

Fitting logistic and Cox models to cure data
In this section, we provide some insight on the bias incurred in the parameter estimates when the data stem from a cure model, but a logistic regression model or a Cox regression model, is chosen.
Suppose that the data ðT; dÞ are generated from a model with survival function where pðuÞ ¼ expðuÞ=ð1 þ expðuÞÞ, x a binary indicator and S(t) is a proper survival function that can be expressed as SðtÞ ¼ expðÀAðtÞÞ, in term of the cumulative hazard A(t). Thus, we assume that the true data generating mechanism is that of a cure model. Our parameter of interest is b 1 , giving the effect of the exposure x on susceptibility (Y ¼ 1). Consider fitting a logistic regression model to independent data ðT 1 ; d 1 Þ; . . . ; ðT n ; d n Þ generated by equation (3), with fixed covariates x 1 ; . . . ; x n , and independent censoring. The expectation of d given x is the expectation with respect to the distribution G of the censoring times. Since x is binary, we can define n 0 ¼ #fi : The maximum likelihood estimators of p 0 and p 1 arê which converge in probability to E ½d j x ¼ 0 and E ½d j x ¼ 1, respectively. Being invariant under transformation, the maximum likelihood estimator of b 1 is then so that by continuous mappinĝ as n tends to infinity. From this expression, we see that the estimatorb 1 will be biased (negatively if b 1 > 0), and that the degree to which the estimator is biased depends on the distribution of the diagnosis times and on the value of b 0 (and on the distribution of the censoring times). If S(t) rapidly approaches zero, which is the case if the condition under study is likely to be discovered early in life, then the bias term will be small. And, through its dependence on b 0 , we see that the bias ofb 1 is less pronounced if the condition in question is rare, which is the case with ADHD (recall that only 2:3% of the children in our sample were diagnosed with ADHD). Now, consider fitting a Cox regression model with hazard rate h 0 ðtÞexpðcxÞ, with h 0 ðtÞ left unspecified, to the data generated by equation (3). In this case, it turns out that if expðb 0 þ b 1 x i Þ is close to zero for all x i , then the point estimateĉ obtained by maximising the Cox partial likelihood will not deviate much from the estimateb 1 obtained by maximising the likelihood of the true model. The details are as follows (an excellent exposition of the machinery used in the following can be found in Gill 21 ): the counting processes corresponding to the model in equation (3) are N i ðtÞ ¼ M i ðtÞ þ K i ðtÞ; i ¼ 1; . . . ; n, with where the M i ðtÞ and Y i ðtÞ are martingales and at-risk indicators, respectively; A(t) is the cumulative hazard of the susceptible individuals; while pðÁÞ is the logistic function; and we have used that d logS pop ðtÞ ¼ Àpðb 0 þ b 1 x i À AðtÞÞ dAðtÞ.
then the score function U n ðcÞ of Cox's partial likelihood is If the second term on the right is zero, which it is when the model h 0 ðtÞexpðcxÞ is the true model, then U n ðcÞ ¼ 0 is an unbiased estimating equation. The function is approximately zero when the function is approximately constant. Since a more rapidly increasing cumulative hazard A(t) will on an average result in shorter lifetimes, the function x 7 ! gðx; tÞ is approximately constant only when b 0 is small, that is, if the probability of being susceptible to the event of interest is low.
In summary, when it comes to estimating b 1 , the logistic model provides decent estimates when b 0 is small or the cumulative hazard increases rapidly, while the Cox model only gives decent estimates when b 0 is small.
To illustrate this, we performed two simulations studies with varying parameter values. In both, the data were simulated from a cure model of the form given in equation (3), with the parameter of interest set to b 1 ¼ 1:5, x being a binary exposure, the censoring variables were drawn from an exponential distribution with mean 8 and the sample size set to 4000.
In the simulations reported in Figure 2, we set AðtÞ ¼ t=8 (i.e. the lifetimes of the susceptible population stemmed from an exponential distribution with mean 8) and varied the b 0 parameter. We see that the logistic model and the Cox model estimates are close to the truth for small values of b 0 , and that the bias of these estimators increases with b 0 . The increasing variability of the semiparametric estimates is due to pðb 0 þ b 1 x i Þ approaching one as b 0 increases.
In the simulations reported in Figure 3, we set b 0 ¼ 1:2 (thus pðb 0 Þ ¼ 0:77 and pðb 0 þ b 1 Þ ¼ 0:94) and varied the cumulative hazard AðtÞ ¼ at, taken to be that of exponential distributions. As the hazard rate a increases, the bias of the logistic model decreases, eventually converging to 'unbiasedness'. Varying values of a does not, as discussed above, have an effect on the estimates of the Cox model.
In the data set we analyse in Section 3, only 2:3% of the children are diagnosed with ADHD. This indicates that b 0 is small. Moreover, more than half of the diagnoses occur before the age of 12 years, indicating that the cumulative hazard increases quickly. The insights of the current section therefore suggest that we should expect to see a nominal similarity between the Cox model estimates and the estimates of the logistic model, as well as a similarity of both these estimates to the estimates of the logistic part of the cure models.

Data analysis
The cure models we fit to the paracetamol-ADHD data have population survival functions S pop ðt; x i ; z i Þ ¼ 1 À pðx t i bÞ þ pðx t i bÞS 0 ðtÞ expðz t i cÞ ; with pðx t i bÞ ¼ expðx t i bÞ=f1 þ expðx t i bÞg and S 0 ðtÞ a baseline survival function being either nonparametric or that of a gamma distribution. The argument presented in Section 2.1 entails that we expect gestational exposure to paracetamol to have an effect on whether or not a child belongs to the susceptible group, but, given that a child belongs to the susceptible group, we do not expect paracetamol to have an effect on determining when in life the child might be diagnosed with ADHD. This means that if the exposure variable enters both covariates vectors in equation (5) (i.e. both x i and z i ), then we anticipate that the true band c-coefficients corresponding to the exposure should be positive and zero, respectively.
For comparison, we also fit a logistic model and a Cox model to the paracetamol-ADHD data. As elaborated on in Section 2.3, we have reason to expect a nominal similarity between the exposure estimates from these models to those of the corresponding cure models. This is because the prevalence of ADHD in the data is low, and because most children diagnosed with ADHD are diagnosed quite early in life.
The data used in this analysis stem from the Norwegian Mother, Father and Child Cohort Study (MoBa) conducted by the Norwegian Institute of Public Health. Information about the ADHD diagnoses was obtained from the Norwegian Patient Registry (NPR). The analyses of this section are motivated by and use essentially the same data as Ystrom et al., 2 and a more elaborate discussion of the MoBa and the NPR can be found therein.
After having removed observations with missing values, the sample consisted of n ¼ 95 545 units (pairs of mothers and one of their offspring). Among the children in this sample, 2 165 had been diagnosed with ADHD by the end of the follow up in 2016, that is about 2:3%, a number which is about half the international estimate of ADHD prevalence. [16][17][18] The mean and median age at diagnosis were 10.8 and 11 years, respectively. Half of the children with a diagnosis of ADHD were diagnosed when they were between 9 and 12 years old, while the youngest and oldest child to be diagnosed were 1 and 16 years old, respectively. Table 1 gives the birth year of  the children in the full sample (before deleting 7 713 observations due to missing values on the covariates), the number of diagnoses observed in the relevant birth cohort and the percentage of mothers who consumed paracetamol at least once during pregnancy in each cohort.
The cure models we fit have population survival functions of the form (equation (5)), with the baseline survival function being either nonparametric or that of a gamma distribution with density ðb a =CðaÞÞt aÀ1 expðÀbtÞ. The four covariates we considered were binary indicators of gestational exposure to paracetamol; of whether paracetamol was consumed due to fever; of whether the mother consumed alcohol more than once a month during pregnancy; and of whether the mother had four years or more university education (or equivalents). The paracetamol indicator is the exposure of interest, while the three other covariates are potential confounders. Summary statistics for these covariates are presented in Table 2.
We fitted three different cure models for each of the two specifications of the baseline survival function S 0 ðtÞ. One where all four covariates entered both regression parts of the model, one where they only entered the survival part and one where they only entered the logistic part. Note that the second corresponds to treating the latent Y's as independent and identically distributed, and the third to treating the diagnosis times of the susceptible group as independent and identically distributed. Estimates from a logistic regression on the event indicators d, and from a Cox regression model (assuming no cured fraction), with the same four covariates, are included for comparative purposes. The nominal similarity of the estimates in the logistic model and Cox model to those of the logistic part of the cure models is discussed in Section 2.3. Table 3 reports the parameter estimates and estimated standard errors of these for all eight models. Figure 4 displays estimates of the proper survival functions (that is, S 0 ðtÞ in equation (5)) for the two cure models that treat the diagnosis times as independent and identically distributed (Gamma 1 and Semipara. 1).
In Table 3, the first thing to notice is that in the cure models that include covariates on both the logistic and the survival part (Gamma 2 and Semipara. 2), the estimated effects of paracetamol on the logistic part are significant Note: The last column is the percentage of mothers in the data who consumed paracetamol at least once during pregnancy. ADHD: attention-deficit hyperactivity disorder. Note: All the covariates are binary (0-1). For an individual, a value of 1 means, respectively, that paracetamol was consumed at least once during gestation, the mother has higher education, the mother consumed alcohol at least once a month during gestation and that paracetamol has been consumed to alleviate fever. at the 95% level, while the estimated effects of paracetamol on the survival part is close to zero and insignificant at all reasonable significance levels. Among the three parametric cure models, Gamma 2 is the one with superior performance according to the Akaike information criterion (AIC), albeit only slightly better than the gamma model treating the diagnosis times as independent and identically distributed (Gamma 1). Not surprisingly, removing the paracetamol indicator and alcohol indicators from the survival part of the Gamma 2 model results in an improvement of the AIC score, that is, it has an AIC superior to all the models reported in Table 3 (this model has an AIC score of -28038.08. The estimates are similar and not reported). The results reported in Table 3 are interesting because they can be seen as corroborating the reading of the paracetamol-ADHD hypothesis expounded in Section 2.1, namely that gestational exposure to paracetamol determines whether or not a child is susceptible, while being unimportant for the time to diagnosis. In other words, given susceptibility the time to diagnosis appears to be independent of the exposure.
The important issue of identifiability of the semiparametric cure model should be pointed out. Loosely speaking, for the fraction of susceptible children to be accurately estimated, we must assume that the (covariate dependent) distribution function of the survival times of the susceptible individuals reaches unity before the distribution function of the censoring times. 12 In effect, for identifiability reasons, when we fit the semiparametric cure models, the survival functions are set to zero for all survival times above the largest observed diagnosis time. No such fix is demanded when fitting fully parametric cure models. See Section 4 for further discussion of these issues, and Amico and Van Keilegom 12 for a thorough discussion of identifiability in semiparametric cure models.

Discussion and concluding remarks
In this section, we briefly discuss the above findings and introduce some topics for possible future research.
The cure model was motivated by arguing that the scientifically interesting question in many perinatal studies is how the exposure relates to a partly unobservable variable indicating whether or not the child is susceptible to the condition or disease of interest.
The empirical analysis of the paracetamol-ADHD hypothesis of Section 3 indicates that the diagnosis times are independent of the exposure when susceptibility is accounted for. These findings have important implications for studies on most childhood long-term outcomes as there will always be a fraction of the children that is never diagnosed with the condition studied, and among these many should, for all practical purposes, be regarded as nonsusceptible to the condition in question. When a fraction of the offspring are nonsusceptible, conventional survival analysis methods will give biased effect estimates.  Table 3.

Remark 1
As discussed in Section 2.1, when using the cure model, we assume that we do not have data on the absence of the condition or disease, i.e. d ¼ 0 does not inform us on what the true value of Y is. Now, consider a different scenario, where one does indeed have data on the absence of a condition or disease. Then, one would want to model a positive probability of nonsusceptibility (Y ¼ 0) being discovered. This motivates a model where the children born susceptible (Y ¼ 1) have a hazard rate aðtÞ governing the time to diagnosis, while the nonsusceptible (Y ¼ 0) children have a hazard rate bðtÞ, governing the time to it is ascertained that they do not have the condition or disease under study. Define the variable, D i ¼ Y i if d i ¼ 1, and 0 otherwise. The likelihood function is then If p, aðtÞ and bðtÞ are parametrically specified, one can proceed with likelihood inference on this model. Theory for the situation where one or both of the hazard rates are nonparametric is a topic for further research.

Remark 2
A class of survival models that can give estimates of continuous levels of susceptibility are so called first hitting time models. 22,23 One example is the following. Consider a Wiener process Z(t) with drift l and VarðZðtÞÞ ¼ r 2 , starting at c 0 > 0. It is well known that the first time Z(t) hits zero follows an Inverse Gaussian distribution with parameters l; r and c 0 . 24 Here, the parameter c 0 can be interpreted as the degree of susceptibility, with higher values translating to lower degrees of susceptibility. One could also let c 0 stem from some distribution on the positive half line and build some regression structure on this distribution. Moreover, if l > 0, then the distribution of the first hitting times is not proper. In particular, the probability of never being diagnosed is 1 À expfÀ2c 0 l=r 2 g > 0, which is what we want in order to allow for some of the children to be nonsusceptible to the condition in question.

Remark 3
We have argued that in the perinatal studies discussed in this paper, the quantity of scientific interest is p, the probability of being born susceptible, while parameters related to the distribution of the diagnosis times are nuisance parameters. Nevertheless, the model selection criterion employed in Table 3 is the AIC, a criterion that assesses general overall issues and goodness of fit aspects of the cure models, and not only how good the inference on p or related quantities is. Preferably, when the scientific question directs attention to one part of the cure model, the model selection criterion employed ought to reflect this. Therefore, a possible topic for future research is developing a focused information criterion (see Jullum and Hjort 25 and Claeskens and Hjort 26 ) for comparing different parametric, as well as parametric and semiparametric cure models. The idea is to select the model that best estimates a focus parameter, say w, where the quality of the estimator is assessed by (an estimate of) the mean squared error E ½ðŵ À wÞ 2 . The obvious focus parameter in the context of the paracetamol-ADHD hypothesis is b 1 , but other interesting quantities include Prfsusceptiblejnondiagnosed at tg, or pðx t 0 bÞ, for a covariate vector x 0 of particular interest.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work and Emil A Stoltenberg's PhD is funded by The PharmaTox Strategic Research Initiative. Hedvig ME Nordeng and Eivind Ystrom are funded by the European Research Council Starting Grant 'DrugsInPregnancy', ERC-STG-2014 under agreement No. 639377.