Modelling life course blood pressure trajectories using Bayesian adaptive splines

No single study has collected data over individuals’ entire lifespans. To understand changes over the entire life course, it is necessary to combine data from various studies that cover the whole life course. Such combination may be methodologically challenging due to potential differences in study protocols, information available and instruments used to measure the outcome of interest. Motivated by our interest in modelling blood pressure changes over the life course, we propose the use of Bayesian adaptive splines within a hierarchical setting to combine data from several UK-based longitudinal studies where blood pressure measures were taken in different stages of life. Our method allowed us to obtain a realistic estimate of the mean life course trajectory, quantify the variability both within and between studies, and examine overall and study specific effects of relevant risk factors on life course blood pressure changes.


Introduction
Most evidence about changes over the lifespan in markers of physical function has been produced either by comparison of cross sectional data at different points in time or by longitudinal studies with limited follow-up. The comparison of cross sectional data can produce misleading results as samples compared over time may differ due to differential dropout, study design, cohort effects and eventual changes in measuring instruments. Although some of these are factors that also affect longitudinal studies, longitudinal data offer the opportunity of investigating population and person level changes in the outcome of interest. Yet, most longitudinal studies conducted so far have limited follow-up. Although studies exist (1946( British birth cohort, 1958 British birth cohort) where individuals have been followed since birth, study participants have not yet reached old age. Therefore, to describe changes over the entire lifespan, it is necessary to combine data from studies where individuals have been assessed during different stages of life.
The combination of data from longitudinal studies, as of cross sectional studies, is methodologically challenging. As mentioned before, studies will be likely to differ in their design and these differences will need to be taken into account in the models. For example, some samples may be age homogeneous at study entry, whilst other samples may be conformed by an age heterogeneous group; some samples may be population representative whilst others may be selected samples; and some studies may be gender specific whilst other studies may include both men and women. It is likely that in some situations, studies may have employed different instruments to measure the same variable of interest. Furthermore, these differences may be present within the same study when instruments were changed in the different data collection waves. For instance, as more technologically advanced devices became available, a switch in the measuring device used to measure blood pressure occurred between the 1982 and 1989 assessments in the NSHD 1946.
Longitudinal data are often modelled using parametric random effects models, 1 as these models permit the description of mean change whilst informing about variability across individuals about that mean change. However, standard parametric formulations of random effects models may be inappropriate for the description of change of some biological markers as most of these parametric formulations produce unrealistic trajectory shapes. [2][3][4] Non-parametric formulations of random effects models are a more flexible alternative for the description of change of biological markers. In particular, splines 5 are an appealing and computationally efficient alternative that can be used to smooth noisy data. Initially an approach where a spline with a fixed and predetermined number of knots could be used to model the data; however, the choice and location of these knots may influence the estimation of the smoothed line. For example, a large number of knots would produce a very detailed line whilst a small number of knots may not capture the pattern of change sufficiently well.
To overcome these difficulties, DiMatteo et al. 6 proposed the use of Bayesian adaptive splines to fit curves with free knots to data drawn from an exponential family. In DiMatteo's proposed method, reversible jump Markov chain Monte Carlo (RJMCMC 7 ) allows the number of knots, as well as their locations, to be estimated as model parameters; hence the spline's level of detail adapts to the information present in the data. A prior distribution is required for the number of knots, their locations, the corresponding spline coefficients and the residual standard deviation. DiMatteo et al.'s chosen prior allowed the joint marginal posterior distribution for the number and location of knots to be obtained analytically (by integrating out the other parameters), allowing for efficient exploration of that marginal posterior by the RJMCMC method. Note that RJMCMC can be difficult to implement and often relies on the model being mathematically tractable in some way. DiMatteo et al. applied their approach to single studies, but it is not applicable in hierarchical settings, as are discussed in this paper.
In this paper, motivated by our interest in understanding changes in BP across the life course, we extend DiMatteo et al.'s method by incorporating a hierarchical structure to combine data from multiple longitudinal studies thus covering the entire life course. To the best of our knowledge, ours is the first attempt to model trajectories of a biomarker over the entire life course. Previously, Wills et al. 8 modelled blood pressure measures of a set of longitudinal studies that include the studies used in our work independently using splines. In this publication, random effects models that described change using cubic and quadratic polynomials were fitted to blood pressure measures and the effect of several risk factors was examined. This paper produced relevant information about study-specific blood pressure changes and the effect of risk factors on each of these studies, but did not combine the different studies. This limited the ability of researchers to examine information such as variability across the different studies and of estimating a life course trajectory.
The paper is organised as follows. In Section 2, we describe the data sets used. Then, we describe the model structure in Section 3 and present results in Section 4. We discuss results in Section 5 and provide the WinBUGS code used for the analyses conducted in Appendix 1. Blood pressure measures were taken by trained nurses in all studies except for CAPS, where measures were taken by physicians in the first four data collection waves and by a trained field worker in the fifth wave. In most studies at least two measurements were taken at each assessment. Instead, in CAPS one single blood pressure measure was taken in all collection waves except the second when two measures were taken. At the time of taking the measures, participants were seated and allowed at least 2 min rest prior to measurement. Different devices were used over time and in the different studies to take blood pressure reading. In ALSPAC and late waves of CAPS, NSHD and T-07 an automated oscillometric (AO) device was used to take the readings, whilst a manual random zero sphygmomanometer (MRZ) was used in earlier waves of the CAPS, NSHD and T-07 studies. Specifically, in CAPS, an MRZ machine was used in the first four waves and an AO machine in the last wave. In NSHD, an MRZ machine was used in the first two waves and an AO in the third one, whilst T-07 switched from an MRZ to an AO in the third wave, when both machines were used. Table 1 presents the range of ages of individuals, the number of waves and total number of individuals for each cohort. Each study collected information specific to their original research aims, but a common core of socio-demographic variables such as participants' height and weight, occupational social class at study entry and marital status was collected in all studies. In studies of young people (ALSPAC and youngest cohort of the T-07 study), social class was defined by the parents' main occupation. Adult males' and non-married women's social classes were defined based on their own occupations; married women's social classes were defined according to their husbands' occupations or, if this information was not available, on their own occupations. Information about medication intake collected by nurses was available in all studies of individuals aged 18 years and older.

Data
We derived a set of variables to account for factors known to be associated with hypertension. First, using information about social class, we derived an indicator of manual occupation at study entry. Body mass index (BMI) was calculated as weight=height 2 ðkg=m 2 Þ. Using published cut-off points, 13 we defined binary indicators to classify individuals as underweight, normal weight and overweight. 14 For children we used a different criterion as BMI classification varies by age and gender. If a child has BMI for age below the 5th percentile then it was considered as in the underweight group while a child with BMI for age above the 95th percentile was considered in the overweight group, 14 with all other children classified as in the normal weight group.
Where information about medication intake was available, blood pressure readings were corrected for medication intake using published methods 15 that assume that blood pressure from medicated individuals is higher than the observed measure. This correction was conducted by adding 15% of its value to the observed measure.
To correct for possible differences in the blood pressure measures taken in the first and second reading, we calculated the average of the two readings. Given the differences observed in blood pressure trajectories within the different studies, data from men and women were analysed separately.

Methods
Suppose we wish to model life course trajectories using splines in some way. As we will invariably have data on many individuals, we might consider fitting a separate spline to each individual and then characterising the population distribution of spline parameters in some way. However, this would require that all individuals share the same set of parameters, albeit with different values, which is impractical when individuals are observed over different stages of life, not least because each individual spline would have to be extrapolated into the unobserved age ranges. As an alternative, we propose using a single spline to define the population mean life course, with individuals' departures from that mean behaviour accounted for by adding random effects to the spline rather than its parameters. We also add random effects to account for differences between cohorts/studies that are due to unmeasured covariates. A principal advantage of this approach is that the spline may be adaptive, with an a priori unknown number of knots. Individual-specific adaptive splines would have different numbers of parameters with different meanings, in general, which would preclude overall inferences.

Model description
Let y ijk denote the kth blood pressure measurement k ¼ 1, . . . , K ij À Á taken for the jth individual where s ijk denotes the spline function to be fitted as described below; i and i are study-level random effects, representing study-specific deviations from the mean spline and effects of using an MRZ device, respectively (mrz ijk ¼ 1 if y ijk was measured using an MRZ device and mrz ijk ¼ 0 if an AO device was used), whilst ij represents an individual-level random effect. The term r ijk represents the combined effects of various individual-level risk factors In this expression parameters 1 i , 2 i and 3 i are study-level random effects and manual ijk , bmiu ijk and bmio ijk are binary indicators of manual occupation, being underweight and being overweight, respectively.
The spline s ijk is a piecewise polynomial given by where X ijk denotes the age of individual j in study i at the kth observation time, x 0 is the lowest age for which the function is defined and . . , q are the 'knots' of the spline and the s are regression coefficients. The constants c ! 0 and d ! c represent the 'continuity' and 'order' of the spline, respectively, and ðl, ' In this paper we use a linear spline by selecting c ¼ d ¼ 1. We choose a linear form, as opposed to a quadratic or cubic form, say, because then the knots correspond directly to change points in the fitted trajectory, whereas higher order splines can change direction away from the knots. This aids in interpreting the fitted trajectory and in specifying a prior distribution (see below) for the number of knots q, which is estimated as an unknown parameter, so that the smoothness of the fitted curve is estimated as part of the model, and hence the spline adapts itself to the information present in the data. We scale the age range so that the minimum and maximum values are 0 and 1, respectively.
Possible differences in the variability of measurements by device were accounted for by modelling the residual standard deviation ijk as a function of the device used as follows where a i and m i are study-level random effects representing the study-specific residual standard deviations associated with the AO and MRZ devices, respectively.

Model estimation
The model was estimated within a Bayesian framework. Analyses were performed using WinBUGS 16,17 with 'Jump' interface installed. 18 As described by Lunn et al., 18 the reversible jump algorithm is implemented in WinBUGS 16,17 to draw samples from the joint full conditional distribution of the coefficients , knots and number of knots q, whilst standard Gibbs/Metropolis steps are used to update the remaining model parameters. As the value of q changes during the Markov chain Monte Carlo (MCMC) simulation, so do the dimensions of and . One of the main challenges in such 'variable dimension' analyses is choosing sensible values for the spline coefficients when attempting a dimension-changing move. This is because the change of dimension necessitates a change of parameter space, which may have been visited previously (in the MCMC simulation) only rarely, if ever. Hence the MCMC sampler will have had little opportunity to learn about appropriate parameter values in the new space. This problem can be alleviated if we can derive the full conditional distribution for the coefficients in closed form, since appropriate coefficients can then be generated instantly for any set of proposed knots. A multivariate normal prior for the coefficients combines with the normal likelihood to give a multivariate normal full conditional, which is straightforward to derive. In our analyses the prior mean and variance for each coefficient are 0 and 100 2 , respectively.
As the age range was scaled to the interval (0,1), we chose a Unif(0,1) distribution as a prior distribution for the location of each knot. The RJMCMC algorithm considered here increased the model flexibility with regard to the number of knots of the model. However, it requires a prior distribution for the number of knots q. We chose a Poisson prior with mean 3 as this distribution represents our a priori expectation that, after increasing throughout childhood and adolescence, blood pressure may begin to level off in early adulthood; it may then begin to increase again in middle age, with a possible further change in later life. The shape of the Poisson distribution then penalises large numbers of knots, which encourages parsimony.
The individual-and study-level random effects (or their logarithms) were assumed to arise from normal population distributions. The code can be found in Appendix 1, and a graphical representation of the model is presented in Figure 1. For each analysis, two MCMC chains were simulated from different initial states. Convergence was assessed via visual inspection of the trace plots and by the Brooks-Gelman-Rubin method. 19,20 The Monte Carlo standard error (MCSE) for all parameters of interest was examined periodically to ensure accurate inferences. The rule of thumb of requiring an MCSE less than 5% 21 (p. 277) was fully satisfied in our analysis, with 52% achieved for the vast majority of parameters.

Results
Our results indicate that blood pressure increases with age as illustrated in Figure 2. The figure shows, for both genders, a rapid increase in blood pressure coinciding with peak adolescent growth, followed by a gentle increase in early adulthood, a midlife acceleration beginning in the fifth decade of life and a final period of deceleration in late adulthood. Following the rapid acceleration during adolescence, women typically have lower blood pressure than men (by up to around 10 mmHg), until their late 60s.  Table 2 presents posterior median estimates, along with 95% credible intervals, for the overall mean parameters. Male individuals with a manual occupation had higher blood pressure than individuals with a non-manual occupation (posterior median ¼ 0.85). Although there is a suggestion that this is also the case for females (posterior median ¼ 0.56), the effect is not statistically significant, with a posterior credible interval that includes zero. The effect of manual occupation on blood pressure seems to be greater for males. There is an indication that underweight individuals have lower blood pressure than normal weight individuals, but this effect is not significant for either gender. Overweight individuals, on the other hand, were found to have significantly higher blood pressure than normal weight individuals, for both males and females. For neither gender was there a significant effect on expected blood pressure of using an MRZ as opposed to an AO device. Posterior median parameter estimates for both genders were negative, however, suggesting that blood pressures may be lower, on average, when measured with an MRZ device. The level of residual variability was similar for both devices, suggesting that neither device is more variable than the other. Finally, a similar number of knots were estimated in modelling the men's and women's data, reflecting the fact that similar trajectories were estimated for both genders (see Figure 2). The estimated number of knots was also larger than expected a priori. This is probably due to confounding, leading to artefacts in the fitted spline, as discussed below. Mean posterior estimates of study specific effects are presented in Table 3. The results suggest that individuals with a manual occupation have higher blood pressure than individuals with a nonmanual occupation consistently over the different studies. However, in studies of female individuals, estimates only reached significance in ALSPAC, midlife T-07 cohort and NSHD. In studies of male individuals, significant estimates were found in midlife T-07, NSHD and CAPS studies. Underweight men and women typically had lower blood pressure than normal weight individuals. Although the direction of the effect reversed in older NSHD women, the effect of being underweight in this study did not reach significance (as it also did not for the older T-07 cohort, both genders, NSHD men and T-07 men in the midlife cohort). Consistently across studies, larger body size was significantly associated with higher blood pressure in both men and women.
It is interesting to note that the direction of estimated study-level random effects ( i ) is consistently the same for both men and women, although only the ALSPAC effect for women and the CAPS effect for men are significantly non-zero. These suggest that ALSPAC women typically have blood pressures above the mean curve and CAPS men typically have blood pressures below the mean curve. Only in the CAPS study was the effect of using a different device () significantly non-zero, indicating that measurement with the manual device was associated with higher blood pressures in the CAPS study. Study-specific estimates of residual variation were consistently similar for men and women, and for both devices. There appears to be an upward trend in residual variation with age.
Previous to the work reported here, we conducted a series of sensitivity analyses on each of the independent datasets to examine robustness of results to the method of adjustment for medication intake and to examine whether results varied if the first or second measure was modelled instead of their average. In both cases, results were robust.
Results reported here were obtained adjusting blood pressure measure for medication intake by adding 15% to the registered measure. To examine robustness of results to this method, we varied the percentage added (5, 10, 15, 20, 25%) and also added fixed constants (5, 10, 15 mmHg). Results obtained remained robust.

Discussion
In this paper, we propose a method to combine data from longitudinal studies of individuals in different stages of life to produce a life course trajectory. The method has been illustrated using blood pressure measures from four UK-based longitudinal studies. Our results elucidate the nature of the life course and provide scope for targeting public health interventions at specific age groups. They are also in general agreement with those reported by Wills et al., who found a rapid increase in blood pressure coinciding with peak adolescent growth, gentle increase in early adulthood, midlife acceleration beginning in the fifth decade of life and a final period of deceleration in late adulthood. In addition, they reported differences in blood pressure trajectories by gender, with a similar blood pressure level by the seventh decade. Our results show that following rapid acceleration during adolescence, women typically have lower blood pressure than men (by up to around 10 mmHg), until their late 60s. The local maxima occurring in the late teens for both genders are likely an artefact of the model rather than true maxima, as discussed below. Males with a manual occupation have higher blood pressures, as do overweight individuals from both genders. Although some cohort effects may exist on the risk factors that were not optimally considered when we opted for using cutoff points for the definition of BMI categories, our results suggest other interesting effects, such as underweight individuals having lower blood pressure and a trend between residual (unexplained) variation and age, but these have not been shown to be statistically significant.
Modelling life course trajectories is severely complicated by the fact that individuals are invariably only monitored over some fraction of their life course. In attempting to draw overall inferences about the population of interest, we might naturally wish to fit a parametric model to each individual's data and characterise the population distribution of model parameters. However, unless the appropriate model is well established, this seems unrealistic, since in order to obtain a complete overall/population trajectory, each individual's fitted trajectory must cover the whole life course, with much 'borrowing of strength' required to extrapolate into the (extensive) unobserved periods. If the parametric model is linear in the parameters, such as a spline, then one way around this might be to integrate out the individual-level parameters. However, this would require a bespoke MCMC algorithm, particularly if the number of knots is unknown, as here, rather than the more general modelling framework offered by BUGS, say. This might exploit some aspects of DiMatteo et al.'s RJMCMC approach, but would be substantially more complex due to the hierarchical structure of our model. (Note that we would have to assume a common set of knots for all individuals.) A very basic alternative might be to simply average the observed data in each of a series of short 'age bins', but this would preclude the possibility of controlling for covariates, say. Instead, our approach has been to start with the population life course trajectory and account for departures away from this 'overall' behaviour, due to variability between cohorts and individuals within cohorts, say, by adding random effects. Such an approach is limited, however, by the additive nature of the random effects. For example, excepting the effects of covariates, an individual can only deviate from his/her cohort's overall trajectory, and that cohort can only deviate from the 'global' trajectory, by a constant amount (not changing with age). We could explore the inclusion of multiplicative random effects also, but the model's flexibility would still be limited.
A principal objective of this work has been to avoid the need for strong parametric assumptions regarding the shape of the fitted trajectory whilst accommodating possible risk factors. By assuming the number and location of the spline knots to be unknown, the spline's 'smoothness' is not predetermined and it can adapt itself to the information in the data; estimation is thus largely data driven. As a consequence, the methodology is applicable to much smaller data sets than those considered herein, although the level of detail in the fitted curve will be reduced accordingly. We chose to use a linear adaptive spline, as opposed to a quadratic or cubic spline, say, because then the knots correspond directly to change points in the fitted trajectory, whereas higher order splines can change direction away from the knots (the change points also tend to be more 'visible' with a linear spline). This aids in the specification of a prior distribution for the number of knots and in the interpretation of the fitted trajectory. We note, however, that extensions to higher order splines are straightforward. In our work, a larger than initially expected number of knots was estimated (12 for men and 10 for women). This may be a consequence of the large amount of data included in the analysis, which allows even small variations in the data to be tracked inexpensively by the spline.
In our analyses, the estimation of study-level random effects ( i ) presented some difficulty, with very high autocorrelation in the monitored Markov chains. This may be explained by several factors. First, there is limited overlap between some of the studies, which makes it difficult to identify differences between the studies. (In fact, in the presence of more overlap, in addition to a study effect, it may be necessary to include a cohort effect to better account for potential differences between the studies.) Second, there is a limited number of studies, meaning that even if contrasts between studies can be identified, the actual values of the random effects may still be poorly determined, since a wider range of values is supported by the hierarchical prior when the number of effects is small. Finally, the flexibility of the spline is such that, where there is little overlap, it can account for any differences between studies itself, and so the study effects and the spline are somewhat confounded, making for relatively slow exploration of the posterior distribution. The problem of slow exploration is alleviated straightforwardly by simulating longer Markov chains, though in our case this led to lengthy run-times, due to the large data sets (30-40,000 observations). The problem of confounding is much more challenging, and we believe may be responsible for the some of the 'sudden' direction changes in the fitted trajectories.
The introduced methodology assumes that missing information on blood pressure measurements is modelled as missing at random. Although a missing at random missing data assumption is likely to be plausible in the younger cohorts, it may not be realistic in the older cohorts as individuals with higher blood pressure are more likely to dropout of the studies. Extensions to the proposed method are under consideration to account for informative missing data.