Meta-analysis of prediction model performance across multiple studies: Which scale helps ensure between-study normality for the C-statistic and calibration measures?

If individual participant data are available from multiple studies or clusters, then a prediction model can be externally validated multiple times. This allows the model’s discrimination and calibration performance to be examined across different settings. Random-effects meta-analysis can then be used to quantify overall (average) performance and heterogeneity in performance. This typically assumes a normal distribution of ‘true’ performance across studies. We conducted a simulation study to examine this normality assumption for various performance measures relating to a logistic regression prediction model. We simulated data across multiple studies with varying degrees of variability in baseline risk or predictor effects and then evaluated the shape of the between-study distribution in the C-statistic, calibration slope, calibration-in-the-large, and E/O statistic, and possible transformations thereof. We found that a normal between-study distribution was usually reasonable for the calibration slope and calibration-in-the-large; however, the distributions of the C-statistic and E/O were often skewed across studies, particularly in settings with large variability in the predictor effects. Normality was vastly improved when using the logit transformation for the C-statistic and the log transformation for E/O, and therefore we recommend these scales to be used for meta-analysis. An illustrated example is given using a random-effects meta-analysis of the performance of QRISK2 across 25 general practices.


Supplementary material 3 Simulation extension 1
The values of age sampled for patients were restricted to between -42 and 40 for the mean centred variable (corresponding to between 18 and 100 years if the mean age is 60) to be more realistic. Therefore, if an age<-42 or age>40 was sampled for a patient, age for that patient would be classed as missing and another value would be sampled until an age within the specified range was found.
Restricting the age range did not result in skewed distributions for any of the scenarios. It had very little effect on the distributions at all, except for the C-statistic which was only slightly lower when age was restricted (Supplementary Figure 9).  Figure 9: Histograms for performance statistics in scenario 7, with data generated using the original mean centred age distribution N(0, 17.62) and age restricted to between -42 and 40(corresponding to 18 and 100 years if the mean age is 60).

Simulation extension 2
In addition to limiting the age range as in extension 1 above, the distribution from which age was sampled was allowed to vary across studies. Hereto, we sampled the mean and SD values for age using normal distributions. It was assumed that , where and .
The between-study distributions of calibration measures (E/O, calibration slope and calibration-in-the-large) were very similar to extension 1 where the range of ages were restricted. However, the between-study distribution for the C-statistic was wider when the mean and SD varied and started to skew with strong predictors (such as in scenarios 7 and 8, Supplementary Figure 10). Using the logit transformation offered some improvement towards normality, but distributions sometimes remained skewed. Note that scenario 9 was defined to have a high number of outcomes and a strong predictor, therefore with varying age distributions, computation problems were encountered and performance statistics could not be calculated for all studies. Scenario 9 was excluded as it is likely that for some  Figure 10: Histograms for performance statistics comparing fixed mean and SD for age with random effects on the mean and SD for age in scenario 7.
Additional simulation settings were also considered that involve generating data from a multivariable model that included a second predictor and an interaction between age and the additional predictor. However, the model to be examined for its performance in each study still only included age as in previous simulation settings. Thus, this reflects a situation where the model being considered for use in clinical practice is incomplete (i.e. it misses important predictors), and is therefore a potentially more realistic alternative to those settings described previously. Extension 3 built upon the previous two extensions, so age was restricted and random effects assumed for the mean and SD of age. However for simplicity, no variability in the intercept or predictor effects was considered in this extended setting and simulations were also restricted to scenarios 4 to 6 where the predictor effect was moderate rather than weak (as in scenarios 1 to 3) or strong (as in scenarios 7 to 9). Scenarios 4 to 6 were considered ideal as the original predictor, age, could discriminate reasonably well between patients that have the outcome and patients that do not, but with room for improvement in the model if a further predictor and interaction were added.
The model for generating data in extension 3 was specified as follows: This extended setting was considered for both a continuous and a categorical predictor (  The data for the 500 000 patients for each of the 1000 studies was generated, using the new model with specified parameter values, in a similar manner to the steps outlined in Box 1.
For the prediction model to be examined, the assumed value of the single coefficient, would, in reality, also account for some of the variation in the other terms not fitted.
Therefore, to calculate its assumed value, a large sample of five million patients was generated to estimate α and in model (8), and these used to form the prediction model to be examined (Supplementary Table 2). The additional predictor and interaction affected the α values, and the values were slightly larger only when the missing predictor was continuous.

Deleted: 7
Supplementary  Figure 11). The variances of the between-study distributions remained very small and are likely due to the minimal amount of sampling error rather than any betweenstudy heterogeneity.