Estimation of required sample size for external validation of risk models for binary outcomes

Risk-prediction models for health outcomes are used in practice as part of clinical decision-making, and it is essential that their performance be externally validated. An important aspect in the design of a validation study is choosing an adequate sample size. In this paper, we investigate the sample size requirements for validation studies with binary outcomes to estimate measures of predictive performance (C-statistic for discrimination and calibration slope and calibration in the large). We aim for sufficient precision in the estimated measures. In addition, we investigate the sample size to achieve sufficient power to detect a difference from a target value. Under normality assumptions on the distribution of the linear predictor, we obtain simple estimators for sample size calculations based on the measures above. Simulation studies show that the estimators perform well for common values of the C-statistic and outcome prevalence when the linear predictor is marginally Normal. Their performance deteriorates only slightly when the normality assumptions are violated. We also propose estimators which do not require normality assumptions but require specification of the marginal distribution of the linear predictor and require the use of numerical integration. These estimators were also seen to perform very well under marginal normality. Our sample size equations require a specified standard error (SE) and the anticipated C-statistic and outcome prevalence. The sample size requirement varies according to the prognostic strength of the model, outcome prevalence, choice of the performance measure and study objective. For example, to achieve an SE < 0.025 for the C-statistic, 60–170 events are required if the true C-statistic and outcome prevalence are between 0.64–0.85 and 0.05–0.3, respectively. For the calibration slope and calibration in the large, achieving SE < 0.15 would require 40–280 and 50–100 events, respectively. Our estimators may also be used for survival outcomes when the proportion of censored observations is high.


1
Proof of the formulae for the variance of the estimated C-statistic Let η (1) i , i = 1, . . . , n 0 , and η (0) j , j = 1, . . . , n 1 denote the linear predictor for the ith case and jth control, respectively. Also, η (0) = η (0) 1 , ..., η (0) The Mann-Whitney estimator of C, the probability that a randomly selected observation from the sample represented by G 0 will be less than or equal to a randomly selected observation from the population represented by G 1 , is:Ĉ We dene for each case, i, and each control, j, the quantities where As Delong (1988) and Cleve (2012) do, we subsequently omit the third term on the right-hand side of equation (2) as it is negligible when n 0 and n 1 are large. We rst obtain an expression for the variance of the C-statistic that is based on the asymptotic variance based on DeLong's expression. Subsequently we aim to obtain a closed-form expression for the variance of the estimated C-statistic that is free from patient-level information.
To achieve this we make the assumption that the distribution of the linear predictor is conditionally Normal given the binary outcome.
Given an arbitrary (marginal) distribution F of the linear predictor with density funciton f , the distribution of the linear predictor for cases and controls, respectively, has been given by Gail and Pfeier (2005): These probability distributions can be obtained using numerical integration, after assuming a functional form for the probability density function of η. Having obtained these probability distributions, the expectations in (13) can also be computed using numerical integration.
In practice, risk models most often include a number of continuous and categorical predictors, and, unless this number is very small or there are only binary predictors with extreme prevalences, the distribution of η is likely to be approximately marginally Normal.
Assumption 1: Marginal normality of the linear predictor In applying equation (13) under the assumption of marginal normality, values for the parameters of µ and σ 2 need to be chosen to match the anticipated values of the outcome prevalence and C-statistic. To avoid the use of simulation in choosing suitable values for µ and σ 2 , we obtain in the next subsection the following expressions for µ and σ 2 and that correspond approximately to the required anticipated values of C and p. We also show that the approximation works very well for a wide range of values of C and p (within 1.5% of the required anticipated values in all scenarios). More information for cross-checking that these values are adequate is given in the Supplementary Material 3.
To obtain a simpler estimator of the variance ofĈ that does not depend on patient-level data and involves less computation, we alternatively assume that the linear predictor is Normally distributed conditionalal on Y.
Assumption 2: Conditional Normality of the linear predictor Under Assumption 1, the assumption of conditional normality for the distribution of the linear predictor, a simple expression in closed form can be conveniently obtained for the variance of the estimated C-statistic, by substituting K and G in (13) by the cumulative probability function of the Normal distribution.
Hence, from equation (21) the variance estimator under Assumption 1 is: As we demonstrate later, conditional normality of the linear predictor also correponds to marginal normality of the linear predictor for values of the C-statistic approximately up to 0.9. As we show in simulations in the main paper, our derived formula performs very well when the linear predictor is normally distributed, which is a realistic assumption that is likely to hold in practice.
Proof of the formulae for the values of µ and σ 2 under marginal normality Assuming that the distribution of the linear predictor is Normal with parameters µ and σ 2 , in applying (13) values for these parameters can be chosen by simulation to correspond to the anticipated p and C.
However, we note that when the conditional distribution of the linear predictor (given the outcome) is Normal with common variance in the cases and controls groups, then the marginal distribution of the linear predictor is also approximately Normal when C is not too large (<0.9). This can be seen in the Figure (1) below for p=0.1 and values of C between 0.64-0.98. Hence, for values of C < 0.9, the marginal linear predictor can be reasonably approximated as the mixture of two conditionally Normal (on the outcome) distributions with common variance. Letting η (1) ∼ N (µ 1 , σ 2 c ) and η (0) ∼ N (µ 0 , σ 2 c ) denote the conditional distribution of the linear predictor in the cases and control groups, respectively, the marginal distribution of the linear predictor, η, can be approximated by We rst note that C ≈ Φ . Using the relationship between the parameters in a logistic regression model for the calibration parameters in the model and the corresponding LDA model (e.g. Efron (1975)), the parameters in model (24) can be expressed as Assuming a well-calibrated model with α = 0 and β = 1 in model (24), then Using (23) we can approximate the mean and the variance of the marginally normally distributed linear predictor, given p and C: and σ 2 =var(η) ≈ p 2 var(η (1) ) + (1 − p) 2 var(η (0) ) ⇒ In practice, the selected values of µ and σ 2 , should correspond to the anticipated C and p. For given values of µ and σ 2 , the actual p and C can be computed using simulation and the following steps: 1. Set the anticipated p and C. These are the input values in the step below.
To check the quality of the chosen values for µ and σ 2 using (29) and (30) we apply steps (1)-(5) above for a range of values for the anticpiated, input values of C and p. The results presented in Table (1) show that the actual p and C for the chosen values of µ and σ 2 are very close to the anticipated, input C-statistic and prevalence. Hence, for the purposes of sample size calculations, (29) and (30) can be reliably used to detect the values of µ and σ 2 that correspond to the desired C and p, avoiding the need for trial and error. Specically, the agreement is remarkably good for values of C up 0.8 (within 0.5 % of the true value), while for a C >= 0.85, the disagreement increases slightly (up to 1.5 % of the true value when C = 0.9). Should it be required, this minor disagreement for high values of C can be resolved by slightly inating σ c in (28) by a factor f c . Inating by a factor of 1.02-1.03 when C=0.85 and 1.03-1.05 when C=0.9, will provide actual values that are closer to the required anticipated values. More details, and the code to perform these checks are given in Supplementary material 3.  (29) and (30). The actual p and C, for the purposes of sample size calculations are suciently close to the required anticipated values. The calibration in the large is the intercept term in the following logistic regression model: which is equivalent to model (36), with the coecient of the calibration slope set to 1 (i.e. the estimated linear predictor is included as an oset term).
The variance of the estimated calibration in the large can be obtained as the asymptotic approximation to the inverse of Fisher's information in model (31) var where W = π(1 − π), π = (1 + exp(−η)) −1 , and η is assumed to follow a distribution F with parameters θ.
Assuming that the distribution of η is Normal with mean µ and variance σ 2 (Assumption 1)we use Taylor approximations to obtain a closed for expression for E(W ), in terms of µ and σ 2 only.
The Taylor expansion of w = g(η) = p(1 − p) around η = µ is Taking the expectation of the expression above, the odd central moments of η are zero, hence the expectation of the Taylor expansion up to order 3 is: We note that equation (34) is also true for any distribution for the linear predictor with mean µ and variance σ 2 , assuming that the terms above order 2 in the Taylor approximation are zero.
Substituting (35) in (32) we obtain an expression for the variance of the estimated calibration in the large that only depends on the sample size, µ and σ 2 .
3 Proof of the closed-form formula for the variance of the estimated calibration slope The calibration slope is the coecient of the linear predictor in the following logistic regression model: We aim to obtain an estimator for variance of the estimated calibration slope that is free from patient-level information. To achieve this we start by assuming that the distribution of the linear predictor is conditionally Normal given the binary outcome and that the corresponding variances are equal.
Assumption 3: j are dened as above.
Substituting (41) into (40) This formula for calculating the variance ofβ logis cs is valid as long as the variance ofβ LDA cs is the same as the variance of the estimated calibration slope,β logis cs , obtained from the t of model (36). As logistic regression models P (Y |η) while LDA models P (η|Y ) and P (Y ), hence using more information than logistic regression, var(β LDA cs ) will, at least asymptotically, be smaller than var(β logis cs ). As noted by Efron(1975) the eciency of LDA is higher than logistic regression for higher values of C, while the eciency of the two methods will be similar for values of C up to 0.8-0.85 and prevalence close to 0.5. This is conrmed by a simulation in section 5 to compare the eciency of the two methods when data are generated under model Assumption 3 (DGM1 of Section 5) and are presented in Figure (2) below. Therefore our variance formula above is expected to work well for this range of values. For very high values of the C-statistic, the variance of the estimated calibration slope obtained from tting model (36)will tend to be underestimated by equation (42).