Value-of-Information Analysis for External Validation of Risk Prediction Models

Background A previously developed risk prediction model needs to be validated before being used in a new population. The finite size of the validation sample entails that there is uncertainty around model performance. We apply value-of-information (VoI) methodology to quantify the consequence of uncertainty in terms of net benefit (NB). Methods We define the expected value of perfect information (EVPI) for model validation as the expected loss in NB due to not confidently knowing which of the alternative decisions confers the highest NB. We propose bootstrap-based and asymptotic methods for EVPI computations and conduct simulation studies to compare their performance. In a case study, we use the non-US subsets of a clinical trial as the development sample for predicting mortality after myocardial infarction and calculate the validation EVPI for the US subsample. Results The computation methods generated similar EVPI values in simulation studies. EVPI generally declined with larger samples. In the case study, at the prespecified threshold of 0.02, the best decision with current information would be to use the model, with an incremental NB of 0.0020 over treating all. At this threshold, the EVPI was 0.0005 (relative EVPI = 25%). When scaled to the annual number of heart attacks in the US, the expected NB loss due to uncertainty was equal to 400 true positives or 19,600 false positives, indicating the value of further model validation. Conclusion VoI methods can be applied to the NB calculated during external validation of clinical prediction models. While uncertainty does not directly affect the clinical implications of NB findings, validation EVPI provides an objective perspective to the need for further validation and can be reported alongside NB in external validation studies. Highlights External validation is a critical step when transporting a risk prediction model to a new setting, but the finite size of the validation sample creates uncertainty about the performance of the model. In decision theory, such uncertainty is associated with loss of net benefit because it can prevent one from identifying whether the use of the model is beneficial over alternative strategies. We define the expected value of perfect information for external validation as the expected loss in net benefit by not confidently knowing if the use of the model is net beneficial. The adoption of a model for a new population should be based on its expected net benefit; independently, value-of-information methods can be used to decide whether further validation studies are warranted.


The model
We model a stylized scenario, as described in the main text, in which there is only one predictor, , which has a standard normal distribution in the target (validation) population. The true data generating function for the binary outcome is of the typical logistic form ( ( = 1| )) = 0 + 1 .
We chose 0 = −1.55 and 1 = 0.77 to reflect a 'typical' scenario: the event probability in the population is 0.2, and the best prediction model, which is the true outcome generating function itself, has a c-statistic of 0.70. 2 This was achieved by introducing noise to the predicted risks. This in turn was modeled by adding a normally distributed random variable (with zero mean and varying standard deviation [SD]) to the logit of the predicted probabilities. Adding such a noise will result in random changes in the ranking of predicted risks, thus reducing model discrimination. Because the variance of predicted risks generated in this way exceeds that of the true risk, this setup simulates optimistic predictions, manifested as the calibration slope being less than one. We

Changing calibration intercept while preserving discrimination
This was performed by creating prediction models that are only different in the intercept ( 0 ) from the actual outcome probabilities. This approach preserves the c-statistic of the model (0.70) but changes its calibration intercept. We applied differences of {-1.6, -0.8, -0.4, 0.4, 0.8, 1.6} to the intercept. We note that at extreme negative values, the model will severely underestimate the correct risks, and its use is effectively equal to treating none. Reciprocally, at extreme positive values, the model will severely overestimate the correct risks, and its use is effectively equal to treating everyone.

Analysis
EVPIs were evaluated at three risk thresholds of 0.1 (lower than prevalence, thus the model competes with treating all), 0.2 (at prevalence, where the NBs of treating all and none are both zero), and 0.3 (higher than prevalence, thus the model competes with treating none).
EVPIs were produced for sample sizes in {250,500,1000,2000} using the Bayesian bootstrap, ordinary bootstrap, and asymptotic methods. The bootstrap-based methods were based on 1,000 bootstraps. Each analysis was performed by averaging 10,000 independent simulations. Figure S1 demonstrates the EVPI of the perturbed models as a function of model discrimination. Figure S2 rearranges the same results as a function of the calibration intercept.

Results
When evaluating EVPI as a function of sample size, in all scenarios EVPI changed in the expected direction (lower EVPI values with higher sample size).
The EVPI monotonically increased when model discrimination was degraded (Figure S1). This is expected because in all these simulations the model remains moderately calibrated(1), and it is known that as long as a model is calibrated, ≥ { , 0} (2). This can be interpreted as more uncertainty associated with the superiority of a model with degraded performance (lower c-statistic). Here, with degraded discrimination, gets closer to 0 but remains non-negative. With smaller values of , we become less confident around the superiority of the model, so EVPI increases accordingly. Figure S1: EVPIs at three exemplary thresholds as a function of the SD of prediction noise (X axis) and sample sizes (colored lines) for three calculation methods (column) at three exemplary threshold (rows) 4 On the other hand, the EVPI as a function of the calibration intercept of the model was nonmonotonical and highly dependent on the risk threshold. Two different patterns are observed here depending on if the threshold was equal to the outcome prevalence (middle row) or away from it (first and last rows).

EVPI as a function of model intercept when risk threshold is away from outcome prevalence (first and third rows)
When the model is calibrated, EVPI is low because there is little uncertainty about the superiority of the model over the default strategies. On the other hand, at extremes of model miscalibration, EVPI approaches zero. This is because the NB of the model is negative and the EVPI effectively becomes about the uncertainty between the two default decisions: EVPI: Expected Value of Perfect Information 5 At thresholds away from outcome prevalence, on the left treating all dominates treating none, and on the right treating none dominates treating all. In both instances, the distribution of does not cover 0, and the two term on the right-hand side of the above equation are equal, resulting in EVPI=0. The behavior of EVPI in between is non-monotonical and depends on the direction of miscalibration and whether risk threshold is above or below the outcome prevalence.

EVPI as a function of model intercept when risk threshold is equal to outcome prevalence
The EVPI here has its smallest value when calibration intercept is 0. This is because at this threshold, the model has the highest and there is little uncertainty in the superiority of the model. However, as the calibration is degraded, its gets close to 0 and EVPI increases.
At higher level of miscalibration, becomes negative. In such instances, there will be little uncertainty in the inferiority of the model, and the EVPI effectively becomes about the uncertainty between the two default decisions: But a crucial point is that the NB curves of the two default strategies cross at this risk threshold.
At this threshold, where E = 0, there is substantial uncertainty about which of the two default strategies is truly zero; thus the two term on the right hand side of the above equation are not equal. Figure S2: EVPIs at three exemplary thresholds as a function of calibration intercept (X axis) and sample sizes (colored lines) for three calculation methods (column) at three exemplary threshold (rows)