Abstract
Although the C-statistic is widely used for evaluating the performance of diagnostic tests, its limitations for evaluating the predictive performance of biomarker panels have been widely discussed. The increment in C obtained by adding a new biomarker to a predictive model has no direct interpretation, and the relevance of the C-statistic to risk stratification is not obvious. This paper proposes that the C-statistic should be replaced by the expected information for discriminating between cases and non-cases (expected weight of evidence, denoted as Λ), and that the strength of evidence favouring one model over another should be evaluated by cross-validation as the difference in test log-likelihoods. Contributions of independent variables to predictive performance are additive on the scale of Λ. Where the effective number of independent predictors is large, the value of Λ is sufficient to characterize fully how the predictor will stratify risk in a population with given prior probability of disease, and the C-statistic can be interpreted as a mapping of Λ to the interval from 0.5 to 1. Even where this asymptotic relationship does not hold, there is a one-to-one mapping between the distributions in cases and non-cases of the weight of evidence favouring case over non-case status, and the quantiles of these distributions can be used to calculate how the predictor will stratify risk. This proposed approach to reporting predictive performance is demonstrated by analysis of a dataset on the contribution of microbiome profile to diagnosis of colorectal cancer.
References
| 1. | Wu, PY, Cheng, CW, Kaddi, CD, et al. –Omic and electronic health record big data analytics for precision medicine. IEEE Trans Biomed Eng 2017; 64: 263–273. Google Scholar | Crossref | Medline |
| 2. | Byrne, S . A note on the use of empirical AUC for evaluating probabilistic forecasts. Electron J Stat 2016; 10: 380–393. Google Scholar | Crossref |
| 3. | Pencina, MJ, D’Agostino, RB, Pencina, KM, et al. Interpreting incremental value of markers added to risk prediction models. Am J Epidemiol 2012; 176: 473–481. Google Scholar | Crossref | Medline | ISI |
| 4. | Pepe, MS, Fan, J, Seymour, CW, et al. Biases introduced by choosing controls to match risk factors of cases in biomarker research. Clin Chem 2012; 58: 1242–1251. Google Scholar | Crossref | Medline | ISI |
| 5. | Cook, NR . Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clin Chem 2008; 54: 17–23. Google Scholar | Crossref | Medline | ISI |
| 6. | Janes, H, Longton, G, Pepe, M. Accommodating covariates in ROC analysis. Stata J 2009; 9: 17–39. Google Scholar | SAGE Journals | ISI |
| 7. | Huang, Y . Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case-control studies. Biostatistics 2016; 17: 499–522. Google Scholar | Crossref | Medline |
| 8. | Parikh, CR, Thiessen-Philbrook, H. Key concepts and limitations of statistical methods for evaluating biomarkers of kidney disease. J Am Soc Nephrol 2014; 25: 1621–1629. Google Scholar | Crossref | Medline | ISI |
| 9. | Pencina, MJ, D’Agostino, RB, D’Agostino, RB, et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 2008; 27: 157–172; discussion 207–212. Google Scholar | Crossref | Medline |
| 10. | Huang, Y, Pepe, MS. Biomarker evaluation and comparison using the controls as a reference population. Biostatistics 2009; 10: 228–244. Google Scholar | Crossref | Medline | ISI |
| 11. | Pepe, MS, Feng, Z, Huang, Y, et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008; 167: 362–368. Google Scholar | Crossref | Medline | ISI |
| 12. | Hilden, J, Gerds, TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Stat Med 2014; 33: 3405–3414. Google Scholar | Crossref | Medline |
| 13. | Pepe, MS . Problems with risk reclassification methods for evaluating prediction models. Am J Epidemiol 2011; 173: 1327–1335. Google Scholar | Crossref | Medline | ISI |
| 14. | Pepe, MS, Fan, J, Feng, Z, et al. The Net Reclassification Index (NRI): a misleading measure of prediction improvement even with independent test data sets. Stat Biosci 2015; 7: 282–295. Google Scholar | Crossref | Medline |
| 15. | Collins, GS, Reitsma, JB, Altman, DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med 2015; 13: 1–1. Google Scholar | Crossref | Medline |
| 16. | Good IJ. Weight of evidence: a brief survey. In: Bernardo JM, DeGroot MH, Lindley DV, et al. (eds) Bayesian statistics. Amsterdam: Elsevier, 1985, pp.249–270. Google Scholar |
| 17. | Gneiting, T, Raftery, AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007; 102: 359–378. Google Scholar | Crossref | ISI |
| 18. | Good, IJ, Toulmin, GH. Coding theorems and weight of evidence. J Inst Math Appl 1968; 4: 94–105. Google Scholar | Crossref |
| 19. | McKeigue, P . Sample size requirements for learning to classify with high-dimensional biomarker panels. Stat Meth Med Res 2019; 28: 904–910. Google Scholar | SAGE Journals | ISI |
| 20. | Austin, PC, Steyerberg, EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol 2012; 12: 82–82. Google Scholar | Crossref | Medline | ISI |
| 21. | Mackay, DJ . Information theory, inference and learning algorithms, Cambridge, UK: Cambridge University Press, 2003. Google Scholar |
| 22. | Kansagara, D, Englander, H, Salanitro, A, et al. Risk prediction models for hospital readmission: a systematic review. JAMA 2011; 306: 1688–1698. Google Scholar | Crossref | Medline | ISI |
| 23. | Clayton, DG . Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet 2009; 5: e1000540–e1000540. Google Scholar | Crossref | Medline | ISI |
| 24. | Johnson, NP . Advantages to transforming the receiver operating characteristic (ROC) curve into likelihood ratio co-ordinates. Stat Med 2004; 23: 2257–2266. Google Scholar | Crossref | Medline |
| 25. | DeLong, ER, DeLong, DM, Clarke-Pearson, DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–845. Google Scholar | Crossref | Medline | ISI |
| 26. | Chen, W, Samuelson, FW, Gallas, BD, et al. On the assessment of the added value of new predictive biomarkers. BMC Med Res Methodol 2013; 13: 98–98. Google Scholar | Crossref | Medline |
| 27. | Stone, M . An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc Series B (Methodological) 1977; 39: 44–47. Google Scholar | Crossref |
| 28. | Baxter, NT, Ruffin, MT, Rogers, MAM, et al. Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med 2016; 8: 37–37. Google Scholar | Crossref | Medline |
| 29. | Carpenter, B, Gelman, A, Hoffman, M, et al. Stan: a probabilistic programming language. J Stat Softw 2017; 76: 1–32. Google Scholar | Crossref | ISI |
| 30. | Piironen, J, Vehtari, A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron J Stat 2017; 11: 5018–5051. Google Scholar | Crossref |
| 31. | Goutis, C, Robert, CP. Model choice in generalised linear models: a Bayesian approach via Kullback-Leibler projections. Biometrika 1998; 85: 29–37. Google Scholar | Crossref | ISI |
| 32. | Piironen J and Vehtari A. Projection predictive variable selection using Stan + R. arXiv:150802502 [stat], http://arxiv.org/abs/1508.02502. ArXiv: 1508.02502 (2015, accessed 15 May 2018). Google Scholar |
| 33. | Ott, J . Major strengths and weaknesses of the lod score method. Adv Genet 2001; 42: 125–132. Google Scholar | Crossref | Medline |
| 34. | Lee, WC . Selecting diagnostic tests for ruling out or ruling in disease: the use of the Kullback-Leibler distance. Int J Epidemiol 1999; 28: 521–525. Google Scholar | Crossref | Medline | ISI |
| 35. | Lindley, DV . On a measure of the information provided by an experiment. Ann Math Stat 1956; 27: 986–1005. Google Scholar | Crossref |
| 36. | Hughes, G . Information graphs for epidemiological applications of the Kullback-Leibler divergence. Meth Inform Med 2014; 53: IV–VI. Google Scholar | Medline |
| 37. | McShane, LM, Altman, DG, Sauerbrei, W, et al. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst 2005; 97: 1180–1184. Google Scholar | Crossref | Medline |
| 38. | Janssens, ACJW, Ioannidis, JPA, van Duijn, CM, et al. Strengthening the reporting of Genetic RIsk Prediction Studies: the GRIPS Statement. PLoS Med 2011; 8: e1000420–e1000420. Google Scholar | Crossref | Medline |
| 39. | Bossuyt, PM, Reitsma, JB, Bruns, DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015; 351: h5527–h5527. Google Scholar | Crossref | Medline |
| 40. | Varma, S, Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform 2006; 7: 91–91. Google Scholar | Crossref | Medline |
| 41. | Iqbal, SA, Wallach, JD, Khoury, MJ, et al. Reproducible research practices and transparency across the biomedical literature. PLoS Biol 2016; 14: e1002333–e1002333. Google Scholar | Crossref | Medline | ISI |

