Abstract
When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.
References
| 1. | Altman, DG, Lyman, GH. Methodological challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat 1998; 52: 289–303. Google Scholar | Crossref | Medline | ISI |
| 2. | Royston, P, Altman, DG, Sauerbrei, W. Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med 2006; 25: 127–141. Google Scholar | Crossref | Medline | ISI |
| 3. | Hastie, T, Tibshirani, R. Generalized additive models, London: Chapman & Hall, 1990. Google Scholar |
| 4. | Wood, SN . Generalized additive models: An introduction with R, London: Chapman & Hall, 2006. Google Scholar |
| 5. | Turner, E, Dobson, J, Pocock, J. Categorisation of continuous risk factors in epidemiological publications: A survey of current practice. Epidemiol Perspect Innov 2010; 7: 9–9. Google Scholar | Crossref | Medline |
| 6. | Bennette, C, Vickers, A. Against quantiles: Categorization of continuous variables in epidemiologic research, and its discontents. BMC Med Res Methodol 2012; 12: 21–21. Google Scholar | Crossref | Medline | ISI |
| 7. | Lim, BL, Kelly, AM. A meta-analysis on the utility of peripheral venous blood gas analyses in exacerbations of chronic obstructive pulmonary disease in the emergency department. Eur J Emerg Med 2010; 17: 246–248. Google Scholar | Crossref | Medline | ISI |
| 8. | Mazumdar, M, Glassman, JR. Categorizing a prognostic variable: Review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med 2000; 19: 113–132. Google Scholar | Crossref | Medline | ISI |
| 9. | Lausen, B, Schumacher, M. Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comput Stat Data Anal 1996; 21: 307–326. Google Scholar | Crossref | ISI |
| 10. | Hin, LY, Lau, TK, Rogers, MS Dichotomization of continuous measurements using generalized additive modelling – Application in predicting intrapartum caesarean delivery. Stat Med 1999; 18: 1101–1110. Google Scholar | Crossref | Medline | ISI |
| 11. | Magder, LS, Fix, AD. Optimal choice of a cut point for a quantitative diagnostic test performed for research purposes. J Clin Epidemiol 2003; 56: 956–962. Google Scholar | Crossref | Medline | ISI |
| 12. | Tsuruta, H, Bax, L. Polychotomization of continuous variables in regression models based on the overall C index. BMC Med Inform Decis Making 2006; 6: 41–41. Google Scholar | Crossref | Medline |
| 13. | Barrio, I, Arostegui, I, Quintana, JM Use of generalised additive models to categorise continuous variables in clinical prediction. BMC Med Res Methodol 2013; 13: 83–83. Google Scholar | Crossref | Medline | ISI |
| 14. | Harrell, FE, Califf, RM, Pryor, DB Evaluating the yield of medical tests. JAMA J Am Med Assoc 1982; 247: 2543–2546. Google Scholar | Crossref | Medline | ISI |
| 15. | Buist, AS, Vollmer, WM, McBurnie, MA. Worldwide burden of COPD in high-and low-income countries. Part I. The Burden of Obstructive Lung Disease (BOLD) Initiative. Int J Tuberc Lung Dis 2008; 12: 703–708. Google Scholar | Medline | ISI |
| 16. | Esteban, C, Quintana, JM, Moraza, J Impact of hospitalisations for exacerbations of COPD on health-related quality of life. Respir Med 2009; 103: 1201–1208. Google Scholar | Crossref | Medline | ISI |
| 17. | Rabe, KF, Hurd, S, Anzueto, A Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: GOLD executive summary. Am J Respir Crit Care Med 2007; 176: 532–555. Google Scholar | Crossref | Medline | ISI |
| 18. | Pauwels, RA, Rabe, KF. Burden and clinical features of chronic obstructive pulmonary disease (COPD). Lancet 2004; 364: 613–620. Google Scholar | Crossref | Medline | ISI |
| 19. | Quintana, JM, Esteban, C, Barrio, I The IRYSS-COPD appropriateness study: Objectives, methodology, and description of the prospective cohort. BMC Health Serv Res 2011; 11: 322–322. Google Scholar | Crossref | Medline | ISI |
| 20. | Teasdale, G, Jennett, B. Assessment of coma and impaired consciousness: A practical scale. Lancet 1974; 304: 81–84. Google Scholar | Crossref |
| 21. | McCullagh, P, Nelder, JA. Generalized linear models, London: Chapman & Hall, 1989. Google Scholar | Crossref |
| 22. | Pepe, MS . The statistical evaluation of medical tests for classification and prediction, New York: Oxford University Press, 2003. Google Scholar |
| 23. | Eiben AE and Smith JE. Introduction to evolutionary computing. Berlin: Springer, 2003. Google Scholar |
| 24. | Copas, JB, Corbett, P. Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 2002; 89: 315–331. Google Scholar | Crossref | ISI |
| 25. | Airola, A, Pahikkala, T, Waegeman, W An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Comput Stat Data Anal 2011; 55: 1828–1844. Google Scholar | Crossref | ISI |
| 26. | Steyerberg, EW . Clinical prediction models. A practical approach to development, validation, and updating, New York, NY: Springer, 2009. Google Scholar |
| 27. | Efron, B, Tibshirani, RJ. An introduction to the bootstrap, New York, NY: Chapman & Hall, 1993. Google Scholar | Crossref |
| 28. | Pencina, MJ, D’Agostino, RB, Vasan, RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med 2008; 27: 157–172. Google Scholar | Crossref | Medline | ISI |
| 29. | Pepe, MS, Feng, Z, Gu, JW. Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by MJ Pencina et al., Statistics in Medicine. Stat Med 2008; 27: 173–181. Google Scholar | Crossref | Medline | ISI |
| 30. | R Core Team. R: A language and environment for statistical computing, http://www.R-project.org/ (2015, accessed August 2015). Google Scholar |
| 31. | Global Initiative for Chronic Obstructive Lung Disease. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease, http://www.goldcopd.com/ (2013). Google Scholar |
| 32. | Mebane, WR, Sekhon, JS. Genetic optimization using derivatives: The rgenoud package for R. J Stat Softw 2011; 42: 1–26. Google Scholar | Crossref | ISI |
| 33. | Cohen, J . Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 1968; 70: 213–213. Google Scholar | Crossref | Medline | ISI |
| 34. | Taylor, JMG, Yu, M. Bias and efficiency loss due to categorizing an explanatory variable. J Multivar Anal 2002; 83: 248–263. Google Scholar | Crossref | ISI |
| 35. | Altman DG. Categorizing continuous variables. In: Encyclopedia of biostatistics. Chichester: John Wiley & Sons, 2005. Google Scholar |
| 36. | Mazumdar, M, Smith, A, Bacik, J. Methods for categorizing a prognostic variable in a multivariable setting. Stat Med 2003; 22: 559–571. Google Scholar | Crossref | Medline | ISI |
