Abstract
Despite its well-known weaknesses, researchers continuously choose the kappa coefficient (Cohen, 1960, Educational and Psychological Measurement 20: 37–46; Fleiss, 1971, Psychological Bulletin 76: 378–382) to quantify agreement among raters. Part of kappa's persistent popularity seems to arise from a lack of available alternative agreement coefficients in statistical software packages such as Stata. In this article, I review Gwet’s (2014, Handbook of Inter-Rater Reliability) recently developed framework of interrater agreement coefficients. This framework extends several agreement coefficients to handle any number of raters, any number of rating categories, any level of measurement, and missing values. I introduce the kappaetc command, which implements this framework in Stata.
References
| Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall. Google Scholar | |
| Bennett, E. M. , Alpert, R. , and Goldstein, A. C. 1954. Communications through limited-response questioning. Public Opinion Quarterly 18: 303–308. Google Scholar | |
| Bland, J. M. , and Altman, D. G. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327: 307–310. Google Scholar | |
| Brennan, R. L. , and Prediger, D. J. 1981. Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41: 687–699. Google Scholar | SAGE Journals | |
| Byrt, T. , Bishop, J. , and Carlin, J. B. 1993. Bias, prevalence and kappa. Journal of Clinical Epidemiology 46: 423–429. Google Scholar | |
| Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46. Google Scholar | SAGE Journals | |
| Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70: 213–220. Google Scholar | |
| Conger, A. J. 1980. Integration and generalization of kappas for multiple raters. Psychological Bulletin 88: 322–328. Google Scholar | |
| Cox, N. J. 2016. entropyetc: Stata module for entropy and related measures for categories. Statistical Software Components S458272, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458272.html. Google Scholar | |
| Feinstein, A. R. , and Cicchetti, D. V. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43: 543–549. Google Scholar | |
| Feng, G. C. 2013. Factors affecting intercoder reliability: A Monte Carlo experiment. Quality and Quantity 47: 2959–2982. Google Scholar | |
| Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76: 378–382. Google Scholar | |
| Fleiss, J. L. , Cohen, J. , and Everitt, B. S. 1969. Large sample standard errors for kappa and weighted kappa. Psychological Bulletin 72: 323–327. Google Scholar | |
| Fleiss, J. L. , Levin, B. , and Paik, M. C. 2003. Statistical Methods for Rates and Proportions. 3rd ed. Hoboken, NJ: Wiley. Google Scholar | |
| Gwet, K. L. 2008a. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psycholgy 61: 29–48. Google Scholar | |
| Gwet, K. L. 2008b. Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika 73: 407–430. Google Scholar | |
| Gwet, K. L. 2014. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. 4th ed. Gaithersburg, MD: Advanced Analytics. Google Scholar | |
| Gwet, K. L. 2015. Standard error of Krippendorff's alpha coefficient. K. Gwet's Inter-Rater Reliability Blog. http://inter-rater-reliability.blogspot.de/2015/08/standard-error-of-krippendorffs-alpha.html. Google Scholar | |
| Gwet, K. L. 2016. Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement 76: 609–637. Google Scholar | SAGE Journals | |
| Harrison, D. 2004. kaputil: Stata module to generate confidence intervals and sample size calculations for the kappa-statistic. Statistical Software Components S446501, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s446501.html. Google Scholar | |
| Hayes, A. F. , and Krippendorff, K. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1: 77–89. Google Scholar | |
| Klein, D. 2014. kalpha: Stata module to compute Krippendorff's alpha-reliability. Statistical Software Components S457862, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457862.html. Google Scholar | |
| Krippendorff, K. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30: 61–70. Google Scholar | SAGE Journals | |
| Krippendorff, K. 2011. Computing Krippendorff's alpha-reliability. https://repository.upenn.edu/asc_papers/43/. Google Scholar | |
| Krippendorff, K. 2013. Content Analysis: An Introduction to Its Methodology. 3rd ed. Thousand Oaks, CA: Sage. Google Scholar | |
| Landis, J. R. , and Koch, G. G. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159–174. Google Scholar | |
| Lazaro, J. , Zamora, J. , Abraira, V. , and Zlotnik, A. 2013. kappa2: Stata module to produce generalizations of weighted kappa for incomplete designs. Statistical Software Components S457739, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457739.html. Google Scholar | |
| Mitnik, P. 2016. kanom: Stata module to estimate Krippendorff's alpha for nominal variables. Statistical Software Components S458277, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458277.html. Google Scholar | |
| Mitnik, P. , and Cumberworth, E. 2016. Measuring social class with changing occupational classifications: Reliability, competing measurement strategies, and the 1970–1980 U.S. classification divide. Working Paper, Stanford Center on Poverty and Inequality. https://web.stanford.edu/∼pmitnik/Mitnik_Cumberworth_2016.pdf. Google Scholar | |
| Reichenheim, M. E. 2004. Confidence intervals for the kappa statistic. Stata Journal 4: 421–428. Google Scholar | SAGE Journals | |
| Scott, W. A. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19: 321–325. Google Scholar | |
| Staudt, A. , and Krewel, M. 2013. krippalpha: Stata module to compute Krippendorff's alpha intercoder reliability coefficient. Statistical Software Components S457750, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457750.html. Google Scholar | |
| Warrens, M. J. 2012. Some paradoxical results for the quadratically weighted kappa. Psychometrika 77: 315–323. Google Scholar | |
| Warrens, M. J. 2014. Power weighted versions of Bennett, Alpert, and Goldstein's S. Journal of Mathematics 2014: 231909. Google Scholar | |
| Warrens, M. J. , and Pratiwi, B. C. 2016. Kappa coefficients for circular classifications. Journal of Classification 33: 507–522. Google Scholar | |
| Wongpakaran, N. , Wongpakaran, T. , Wedding, D. , and Gwet, K. L. 2013. A comparison of Cohen's kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology 13: 61. Google Scholar | |
| Zapf, A. , Castell, S. , Morawietz, L. , and Karch, A. 2016. Measuring inter-rater reliability for nominal data—Which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology 16: 93. Google Scholar |
