Despite its well-known weaknesses, researchers continuously choose the kappa coefficient (Cohen, 1960, Educational and Psychological Measurement 20: 37–46; Fleiss, 1971, Psychological Bulletin 76: 378–382) to quantify agreement among raters. Part of kappa's persistent popularity seems to arise from a lack of available alternative agreement coefficients in statistical software packages such as Stata. In this article, I review Gwet’s (2014, Handbook of Inter-Rater Reliability) recently developed framework of interrater agreement coefficients. This framework extends several agreement coefficients to handle any number of raters, any number of rating categories, any level of measurement, and missing values. I introduce the kappaetc command, which implements this framework in Stata.

Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall.
Google Scholar
Bennett, E. M. , Alpert, R. , and Goldstein, A. C. 1954. Communications through limited-response questioning. Public Opinion Quarterly 18: 303308.
Google Scholar
Bland, J. M. , and Altman, D. G. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327: 307310.
Google Scholar
Brennan, R. L. , and Prediger, D. J. 1981. Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41: 687699.
Google Scholar | SAGE Journals
Byrt, T. , Bishop, J. , and Carlin, J. B. 1993. Bias, prevalence and kappa. Journal of Clinical Epidemiology 46: 423429.
Google Scholar
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 3746.
Google Scholar | SAGE Journals
Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70: 213220.
Google Scholar
Conger, A. J. 1980. Integration and generalization of kappas for multiple raters. Psychological Bulletin 88: 322328.
Google Scholar
Cox, N. J. 2016. entropyetc: Stata module for entropy and related measures for categories. Statistical Software Components S458272, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458272.html.
Google Scholar
Feinstein, A. R. , and Cicchetti, D. V. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43: 543549.
Google Scholar
Feng, G. C. 2013. Factors affecting intercoder reliability: A Monte Carlo experiment. Quality and Quantity 47: 29592982.
Google Scholar
Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76: 378382.
Google Scholar
Fleiss, J. L. , Cohen, J. , and Everitt, B. S. 1969. Large sample standard errors for kappa and weighted kappa. Psychological Bulletin 72: 323327.
Google Scholar
Fleiss, J. L. , Levin, B. , and Paik, M. C. 2003. Statistical Methods for Rates and Proportions. 3rd ed. Hoboken, NJ: Wiley.
Google Scholar
Gwet, K. L. 2008a. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psycholgy 61: 2948.
Google Scholar
Gwet, K. L. 2008b. Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika 73: 407430.
Google Scholar
Gwet, K. L. 2014. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. 4th ed. Gaithersburg, MD: Advanced Analytics.
Google Scholar
Gwet, K. L. 2015. Standard error of Krippendorff's alpha coefficient. K. Gwet's Inter-Rater Reliability Blog. http://inter-rater-reliability.blogspot.de/2015/08/standard-error-of-krippendorffs-alpha.html.
Google Scholar
Gwet, K. L. 2016. Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement 76: 609637.
Google Scholar | SAGE Journals
Harrison, D. 2004. kaputil: Stata module to generate confidence intervals and sample size calculations for the kappa-statistic. Statistical Software Components S446501, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s446501.html.
Google Scholar
Hayes, A. F. , and Krippendorff, K. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1: 7789.
Google Scholar
Klein, D. 2014. kalpha: Stata module to compute Krippendorff's alpha-reliability. Statistical Software Components S457862, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457862.html.
Google Scholar
Krippendorff, K. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30: 6170.
Google Scholar | SAGE Journals
Krippendorff, K. 2011. Computing Krippendorff's alpha-reliability. https://repository.upenn.edu/asc_papers/43/.
Google Scholar
Krippendorff, K. 2013. Content Analysis: An Introduction to Its Methodology. 3rd ed. Thousand Oaks, CA: Sage.
Google Scholar
Landis, J. R. , and Koch, G. G. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159174.
Google Scholar
Lazaro, J. , Zamora, J. , Abraira, V. , and Zlotnik, A. 2013. kappa2: Stata module to produce generalizations of weighted kappa for incomplete designs. Statistical Software Components S457739, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457739.html.
Google Scholar
Mitnik, P. 2016. kanom: Stata module to estimate Krippendorff's alpha for nominal variables. Statistical Software Components S458277, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458277.html.
Google Scholar
Mitnik, P. , and Cumberworth, E. 2016. Measuring social class with changing occupational classifications: Reliability, competing measurement strategies, and the 1970–1980 U.S. classification divide. Working Paper, Stanford Center on Poverty and Inequality. https://web.stanford.edu/∼pmitnik/Mitnik_Cumberworth_2016.pdf.
Google Scholar
Reichenheim, M. E. 2004. Confidence intervals for the kappa statistic. Stata Journal 4: 421428.
Google Scholar | SAGE Journals
Scott, W. A. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19: 321325.
Google Scholar
Staudt, A. , and Krewel, M. 2013. krippalpha: Stata module to compute Krippendorff's alpha intercoder reliability coefficient. Statistical Software Components S457750, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457750.html.
Google Scholar
Warrens, M. J. 2012. Some paradoxical results for the quadratically weighted kappa. Psychometrika 77: 315323.
Google Scholar
Warrens, M. J. 2014. Power weighted versions of Bennett, Alpert, and Goldstein's S. Journal of Mathematics 2014: 231909.
Google Scholar
Warrens, M. J. , and Pratiwi, B. C. 2016. Kappa coefficients for circular classifications. Journal of Classification 33: 507522.
Google Scholar
Wongpakaran, N. , Wongpakaran, T. , Wedding, D. , and Gwet, K. L. 2013. A comparison of Cohen's kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology 13: 61.
Google Scholar
Zapf, A. , Castell, S. , Morawietz, L. , and Karch, A. 2016. Measuring inter-rater reliability for nominal data—Which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology 16: 93.
Google Scholar
Access Options

My Account

Welcome
You do not have access to this content.



Chinese Institutions / 中国用户

Click the button below for the full-text content

请点击以下获取该全文

Institutional Access

does not have access to this content.

Purchase Content

24 hours online access to download content

Research off-campus without worrying about access issues. Find out about Lean Library here

Your Access Options


Purchase

STJ-article-ppv for $37.50

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top