Abstract
Psychological science relies on behavioral measures to assess cognitive processing; however, the field has not yet developed a tradition of routinely examining the reliability of these behavioral measures. Reliable measures are essential to draw robust inferences from statistical analyses, and subpar reliability has severe implications for measures’ validity and interpretation. Without examining and reporting the reliability of measurements used in an analysis, it is nearly impossible to ascertain whether results are robust or have arisen largely from measurement error. In this article, we propose that researchers adopt a standard practice of estimating and reporting the reliability of behavioral assessments of cognitive processing. We illustrate the need for this practice using an example from experimental psychopathology, the dot-probe task, although we argue that reporting reliability is relevant across fields (e.g., social cognition and cognitive psychology). We explore several implications of low measurement reliability and the detrimental impact that failure to assess measurement reliability has on interpretability and comparison of results and therefore research quality. We argue that researchers in the field of cognition need to report measurement reliability as routine practice so that more reliable assessment tools can be developed. To provide some guidance on estimating and reporting reliability, we describe the use of bootstrapped split-half estimation and intraclass correlation coefficients to estimate internal consistency and test-retest reliability, respectively. For future researchers to build upon current results, it is imperative that all researchers provide psychometric information sufficient for estimating the accuracy of inferences and informing further development of cognitive-behavioral assessments.
References
|
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73, 3–25. doi:10.1037/amp0000191 Google Scholar | Crossref | |
|
Aust, F., Barth, M. (2018). papaja: Prepare APA journal articles with R Markdown (R package Version 0.1.0.9842) [Computer software]. Retrieved from https://github.com/crsh/papaja Google Scholar | |
|
Bar-Haim, Y., Holoshitz, Y., Eldar, S., Frenkel, T. I., Muller, D., Charney, D. S., . . . Wald, I. (2010). Life-threatening danger and suppression of attention bias to threat. American Journal of Psychiatry, 167, 694–698. doi:10.1176/appi.ajp.2009.09070956 Google Scholar | Crossref | ISI | |
|
Barry, A. E., Chaney, B., Piazza-Gardner, A. K., Chavarria, E. A. (2014). Validity and reliability reporting practices in the field of health education and behavior: A review of seven journals. Health Education & Behavior, 41, 12–18. doi:10.1177/1090198113483139 Google Scholar | SAGE Journals | ISI | |
|
Borsboom, D., Kievit, R. A., Cervone, D., Hood, S. B. (2009). The two disciplines of scientific psychology, or: The disunity of psychology as a working hypothesis. In Valsiner, J., Molenaar, P. C. M., Lyra, M. C. D. P., Chaudhary, N. (Eds.), Dynamic process methodology in the social and developmental sciences (pp. 67–97). New York, NY: Springer. Google Scholar | Crossref | |
|
Brown, H. M., Eley, T. C., Broeren, S., MacLeod, C., Rinck, M., Hadwin, J. A., Lester, K. J. (2014). Psychometric properties of reaction time based experimental paradigms measuring anxiety-related information-processing biases in children. Journal of Anxiety Disorders, 28, 97–107. doi:10.1016/j.janxdis.2013.11.004 Google Scholar | Crossref | |
|
Brown, W. (1910). Some experimental results in the correlation of mental health abilities. British Journal of Psychology, 1904-1920, 3, 296–322. doi:10.1111/j.2044-8295.1910.tb00207.x Google Scholar | Crossref | |
|
Button, K., Lewis, G., Penton-Voak, I., Munafò, M. (2013). Social anxiety is associated with general but not specific biases in emotion recognition. Psychiatry Research, 210, 199–207. doi:10.1016/j.psychres.2013.06.005 Google Scholar | Crossref | |
|
Champely, S. (2018). Pwr: Basic functions for power analysis (R package Version 1.2-2) [Computer software]. Retrieved from https://CRAN.R-project.org/package=pwr Google Scholar | |
|
Cisler, J. M., Koster, E. H. W. (2010). Mechanisms of attentional biases towards threat in anxiety disorders: An integrative review. Clinical Psychology Review, 30, 203–216. doi:10.1016/j.cpr.2009.11.003 Google Scholar | Crossref | ISI | |
|
Cooper, S. R., Gonthier, C., Barch, D. M., Braver, T. S. (2017). The role of psychometrics in individual differences research in cognition: A case study of the AX-CPT. Frontiers in Psychology, 8, Article 1482. doi:10.3389/fpsyg.2017.01482 Google Scholar | Crossref | |
|
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Google Scholar | Crossref | |
|
Cronbach, L. J. (1957). The two disciplines of scientific psychology. American Psychologist, 12, 671–684. Google Scholar | Crossref | ISI | |
|
Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30, 116–127. doi:10.1037/h0076829 Google Scholar | Crossref | ISI | |
|
Cronbach, L. J., Hartmann, W. (1954). A note on negative reliabilities. Educational and Psychological Measurement, 14, 342–346. doi:10.1177/001316445401400213 Google Scholar | SAGE Journals | ISI | |
|
De Schryver, M., Hughes, S., Rosseel, Y., De Houwer, J. (2016). Unreliable yet still replicable: A comment on LeBel and Paunonen (2011). Frontiers in Psychology, 6, Article 2039. doi:10.3389/fpsyg.2015.02039 Google Scholar | Crossref | |
|
Dunn, T. J., Baguley, T., Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399–412. doi:10.1111/bjop.12046 Google Scholar | Crossref | ISI | |
|
Enock, P. M., Hofmann, S. G., McNally, R. J. (2014). Attention bias modification training via smartphone to reduce social anxiety: A randomized, controlled multi-session experiment. Cognitive Therapy and Research, 38, 200–216. doi:10.1007/s10608-014-9606-z Google Scholar | Crossref | ISI | |
|
Enock, P. M., Robinaugh, D. J., Reese, H. E., McNally, R. J. (2012, November). Improved reliability estimation and psychometrics of the dot-probe paradigm on smartphones and PC. Poster session presented at the annual meeting of the Association of Behavioral and Cognitive Therapies, National Harbor, MD. Google Scholar | |
|
Fisher, R. (1954). Statistical methods for research workers. Edinburgh, Scotland: Oliver and Boyd. Google Scholar | |
|
Flake, J. K., Pek, J., Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8, 370–378. doi:10.1177/1948550617693063 Google Scholar | SAGE Journals | ISI | |
|
Gawronski, B., Deutsch, R., Banse, R. (2011). Response interference tasks as indirect measures of automatic associations. In Klauer, K. C., Voss, A., Stahl, C. (Eds.), Cognitive methods in social psychology (pp. 78–123). New York, NY: Guilford Press. Google Scholar | |
|
Gonthier, C., Macnamara, B. N., Chow, M., Conway, A. R. A., Braver, T. S. (2016). Inducing proactive control shifts in the AX-CPT. Frontiers in Psychology, 7, Article 1822. doi:10.3389/fpsyg.2016.01822 Google Scholar | Crossref | |
|
Gotlib, I. H., Joormann, J. (2010). Cognition and depression: Current status and future directions. Annual Review of Clinical Psychology, 6, 285–312. doi:10.1146/annurev.clinpsy.121208.131305 Google Scholar | Crossref | ISI | |
|
Hedge, C., Powell, G., Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50, 1166–1186. doi:10.3758/s13428-017-0935-1 Google Scholar | Crossref | |
|
Henry, L., Wickham, H. (2019). Purrr: Functional programming tools (R package Version 0.3.2) [Computer software]. Retrieved from https://CRAN.R-project.org/package=purrr Google Scholar | |
|
Hussey, I., Hughes, S. (2018). Hidden invalidity among fifteen commonly used measures in social and personality psychology. PsyArXiv. doi:10.31234/osf.io/7rbfp Google Scholar | Crossref | |
|
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), Article e124. doi:10.1371/journal.pmed.0020124 Google Scholar | Crossref | ISI | |
|
Ioannidis, J. P. A., Tarone, R., McLaughlin, J. K. (2011). The false-positive to false-negative ratio in epidemiologic studies. Epidemiology, 22, 450–456. doi:10.1097/EDE.0b013e31821b506e Google Scholar | Crossref | ISI | |
|
Jones, A., Christiansen, P., Field, M. (2018). Failed attempts to improve the reliability of the alcohol visual probe task following empirical recommendations. Psychology of Addictive Behaviors, 32, 922–932. doi:10.31234/osf.io/4zsbm Google Scholar | Crossref | |
|
Kanyongo, G. Y., Brook, G. P., Kyei-Blankson, L., Gocmen, G. (2007). Reliability and statistical power: How measurement fallibility affects power and required sample sizes for several parametric and nonparametric statistics. Journal of Modern Applied Statistical Methods, 6, 81–90. doi:10.22237/jmasm/1177992480 Google Scholar | Crossref | |
|
Kappenman, E. S., Farrens, J. L., Luck, S. J., Proudfit, G. H. (2014). Behavioral and ERP measures of attentional bias to threat in the dot-probe task: Poor reliability and lack of correlation with anxiety. Frontiers in Psychology, 5, Article 1368. doi:10.3389/fpsyg.2014.01368 Google Scholar | Crossref | ISI | |
|
Koo, T. K., Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155–163. doi:10.1016/j.jcm.2016.02.012 Google Scholar | Crossref | ISI | |
|
Kruijt, A.-W., Field, A. P., Fox, E. (2016). Capturing dynamics of biased attention: Are new attention variability measures the way forward? PLOS ONE, 11(11), Article e0166600. doi:10.1371/journal.pone.0166600 Google Scholar | Crossref | |
|
LeBel, E. P., Paunonen, S. V. (2011). Sexy but often unreliable: The impact of unreliability on the replicability of experimental findings with implicit measures. Personality and Social Psychology Bulletin, 37, 570–583. doi:10.1177/0146167211400619 Google Scholar | SAGE Journals | ISI | |
|
Love, J., Selker, R., Marsman, M., Jamil, T., Dropmann, D., Verhagen, J., . . . Wagenmakers, E.-J. (2019). JASP: Graphical statistical software for common sta-tistical designs. Journal of Statistical Software, 88(2). doi:10.18637/jss.v088.i02 Google Scholar | Crossref | |
|
Luck, S. J. (2019, February 19). Why experimentalists should ignore reliability and focus on precision [Blog post]. Retrieved from https://lucklab.ucdavis.edu/blog/2019/2/19/reliability-and-precision Google Scholar | |
|
MacLeod, C., Grafton, B. (2016). Anxiety-linked attentional bias and its modification: Illustrating the importance of distinguishing processes and procedures in experimental psychopathology research. Behaviour Research and Therapy, 86, 68–86. doi:10.1016/j.brat.2016.07.005 Google Scholar | Crossref | |
|
MacLeod, C., Mathews, A., Tata, P. (1986). Attentional bias in emotional disorders. Journal of Abnormal Psychology, 95, 15–20. Google Scholar | Crossref | ISI | |
|
MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109, 163–203. Google Scholar | Crossref | ISI | |
|
MacLeod, J. W., Lawrence, M. A., McConnell, M. M., Eskes, G. A., Klein, R. M., Shore, D. I. (2010). Appraising the ANT: Psychometric and theoretical considerations of the Attention Network Test. Neuropsychology, 24, 637–651. doi:10.1037/a0019803 Google Scholar | Crossref | ISI | |
|
Marsman, M., Wagenmakers, E.-J. (2017). Bayesian benefits with JASP. European Journal of Developmental Psychology, 14, 545–555. doi:10.1080/17405629.2016.1259614 Google Scholar | Crossref | |
|
Marwick, B. (2019). Wordcountaddin: Word counts and readability statistics in R markdown documents (R package Version 0.3.0.9000) [Computer software]. Retrieved from https://github.com/benmarwick/wordcountaddin Google Scholar | |
|
McGraw, K. O., Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. Google Scholar | Crossref | ISI | |
|
Michalke, M. (2018a). koRpus: An R package for text analysis (R package Version 0.11-5) [Computer software]. Re-trieved from https://reaktanz.de/?c=hacking&s=koRpus Google Scholar | |
|
Michalke, M. (2018b). sylly: Hyphenation and syllable counting for text analysis (R package Version 0.1-5) [Computer software]. Retrieved from https://reaktanz.de/?c=hacking&s=sylly Google Scholar | |
|
Michalke, M. (2019). koRpus.lang.en: Language support for ‘koRpus’ package: English (R package Version 0.1-3) [Computer software]. Retrieved from https://undocumeantit.github.io/repos/l10n/pckg/koRpus.lang.en/index.html Google Scholar | |
|
Morey, R. D., Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. Retrieved from https://github.com/richarddmorey/psychology_resolution/blob/master/paper/response.pdf Google Scholar | |
|
Müller, K., Wickham, H. (2019). tibble: Simple data frames (R package Version 2.1.3) [Computer software]. Retrieved from https://CRAN.R-project.org/package=tibble Google Scholar | |
|
Parsons, S. (2019a). splithalf package documentation. Retrieved from https://sdparsons.github.io/splithalf_documentation/ Google Scholar | |
|
Parsons, S. (2019b). splithalf: Robust estimates of split half reliability (R package Version 5) [Computer software]. doi:10.6084/m9.figshare.5559175.v5 Google Scholar | Crossref | |
|
Peters, G.-J. Y. (2014). The alpha and the omega of scale reliability and validity: Why and how to abandon Cronbach’s alpha and the route towards more comprehensive assessment of scale quality. European Health Psychology, 16, 56–69. Google Scholar | |
|
Price, R. B., Kuckertz, J. M., Siegle, G. J., Ladouceur, C. D., Silk, J. S., Ryan, N. D., . . . Amir, N. (2015). Empirical recommendations for improving the stability of the dot-probe task in clinical research. Psychological Assessment, 27, 365–376. doi:10.1037/pas0000036 Google Scholar | Crossref | ISI | |
|
R Core Team . (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Google Scholar | |
|
Revelle, W. (2018). psych: Procedures for psychological, psychometric, and personality research (R package Version 1.8.12) [Computer software]. Retrieved from https://CRAN.R-project.org/package=psych Google Scholar | |
|
Richmond, L. L., Redick, T. S., Braver, T. S. (2016). Remembering to prepare: The benefits (and costs) of high working memory capacity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41, 1764–1777. doi:10.1037/xlm0000122 Google Scholar | Crossref | |
|
Rodebaugh, T. L., Scullin, R. B., Langer, J. K., Dixon, D. J., Huppert, J. D., Bernstein, A., . . . Lenze, E. J. (2016). Unreliability as a threat to understanding psychopathology: The cautionary tale of attentional bias. Journal of Abnormal Psychology, 125, 840–851. doi:10.1037/abn0000184 Google Scholar | Crossref | |
|
Rouder, J. N., Haaf, J. M. (2018a). Power, dominance, and constraint: A note on the appeal of different design traditions. Advances in Methods and Practices in Psychological Science, 1, 19–26. Google Scholar | SAGE Journals | |
|
Rouder, J. N., Haaf, J. M. (2018b). A psychometrics of individual differences in experimental tasks. PsyArXiv. doi:10.31234/osf.io/f3h2k Google Scholar | Crossref | |
|
Rouder, J. N., Kumar, A., Haaf, J. M. (2019). Why most studies of individual differences with inhibition tasks are bound to fail. PsyArXiv. doi:10.31234/osf.io/3cjr5 Google Scholar | Crossref | |
|
Schmidt, F. L., Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199–223. Google Scholar | |
|
Schmukle, S. C. (2005). Unreliability of the dot probe task. European Journal of Personality, 19, 595–605. doi:10.1002/per.554 Google Scholar | Crossref | ISI | |
|
Shrout, P. E., Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. Google Scholar | Crossref | ISI | |
|
Sigurjónsdóttir, Ó., Sigurðardóttir, S., Björnsson, A. S., Kristjánsson, Á. (2015). Barking up the wrong tree in attentional bias modification? Comparing the sensitivity of four tasks to attentional biases. Journal of Behavior Therapy and Experimental Psychiatry, 48, 9–16. doi:10.1016/j.jbtep.2015.01.005 Google Scholar | Crossref | |
|
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. doi:10.1007/s11336-008-9101-0 Google Scholar | Crossref | ISI | |
|
Slaney, K. L., Tkatchouk, M., Gabriel, S. M., Maraun, M. D. (2009). Psychometric assessment and reporting practices: Incongruence between theory and practice. Journal of Psychoeducational Assessment, 27, 465–476. doi:10.1177/0734282909335781 Google Scholar | SAGE Journals | ISI | |
|
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15, 72–101. doi:10.2307/1412159 Google Scholar | Crossref | |
|
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 1904-1920, 3, 271–295. doi:10.1111/j.2044-8295.1910.tb00206.x Google Scholar | Crossref | |
|
Staugaard, S. R. (2009). Reliability of two versions of the dot-probe task using photographic faces. Psychology Science Quarterly, 51, 339–350. Google Scholar | |
|
Strauss, M. E., McLouth, C. J., Barch, D. M., Carter, C. S., Gold, J. M., Luck, S. J., . . . Silverstein, S. M. (2014). Temporal stability and moderating effects of age and sex on CNTRaCS task performance. Schizophrenia Bulletin, 40, 835–844. doi:10.1093/schbul/sbt089 Google Scholar | Crossref | |
|
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18, 643–662. Google Scholar | Crossref | |
|
Vacha-Haase, T., Henson, R. K., Caruso, J. C. (2002). Reliability generalization: Moving toward improved understanding and use of score reliability. Educational and Psychological Measurement, 62, 562–569. doi:10.1177/0013164402062004002 Google Scholar | SAGE Journals | ISI | |
|
Vasey, M. W., Dalgleish, T., Silverman, W. K. (2003). Research on information-processing factors in child and adolescent psychopathology: A critical commentary. Journal of Clinical Child & Adolescent Psychology, 32, 81–93. doi:10.1207/S15374424JCCP3201_08 Google Scholar | Crossref | |
|
Viladrich, C., Angulo-Brunet, A., Doval, E. (2017). A journey around alpha and omega to estimate internal consistency reliability. Anales de Psicología/Annals of Psychology, 33, 755–782. doi:10.6018/analesps.33.3.268401 Google Scholar | Crossref | |
|
Waechter, S., Nelson, A. L., Wright, C., Hyatt, A., Oakman, J. (2014). Measuring attentional bias to threat: Reliability of dot probe and eye movement indices. Cognitive Therapy and Research, 38, 313–333. doi:10.1007/s10608-013-9588-2 Google Scholar | Crossref | ISI | |
|
Waechter, S., Stolz, J. A. (2015). Trait anxiety, state anxiety, and attentional bias to threat: Assessing the psychometric properties of response time measures. Cognitive Therapy and Research, 39, 441–458. doi:10.1007/s10608-015-9670-z Google Scholar | Crossref | ISI | |
|
Wagenmakers, E.-J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., . . . Morey, R. D. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin & Review, 25, 58–76. doi:10.3758/s13423-017-1323-7 Google Scholar | Crossref | |
|
Wickham, H. (2016). ggplot2 (R package Version 3.2.0) [Computer software]. Retrieved from https://ggplot2.tidyverse.org Google Scholar | |
|
Wickham, H. (2017). tidyverse: Easily install and load the ‘tidyverse’ (R package Version 1.2.1) [Computer soft-ware]. Retrieved from https://CRAN.R-project.org/package=tidyverse Google Scholar | |
|
Wickham, H. (2019a). forcats: Tools for working with categorical variables (factors) (R package Version 0.4.0) [Computer software]. Retrieved from https://CRAN.R-project.org/package=forcats Google Scholar | |
|
Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations (R package Version 1.4.0) [Computer software]. Retrieved from https://CRAN.R-project.org/package=stringr Google Scholar | |
|
Wickham, H., François, R., Henry, L., Müller, K. (2019). dplyr: A grammar of data manipulation (R package Version 0.8.3) [Computer software]. Retrieved from https://CRAN.R-project.org/package=dplyr Google Scholar | |
|
Wickham, H., Henry, L. (2019). tidyr: tidy messy data (R package Version 0.8.3) [Computer software]. Retrieved from https://CRAN.R-project.org/package=tidyr Google Scholar | |
|
Wickham, H., Hester, J., Francois, R. (2018). readr: Read rectangular text data (R package Version 1.3.1) [Computer software]. Retrieved from https://CRAN.R-project.org/package=readr Google Scholar | |
|
Wilkinson, L. & Task Force on Statistical Inference, American Psychological Association, Science Directorate . (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604. Google Scholar | Crossref | ISI | |
|
Xie, Y. (2019). formatR: Format R code automatically (R package Version 1.7) [Computer software]. Retrieved from https://CRAN.R-project.org/package=formatR Google Scholar | |
|
Xie, Y., Allaire, J. J. (2019). tufte: tufte’s styles for R Markdown documents (R package Version 0.5) [Computer software]. Retrieved from https://CRAN.R-project.org/package=tufte Google Scholar | |
|
Yiend, J. (2010). The effects of emotion on attention: A review of attentional processing of emotional information. Cognition & Emotion, 24, 3–47. doi:10.1080/02699930903205698 Google Scholar | Crossref | ISI | |
|
Zimmerman, D. W., Zumbo, B. D. (2015). Resolving the issue of how reliability is related to statistical power: Adhering to mathematical definitions. Journal of Modern Applied Statistical Methods, 14(2), 9–26. doi:10.22237/jmasm/1446350640 Google Scholar | Crossref |


