Abstract
The interpretation of psychometric test results is usually based on norm scores. We compared semiparametric continuous norming (SPCN) with conventional norming methods by simulating results for test scales with different item numbers and difficulties via an item response theory approach. Subsequently, we modeled the norm scores based on random samples with varying sizes either with a conventional ranking procedure or SPCN. The norms were then cross-validated by using an entirely representative sample of N = 840,000 for which different measures of norming error were computed. This process was repeated 90,000 times. Both approaches benefitted from an increase in sample size, with SPCN reaching optimal results with much smaller samples. Conventional norming performed worse on data fit, age-related errors, and number of missings in the norm tables. The data fit in conventional norming of fixed subsample sizes varied with the granularity of the age brackets, calling into question general recommendations for sample sizes in test norming. We recommend that test norms should be based on statistical models of the raw score distributions instead of simply compiling norm tables via conventional ranking procedures.
References
|
American Educational Research Association, American Psychological Association, National Council on Measurement in Education . (2014). Standards for educational and psychological testing. American Educational Research Association. Google Scholar | |
|
American Psychiatric Association . (2013). Diagnostic and statistical manual of mental disorders: DSM-5 (5th ed.). Google Scholar | Crossref | |
|
American Psychological Association . (n.d.). APA dictionary of psychology. Retrieved April 14, 2020, from https://dictionary.apa.org/reference-population Google Scholar | |
|
Andersen, E., Madsen, M. (1977). Estimating the parameters of the latent population distribution. Psychometrika, 42(3), 357-374. https://doi.org/10.1007/BF02293656 Google Scholar | |
|
Arthur, D. (2012). Recruiting, interviewing, selecting & orienting new employees (5th ed.). AMACOM American Management Association. Google Scholar | |
|
Bracken, B. A. (1988). Ten psychometric reasons why similar tests produce dissimilar results. Journal of School Psychology, 26(2), 155-166. https://doi.org/10.1016/0022-4405(88)90017-9 Google Scholar | |
|
Brosius, H.-B., Haas, A., Koschel, F. (2008). Methoden der empirischen Kommunikationsforschung [Methods of empirical communication sciences]. Springer VS. Google Scholar | |
|
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum. Google Scholar | |
|
Cole, T. J., Green, P. J. (1992). Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine, 11, 1305-1319. Google Scholar | Crossref | Medline | ISI | |
|
De Ayala, R. J . (2009). The theory and practice of item response theory. Guilford Press. Google Scholar | |
|
Duncan, B. A., Stevens, A. (2011). High-stakes standardized testing: Help or hindrance to public education? National Social Science Journal, 36(2), 35-42. Google Scholar | |
|
Duvall, J. C., Morris, R. J. (2006). Assessing mental retardation in death penalty cases: Critical issues for psychology and psychological practice. Professional Psychology: Research and Practice, 37(6), 658-665. https://doi.org/10.1037/0735-7028.37.6.658 Google Scholar | |
|
Eid, M., Gollwitzer, M., Schmitt, M. (2017). Statistik und Forschungsmethoden [Statistics and research methods]. Beltz. Google Scholar | |
|
Eid, M., Schmidt, K. (2014). Testtheorie und Testkonstruktion [Test theory and test construction]. Hogrefe. Google Scholar | |
|
Embretson, S. E., Reise, S. P. (2000). Item response theory. Lawrence Erlbaum. Google Scholar | Crossref | |
|
Faul, F., Erdfelder, E., Buchner, A., Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149-1160. https://doi.org/10.3758/BRM.41.4.1149 Google Scholar | |
|
Fox, J.-P., Klein Entink, R., van der Linden, W. (2007). Modeling of responses and response times with the Package CIRT. Journal of Statistical Software, 20(7), 1-14. https://doi.org/10.18637/jss.v020.i07 Google Scholar | |
|
Friedman, J., Hastie, T., Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1). https://doi.org/10.18637/jss.v033.i01 Google Scholar | |
|
Gregory, R. J. (1996). Psychological testing. History, principles, and applications (2nd ed.). Allyn & Bacon. Google Scholar | |
|
Grob, A., Hagmann-von Arx, P. (2018). IDS-2: Intelligenz- und Entwicklungsskalen für Kinder und Jugendliche [Intelligence and development scales for children and adolescents]. Hogrefe. Google Scholar | |
|
Hansen, B. E. (2004, May). Nonparametric estimation of smooth conditional distributions [Unpublished doctoral dissertation]. University of Wisconsin, Department of Economics. Google Scholar | |
|
Horn, J. L., Cattell, R. B. (1967). Age differences in fluid and crystallized intelligence. Acta Psychologica, 26, 107-129. https://doi.org/10.1016/0001-6918(67)90011-X Google Scholar | |
|
Kaufman, A. S., Kaufman, N. L. (2004). Kaufman Assessment Battery for Children (2nd ed.). Pearson Clinical Assessment. Google Scholar | |
|
Kline, P. (2015). A handbook of test construction: Introduction to psychometric design. Routledge. Google Scholar | |
|
Kubinger, K., Holocher-Ertl, S. (2014). Adaptives Intelligenz Diagnostikum 3 (AID3) [Adaptive Intelligence Diagnostic System]. Hogrefe. Google Scholar | |
|
Lenhard, W., Lenhard, A., Gary, S. (2018). cNORM: Continuous Norming (Version 1.2.2). Vienna: The Comprehensive R Network. https://cran.r-project.org/web/packages/cNORM/ Google Scholar | |
|
Lenhard, A., Lenhard, W., Gary, S. (2019). Continuous norming of psychometric tests: A simulation study of parametric and semi-parametric approaches. PloS One, 14(9), e0222279. https://doi.org/10.1371/journal.pone.0222279 Google Scholar | |
|
Lenhard, W., Lenhard, A., Schneider, W. (2017). ELFE II - Ein Leseverstndnistest fr Erst- bis Siebtklssler [A reading comprehension test for grade 1 to 7]. Hogrefe. Google Scholar | |
|
Lenhard, A., Lenhard, W., Segerer, R., Suggate, S. (2015). Peabody Picture Vocabulary Test - Revision IV (German Adaption). Pearson Assessment. Google Scholar | |
|
Lenhard, A., Lenhard, W., Suggate, S., Segerer, R. (2016, Online first). A Continuous Solution to the Norming Problem. Assessment, 25(1), 112 -125. https://doi.org/10.1177/1073191116656437 Google Scholar | |
|
Lenhard, A., Bender, L., Lenhard, W. (in press). Einstufungstest Deutsch als Fremdsprache (E-DaF) [Placement test for German as a foreign language]. Heidelberg: Springer. Google Scholar | |
|
Lienert, G. A., Raatz, U. (1998). Testaufbau und Testanalyse [Test construction and test analysis]. Psychologie Verlags Union. Google Scholar | |
|
Lumley, T. (2017). leaps: Regression subset selection. https://cran.r-project.org/web/packages/leaps/index.html Google Scholar | |
|
McDonald, R. P. (1999). Test theory: A unified treatment. Lawrence Erlbaum. Google Scholar | |
|
Miller, A. J. (2002). Subset selection in regression (2nd ed.). Chapman & Hall/CRC. Google Scholar | Crossref | |
|
Oosterhuis, H. E. M., van der Ark, L. A., Sijtsma, K. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23(2), 191-202. https://doi.org/10.1177/1073191115580638 Google Scholar | |
|
Rasch, G. (1980). Probabilistic model for some intelligence and achievement tests. University of Chicago Press. Google Scholar | |
|
Rigby, R. A., Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3), 507-554. https://doi.org/10.1111/j.1467-9876.2005.00510.x Google Scholar | |
|
Snijders, J. Th., Tellegen, P. J., Laros, J. A. (1989). Snijders-Oomen non-verbal intelligence test: Manual and research report (SON-R 5–17). Wolters-Noordhoff. Google Scholar | |
|
Soloman, S. R., Sawilowsky, S. S. (2009). Impact of rank-based normalizing transformations on the accuracy of test scores. Journal of Modern Applied Statistical Methods, 8(2), 448-462. https://doi.org/10.22237/jmasm/1257034080 Google Scholar | |
|
Stemmler, M., Lehfeld, H., Siebert, J., Horn, R. (2017). Ein kurzer Leistungstest zur Erfassung von Störungen des Gedächtnisses und der Aufmerksamkeit [A short performance test for assessing disorders of memory and attention]. Diagnostica, 63(4), 243-255. https://doi.org/10.1026/0012-1924/a000178 Google Scholar | |
|
Stern, W. (1912). Die psychologischen Methoden der Intelligenzprüfung [The psychological methods of testing intelligence]. Johann Ambrosius Barth. Google Scholar | |
|
Stock, C., Marx, P., Schneider, W. (2017). Basiskompetenzen für Lese-Rechtschreibleistungen (BAKO 1-4) [Basic competencies for reading and spelling]. Hogrefe. Google Scholar | |
|
Tellegen, P. J., Laros, J. A. (2012). SON-R 6-40: Non-verbal intelligence test: I. Research report. Hogrefe uitgevers. Google Scholar | |
|
Terman, L. M. (1916). The measurement of intelligence: An explanation of and a complete guide for the use of the Stanford revision and extension of the Binet-Simon Intelligence Scale. Houghton Mifflin. Google Scholar | Crossref | |
|
Van Breukelen, G. J. P., Vlaeyen, J. W. S. (2005). Norming clinical questionnaires with multiple regression: The Pain Cognition List. Psychological Assessment, 17(3), 336-344. https://doi.org/10.1037/1040-3590.17.3.336 Google Scholar | |
|
Voncken, L., Albers, C. J., Timmerman, M. E. (2019a). Model selection in continuous test norming with GAMLSS. Assessment, 26(7), 1329-1346. https://doi.org/10.1177/1073191117715113 Google Scholar | |
|
Voncken, L., Albers, C. J., Timmerman, M. E. (2019b). Improving confidence intervals for normed test scores: Include uncertainty due to sampling variability. Behavior Research Methods, 51(2), 826-839. https://doi.org/10.3758/s13428-018-1122-8 Google Scholar | |
|
Wasserman, J. D. (2018). A history of intelligence assessment: The unfinished tapestry. In Flanagan, D., McDonough, E. M. (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (4th ed., pp. 3-55). Guilford Press. Google Scholar | |
|
Wechsler, D. (1939). The measurement of adult intelligence. Williams & Wilkins. Google Scholar | Crossref | |
|
Wechsler, D. (2014). WISC-V Technical and interpretive manual. Pearson. Google Scholar | |
|
Wright, B. D., Stone, M. H. (1979). Best test design: Rasch measurement. Mesa Press. Google Scholar | |
|
Zachary, R. A., Gorsuch, R. L. (1985). Continuous norming: Implications for the WAIS-R. Journal of Clinical Psychology, 41(1), 86-94. https://doi.org/10.1002/1097-4679(198501)41:1%3C86::AID-JCLP2270410115%3E3.0.CO;2-W Google Scholar | |
|
Zhu, J., Chen, H.-Y. (2011). Utility of inferential norming with smaller sample sizes. Journal of Psychoeducational Assessment, 29(6), 570-580. https://doi.org/10.1177/0734282910396323 Google Scholar |

