Abstract
Test scores are commonly reported in a small number of ordered categories. Examples of such reporting include state accountability testing, Advanced Placement tests, and English proficiency tests. This article introduces and evaluates methods for estimating achievement gaps on a familiar standard-deviation-unit metric using data from these ordered categories alone. These methods hold two practical advantages over alternative achievement gap metrics. First, they require only categorical proficiency data, which are often available where means and standard deviations are not. Second, they result in gap estimates that are invariant to score scale transformations, providing a stronger basis for achievement gap comparisons over time and across jurisdictions. The authors find three candidate estimation methods that recover full-distribution gap estimates well when only censored data are available.
References
|
Center on Education Policy . (2007). Answering the question that matters most: Has student achievement increased since No Child Left Behind? Retrieved November 1, 2008, from http://www.cep-dc.org/index.cfm?fuseaction=document.showDocumentByID&nodeID=1&DocumentID=200.
Google Scholar | |
|
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114, 494–509. Google Scholar | Crossref | |
|
Conover, W. J. (1973). Rank tests for one sample, two sample, and k samples without the assumption of a continuous distribution function. The Annals of Statistics, 1, 1106–1125. Google Scholar | Crossref | |
|
Dorfman, D. D., Alf, E. (1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology, 6, 487–496. Google Scholar | Crossref | |
|
Downton, F. (1973). The estimation of Pr (Y > X) in the normal case. Technometrics, 15, 551–558. Google Scholar | |
|
Education Week . (2010, January 14). State of the states: Sources and notes. Education Week, 29, 49–50. Retrieved June 1, 2010, from http://www.edweek.org/ew/articles/2010/01/14/17sources.h29.html.
Google Scholar | |
|
Fritsch, F. N., Carlson, R. E. (1980). Monotone piecewise cubic interpolation. Society for Industrial and Applied Mathematics: Journal on Numerical Analysis, 17, 238–246. Google Scholar | Crossref | |
|
Furgol, K. E., Ho, A. D., Zimmerman, D. L. (2010). Estimating trends from censored assessment data under no child left behind. Educational and Psychological Measurement, 70, 760–776. Google Scholar | SAGE Journals | |
|
Green, D. M., Swets, J. A. (1966). Signal detection theory and psychophysics. New York, NY: Wiley. Google Scholar | |
|
Hedges, L. V., Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Google Scholar | Crossref | |
|
Ho, A. D. (2007). Describing the pliability of growth statistics under transformations of the vertical scale. Paper presented at the 2007 annual meeting of the National Council on Measurement in Education. Chicago, Illinois. Google Scholar | |
|
Ho, A. D. (2008). The problem with “proficiency”: Limitations of statistics and policy under No Child Left Behind. Educational Researcher, 37, 351–360. Google Scholar | SAGE Journals | |
|
Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34, 201–228. Google Scholar | SAGE Journals | |
|
Ho, A. D., Haertel, E. H. (2006). Metric-free measures of test score trends and gaps with policy-relevant examples (CSE Report No. 665). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing, Graduate School of Education & Information Studies. Google Scholar | |
|
Holland, P. (2002). Two measures of change in the gaps between the CDFs of test score distributions. Journal of Educational and Behavioral Statistics, 27, 3–17. Google Scholar | SAGE Journals | |
|
Jencks, C., Phillips, M. (Eds.), (1998) The Black-White test score gap. Washington, DC: Brookings Institution Press. Google Scholar | |
|
Kolen, M. J., Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. 2nd ed. New York, NY: Springer-Verlag. Google Scholar | Crossref | |
|
Livingston, S. A. (2006). Double P-P plots for comparing differences between two groups. Journal of Educational and Behavioral Statistics, 31, 431–435. Google Scholar | SAGE Journals | |
|
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Google Scholar | |
|
Magnuson, K., Waldfogel, J. (Eds.). (2008) Steady gains and stalled progress: Inequality and the Black-White test score gap. New York, NY: Russell Sage. Google Scholar | |
|
McGraw, K. O., Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361–365. Google Scholar | Crossref | |
|
Mislevy, R. J., Johnson, E. G., Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131–154. Google Scholar | SAGE Journals | |
|
Neal, D. A. (2006). Why has Black-White skill convergence stopped?. In Hanushek, E. A., Welch, F. (Eds.), Handbook of the Economics of Education (pp. 511–576). Vol. 1, Amsterdam: North Holland. Google Scholar | Crossref | |
|
Ogilvie, J. C., Creelman, C. D. (1968). Maximum-likelihood estimation of receiver operating characteristic curve parameters. Journal of Mathematical Psychology, 5, 377–391. Google Scholar | Crossref | |
|
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press. Google Scholar | |
|
Pollack, J. M., Narajian, M., Rock, D. A., Atkins-Burnett, S., Hausken, E. G. (2005). Early childhood longitudinal study–kindergarten class of 1998–99 (ECLS-K), psychometric report for the fifth grade (NCES Report No. 2006–036). Washington, DC: U.S. Department of Education, National Center for Education Statistics. Google Scholar | |
|
Reardon, S. F. (2008a). Differential growth in the Black-White achievement gap during elementary school among initially high- and low-scoring students. Working Paper Series. Stanford, CA: Institute for Research on Educational Policy and Practice, Stanford University. Google Scholar | |
|
Reardon, S. F. (2008b). Thirteen ways of looking at the Black-White test score gap. Working Paper Series. Stanford, CA: Institute for Research on Educational Policy and Practice, Stanford University. Google Scholar | |
|
Seltzer, M. H., Frank, K. A., Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16, 41–49. Google Scholar | SAGE Journals | |
|
Simpson, A. J., Fitter, M. J. (1973). What is the best index of detectability?. Psychological Bulletin, 80, 481–488. Google Scholar | Crossref | |
|
Spencer, B. D. (1983). Test scores as social statistics: Comparing distributions. Journal of Educational Statistics, 8, 249–269. Google Scholar | SAGE Journals | |
|
Swets, J. A., Pickett, R. M. (1982). Evaluation of diagnostic systems: Methods from signal detection theory. New York, NY: Academic Press. Google Scholar | |
|
U.S. Department of Education . (2010). A blueprint for reform: The reauthorization of the Elementary and Secondary Education Act. Washington, DC: Office of Planning, Evaluation, and Policy Development. Google Scholar | |
|
Vanneman, A., Hamilton, L., Baldwin Anderson, J., Rahman, T. (2009). Achievement gaps: How Black and White students in public schools perform in mathematics and reading on the national assessment of educational progress (NCES 2009–455). Washington, DC: National Center for Education Statistics, U.S. Department of Education. Google Scholar | |
|
Vargha, A., Delaney, H. D. (2000). A critique and modification of the common language effect size measure of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 101–132. Google Scholar | Abstract | |
|
Wilk, M. B., Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55, 1–17. Google Scholar | Medline | |
|
Wolynetz, M. S. (1979). Algorithm AS 138: Maximum likelihood estimation from confined and censored normal data. Applied Statistics, 28, 185–195. Google Scholar | Crossref |
