What Affects the Quality of Score Transformations? Potential Issues in True-Score Equating Using the Partial Credit Model

This simulation study investigated to what extent departures from construct similarity as well as differences in the difficulty and targeting of scales impact the score transformation when scales are equated by means of concurrent calibration using the partial credit model with a common person design. Practical implications of the simulation results are discussed with a focus on scale equating in health-related research settings. The study simulated data for two scales, varying the number of items and the sample sizes. The factor correlation between scales was used to operationalize construct similarity. Targeting of the scales was operationalized through increasing departure from equal difficulty and by varying the dispersion of the item and person parameters in each scale. The results show that low similarity between scales goes along with lower transformation precision. In cases with equal levels of similarity, precision improves in settings where the range of the item parameters is encompassing the person parameters range. With decreasing similarity, score transformation precision benefits more from good targeting. Difficulty shifts up to two logits somewhat increased the estimation bias but without affecting the transformation precision. The observed robustness against difficulty shifts supports the advantage of applying a true-score equating methods over identity equating, which was used as a naive baseline method for comparison. Finally, larger sample size did not improve the transformation precision in this study, longer scales improved only marginally the quality of the equating. The insights from the simulation study are used in a real-data example.

LID often occurs when items are redundant and measure approximately the same or very similar aspects of a latent construct, e.g., SF-36 items: "walking 100 yards", "walking half a mile", and "walking more than a mile" in Horton and Tennant (2011).Typically, the correlations of the standardized Rasch residuals, or Q 3 -values, are used to detect LID (Yen 1984).High positive correlations indicate LID.Negative residual correlation can reveal multidimensionality of the form.LID leads to inflated reliability estimates (Baghaei 2008).Marais (2013) recommends evaluating LID relative to the average residual correlations, as the magnitude of the residual correlations depends on the number of items.Christensen, Makransky, and Horton (2017) formalized this, suggesting that if the largest Q 3 value is more than 0.2 above the average, i.e., if Q 3 = Q 3,max − Q3 > 0.2 this indicates departure from independence.
For a valid raw score, a test form should measure only one latent construct.With a test form assessing data over several dimensions, the calculation of one unique total score is not valid anymore.Principal component analysis (PCA) of the standardized Rasch residuals tests unidimensionality by searching for non-random patterns in the analysis residuals (E.V. J. Smith 2002).If unidimensionality holds, the residuals are free of any salient component loading structure.Unidimensionality can be tested by mean of pairwise t-tests that compare the PP estimates ( θ) from separate but anchored Rasch analyses, with items that loaded positively (P C + ) or negatively (P C − ) on the first component of the PCA of a common calibration.The anchoring of the separate analyses uses the IP estimates of a common calibration of the items to be tested for unidimensionality.The t-test involves each individual pair of θ estimates, here θP C + and θP C + and their respective measurement error (SE θP C + and SE θP C − ) using the following formula: The percentage of significant t-tests should not exceed 5% (Andrich and Marais 2019).This simulation intentionally varies degrees of similarity between test forms but expects within test form unidimensionality, the t-tests and Q 3 values are expected to reflect this.
When reporting psychometric analyses, the reliability coefficient is often considered a critical statistic to support the quality of an assessment scale (Colton et al. 1997).In practice, when deciding to include an assessment instrument in a study or a survey, the information about its reliability is central.In modern test theory, reliability uses the derived PP (θ) to describe how well an assessment instrument is expected to differentiate between levels of ability among test taker.The reliability estimate entirely relies on the distribution of the θ estimates, specifically their variance σ 2 θ , and the mean of the measurement error (M SE θ ).The R package mirt uses an empirical estimate of the reliability formalized as ρ2 This reliability coefficient indicates how much the measurement process has reduced the uncertainty of the PP estimates (Mislevy et al. 1992).The empirical reliability coefficient is an alternative form of the Person Separation Reliability (PSR) that is found in Rasch software (Linacre 2015;Andrich, Sheridan, and Luo 2010) and R packages (Mueller 2020;Mair, Hatzinger, and Maier 2021).Typically, a PSR of 0.8 or above is interpreted as indicating good reliability.Excellent reliability can be expected with PSR values above 0.9 (Tennant and Conaghan 2007).The simulation indirectly affects the test forms' reliability by varying the dispersion of the person estimates.
The PP of this study were estimated with the weighted likelihood estimation (WLE) approach of Warm (Warm 1989).WLE reduces bias by decreasing the weight of inconsistent items, i.e., items with large residuals, to a considerable degree (Schuster and Yuan 2011) and further provides finite estimates for the extreme score (Warm 1989).For Rasch models, such as the PCM (Masters 1982), all test-takers with the same raw score (R) have the same PP estimate ( θ), independently of the response pattern.
Finally, a complete description of the measurement assumptions tested in an analysis with a Rasch model would also mention the analysis of differential item functioning (DIF) that examines whether each item is free of subgroup effects, e.g., gender or age effects.No DIF analysis was undertaken in this study as all items were simulated free of DIF.

Simulation Input Simulation Result
Factor IP Distribution Factor Correlation Table 2: Mean(SD) the factor correlation, mean(SD) difficulty shift and mean(SD) difficulty dispersion in simulation settings with a person parameter dispersion of σ θ = 1, for the two-dimensional PCM analyses aggregated for the 500 replications.

Table 3 :
Mean(SD) the factor correlation, mean(SD) difficulty shift and mean(SD) difficulty dispersion in simulation settings with a person parameter dispersion of σ θ = 2, for the two-dimensional PCM analyses aggregated for the 500 replications.

Table 4 :
Relative Bias when I = 10 and N = 500

Table 5 :
Relative Bias when I = 20 and N = 500

Table 6 :
Relative Bias when I = 10 and N = 1000

Table 7 :
Relative Bias when I = 20 and N = 1000

Table 8 :
Score Transformation Precision when I = 10 and N = 500

Table 9 :
Score Transformation Precision when I = 20 and N = 500

Table 10 :
Score Transformation Precision when I = 10 and N = 1000