Efficiency Analysis of Item Response Theory Kernel Equating for Mixed-Format Tests

This study aims to evaluate the performance of Item Response Theory (IRT) kernel equating in the context of mixed-format tests by comparing it to IRT observed score equating and kernel equating with log-linear presmoothing. Comparisons were made through both simulations and real data applications, under both equivalent groups (EG) and non-equivalent groups with anchor test (NEAT) sampling designs. To prevent bias towards IRT methods, data were simulated with and without the use of IRT models. The results suggest that the difference between IRT kernel equating and IRT observed score equating is minimal, both in terms of the equated scores and their standard errors. The application of IRT models for presmoothing yielded smaller standard error of equating than the log-linear presmoothing approach. When test data were generated using IRT models, IRT-based methods proved less biased than log-linear kernel equating. However, when data were simulated without IRT models, log-linear kernel equating showed less bias. Overall, IRT kernel equating shows great promise when equating mixed-format tests.


IRT simulations when populations differ
Assume a simulation scenario (S1) in which F X (x) and F Y (y) are known and the abilities of test taker populations P and Q are equal.In this scenario, the true equating transformation is ϕ(x) = F −1 Y (F X (x)).We aim to simulate another scenario (S2) where a difference in abilities between P and Q is introduced, without altering the differences in test form difficulty. Assuming F AP (a) and F AQ (a) are known, we want to find a new score distribution for form Y, F Y 2 (y), such that ϕ(x) in S2 is the same as in S1.Assuming the CE transformation is the true one we get ) can be used to generate Y scores corresponding to the same difficulty difference between forms X and Y as when using F X (x) and F Y (y) and equal populations.

C True equating transformation, IRT simulations
Algorithm C1 True equating transformation for IRT simulated data 1: Generate a sequence of ability values, Θ P , evenly distributed over the most plausible range of abilities for population P. 2: For each ability value, obtain the score probabilities for each item using the GPC model with the true IRT parameters from the scenario of interest.3: For each ability value and the probabilities from previous step, calculate the sum score probabilities P P (X = x|θ ) for each possible x using the recursive algorithm described in Thissen et al. (1995).4: Let f θ P (θ ) be the ability density function for population P. Approximate the integral P P (X = x) = P P (X = x|θ ) f θ P (θ )dθ by summing over all ability values in Θ P and weighting each conditional probability from the previous step with the relative likelihood of each θ 5: Repeat steps 1 to 4 for population Q.
6: Obtain the sum score probabilities for the target population, T = 0.5P+0.5Q,using a weighted average of the results from each population PT (X = x) = 0.5 PP (X = x) + 0.5 PQ (X = x).7: Repeat step 2 to 6 for test form Y .8: Perform equipercentile equating using the probabilities obtained from previous steps.The equate R package was used for this step.

D Algorithm for item weight selection
Algorithm D1 Selecting realistic item probability weights for simulations 1: Compute the probability of responding correctly to each SweSAT item used to fit the smoothing spline density function for the test form of interest (the ratio between the number of correct responses and the total number of responses for each item).The goal is to find weights which make simulated item score probabilities mimic these probabilities.2: Generate a large amount of total test scores using the fitted spline density function.We used 100,000.A larger number results in more accurate performance estimates of the weights.3: Initially set the weight for each item to 0.5 and set the initial value of the step parameter S, used later in the algorithm to adjust the weights, to 0.3.A larger item weight implies an item is easier, and more likely to be scored as correct.4: Generate item responses using the weights from step 3 for all total scores from step 2. Items with correct responses were sampled using the sample function in R to which the weights can be supplied through the prob argument.5: Calculate the probability for getting each item correct in the large dataset of 100,000 test takers generated using the weights in step 4. The ratio between the number of correct responses and the total number of responses for each item were used for the calculations.6: For each item, compare the probability of getting the item correct (step 1) with the item correct probability received using the weights (previous step).If the probability from using the current weight is too low, increase the weight with the current value of S. If it is too high, decrease the weight with S. 7: Decrease the value of S with 25%.8: Repeat steps 4-6 until S < 0.01.

E Monte Carlo standard errors
In this section, we derive the formulas to calculate Monte Carlo SEs for AAB, WAB, ASEE and WSEE.We use the notation Bias x and SEE x to denote the estimated bias and SEE for a total score of x on the test form from which we equate from.From Table 6 where w i is the weight given to item i.For AAB and ASEE, we use the above formulas with w i = 1/81 for all items.

Figure F1 :Figure F2 :Figure F3 :Figure F4 :
Figure F1: Bias plots from the non-IRT simulations.The x-axis shows the test score of form X. X ̸ = Y indicates whether or not the test forms differed in difficulty and P ̸ = Q indicates whether or not the test takers taking each form differed in ability.