Detecting Differential Item Functioning in Multidimensional Graded Response Models With Recursive Partitioning

Differential item functioning (DIF) is a common challenge when examining latent traits in large scale surveys. In recent work, methods from the field of machine learning such as model-based recursive partitioning have been proposed to identify subgroups with DIF when little theoretical guidance and many potential subgroups are available. On this basis, we propose and compare recursive partitioning techniques for detecting DIF with a focus on measurement models with multiple latent variables and ordinal response data. We implement tree-based approaches for identifying subgroups that contribute to DIF in multidimensional latent variable modeling and propose a robust, yet scalable extension, inspired by random forests. The proposed techniques are applied and compared with simulations. We show that the proposed methods are able to efficiently detect DIF and allow to extract decision rules that lead to subgroups with well fitting models.


S1 Additional Tables and Figures
Figure 1: PIEG model for three time points t and three items i.One latent state variable η t is assumed for each time point.Item 1 serves as reference item so that β 2 and β 3 are the only latent item effect variables in the model.For both types of response variables (numerical and ordinal) and all four sample sizes (250, 500, 750, 1000), we repeat the sampling process 1000 times to compile the final set of simulated data sets.
Next, the common CFA model for numerical data (with 33 parameters) is fitted using the ML estimator to all data sets and the generalized M-fluctuation test is applied, using one random numerical and one random categorical partitioning variable.We use all six test statistics that are offered in the semtree R-package (Arnold et al., 2021)

S3.1 Methodology
For a small number of items with a small number of response categories, there are a multitude of unique possible response patterns for the individual respondent.A response pattern y r indicates a sequence of k i , that is For m items with l i response categories, there are m i=1 l i different response patterns.A full information approach to estimating the parameters of the MGR model uses all the information contained in these response patterns (Forero & Maydeu-Olivares, 2009).The standard full information estimation method for MGR models is the marginal maximum likelihood (MML) method that is usually computed via the expectation-maximization (EM) algorithm (Bock & Aitkin, 1981).
In the MGR model, it is assumed that there is local independence, so that within a group of respondents with the same values for ξ, the distributions of item responses are independent of each other (Samejima, 1997).Therefore, the ξ-conditional probability of answering in response pattern y r is For a random subject sampled from a population with a continuous multivariate ability distribution g(ξ), the unconditional probability of answering in response pattern y r is where is a p-dimensional multiple integral.The EM algorithm estimates the probability P (Y = y r ) at every iteration through numerical approximation of the p-dimensional integral.A disadvantage of this approach is the considerable amount of computing power required.The computational burden increases exponentially with an increasing number of latent variable dimensions (Forero & Maydeu-Olivares, 2009).
The MML method is used to find the best estimates for the item parameters in ϑ (see Equation 2) that maximize the probability for all respondents to answer in their respective response patterns.Maximizing the log likelihood is equivalent to minimizing the objective function F M M L (ϑ) through the EM algorithm.Let n be the sample size and p r = n r /n be the relative frequency of occurrence of response pattern y r .In a sense, the objective function represents the difference between the relative frequency p r of a certain response pattern y r and the unconditional probability of answering in that response pattern, that is The minimalization algorithm generates successive parameter estimations ϑ (1) , ϑ (2) , . . ., such that At every other iteration of the minimization algorithm, the gradient of the objective function is used as the search direction (gradient descent approach).This way, the next set of parameter estimates can be chosen so that the objective function F M M L (ϑ) decreases (Jöreskog & Moustaki, 2006).
When the objective function F M M L is used, the overall model fit can be tested for by using the test statistic degrees of freedom equal to the number of different response patterns minus one minus the number of independent elements of ϑ (Jöreskog & Moustaki, 2006).It can then be used to test the model against the associated saturated model in which all possible parameters are freely estimated.This way, a test statistic for global model fit is obtained.Schneider et al. (2021) show that it is possible to perform the generalized M-fluctuation test for several multidimensional polytomous IRT models that are fitted using MML estimation.We can thus perform partykit for MGR models while using MML estimation for growing the decision tree.We call this approach GRM Tree.The steps performed by GRM Tree are shown in Algorithm 5.
Algorithm 5: partykit for MGR models using MML estimation (GRM Tree)  (Forero & Maydeu-Olivares, 2009).This is the case for the original measurement model defined in Section 3.1.In order to apply GRM Tree to the PIEG model, it must be redefined as a MGR model with orthogonal latent variables.We thus define an orthogonal PIEG model and fix the covariances of the latent variables at 0 and the variances at 1. Discrimination parameters are freely estimated.
As in Section 3.1, we consider three reference latent state variables η t and two latent item effect variables β i .The latent variables are derived from three items at three time points resulting in nine five-category ordinal response variables Y it .The cumulative category response function of the orthogonal PIEG model is In this model, there are 36 free threshold parameters (4 for every five-category item) and 15 free discrimination parameters, resulting in 51 free parameters in total.The model structure is shown in Figure 2.  and minimum terminal node size) were conducted in order to reduce the computational burden in the application of GRM Tree.
Simulation Results.The results of the GRM Tree application are shown in Figure 3.It becomes apparent that the algorithm does not retrieve the simulated subgroups correctly.The partitioning variable cat3 is (wrongly) chosen as partitioning variable at the tree's inner nodes 2 and 7.This indicates that the results from the generalized M-fluctuation test may not be as accurate when all threshold parameters are part of the score function (see Equation 10).
An additional disadvantage of GRM Tree, compared to partykit or score-based SEMTree for MGR models, is the immense computation cost.The computation GRM Tree for the simulated sample described above took 450 minutes (7.5 hours) on a processor with a single core and 170GB RAM.Considering that many limitations were imposed on this particular simulation to keep computation time low, we may conclude that GRM Tree proved to be computationally impractical.
for b = 1 to B do Grow recursive partitioning tree using partykit or semtree for MGR with random draws from partitioning variables; save decision rules and model fit indices for terminal nodes; end Select exclusive subgroups with model fit indices that don't exceed cutoff; S2 Performance of the Generalized M-fluctuation Test with Ordinal Data We generate multiple samples to test the performance of the generalized M-fluctuation test with numerical and ordinal data.partykit and semtree use the results of the generalized M-fluctuation test to decide if the sample should be split into groups.The results of the test also guide the selection of the partitioning variable Z r * .semtree even uses the test statistic associated with the M-fluctuation test to determine the split point.The M-fluctuation test, in turn, draws on the scores of the fitted model.It is thus crucial for partykit and for semtree that the M-fluctuation test detects parameter stability correctly, even if parameter estimates derived from ordinal data are based on model assumptions of a common CFA model for metric items.We simulate samples with 250, 500, 750 and 1000 observations with numerical response variables for which a model holds that has the same structure as the outlined PIEG model.Furthermore, corresponding samples with ordinal response variables are simulated for which the PIEG model is true.The parameters are stable for all simulated observations, i.e., there is no DIF.All samples are based on the same input parameters for latent variable variances, latent variable covariances, and mean structure.For the ordinal data set, 36 input threshold parameters are created instead of input intercepts.

Figure 3 :
Figure 3: Results of the application of GRM Tree to simulated data.

Table 2 :
Input threshold parameters for simulation 1 (R 1 to R 4 ) and simulation 2 (R 1 and R 2 ).Naive semtree for MGR models Initialization: Assign data to root node Parameters: minimum sample size in terminal node, p-value threshold Estimate model parameters in θ for the sample in the current node using the ML estimator (template model); else select partitioning variable with lowest p-value in LR test; split node into two subnodes at optimal split point; for each node of current tree do continue partitioning process; end end for each terminal node do re-fit models using WLS estimator; end Algorithm 2: partykit for MGR models Initialization: Assign data to root node Parameters: minimum sample size in terminal node, p-value threshold Estimate model parameters in θ for the current node using ML estimation;

that maximizes the score-based test statistic;
The simulation results are shown in Table3.We calculated the percentage across all simulated data sets for which the generalized M-fluctuation test is significant (p-value below 0.05).Because the parameters in the simulated samples are stable, we denote this number as the dropout rate.Notably, none of the generalized M-fluctuation tests for models fitted to simulated numerical response variables performed considerably better than the tests for models fitted to simulated ordinal response variables.Even for small sample sizes, there is no test statistic that yields larger dropout rates for simulated ordinal responses.With numeric covariates, the CvM test statistic yields very high dropout rates of around 50% for both ordinal and numeric response variables.However, Merkle et al., 2014)13)Merkle et al., 2014)fluctuation test.This includes three test statistics for the numerical covariate and three test statistics for the categorical covariate (seeMerkle & Zeileis, 2013;Merkle et al., 2014).With this setup, we can determine how the generalized M-fluctuation test performs when the assumptions of a common CFA model are tested with data that follows a MGR model, and which test statistics are least susceptible to this type of misspecification.Results.this is due to the fact that critical values of the CvM statistic are not provided for models with more than 25 parameters.The best tests statistics for multivariate latent variable models with a considerable amount of parameters (33 in our simulation) seem to be the maxLM and DM statistic for numerical covariates (seeMerkle & Zeileis, 2013)and the LM statistic for categorical covariates (seeMerkle et al., 2014).

Table 3 :
Results of simulation.The proportion of p-values of the generalized Mfluctuation test across simulated data sets that are smaller than 0.05 are shown.The column label 'numerical' indicates that numerical response variables were simulated, the label 'ordinal' indicates that ordinal response variables were simulated.

:
Assign data to root node Parameters: minimum sample size in terminal node, p-value threshold 1 Estimate model parameters in ϑ for the current node using MML estimation; Assess item parameter instability though generalized M-fluctuation test with respect to each covariate Z 1 , . . ., Z R ; 2 Figure2: PIEG model with orthogonal latent variables for three time points t and three items i.One latent state variable η t is assumed for each time point.Item 1 serves as reference item so that β 2 and β 3 are the only latent item effect variables in the model.All latent variable variances are fixed at 1. To test GRM Tree, we simulate a sample with a similar subgroup structure as the sample of simulation 1 in Section 3.2.The only difference with respect to the subgroup structure is that the numeric partitioning variable num1 is replaced by the categorical partitioning variable cat3.The structure of the entire simulated sample