Model Misspecification and Robustness of Observed-Score Test Equating Using Propensity Scores

This study explores the usefulness of covariates on equating test scores from nonequivalent test groups. The covariates are captured by an estimated propensity score, which is used as a proxy for latent ability to balance the test groups. The objective is to assess the sensitivity of the equated scores to various misspecifications in the propensity score model. The study assumes a parametric form of the propensity score and evaluates the effects of various misspecification scenarios on equating error. The results, based on both simulated and real testing data, show that (1) omitting an important covariate leads to biased estimates of the equated scores, (2) misspecifying a nonlinear relationship between the covariates and test scores increases the equating standard error in the tails of the score distributions, and (3) the equating estimators are robust against omitting a second-order term as well as using an incorrect link function in the propensity score estimation model. The findings demonstrate that auxiliary information is beneficial for test score equating in complex settings. However, it also sheds light on the challenge of making fair comparisons between nonequivalent test groups in the absence of common items. The study identifies scenarios, where equating performance is acceptable and problematic, provides practical guidelines, and identifies areas for further investigation.


Introduction
Test score equating is a crucial statistical tool that enables the comparison of test scores from different test forms and ensures fairness in assessments (González & Wiberg, 2017). When equating scores from nonequivalent test groups, it is essential to account for differences in both the ability levels of the test groups and the difficulty of the test forms. To make the scores comparable, any differences in ability and difficulty must be separated, so that the scores are only adjusted for differences in difficulty. For this purpose, testing programs generally apply either an assumption of common test-takers or the use of common items. The former assumes that the test groups to be equated are random samples from the same underlying population, whereas the latter views the groups as samples from different populations. In the latter case, a subset of common items is used to adjust for the differences in ability between the test groups. These common items are often referred to as anchor items and the belonging data collection design is known as the Nonequivalent Groups With Anchor Test (NEAT) design (von Davier et al., 2004b). However, not all testing programs have common items available but still need to adjust for ability imbalances. Examples of such tests are the Invalsi test (Invalsi, 2013), the Armed Services Vocational Aptitude Battery (Quenette et al., 2006), and the Swedish Scholastic Aptitude Test (SweSAT; Stage & Ö gren, 2004) up until recently. If the ability imbalances are ignored, the equated test scores will be biased, which can have severe consequences in high-stakes testing scenarios.
One way of applying a nonequivalent groups design without anchor items is to use background information about the test takers in the form of measured covariates (Wiberg & Bränberg, 2015). There are several ways that covariates can be utilized within equating. Kolen (1990), Cook et al. (1990), and Wright and Dorans (1993) used covariates to balance the test groups before equating the test forms, Liou et al. (2001) applied covariates in a similar fashion to anchor items, Bränberg and Wiberg (2011) incorporated covariates in linear equating, and Hsu et al. (2002) used covariates within item response theory (IRT) true-score equating. However, as the covariate vector grows, controlling for the covariates quickly becomes very difficult. For example, conditioning on four categorical covariates, each with four categories, yields 256 possible outcomes. The matrix of all possible combinations of test scores and covariate realizations would therefore have an inflated number of empty cells. To overcome this problem, the test-takers can be compared on their propensity score instead, which is a scalar function of the covariates. Livingston et al. (1990) was the first to propose the use of covariates within a propensity score for equating. More recently,  explored the use of two anchor tests within a propensity score, Powers (2010) applied chained equating (CE) frequency estimation, IRT true score, and observed-score equating using propensity scores, Haberman (2015) used propensity scores to create pseudoequivalent groups from nonequivalent groups, and Longford (2015) used it as a tool for matching before equating. Wallin and Wiberg (2019) were the first to propose propensity scores for both a poststratification equating (PSE) and CE estimator within the kernel equating framework (von Davier et al., 2004b). Their Propensity Score Misspecification in Equating results showed that a similar level of precision and accuracy compared to the NEAT design could be achieved. However, their results were based on the assumption that the propensity score was known. Since this will never be the case in any real testing situation, it is of great importance to assess the sensitivity of violations of this assumption. Thus, the aim is to study the functional form of the propensity score through which the covariates are conditioned on and investigate how sensitive the equated scores are to model misspecification of the estimated propensity score using both real and simulated data.
Propensity score model misspecification has previously been studied within the field of causal inference. Drake (1993) showed that a substantial bias was introduced when estimating the average treatment effect if a confounding covariate was omitted in the propensity score estimation model. Dehejia and Wahba (1999) had similar findings but also noted that causal estimates were not sensitive to the specification of the functional form of the propensity score, once all important covariates had been included. This has been shown in more recent studies as well, where Waernbaum (2010Waernbaum ( , 2012 showed that the average treatment effect can be unbiasedly estimated using propensity scores even when, for example, the link function is misspecified or when failing to include higher order terms of the covariates. There were furthermore situations with no efficiency loss, and one of the key components to obtain such results was that the true propensity score was a function of the misspecified model. There are currently no existing studies on propensity score model misspecification in the equating context. This is critical to examine since the equating results often are used for decision-making on an individual level (e.g., admission decisions to universities) and for educational policy making. The current study therefore investigates the sensitivity of the equating function for model misspecification of the propensity score. Assuming a parametric model for the propensity score, three misspecifications are considered, inspired by the studies of Waernbaum (2010) and Waernbaum (2012): (1) misspecifying the link function, (2) excluding an important (true confounder) covariate, and (3) excluding a higher order moment of a confounding covariate. Each misspecification will be evaluated in terms of the equating function precision and accuracy to determine how critical they are.
The structure of this article is as follows. The kernel equating framework is introduced in Section 2, followed by an introduction to propensity scores in Section 3. Section 4 includes an empirical illustration, and Section 5 presents a simulation study. This article is concluded with a discussion of the results together with some practical guidelines.

Kernel Equating
We denote the new test form by X and the old test form by Y and their respective scores by X and Y. The realizations of X and Y are denoted x j , Wallin and Wiberg j ¼ 1; . . . ; J , and y k , k ¼ 1; . . . ; K. The test-takers receiving test form X are viewed as a random sample from population P, and the test-takers receiving test form Y as a random sample from population Q. With randomly sampled groups, the score variables X and Y are considered being random variables with sample spaces X and Y. An equating function thus maps the test scores from X to Y. However, not all such functions are considered an equating function. See Kolen and Brennan (2014) for a list of requirements.
Consider the random variable x ¼ FðX Þ, which is well-known to follow a uniform distribution on the interval ½0; 1, given that F is a continuous and strictly increasing cumulative distribution function (CDF). It is consequently true that V ¼ G À1 ðxÞ exactly follows the distribution given by G, as long as G has a properly defined inverse. The equipercentile function (Braun & Holland, 1982) is undoubtedly the most common equating function and uses this simple relationship between distributions of continuous random variables. With G T and F T denoting the CDFs of Y and X on the target population T for the equating parameter, the equipercentile equating function is defined as The equipercentile function thus matches all of the moments of Y by matching the scores from X and Y that are at the same quantile of their respective distributions, that is However, since most test scores are discrete, their CDFs are not continuous but step functions. Hence, for any value u 2 ð0; 1Þ, it is rarely the case that there are two scores x and y that satisfy Equation 2. All test score equating methods that utilize the equipercentile function in (1) therefore need to resolve this issue.
Since kernel equating (Holland & Thayer, 1989;von Davier et al., 2004b) generalizes many of the most common and modern equating approaches, we present our theory in terms of this framework although the proposed method is applicable for example traditional equipercentile and linear equating as well. This framework consists of five steps: (1) fitting a regression model (typically a log-linear model) to the empirical score distributions, (2) estimating the test score probabilities on the target population based on the estimated model in Step 1 and given the data collection design, (3) making continuous approximations to the estimated discrete score distributions from Step 2, (4) equating the test scores using the equipercentile function, and (5) evaluating the estimated equating function (González & Wiberg, 2017;von Davier et al., 2004b). From Equation 1, it is clear that in order to estimate jðÁÞ, we need estimators of F T and G T . Kernel equating first uses the maximum likelihood estimates of the test score probabilities r j ¼ PðX ¼ x j Þ and s k ¼ PðY ¼ y k Þ and then makes continuous approximations of these distributions using kernel functions. It is thus a Propensity Score Misspecification in Equating semiparametric method of estimating the equating function jðÁÞ. For this purpose, we define the joint distribution of ðX ; AÞ and ðY ; AÞ, where A denotes a proxy variable for the latent ability that the test is constructed to measure. Typically, A represents an anchor test score, but as will be presented in the next section, we will instead consider a set of covariates that are gathered in a propensity score. Let P ¼ fp jl g JÂL , where p jl ¼ PrðX ¼ x j ; A ¼ a l jPÞ, j ¼ 1; . . . ; J and l ¼ 1; . . . ; L. Letting p l ¼ ðp 1l ; . . . ; p Jl Þ T , we vectorize the matrix P, such that the vectors p l , l ¼ 1; . . . ; L, are stacked onto each other. We denote this by vðPÞ. For details, see von Davier et al. (2004b). It is common practice to fit a log-linear model to the data to reduce sampling variance, so we will assume that vðPÞ can be described by a log-linear model with R number of free parameters: where a is a normalizing constant, u is a known constant of length J that specifies the null model when ␤ ¼ 0, B ¼ ðb 1 ; . . . ; b J Þ is a matrix of dimension R Â J of known constants, and ␤ is a R-dimensional vector of unknown parameters. Equivalent model assumption is made for vðQÞ ¼ . . . ; J and l ¼ 1; . . . ; L. The model parameters in (3) are estimated through maximum likelihood. The next step is to estimate the score probabilities r ¼ ðr 1 ; . . . ; r J Þ T and s ¼ ðs 1 ; . . . ; s K Þ T , where r and s are functions of vðPÞ and vðQÞ, respectively, and of a design function that depends on the choice of data-collection design and equating estimator. We will save the introduction of necessary assumptions for Section 3, where two propensity score-based estimators are presented. For now, we assume that legitimate estimators of r and s are available. We also choose to present the required quantities only for the X scores since the expressions for the Y scores are given by corresponding formulas.
Let the mean and variance of X be denoted by ðm X ; s 2 X Þ, and let V denote a continuous random variable, such that EðV Þ ¼ 0 and VðV Þ ¼ s 2 V . Lastly, let where h X > 0 denotes a smoothing parameter from here on referred to as the bandwidth. With the introduced notation, we define a new random variableX by constructing a linear combination of X, h X , and V. This will serve as a continuous version of X:X ¼ a X ðX þ h X V Þ þ ð1 À a X Þm X :

Wallin and Wiberg
The random variableX is defined, such that EðX Þ ¼ EðX Þ ¼ m X and VðX Þ ¼ VðX Þ ¼ s 2 X . To define the CDF ofX , let K V ðÁÞ denote the kernel function of V. It is then straight-forward to show that the CDF ofX is equal to In most studies of kernel equating, the function K V ðÁÞ is set to the standard normal CDF FðÁÞ, although other choices have been suggested (Lee & von Davier, 2011). In the empirical illustration of this study, the Gaussian kernel function is used although the proposed estimators can be used together with any other proper kernel function. Since FX is a function of the quantities m X , s X , and a X , which in turn all are a function of r, every component needed to estimate the continuized score CDFs, except for the bandwidth h X , is available after estimating r.
The bandwidth h X determines the degree of smoothness of the score distribution FX and is often selected by minimizing certain criterion function. The most commonly used criterion, first suggested in von Davier et al. (2004b), is given by wherefX ðx; h X Þ is the density function ofX for bandwidth h X yielded by differentiating FX ðxÞ in x, A j ¼ 1 if ½ðf 0X ðx j À oÞ > 0Þ \ ðf 0X ðx j þ oÞ < 0Þ [ ½ðf 0X ðx j À oÞ < 0Þ \ ðf 0X ðx j þ oÞ > 0Þ; and A j ¼ 0 otherwise (von Davier, 2013;Wallin et al., 2021). The weight k could be chosen through, for example, cross-validation but is typically set to 1, and o determines the neighborhood for which the criterion function penalizes a bandwidth that permits sign changes in f 0 . In this study, we use o ¼ 0:25 as it has yielded densities that closely follows the raw score histograms. However, it has been shown that the equated scores are not sensitive to the choice of bandwidth among the methods that are currently available . In both the empirical illustration and the simulation study, we therefore use the criterion function in (5).
Remark 1. As the bandwidth grows to infinity, the continuous score CDF FX ðxÞ % Fð xÀm X sX Þ, which makes the KE estimator approach the linear equating function LinðxÞ ¼ m Y þ sY sX ðx À m X Þ. See von Davier et al. (2004b) for the proof. If the bandwidth is set large, for example, h X ¼ 10s X , the linear equating estimator can be closely approximated. If the bandwidth instead is set to something very small, FX is a close approximation of the step function F. The traditional percentile rank method, where FX and GỸ are the piecewise linear functions, can thus also be closely approximated (von Davier et al., 2004b). These two results emphasize that KE comprises a family of equating methods that incorporates both of the traditional methods as special Propensity Score Misspecification in Equating cases when the bandwidth is either very large (linear equating) or very small (percentile rank method).
With the estimated, continuized score distributionsFX ðxÞ ¼ FX ; ðx;rÞ and GỸ ðyÞ ¼ GỸ ðy;ŝÞ, the kernel equating estimator of the equipercentile function jðxÞ equals jðx;r;ŝÞ ¼ G À1 Y ðFX ðx;rÞ;ŝÞ: The asymptotic distribution of jðx;r;ŝÞ is given by N ðjðx; r; sÞ; the Jacobian of the equating function, J DF denotes the Jacobian of the design function, and C is the covariance matrix of the score distributions vðPÞ and vðQÞ. See Wallin and Wiberg (2019) for the specific formula of these quantities. The standard error of equating (SEE; von Davier et al., 2004b) is consequently given by where jj Á jj denotes the Euclidean norm.

Nonequivalent Groups With Covariate (NEC) Design
This section will clarify the viewpoint we take on the nonequivalent groups designs in test score equating, and the specific assumptions underlying the NEC design (Wiberg & Bränberg, 2015). The NEC design assumes that the group of test-takers being administered test form X are a random sample from population P, and the group of test-takers being administered test form Y are a random sample from population Q, where P 6 ¼ Q and X 6 ¼ Y. Each test-taker thus has a recorded test score on only one of the test forms, but never both. Additional to the test score there is a vector of measured covariates D ¼ ðD 1 ; . . . ; D m Þ for all testtakers regardless of test form. The NEC design is summarized in Table 1.
The covariates in D take a similar role to that of the anchor score in the NEAT designs, meaning that they are intended to adjust for any imbalance in ability between the test groups. All covariates confounding the relationship between the test form assignment mechanism and ðX ; Y Þ need to be controlled for. We denote the test form assignment by Z ¼ 1 if a randomly chosen test-taker is administered test form X and Z ¼ 0 if test form Y is administered.

Wallin and Wiberg
In Figure 1, the variables Z, D, and ðX ; Y Þ are illustrated in a directed acyclic graph (DAG). In the DAG, the relationship between test form assignment and test score is confounded by the covariate vector D. A proper equating procedure under the NEC design thus needs to control for such disturbance or else it will result in biased equated scores. In this sense, there is no difference with the use of anchor test scores A in NEAT design equating. Simply replace D with A in the DAG, and it will graphically summarize the NEAT design. Just as the anchor test score A, the covariate vector D thus is used as proxy for ability.

Propensity Scores
The basic idea of the NEC design is to replace the anchor test scores with the covariates and then to equate the test scores treating the covariate realizations as if they were in fact anchor scores. When using more than only a few covariates, the number of empty cells in the frequency table will grow large. There is thus a practical problem with the NEC design that is unrelated to the theoretical justification of the method. The curse of dimensionality is a well-known problem far beyond the equating literature, and a well-established method to handle this problem is by using a dimension-reducing function of the covariates called the propensity score. It reduces the dimension of covariate vector down to a scalar and is defined as eðDÞ ¼ PðZ ¼ 1jDÞ.
The propensity score possesses the appealing property of being a balancing score (Rosenbaum & Rubin, 1983). This means that it is sufficient to control for eðDÞ to balance the covariates between the test groups, if all confounders of the relationship between Z and ðX ; Y Þ are contained in D. It is worth reminding that the variable we truly wish to control for is latent ability. It follows that the usefulness of balancing the test groups on the covariates is dependent on the quality of D as a proxy variable for ability. Note that this is completely in line with the assumptions underlying NEAT-based equating using anchor scores.

Propensity Score Misspecification in Equating
As the propensity score is not known, it needs to be estimated. A common method is to use logistic regression, which will be used here. Following Rosenbaum and Rubin (1984) and Wallin and Wiberg (2019), the estimated propensity scores of the test-takers will thereafter be partitioned into strata based on the percentiles. The test-takers in each stratum are treated as homogeneous in terms of the latent ability, meaning that the equivalent groups design assumptions hold true within each stratum.

Equating Estimators Based on the Propensity Score
In the following, two propensity score-based equating estimators are derived and presented together with their underlying assumptions, following the estimators presented in Wallin and Wiberg (2019). Note that as these estimators were presented without much theoretical justification in the original paper by Wallin and Wiberg (2019), special attention is given to motivate them in this section.

PSE estimator.
To define the PSE estimator, abbreviated PS-PSE, define the elements in r and s as The probabilities are defined on the target population T and populations P and Q. For PSE, this is a somewhat theoretical construct and described by the symbolic equation T ¼ wP þ ð1 À wÞQ, where w is a weight often set according to the relative sample sizes.
In Equations 8 and 9, the terms r Q j and s P k are not possible to calculate with data since the P sample has only been administrated test form X and the Q sample only test form Y. There is thus data missing by design, in the same sense as is thoroughly discussed in Sinharay and Holland (2010b). The following assumption is therefore needed to define the PS-PSE estimator. We follow the notation in Dawid (1979) and let v denote statistical independence.
Assumption 1: For the PS-PSE estimator, we assume that ðX ; Y ÞvZjeðDÞ; 0 < eðDÞ < 1: Note, for a dichotomous treatment (i.e., a pair of test forms to be equated), true. This is sometimes referred to as strong ignorability in the causal inference literature (Hernan & Robins, 2020).
The first part of Assumption 1 means that the test scores are conditionally independent of the test form assignment by controlling for the propensity score. The test groups would thereby be only randomly different from each other, as in the equivalent groups design. The second part of Assumption 1 is to ensure that all test-takers have a nonzero probability of being assigned either test form. If the propensity score has been stratified into L strata, such that the test groups are balanced on D in each stratum, estimators of the missing-data quantities in Equations 8 and 9 can be identified under Assumption 1. To that end, let the stratified propensity score be denoted M 2 f1; . . . Lg with realizations denoted m.
Under Assumption 1, we furthermore assume that: and Equation 10 states that the probability of test score x j is the same in populations P and Q conditional on the observed, stratified propensity score M ¼ m. The corresponding probability statement is true for the Y scores in Equation 11. Estimators of r Q j and s P k are now possible to define. In the following, such estimators are defined and justified.
Proposition 1: Denote the (from log-linear models) estimated joint distributions of X and M in P, and of Y and M in Q, bŷ Propensity Score Misspecification in Equating If Assumption 1 holds true and the propensity score has been stratified, such that the covariate distribution is balanced in the test groups, r Q j and s P k can be estimated byr The proof of Proposition 1 is found in Online Appendix A. Lastly, plug the estimated test score probabilityr into Equation 4, and do the corresponding forŝ, and functionally compose the equating estimator as jðx;r;ŝÞ PSE ¼ G À1 Y ðFX ðx;rÞ;ŝÞ: 3.2.2. CE estimator. Even though conditioning on the propensity score, as has been outlined for the PS-PSE estimator, is the traditional way of removing dependencies between the outcome and the treatment, CE methods have a long-standing tradition within test score equating. Several studies have showed that the result of linking, or chaining, together a sequence of equating functions is often very similar and sometimes even better than that of the competing PSE estimator (Sinharay & Holland, 2010a, 2010bWallin & Wiberg, 2019). In fact, when equating with an anchor, the PSE and CE coincide if the anchor score distribution in P and Q is equal (von Davier et al., 2004a). Thus, for the second equating estimator considered, abbreviated PS-CE, let Note that these are score probabilities for populations P and Q, respectively, and not for the mixture population T. The quantity t Pj is to be understood as the probability of the random, stratified variable M being equal to the realization m in population P, and t Qk interpreted analogously. We furthermore define the corresponding continuized score CDFs that are functions of the score probabilities:

Wallin and Wiberg
where FX P ; GỸ Q , Hẽ P , and Hẽ Q denote CDFs that have been continuized in similar fashion as in Equation 4. The PS-CE estimator is dependent on the linking of distributions between populations P and Q. There is an underlying assumption that there is a link between the X scores and the propensity scores in population P and a link between the propensity scores and the Y scores in population Q. This is specified in Assumption 2.
Assumption 2: For the PS-CE estimator, we assume that for any target distribution of the form T ¼ wP þ ð1 À wÞQ.
Assumption 2 is to be understood as a statement regarding population invariance of the equipercentile function linking X to eðDÞ on P and of the equipercentile function linking eðDÞ to Y on Q.
Proposition 2: The PS-CE estimator is given by linking the functions in Equation 13 together in a chain: The proof of Proposition 2 is found in Online Appendix B.

A Motivating Example Using Empirical Data
As a motivating example, two test administrations of the SweSAT are analyzed. The SweSAT is used in the selection process for Swedish university programs and consists of a verbal and quantitative section. These sections in turn consist of 80 items each and are equated separately. Only recently, the SweSAT started including anchor items. Prior to this, covariates were used in a matching procedure when the test forms were equated (Wiberg & Bränberg, 2015). In this empirical study, both the PS-PSE and PS-CE estimators will be used to equate the quantitative sections from two SweSAT test administrations from the past decade.

Data and PS Models
The score distributions of the analyzed test forms are shown in Figure 2. As seen, the Y score distribution (the old test form) is slightly skewed and shifted to the left of the X score distribution. The empirical distributions suggest that either the X test group is on average more capable compared to the Y test group or that the X test form is easier, or a combination of both. In addition to the test scores, each test-taker has a set of covariates recorded. Based on previous studies Propensity Score Misspecification in Equating (Altintaş & Wallin, 2021;Bränberg et al., 1990;Wallin & Wiberg, 2019) and on availability, the covariates used in this study are gender, age, and the test score from the verbal section as these have shown to correlate with the quantitative score. In Table 2, summary statistics are being presented for the variables of the empirical study. The variable Age is reported in five categories: It equals 1 if an individual's age is within ½0; 20, it equals 2 if the age is within ½21 À 24, it equals 3 if the age is within ½25 À 29, it equals 4 if the age is within ½30 À 39, and it equals 5 if the age is 40 or older. Note that at the time for these test administrations, there was no age restriction for individuals taking the test.  Since there is no known true propensity score model, a number of candidate models are set up for both the PS-PSE and PS-CE equating estimators. Let D 1 denote the test score on the verbal section of the test, D 2 denote age, and D 3 denote gender. The candidate propensity score models are estimated using logistic regression with a logit link except when indicated. We consider most of the possible combinations of covariates and factors, resulting in 13 models, which are shown in Table 3.
Hence, in total, there will be 26 equating estimators considered, 13 for the PS-PSE estimator and 13 for the PS-CE estimator. The equated scores and the SEEs of each estimator will be analyzed to determine the extent to which they vary with changes in the propensity score model's parameterization. The difference that matters (Dorans & Feigenbaum, 1994), defined to be larger than half a raw score point, will also be investigated. Goodness-of-fit measures like the Akaike information criterion (Akaike, 1974) or the Bayesian information criterion (BIC; Schwarz, 1978) are not suitable for evaluating the propensity score models, since their parameter estimates are not the priority but rather the achieved covariate balance between the test groups (Augurzky & Schmidt, 2001;Stuart, 2010). The absolute standardized mean difference (ASMD; Austin, 2008) will therefore be used to evaluate the level of achieved covariate balance: denote their respective variances. There exist no general threshold for the ASMD, but a value above 0.10 is considered to indicate covariate imbalance (Austin, 2008). Once a proper stratification has been achieved, bivariate log-linear models of the test scores and the stratified, estimated propensity score according to (3) are fit to 8. D 1 and D 3 2. D 1 ; D 2 ; and D 3 with probit 9. D 2 and D 3 3. D 1 and D 2 10. D 2 ; D 2 2 ; and D 3 4. D 1 ; D 2 1 ; D 2 ; and D 3 11. D 1 ; D 2 1 ; and D 2 5. D 1 12. D 1 ; D 2 ; D 2 2 ; and D 3 6. D 2 13. D 1 ; D 2 1 ; D 2 ; D 2 2 ; and D 3 7. D 3 Propensity Score Misspecification in Equating the empirical data. The BIC is used to choose parametrization of the log-linear models since it as it has been proven to have a high selection accuracy for bivariate smoothing (Moses & Holland, 2010). All analyses are made using R (R Core Team, 2021) and the R package kequate (Andersson et al., 2013).
In Figure 3, the ASMDs between the treatment (test form X) and control (test form Y) group for the covariate Verb are displayed. The ASMD was calculated within each stratum of the propensity score. To determine the number of strata, a sequence of potential stratifications was set up. For each possible number of strata, the ASMD was calculated within each stratum. The stratification that produced a low ASMD for all strata was chosen, resulting in 20 strata for this particular dataset. In Figure 3, the first two models show the best performance in terms of ASMD since most of the strata have an ASMD below 0.1. This could be compared with the ASMD if not controlling for the propensity score, which equals 0.386. Stratifying on the propensity score has thus successfully brought the test groups substantially closer in terms of their covariate distribution. The corresponding plots were examined for the covariates Gender and Age as well, FIGURE 3. The absolute standardized mean difference between the treatment group (test form Y) and the control group (test form X) for the covariate Verb, for each of the 20 strata of the four candidate propensity score models.
where similar patterns were observed. It could thus be suspected that these models will lead to low equating error, given that the covariates Verb, Gender, and Age are at all associated with the latent ability.
In the next stage, bivariate log-linear models are fit to the observed test scores and the stratified propensity scores. A set of candidate models are considered and evaluated in terms of their BIC. In Tables 4 and 5, the estimated coefficients together with their corresponding standard errors, p values, and the BIC are presented for four candidate models. The notation X : M refers to an included interaction term between X and Y. We decided to use the third model for both the Propensity Score Misspecification in Equating X and Y scores, as it showed the best fit in terms of the BIC. We thereafter continuized the score distributions by applying a Gaussian kernel to the distribution approximation in Equation 4 and select the degree of smoothness using the criterion in Equation 5.

Results
To illustrate the general trend among the estimators, we display the results of the equating estimators using propensity score models 1-4 in Figure 4.

Wallin and Wiberg
Propensity score model number 3, which does not include the covariate Gender, deviates clearly. For the upper score scale, Model 3 has a score difference to the other estimators that clearly matters. Since gender has been established as an important covariate when analyzing the SweSAT (Bränberg et al., 1990) and with a fairly strong correlation with the test scores, it comes as no surprise that the equated scores are affected when gender is excluded. Far less important is the choice of link function, or whether or not a second-order term is included, for this dataset. On the other hand, the SEEs of all estimators are more or less similar along the whole score scale. In Figure 5, the equated scores (upper part) are shown for the four PS-CE estimators, together with SEE (lower part). The pattern from the PSE estimators is evident here as well, with clear deviations for the model that fails to include gender in the propensity score model and with a negligible difference in terms of SEE. We also notice a distinct difference between the equated scores produced by the PSE-based estimators in Figure 4 and the CE-based estimators in Figure 5. In the online supplements, the estimated equating functions resulting from all 13 propensity score models are given. Propensity Score Misspecification in Equating

Simulation Study
For the empirical illustration, the results suggested that a critical component when using propensity scores to equate test scores is to include all important covariates in the propensity score estimation model. The equated scores were less sensitive to the choice of link function and the inclusion of higher order polynomials. Since it is not possible to generalize these results, the robustness of the PS-PSE and PS-CE estimators to misspecifications of the propensity score model is evaluated in a simulation study. We assume that the propensity score is described by a parametric model and consider two different simulation designs. Both designs are inspired by the simulation study in Wallin and Wiberg (2019) but with propensity score model misspecifications added. The misspecifications considered are (1) using the wrong link function, (2) leaving out a covariate, and (3) leaving out higher order terms. The simulation designs follow closely the studies typically seen in the causal inference literature, where potential outcomes under different treatment regimes are generated and the observed outcomes depend on the realization of the treatment variable, which in turn is a function of a covariate vector. Inspired by this and by trying to mimic the situation described in Figure 1, we generated covariates that both affected the test form assignment (through the propensity score) and the test scores, making them true confounders. Both potential test scores and observed test scores are generated, as explained in the simulation designs. The presented results are based on n ¼ 10;000 simulated test-takers and 1,000 iteration, although sample sizes of n ¼ 1;000 and n ¼ 5;000 were considered as well. As the difference of those results to the ones based on n ¼ 10;000 was negligible, they have been excluded but can be sent upon request. As in the empirical study, all calculations are carried out in R with the R package kequate.
3. The potential test scores on test form X are for all test-takers generated as and the potential test scores on test form Y are for all test-takers generated as Since the covariates in these expressions represent the ability differences between the groups, the E terms represent the difficulty of the test forms, where E X *N ð2; 1:5Þ and E Y *N ð0; 1Þ. The means and variances of the test scores are E½X ¼ 23, E½Y ¼ 18, V½X % 56:92, and V½Y ¼ 61. With the data generated, the distributions of the covariates differ between the test groups.
4. The observed test score for each test-taker is defined as To generate an observed score U Ã for each test-taker, we set U Ã ¼ minðU ; 40Þ, which is to be understood as the rounded value of whichever is the smaller of U and 40. Although no generated score was smaller than 0, Propensity Score Misspecification in Equating such score would have been truncated to 0. The score range is therefore set to ½0; 40. 5. The propensity score is estimated using logistic regression. Based on the percentiles, it is thereafter divided into 20 categories. The number of categories was chosen trying to reach a covariate balance between the test groups as measured by the ASMD. Four candidate models will be defined: one that is correctly specified according to Equation 15, one that uses a probit link function instead of the correct logit link, one that leaves out D 2 , and one that leaves out D 2 1 and D 2 2 .

Simulation Design B
For Design B, the DGP is as follows: 1. Generate the covariates D 1 ; D 2 *Uniformð1; 5Þ. 2. Generate n ¼ f1; 000; 5; 000; 10; 000g Bernoulli trials to compose the treatment variable Z*BernoulliðeðDÞÞ, where It follows that the test groups will be of approximately the same size.
3. The scores on test form X are for all test-takers generated as and the scores on test form Y are for all test-takers generated as where E X *N ð0; 1Þ and E Y *N ð5; 1:5Þ. Note that the covariates in this design have a nonlinear relationship with the test scores and that there is an interaction term included. The means and variances of the test scores are E½X % 46:17, E½Y % 49:67, V½X % 129:79, and V½Y % 131:04.
4. The observed test score for each test-taker is generated as To generate an observed score U Ã for each test-taker, we set U Ã ¼ minðU ; 90Þ, which is to be understood as the rounded value of whichever is the smaller of U and 90. Although no generated score was smaller than 0, such score would have been truncated to 0 as in Design A. The score range is therefore set to ½0; 90. 5. As in Design A, the propensity score is estimated using logistic regression and thereafter divided into 20 categories, based on the absolute standardized mean difference. Four candidate models will be used: one that is correctly specified according to Equation 16, one that uses a probit link function instead of the correct logit link, one that leaves out D 2 , and one that leaves out D 2 1 and D 2 2 .
Remark 3. The potential test score X is to be interpreted as the test score that a testtaker would have got if they had been administered the X test form, and Y is the potential test score if test form Y had been administered. In this way, every testtaker has a potential, but not observed, test score on both forms. The observed test score U reflects the test form actually administered to each of the test-takers. In addition, for both Designs A and B, each test-taker has an observed covariate vector D and an estimated propensity scoreêðDÞ. Also, a discrete version of the covariates was considered by splitting them into five equally spaced groups, inspired by the DGP in Wiberg and Bränberg (2015). The reason for doing this was to mimic testing programs where only categorized versions of the background information have been stored, such as prespecified age intervals instead of the actual ages of the test-takers. Lastly, it is important to note that it is possible to define a true equating function for both DGPs given above, since each test-taker has a potential test score on both test forms.
Remark 4. Note that in Design A, the covariates are associated with the log odds of the propensity score and the test scores in a linear way, whereas in Design B, that relationship involves both higher order terms and interactions. In this way, we are able to investigate whether there is any connection between model complexity, model misspecification, and sensitivity of the equated scores.

Evaluation Measures
The PS-PSE and PS-CE estimators are evaluated by calculating the bias and SE, as given in Wiberg and González (2016) andĵ ðgÞ ðx i Þ denote the estimated equating function evaluated at x i for replicate g, g ¼ 1; . . . ; 1,000.
Propensity Score Misspecification in Equating

Simulation Results-Design A
The bias of the PS-PSE and PS-CE estimators is presented in Figure 6. Note that for propensity score models with a misspecified link function and for those that fail to include the second order term, the bias is very similar. Although not illustrated in the figure, their biases practically coincide with the biases of their correctly specified counterparts (the difference is less than 0.01 for each score point). This turns out to be a pattern which is present for both estimators for all considered sample sizes, all evaluation measures, and both simulation designs.
As the upper part of Figure 6 illustrates, the PS-PSE estimators exhibit only a small bias for all scores, with the exception of the KE estimators with a propensity score model that leaves out a covariate. It is also noteworthy that it does not matter whether or not the covariates have been categorized; the biases for all estimators stay similar regardless. The estimators that misspecify the link FIGURE 6. The bias of the PS-PSE and PS-CE estimators for n ¼ 10;000 test-takers under Simulation Design A, considering both categorized and uncategorized covariates, using a misspeficied link function and a missing covariate, respectively, in the propensity score estimation model. function and that leaves out the second-order term show the best performance, with differences between them being too small to be discovered in the figure. As these estimators more or less coincide with the estimator using a correctly specified model, the results suggest that the propensity score is successful at balancing the test groups for the PS-PSE estimator.
The lower part of Figure 6 depicts the bias for the PS-CE estimators. For the PS-PSE estimators, misspecifying the link function (and leaving out the second order term) yields small biases across the score range. There is a negligible difference between using categorized and uncategorized covariates in the propensity score model, and the bias increases substantially when a covariate is left out and grows particularly large for categorized covariates.
The SE of the PS-PSE and PS-CE estimators is illustrated in Figure 7. Generally, the SE is larger in the lower and upper end of the score range FIGURE 7. The standard error of the PS-PSE and PS-CE estimators for n ¼ 10;000 testtakers under Simulation Design A, considering both categorized and uncategorized covariates, using a misspeficied link function and a missing covariate, respectively, in the propensity score estimation model. Propensity Score Misspecification in Equating regardless of the type of misspecification. This is due to the sparse data at the most extreme scores. The estimators perform similarly with few exceptions. However, the PS-PSE estimator with a propensity score model that leaves out the second-order term of the categorized covariates yields a slightly larger SE, especially in the middle segment of the score scale. For the PS-CE estimators, it instead is the misspecification consisting of a left out covariate that results in such pattern. We remind that the solid curves also represent the results of the estimators with the second-order terms missing, down to a very small difference. The dot dashed curves in the same way represent two types of misspecifications for categorized covariates.
From Design A, we conclude that misspecifying the link function or missing to include a second-order term, for both the original covariates and the categorized versions of them, introduces far less error compared to missing to include a covariate in the propensity score model.

Simulation Results-Design B
The results of Design B are presented for n ¼ 10;000 as the results for n ¼ 1;000 and n ¼ 5;000 are more or less the same, both in magnitudes of the evaluation measures and in relative performance of the estimators.
The bias of the PS-PSE and PS-CE estimators is displayed in Figure 8. The similarity with the biases in Design A is apparent. Once again, failing to include an important covariate leads to severe bias for both estimators. Especially in the case of the PS-CE estimator with categorized covariates, the results are particularly inaccurate. The estimators with a misspecified link function and those who fail to include the second-order term show robust results in the presence of model misspecification.
The SE of the estimators is shown in Figure 9. Both estimators perform similarly for all misspecifications, but with an overall best performance shown in the case of misspecified link functions and with a second order missing, respectively. In contrast to the other estimators, the SE of these estimators also drops for the top scores. This could be a meaningful difference since the most critical decisions in many tests, for example, selection tests, are made at the top scores. It should however be noted that the SEs are large, especially in the tails.
Similar to Design A, we conclude from Design B that the estimators with an incorrect link function and those that do not include the second-order term are relatively robust. The PS-CE estimator that fails to include one of the categorized covariates shows the overall worst performance. We also observe that the results of Design B are approximately proportional to those of Design A, possibly due to both designs having the same type of covariates (uniformly distributed on the interval [1,5]). However, Design B has a more intricate relationship between the covariates and the propensity score, as well as the covariates and the test scores. As a result, the biases displayed in Design B's results are roughly twice as large Wallin and Wiberg as those seen in Design A, and the SEs have also increased. Therefore, the additional complexity in the DGP has amplified the equating error.

Discussion
The goal of this study was to investigate how sensitive the equated scores are to model misspecification of the propensity score, when the propensity score is used to equate nonequivalent test groups. It has already been shown in Wallin and Wiberg (2019) that equating with propensity scores has the possibility to reach similar precision and accuracy as equating with an anchor, and superior results compared to equating under a false assumption of equivalent groups. But since the results of Wallin and Wiberg (2019) are based on the assumption that the propensity score is known, which it typically is not in practical research FIGURE 8. The bias of the PS-PSE and PS-CE estimators for n ¼ 10;000 test-takers under Simulation Design B, considering both categorized and uncategorized covariates, using a misspeficied link function and a missing covariate, respectively, in the propensity score estimation model. Propensity Score Misspecification in Equating scenarios, it was crucial to study how sensitive these results are to model misspecification. The propensity score is a useful tool in research as it possesses the desirable feature of being a balancing score, which has led to its widespread application across various domains. However, its high degree of flexibility means that there are numerous modeling options available, emphasizing the need for careful scrutiny to determine when the propensity score can effectively balance test-taker groups and when it falls short.
The propensity score methods explored in this study demonstrate potential as the equated scores remain insensitive to both link function misspecification and the omission of a second-order term in the estimation model. This applies to both linear (Simulation Design A) and nonlinear (Simulation Design B) relationships between covariates and outcomes. Notably, the model misspecifications resulted in a similar bias and SE (in rounded score terms) to the correctly specified models, signifying robustness of the equated scores to such errors in the FIGURE 9. The standard error of the PS-PSE and PS-CE estimators for n ¼ 10;000 test-takers under Simulation Design B, considering both categorized and uncategorized covariates, using a misspeficied link function and a missing covariate, respectively, in the propensity score estimation model. propensity score model. On the other hand, the equated scores were negatively affected by a propensity score model that omitted a true confounding covariate. These conclusions remained the same for all considered sample sizes and for both simulation designs. The results therefore clearly point to the importance of using all pertinent information related to latent ability when using the propensity score as a proxy variable. This aligns with earlier research on the propensity score, which indicates that omitting a higher order term that exists in the actual model while estimating the propensity score does not result in biased estimates (Dehejia & Wahba, 1999;Drake, 1993;Stuart, 2010;Waernbaum, 2010Waernbaum, , 2012. Incorporating all true confounding variables is linked to the unconfoundedness assumption that forms the foundation of the propensity score method for covariate balancing. Consistent with earlier research, it was found that this aspect is crucial in the equating context as well. As in Waernbaum (2010Waernbaum ( , 2012, we note that as long as the true propensity score is a function of the misspecified model, unbiased estimation of the parameter of interest is possible. We note that for Design B, the standard errors are fairly large but should be seen in relation to previous research that has showed that equating error and variability is even greater when falsely assuming equivalent groups (Wallin & Wiberg, 2019). A misspecification of the propensity score model when the relationship between the test scores and the covariates is nonlinear is thus a delicate scenario. Since reported scores often are used for individual-level decision making, the current results suggest that future research should carefully study nonlinear cases.
We emphasize that the quality of the ability balancing suggested in this article depends strictly on the quality of the auxiliary information. The restrictions that come with the data at hand need to be evaluated with the identifying Assumptions 1 and 2 in mind. Two examples of restrictions in the empirical data analyzed in this study are the limited amount of covariates and the fact that the variable Age is only available in a categorized version. Since the proposed method has been shown to perform similar to anchor test-based equating for this particular data set (Wallin & Wiberg, 2019), there is reason to believe that the current covariate restrictions have not reversed the results. In the case of propensity score-based equating, we advise seeking input from experts in the subject matter concerning the testing program and test groups that need to be equated. Additionally, we suggest conducting a comprehensive analysis of the associations between the collected covariates and test scores. Since both the propensity score and anchor test score are employed as proxies for ability, they can be evaluated using similar methods.
Some limitations with the current study include the following. We only considered two types of covariates and future studies could consider to expand that. Both by using a propensity score model that is a function of both discrete and continuous covariates and with different dependence structure between them. We however emphasize that the aim of this article was to study propensity score model misspecification, and the misspecifications were thus the main focus and Propensity Score Misspecification in Equating not different types of covariates. We therefore chose to vary the relationship between the treatment variable, the test scores, and the covariates but not the covariates themselves. On this note, it should be pointed out that Assumptions 1 and 2 are strong, but of similar magnitude to the assumptions underlying NEAT equating. The results in both the original paper by Wallin and Wiberg (2019) and the current article furthermore suggest that there are several realistic test scenarios, where propensity score stratification is a viable technique for a sufficient ability imbalance reduction. It would therefore be of importance to further investigate how sensitive the equating function parameter is to violations of the propensity score assumption. Studying the omission of a true confounder in the propensity score model could be considered a first step toward such analysis, since this violated the unconfoundedness assumption in Assumption 1. A diagnostic tool would in the future be of great use for such analysis. In Online Appendix C, further simulation results are presented, considering both missing data and another case of model assumption violation. These results suggest that the PS-PSE particularly is robust against certain missingness, but that bias is introduced when a subset of test-takers have a true propensity score equal to 1 (or equivalently, equal to 0). These scenarios could, for example, happen when there is an age restriction to the test in question, and certain test-takers were not allowed to take the test in the previous administration. An empirical check of the propensity scores should therefore always be conducted.
It is worth mentioning that the outcomes of Simulation Design B demonstrate a proportional relationship with those of Simulation Design A. This is attributed to the intricate association among the covariates, the treatment variable, and the outcome in Design B, which is more complicated than that in Design A. In addition to these factors, there are testing programs that have access to both covariates and an anchor test. It would therefore be worth investigating if there is any additional gain by using both sources of information to control for ability differences. Incorporating both covariates and anchor test scores has been studied within the NEC design (Albano & Wiberg, 2019;Wiberg & Bränberg, 2015), but never when considering propensity scores. We expect this to improve the results, as demonstrated in the small example in Online Appendix C. Generalizing these results and quantifying the improvement would be a significant contribution to equating nonequivalent groups. Finally, this study has only considered parametric regression models to estimate the propensity score, and other existing methods should be examined in future research.
As a final note, we point out the recent critique that has been raised toward NEAT-based equating in San Martín and González (2022). With data being partially missing by design in the nonequivalent groups designs, the test score distributions, and thus the equating estimator, are not identified. Most methods, including the methods studied in this article, make identifying assumptions to estimate the score distributions. An alternative approach, suggested in San Martín and González (2022), is to use the theory of partial identification (Manski, 2009) to define identification regions for the equating function. This is a new perspective that we believe sheds light to the discussion on whether or not equating has any potential to report fair scores under nonequivalent groups designs, see, for example, Bolsinova and Maris (2016). Their approach could also serve as a useful tool to investigate the sensitivity of the identifying assumptions presented in this article.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This work was supported by Vetenskapsrådet (2020-06484) and Marianne och Marcus Wallenbergs Stiftelse (2019.0129).