Fast Bayesian Estimation for the Four-Parameter Logistic Model (4PLM)

There is a rekindled interest in the four-parameter logistic item response model (4PLM) after three decades of neglect among the psychometrics community. Recent breakthroughs in item calibration include the Gibbs sampler specially made for 4PLM and the Bayes modal estimation (BME) method as implemented in the R package mirt. Unfortunately, the MCMC is often time-consuming, while the BME method suffers from instability due to the prior settings. This paper proposes an alternative BME method, the Bayesian Expectation-Maximization-Maximization-Maximization (BE3M) method, which is developed from by combining an augmented variable formulation of the 4PLM and a mixture model conceptualization of the 3PLM. The simulation shows that the BE3M can produce estimates as accurately as the Gibbs sampling method and as fast as the EM algorithm. A real data example is also provided.

parameters cannot be calculated as a consequence of program errors in computing the Hessian matrix when a Beta distribution prior ("expbeta" in mirt) is imposed on guessing or slipping parameters.
At present, we propose an alternative Bayesian modal estimation approach that we term Bayesian Expectation-Maximization-Maximization-Maximization (BE3M) method, which is developed by using a latent variable augmentation approach for the 4PLM in conjunction with the mixture modeling approach for the 3PLM (Zheng et al., 2017). The remainder of the manuscript is as follows: First, we present a detailed derivation of the BME and discuss potential difficulties in item calibration. Second, we present a latent variable augmentation approach to formulating the 4PLM (Béguin & Glas, 2001;Culpepper, 2016), and the resulting BE3M estimation method. Third, we present two simulation studies assessing the accuracy of parameter recovery for the four calibration methods (BE3M, mirt, fourPNO, Stata). Fourth, we apply the four methods to bullying item responses of 7,491 adolescents from the 2005 to 2006 Health Behavior in School-Aged Children (HBSC) study. Lastly, we discuss future research directions and concluding remarks.

Bayesian Modal Estimation for the 4PLM
Following Culpepper (2016), the slope-intercept form IRT formulation of the 4PLM is: where Y ij is the observed score of examinee i on item j ; α j , β j , γ j , and ς j are item slope, intercept, guessing, and slipping parameter of item j j J = ( ) 1 2 3 , , ..., , respectively; θ i ∈ θ θ is the ability parameter of examinee where g i θ τ | ( ) is a density function for each examinee i, and τ is the vector containing the parameters of the examinee population ability distribution. Please refer to Supplemental Appendix A for a detailed derivation. It is worth pointing out that the BME is a natural extension of the one for 3PLM, but that the Fisher information matrix is slightly larger-it is size 4 × 4 rather than size 3 × 3-with small diagonal elements for the guessing and slipping parameters. Taken together, we find a great deal of numerical instability in the calculation of the Fisher information matrix, resulting in the BME requiring a much larger sample size than the 3PLM to avoid convergence problems. According to Waller and Feuerstahler (2017), the convergence rate for a sample of 1,000, which is always feasible for 3PLM, is only 9%. The new method in the current paper can address this issue by reformulating the 4PLM as a latent mixture model.

The Latent Mixture Modeling Reformulation of the 4PLM
From inspection, we can determine that the 4PLM can be regarded as a latent mixture model. First, note that Next, we define a latent indicator variable W ij to indicate whether examinee i knows the answer to item j , where W ij = 1 if he knows it and 0 if he does not (Béguin & Glas, 2001;Culpepper, 2016 Thus, we find the joint distribution of Y ij = 1 and W ij (conditional on item and person parameters) as follows: Let w i be the latent indicator vector for examinee i. Then the joint distribution for the new augmented complete data Just like the MMLE/EM algorithm, the BE3M algorithm also marginalizes out the latent ability variable, so the likelihood function of the BE3M without priors is: This formulation follows Culpepper (2016) for 4PNO. In contrast to his work, however, we now develop a Bayesian modal method to estimate model parameters for the 4PLM. This approach follows the basic premise of an EM algorithm and proposes to split the maximization step into three parts: first maximizing over γ j , then maximizing over ς j , and finally jointly maximizing over α j and β j . We further use prior distributions for the item parameters in the maximization step. As such, we call this approach the Bayesian Expectation-Maximization-Maximization-Maximization (BE3M) method.

Expectation Step and Artificial Data
In the expectation step, we take the expectation of the loglikelihood over the incomplete portion of the data, the latent variables W ij and θ i . In the maximization step, we maximize the posteriors for the item parameters. This maximization process is equivalent to finding the zero of the summation of the first derivatives of the expected log-likelihood and the first derivatives of the priors. Thus, here we find the expected value of the first derivatives of the log-likelihood function. To begin, the first derivatives of the log-likelihood for item j is Notice that the above expression includes the latent variable W ij so when we take the expectation of the log-likelihood, it will be necessary to calculate E W Y ij ij i j | , , θ ψ ψ ( ) . To find this, we first find the conditional distribution of W ij using Bayes' rule: Thus, we find the conditional expectation of W ij to be Finally, let X k , k q = … 1 2 , , , , be nodes on the ability scale with associated weights A X k ( ). Using the Hermite-Gauss quadrature method to approximate the integral, the first derivatives of the expected log-likelihood function for each item is: ) is the posterior density of θ i at quadrature point X k . Following Bock and Aitkin (1981), we can define several "artificial data" points for our algorithm: and By taking their expectations, we find that these expressions have very intuitive interpretations: 1 is the expected number of persons with ability X k to know the answer to item j; and is the expected number of persons with ability X k that know the answer to item j and also answer it correctly.

Maximization Step 1: γ j -Parameters
Let λ γ j be the summation of the expected first derivatives of the log-likelihood with respect to γ j and the first derivatives of the prior distribution of γ j . Following usual practice (Baker & Kim, 2004), we use a beta prior with hyperparameters υ γ j ( ) and ρ γ j ( ) . Then, By setting λ γ j equal to 0 and solving for γ j , we get a closed-form solution for the guessing parameter: Standard errors will require the expected second derivative as well, which is

Maximization
Step 2: ς j -Parameters Let λ ς j be the summation of the expected first derivatives of the log-likelihood with respect to ς j and the first derivatives of the prior distribution of ς j . As with the γ j -parameters, we use a beta prior with hyperparameters υ ς j ( ) and ρ ς j ( ) . Then, By setting λ ς j equals to 0 and solving for ς j , we get a closed-form solution for the slipping parameter: Standard errors will require the expected second derivatives as well, which is

Maximization
Step 3: α j -and β j -Parameters Let α α j j * ln = ( ) , and λ α j * and λ β j be the summation of the expected first derivatives of the log-likelihood and the first derivatives of the prior distribution with respect to α j * and β j , respectively. In this maximization step, we first solve the estimates of α j * , and then obtain the α j parameter via α α j j = ( ) exp * . For both the α j * -and β j -parameters, we use a normal prior with means µ α j * and µ β j and variances σ α j * 2 and σ β j 2 , respectively. Then, and Unlike the guessing and slipping parameters, α j * and β j do not have closed-form solutions for their estimates, so it is necessary to find them numerically. The method used here is the Fisher scoring algorithm, which uses the expected values of the second derivatives, along with the first derivatives. These will also be used later for standard error calculations. We find the second derivatives as follows:

Standard Errors of Parameter Estimation in BE3M
An important index of estimation quality is the standard error (SE). However, one major criticism of the EM algorithm is that parameter estimate SEs are not a natural product of the algorithm, and so other methods have to be devised (McLachlan & Krishnan, 2007). BE3M falls prey to this criticism, just as all members of the EM algorithm family do. It can be shown that the inverse of the negative expected value of the matrix of second derivatives of the log-likelihood (i.e., the inverse of the information matrix) is the asymptotic covariance matrix of the estimates (Stuart & Kendall, 1968). As such, the square roots of the diagonal elements of the resulting matrix are the asymptotic standard errors of the parameters. Generically, the expected second derivative matrix can be written as in which each entry is the expected partial second derivative of the log-likelihood with respect to the subscript parameters for a given item j. Convergence issues resulting from the illconditioning of the matrix plague the MMLE/EM for 4PLM just as it does for the 3PLM (Baker & Kim, 2004). In contrast, due to the divide-and-conquer strategy implemented in BE3M, we contend that the resulting expected second derivative matrix does not have these issues. In BE3M, the item parameter estimation problem is separated into three smaller distinct estimation problems: (1) estimation of the guessing parameter, (2) estimation of the slipping parameter, and (3) joint estimation of the difficulty and discrimination parameters. This separation has the benefit that any instabilities present in one of the three steps will not negatively affect either of the other two steps. Statistically, this separation implies that the covariances between γ  j and α  j * , β  j , and ς  j are zero; similarly, the covariances between ς  j and α  j * , β  j , and γ  j are zero. Collected together, we find the item j expected second derivative matrix to be Λ Λ The standard errors of the estimates for item j, then, are the square roots of the diagonal elements of the matrix Σ Σ Λ Λ j j = − −1 . Furthermore, the standard error of the α j parameter can be transformed from Λ Λ j via the delta method (Oehlert, 1992).

Simulation Studies
This section presents Monte Carlo results regarding the accuracy of the developed BE3M procedure for estimating 4PLM item parameters, compared to the BME in mirt (version 1.30) package (Chalmers, 2012) and the MCMC in R package fourPNO (Culpepper, 2016) and "bayesmh" module of Stata/SE 16. Results from two simulation studies are reported. The goal of the simulation studies was to assess the influence of various design features, including the location of item threshold, sample size, and priors in educational (Simulation 1) and psychological (Simulation 2) scenarios.

Item and Person Parameter Generation
For Simulation 1, the item parameters were generated in the same way as in Culpepper (2016) . For Simulation 2, we used the item parameters for the 23 items calibrated in Waller and Reise (2010) application to an adolescent self-esteem scale; these were also used in Culpepper (2016). For both testing scenarios, three sample sizes (2,500, 5,000, and 10,000) of person parameters were sampled from the standard normal distribution.

The Setting of MCMC in R Package FourPNO
Since Culpepper (2016) introduced an extra latent variable and imposed various truncation restrictions on the item parameters, the priors in the fourPNO package have obvious differences from the traditional MCMC procedures (see Culpepper (2016) Furthermore, since Culpepper (2016) concluded that "no more than 50,000 iterations were needed to achieve R  < 1.1 for all item parameters," only one chain (100,000 MCMC iterations and discarded the first 50,000 as burn-in) were run in the current studies.

The Setting of BE3M, BME in R Package mirt and MCMC in Stata
Both the BE3M and mirt used 61 quadrature points to approximate the Gaussian distribution from −6 to +6 in all simulation studies and real-data examples. The priors for α and β parameters were set as follows: ln~, . α N 0 0 25 ( ) and β~, N 0 2 ( ). Furthermore, two different priors for γ and ζ were used in the BE3M, mirt, and Stata: 1. γ ς ,~, Beta 2 8 ( ) , which is the default setting for the guessing parameter in the advanced IRT software flexmirt (Houts & Cai, 2015), and 2. γ ς ,~, Beta 4 16 ( ), which is the default setting for the guessing parameter in the classical IRT software BILOG-MG3 (Zimowski et al., 2003) In sum, for both the educational and psychological testing scenarios, 21 conditions were simulated: Seven estimation method conditions (fourPNO, BE3M with prior conditions 1 and 2, mirt with prior conditions 1 and 2, and Stata with prior conditions 1 and 2) each crossed with three examinee sample size conditions (2,500, 5,000, and 10,000). For each condition, 100 replications were run to reduce sampling error. Each condition was assessed by its accuracy in recovering individual item parameter as defined by bias and the root mean squared error (RMSE) across the 100 replications: Furthermore, since Waller and Feuerstahler (2017) pointed out that the item parameter recovery for the individual parameter cannot reflect the entire situation of the accuracy of fitting the 4PLM, the current studies also introduced the root integrated mean squared error (RIMSE; Ramsay, 1991) to assess the entire accuracy via the item response function of the 4PLM:

Results
The results of two simulation studies will be respectively reported at two levels: the individual level for each type of parameters and the whole level for the item response function of the 4PLM. As for the individual level, the detailed results for two simulation studies are presented in tables in Supplemental Appendices B and C, respectively. Only RMSEs (Tables 1 and 2) for the condition of 5,000 examinees are summarized and presented here since those for other conditions are very similar. The results of two simulation studies can be briefly summarized as follows: (1) In general, the results of MCMC in the fourPNO package replicate the previous study in both scenarios; (2) The item parameter recovery of BE3M are very similar with MCMC although the RMSEs of the α parameters yielded by the fourPNO package and Stata are slightly larger than those from the BE3M in the first simulation study; (3) The mirt package and the BE3M have similar bias and RMSEs for α and β parameters, but the BE3M has better performance in estimating the γ and ζ parameters; and (4) Compared with the mirt package, the BE3M, and Stata can provide more stable estimates when changing priors for γ and ζ parameters.
As for the whole level, the RIMSEs for two simulation studies are presented in Figure 1. In general, with the increasing of the sample size, the RIMSEs of BE3M, fourPNO, Stata, and mirt package gradually decrease, and all of the RIMSEs across different conditions are relatively small. Furthermore, the results of RIMSEs show that there is no obvious difference between BE3M and BME in R package mirt although the BE3M has better performance in estimating the asymptotes, which is consistent with Waller and Feuerstahler (2017).

Real Data Example
The 4PLM is applied to bullying items collected as part of the 2005 to 2006 Health Behavior in School-Aged Children study (Iannotti, 2005) funded by the National Institute of Child Health and Human Development. This real data example was chosen to replicate previous results (Culpepper, 2016), and to demonstrate comparability between BE3M and the MCMC approaches (fourPNO package and "bayesmh" module in Stata) in terms of estimation accuracy and BE3M's advantage in terms of estimation time. The prior settings for the "bayesmh" module in Stata are the same as those for BE3M and BME in the simulation studies. To further compare BE3M with BME, items were also calibrated with BME in the mirt package.
In addition, as we have discussed in the introduction, there is no available SEs for all item parameters in the mirt package because the Beta priors ("expbeta" in mirt) were imposed on the guessing and slipping parameters in this example.
The data includes student responses on ten items related to bullying behavior for 7,491 adolescents. The ten bullying items asked students "How often have you bullied another student(s) at school in the past couple of months in the ways listed below?" The items were dichotomized as "1→0" and ">1→1" to indicate the student has not bullied or has bullied, respectively, in the given way over the past couple of months. For this application, the latent variable W ij can be interpreted as so that ς j is the probability that a student who bullied over the past couple of months neglected to report bullying. On the other hand, it can be expected for γ j to be zero in this application, because it is unlikely for students who did not bully to report bullying behaviors (Culpepper, 2016).

Results
Item calibration results from the three methods are summarized in Table 3. The estimates and SEs yielded by fourPNO are the same as in Culpepper's (2016) study due to we used identical package (fourPNO) and estimation conditions. The BE3M produces point estimates comparable to both MCMC approaches, and estimates of the BE3M and Stata remain stable when changing priors. Unlike the performance of the mirt package in the simulation studies, it is interesting to note that the changing of priors has no obvious impact on the guessing and slipping parameters, but rather on the sloping parameters for the Item 1 and 2, which perhaps caused by the unstable Hessian matrix of the BME method. The implication is that careful consideration of priors should be taken if one uses the mirt package (which implements the BME method) to pursue precise estimates of the 4PLM. In terms of SEs, the two priors do not seem to influence the quality of estimation with similar values for the SEs in general. We also find that the SEs for the MCMC approaches are similar to those obtained with BE3M, with the exceptions of the sloping parameters.
The main finding is that BE3M is much more computationally efficient than the MCMC. The Gibbs sampler in the fourPNO package required approximately 47 minutes to complete 100,000 iterations and the MH sampler in Stata required approximately 3 hours to complete 10,000 iterations with N = 7,491 using a 2.4 GHz processor and 6 GB of RAM while BE3M only needed about 1 minute. We hope that this finding in particular will help to promote further applications of the 4PLM in practice.

Discussion
In recent years, interest in the four-parameter IRT model has been on the rise in measurement research. The current study presented a new Bayesian formulation of the 4PLM model. Through two simulation studies and a real data analysis, we demonstrated that the BE3M combines the strengths of these two approaches; the BE3M can produce estimates as accurately as the Gibbs sampling method as quickly as the Bayes model estimation approach using mirt. To facilitate  calibration, the authors have made available an R package IRTBEMM (will be published on CRAN soon). The authors believe this package offers a better alternative for 4PLM calibration, helping to promote the use of the 4PLM in educational and psychological research.
Several opportunities exist for future research related to the 4PLM. First, future research can help to identify other substantive research areas and large survey datasets that could benefit from modeling the slipping effect (Beghetto, 2019). Second, for successful, high-efficient integration into a CAT testing system, being able to implement online calibration of item parameters is important (Zheng, 2016). Since the method proposed at present is a member of the EM family to which most online calibration algorithms belong, there is little barrier to developing online calibration methods based on the BE3M.

Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent
This article does not contain any studies with human participants performed by any of the authors.

Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the Flower of Happiness Project in social science of East China Normal University (2020ECNU-XFZH007, 2019ECNU-XFZH015) and the Peak Discipline Construction Project of Education at East China Normal University.

Supplemental Material
Supplemental material for this article is available online.