Exploratory cognitive diagnosis models (CDMs) estimate the Q matrix, which is a binary matrix that indicates the attributes needed for affirmative responses to each item. Estimation of Q is an important next step for improving classifications and broadening application of CDMs. Prior research primarily focused on an exploratory version of the restrictive deterministic-input, noisy-and-gate model, and research is needed to develop exploratory methods for more flexible CDMs. We consider Bayesian methods for estimating an exploratory version of the more flexible reduced reparameterized unified model (rRUM). We show that estimating the rRUM Q matrix is complicated by a confound between elements of Q and the rRUM item parameters. A Bayesian framework is presented that accurately recovers Q using a spike–slab prior for item parameters to select the required attributes for each item. We present Monte Carlo simulation studies, demonstrating the developed algorithm improves upon prior Bayesian methods for estimating the rRUM Q matrix. We apply the developed method to the Examination for the Certificate of Proficiency in English data set. The results provide evidence of five attributes with a partially ordered attribute hierarchy.

Cognitive diagnosis models (CDMs) offer a useful psychometric framework for classifying individuals on a collection of fine-grained binary attributes. A critical component for applying existing CDMs is the specification of a binary Q matrix, which indicates the attributes required for each item. Correctly specifying Q is essential for ensuring accurate diagnoses and model parameter estimates (Henson & Templin, 2007a; Rupp & Templin, 2008). Prior CDM research mainly relied upon content expert knowledge to specify Q. However, specifying Q is a challenging task. Cognitive theory may be too underdeveloped to offer guidance for specifying Q. Additionally, even in cases where cognitive theory is available, there is no guarantee that content experts will agree on all entries of a Q matrix. The general unavailability of Q for most content areas and data sets poses a barrier to widespread applications of CDMs.

Clearly, a fundamental problem for psychometric research and practice is the estimation of Q. Recent research addressed this critical issue for CDMs by developing procedures for validating a specified Q (Chiu, 2013; de la Torre, 2008; de la Torre & Chiu, 2016), estimating a subset of elements of Q (DeCarlo, 2012; Templin & Henson, 2006), estimating Q for the deterministic-input, noisy-and-gate (DINA) model (Y. Chen, Culpepper, Chen, & Douglas, 2016; Chung, 2014; J. Liu, Xu, & Ying, 2012, 2013; Xiang, 2013), and estimating Q for more general CDMs (Y. Chen, Liu, Xu, & Ying, 2015; Xu & Shang, 2017). Estimation of the DINA Q was an important step for broadening the application of CDMs. However, the DINA model is one of the more restrictive CDMs. That is, if K denotes the number of binary attributes, the DINA model assumes that responses for members of the 2K latent classes are described by only two parameters for each item, which may be too restrictive in some cases. Consequently, there is an opportunity to develop exploratory methods for estimating Q for more flexible CDMs. One alternative to the DINA is the reduced reparameterized unified model (rRUM). The rRUM has at most K + 1 item parameters to describe response probabilities for the 2K classes and is therefore a more flexible model than the DINA. It should be noted that the rRUM is a special case of several more general CDMs (de la Torre, 2011; Henson, Templin, & Willse, 2009; von Davier, 2008; Xu, 2017) that offer even greater flexibility by estimating at most 2K parameters per item. We consider the rRUM in this article given the model is more flexible than the DINA and estimation of the rRUM Q is an important development for estimating Q for more general CDMs.

We develop a Bayesian method for estimating the rRUM Q, which we refer to as the exploratory rRUM (i.e., E-rRUM). The developed E-rRUM explicitly addresses three issues associated with estimating Q. First, the item response function (IRF) for the rRUM presents a unique identifiability issue that must be addressed in order to accurately estimate Q. In the next section, we show that a confound arises when attempting to jointly estimate elements of Q and the corresponding rRUM item parameters. Consequently, accurate recovery of the rRUM Q requires directly addressing the confound between model parameters. We employ a spike–slab prior for the item parameters to address the confounding with elements of Q and to select which attributes are needed for which items.

Second, Y. Chen, Culpepper, Chen, and Douglas (2016) report Monte Carlo evidence that the accuracy of Bayesian estimation of the DINA Q matrix declines as the sample size N and number of attributes K increase, which is undesirable given that more flexible CDMs such as the rRUM may require more data for accurate parameter recovery. We use Bayesian model selection procedures to more accurately estimate Q using Markov chain Monte Carlo (MCMC). In fact, our Monte Carlo simulation results suggest that recovery of Q improves with larger N.

Third, in our preliminary investigations, we found that estimating Q is more difficult whenever the associations among the attributes satisfy a more parsimonious structure. Accordingly, there may be some instances where a parsimonious model for attributes provides better fit than an unstructured model (e.g., see Culpepper & Hudson, 2017). We incorporate a higher order model (de la Torre & Douglas, 2004; Maris, 1999) for attributes to provide a more parsimonious model for the 2K latent classes and offer a new Gibbs sampling algorithm to estimate model parameters.

The remainder of this article includes five sections. The first section discusses model identifiability issues associated with estimating the rRUM Q. In the second section, we introduce a Bayesian formulation for estimating the E-rRUM model parameters. The third section reports results from a Monte Carlo simulation study and provides evidence the developed procedures outperform prior strategies (e.g., see Y. Chen, Liu, et al., 2015; Chung, 2014) for estimating the rRUM Q. The fourth section presents an application of the E-rRUM to the Examination for the Certificate of Proficiency in English (ECPE) data (Templin & Hoffman, 2013) and compares the model fit of the E-rRUM with the rRUM using an expert-specified Q and a two-parameter item response theory (IRT) model. The final section discusses the implications of this study and provides concluding remarks.

Let Yij be the observed binary response for individual i (1iN) to item j (1jJ). Let αi=(αi1,...,αiK) where αi{0,1}K and αik is the latent binary attribute for individual i on attribute k (1kK). Furthermore, let qj=(qj1,...,qjK) be the jth row of Q such that qjk=1 if attribute k is required for item j and 0 otherwise. The IRF for the rRUM is

P(Yij=1|αi,πj*,rj*,qj)=πj*k=1Krjk*(1αik)qjk,1

where πj* is the probability, Yij=1 if individual i has all the required attributes, rj*=(rj1*,...,rjK*), and 0rjk*1 is a penalty parameter for missing a required attribute k.

One observation from Equation 1 is that there is a confound between qjk and rjk*. Specifically, it is impossible to distinguish between rRUM IRFs, where qjk=0 versus qjk=1 and rjk*=1 given that

P(Yij|αi,rj*,πj*,qj,rjk*=1,qjk=1)=P(Yij|αi,rj*,πj*,qj,qjk=0).2

Clearly, the estimation of qjk is confounded with rjk*. Furthermore, researchers can expect empirical identifiability issues whenever rjk*1. In fact, during our preliminary investigations estimating the rRUM Q via MCMC, we generally found that a true qjk=0 was estimated as qjk=1 with a corresponding rjk*1. Therefore, successful estimation of the rRUM Q must directly address the confound between qjk and rjk*.

In the next section, we present a strategy for estimating the E-rRUM. That is, we fix the elements of Q equal to 1 (i.e., qjk=1 for all j and k) and perform model selection to determine whether the associated item parameter is rjk*=1 (i.e., the attribute is not needed for the item) or 0rjk*<1 (i.e., the attribute is needed for the item).

Overview Bayesian Formulation

This subsection discusses the following three components of the E-rRUM, which includes models for the (1) observed Yij, (2) latent structure for αi, and (3) item parameters π*=(π1*,...,πJ*) and r*.

Model for observed Yij

The E-rRUM employs a data augmentation strategy to model the rRUM IRF (e.g., see Culpepper & Hudson, 2017). The rRUM is a generalized noisy input, deterministic “AND” gate model, which assumes that observed responses are deterministically related to a collection of latent responses according to an “AND” logic gate, and the latent responses are stochastically related to the underlying attributes. Accordingly, let Xijk be a latent response that equals 1 if individual i correctly applies attribute k to item j and 0 otherwise, and let Xij=(Xij1,...,XijK) be a random vector of latent responses for individual i on item j and let Xij be a realized value. We use the conjunctive condensation rule (i.e., an “AND” logic gate) to deterministically relate the observed Yij to the latent responses,

Yij=k=1KXijk.3

Unlike Culpepper and Hudson (2017), the conjunctive condensation rule in Equation 3 is not a function of the Q matrix. Instead, we fix qjk=1 in Equation 3 to address the confound between rjk* and qjk. Consequently, as discussed below, under our formulation, “inactive” latent responses are characterized by Xijk=1.

We model each latent response as conditionally independent given attributes and augmented item parameters,

Xijk|αik,β0jk,β1jkBernoulli[Φ(β0jk+β1jkαik)],4

where Φ() is the normal cumulative distribution function, β0jk is an intercept, and β1jk is the impact of possessing the attribute on the probability of a correct latent response. Under this formulation, the probability of “guessing” latent response k on item j is Φ(β0jk), and the probability of slipping is 1Φ(β0jk+β1jk).

We use a probit data augmentation strategy to model Xijk (e.g., see Albert, 1992). That is, we define a deterministic relationship between Xijk and a latent augmented variable, Xijk*, such that Xijk=I(Xijk*>0). The prior for the continuous latent response augmented data is Xijk*|αik,β0jk,β1jkN(β0jk+β1jkαik,1).

Model for latent structure of αi

We consider two different models for the latent attributes, αi. First, we consider an unstructured model for αi (e.g., see Culpepper, 2015; Culpepper & Hudson, 2017) as presented below in Equations 5 and 6:

p(αi|π)c=02K1πcI(αiv=c),v=(2K1,...,2,1),5
π=(π0,π1,...,π2K1)Dirichlet (d0),d0=(d00,...,d0,2K1),6

where I() is the indicator function, πc=P(αiv=c), and the K-vector v is used to create a bijection between the 2K classes and the integers between 0 and 2K1 (Chung, 2014; von Davier, 2014a). Prior information about class probabilities can be encoded in d0. We use a uniform prior for π with d0=12K.

In practice, the latent class probabilities may satisfy some underlying structure that is better described by a more parsimonious model. At least two alternatives have been proposed in the literature. For instance, Henson, Templin, and Willse (2009) and Templin, Henson, Templin, and Roussos (2008) describe a multivariate probit model underlying αi, and de la Torre and Douglas (2004) and Maris (1999) assume the binary attributes are independent when conditioned upon higher order factors. We consider the higher order factor model as an alternative to the unstructured model in Equations 5 and 6 where the elements of αi are assumed independent, given a p-vector of higher order factors, ξi=(ξi1,...,ξip),

p(αi|ξi,τ,Λ)=k=1Kp(αik|ξi,τ,Λ),7

where τ=(τ1,...,τK) is a vector of thresholds and Λ=(Λ1,...,ΛK) is a K×p loading matrix. There are many options for the parametric form for the model relating αik to ξi. We use a probit model to relate latent factors to attributes, that is, p(αik=1|ξi,τk,Λk)=Φ(Λkξiτk).

Our formulation for the higher order probit model follows strategies from Bayesian IRT models (Albert, 1992; Béguin & Glas, 2001; Culpepper, 2016). That is, we introduce a normally distributed augmented variable αik* and specify a deterministic relationship between αik and αik*. Let αi*=(αi1*,...,αiK*) be a K-vector of augmented variables for individual i. Similar to de la Torre and Douglas (2004) and Templin et al. (2008), we consider a single higher-order factor ξi (i.e., p=1) for individual i and let ξ=(ξ1,...,ξN) (note that the formulation below can be easily extended to the p>1 case). Additionally, let λ=(λ1,...,λK) be a K-vector of loadings for the single factor. The Bayesian formulation for the higher order probit model with a single factor follows:

αik=I(αik*>0),8
α*=(α1*,...,αN*)|ξ,λ,τMNN×K(ξλ1Nτ,IN,IK),9
ξNN(0N,IN),10
τNK(0K,στ2IK),11
λNK(0K,σλ2IK)k=1KI(λk>0).12

The deterministic relationship between αik and αik* in Equation 8 and the matrix normal distribution prior for the N×K matrix α* in Equation 9 together imply the probit models for the higher order single factor model. Equation 10 specifies a multivariate normal prior for ξ across individuals with a N-vector of 0s (i.e., 0N) as the expected value and an N identity matrix as the variance–covariance matrix (i.e., the higher order factors have unit variance, and independence is assumed between individuals). Equations 11 and 12 specify multivariate normal priors for the latent thresholds τ and loadings λ with means 0K and variance–covariance matrices στ2IK and σλ2IK, where στ2 and σλ2 are specified constants. Note the identification restriction that each λk>0.

Prior for π* and r*

We formulate a prior for the item parameters by using a data augmentation strategy that defines πj* and rjk* as functions of augmented item parameters (Hartz, 2002). Specifically, the deterministic functions relating the augmented item parameters with πj* and rjk* follow

πj*=k=1KΦ(β0jk+β1jk),rjk*=Φ(β0jk)Φ(β0jk+β1jk),k=1,...,K.13

As noted above, we estimate the E-rRUM Q by determining which attributes are active (i.e., 0rjk*<1) versus inactive (i.e., rjk*=1) for each item. The extent to which rjk* is active is characterized by different combinations of values for β0jk and β1jk. For instance, Equation 13 shows that rjk* is inactive (i.e., rjk*=1) whenever β1jk=0. Additionally, rjk*1 in cases where β0jk is large and positive. For instance, values of β0jk>1.96 correspond with guessing probabilities for the latent response Xijk exceeding 0.95 and rjk* near 1. In contrast, rjk* is active when β1jk>0, and the guessing probabilities are not too large (e.g., β0jk is not large and positive).

These observations imply we can specify a spike–slab prior (e.g., see George & McCulloch, 1993) for β0jk and β1jk to select whether rjk* is active versus inactive. Let δjk be an augmented binary variable that indicates whether attribute k is needed for item j. Specifically, if attribute k is needed for item j, then δjk=1 and β1jk>0 and β0jk is not positive and large. Otherwise, if δjk=0, then β1jk is near 0, and β0jk is large in the positive direction. The spike–slab prior to characterize the activeness of rjk* is

p(β0jk,β1jk|δjk)δjkf1(β0jk,β1jk)+(1δjk)f0(β0jk,β1jk),14

where f1 denotes the “slab” and f0 is the “spike.”

O’Hara and Sillanpää (2009) review several popular approaches for Bayesian variable selection in the statistical literature, and we follow their recommendation to use a “stochastic search variable selection” prior specification. That is, we assume the augmented item parameters are independent and define the slab as β0jkN(0,1/v01) and β1jkN(0,1/v11)I(β1jk>0) and the spike as β0jkN(0,1/v00) and β1jkN(0,1/v10)I(β1jk>0), where v00, v01, v10, and v11 are constants. O’Hara and Sillanpää (2009) note that the scale constants must be tuned. We found in the Monte Carlo simulation study that using the values of v00=0.1, v10=10, v01=4, and v11=2 accurately selected the active and inactive rjk*. The values for the constants have a practical interpretation as well. We noted that active rjk* are less than 1 when β0jk are not too large and β1jk are positive. Therefore, setting v01=4 and v11=2 corresponds to prior variances for β0jk and β1jk of 0.25 and 0.5, respectively, which corresponds with the notion of an active rjk*. In contrast, v00=0.1 implies the variance of β0jk is 10, and v10=10 implies the variance of β1jk is 0.1, so the spike allows β0jk to be large and positive and β1jk to be near 0 as characterized by inactive rjk*.

The spike–slab model selection parameters δjk for j=1,...,J and k=1,...,K are conditionally independent Bernoulli random variables with probability ω,

δjk|ωind.Bernoulli (ω),15
ωBeta (a,b).16

Furthermore, ω follows a Beta distribution with prior parameters “a” and “b.” We employed a uniform prior for ω in the simulation study and application and set a=b=1.

The spike–slab approach in Equations 14 through 16 offers at least two benefits. First, the prior directly addresses the confound when estimating qjk and δjk. Specifically, the elements of Q are set to 1, and δjk is used to infer which attributes are needed for which items. Second, each δjk can be directly used to perform model selection. For instance, one option for estimating qjk is to use the posterior average δ¯jk as q^jk=I(δ¯jk>0.5).

Approximating the Posterior Distribution

We use Gibbs sampling to estimate the E-rRUM parameters. Specifically, the algorithm sequentially updates the (1) augmented latent responses Xij, (2) latent attributes αi under either the unstructured or higher order factor model, (3) rRUM item parameters, and (4) spike–slab parameters for performing model selection. Additional details regarding each step are available in Appendix A of the online version of the journal.

Overview

We report results from a Monte Carlo simulation study in this section to assess the relative performance of the developed Bayesian algorithms with a Bayesian method that employs a Metropolis–Hastings (MH) sampler (Chung, 2014) and a L1 regularization estimator (Y. Chen, Liu, et al., 2015). Chung (2014) used MH sampling to update the item parameters π* and r* and a Gibbs sampler for updating rows of Q by sampling qj from the 2K1 nonzero configurations. The L1 estimator penalizes the likelihood function for model complexity and simulation evidence suggests the approach is accurate for recovering the DINA model Q. It is important to note that the MH sampler does not explicitly address the confound between Q and r*, so we expect the E-rRUM to be more accurate.

In the simulation study, we considered two samples sizes (i.e., N = 1,000 and 2,000), four levels of attribute dependence (i.e., one condition with uniformly distributed attributes and three conditions with dependence generated from the multivariate normal threshold model with a latent correlation of ρ = .05, .15, and .25), and two values for the total number of attributes (i.e., K = 3 and 4). For the conditions with uniformly distributed attributes, attributes were uniformly sampled from the 2K classes. In contrast, dependent attributes were generated from the latent multivariate normal probit model (e.g., see Chiu, Douglas, & Li, 2009) with thresholds for attribute k defined as k/(K+1), a mean vector of zero, and a constant correlation among the attributes of ρ. Following Chiu and Köhn (2016), the item parameters were defined as πj*=0.9 and rjk*=0.6. Furthermore, J=30 for all conditions, and the true Q was sampled each replication by first randomly placing three IK matrices into the rows of Q and then filling the remaining rows by sampling uniformly from the 2K1 nonzero arrangements.

We compared our unstructured and higher order single factor models discussed in the previous section with parameter estimates using the MH sampler and L1 estimator. For our variable selection method, we fixed v00=0.1, v10=10, v01=4, and v11=2. Note the single, higher order factor model was estimated with the constraint that loadings were constant across k such that λk=λ, whereas the elements of τ were freely estimated. The L1 estimator requires a specified value for the lasso penalty parameter. We considered a grid of values for the penalty parameter from 0.01 to 0.15 in increments of 0.005, and we report the most accurate set of results for each replication.

The outcomes of interest were the accuracy of the estimated Q across the four methods. We computed accuracy both matrix-wise and entry-wise across the replications. That is, the matrix-wise accuracy rate was computed as

1100r=1100I(Q=Q^r),17

where Q^r is the estimated Q for replication r. The entry-wise accuracy rate was defined as

1100r=11001JKj=1Jk=1KI(qjk=q^jk,r),18

where q^jk,r is the estimated qjk for replication r.

For the MH sampler, we followed Chung (2014) and computed Q^ using the entry-wise mode. For the E-rRUM, we computed q^jk using the entry-wise mode for δjk (i.e., q^jk=I(δ¯jk>0.5)). Note the L1 estimator for Q^ is implied by the pattern of nonzero coefficients in the transformed rRUM IRF (e.g., see Y. Chen, Liu, et al., 2015, p. 856).

We executed 100 replications for each combination of parameter values and used chain lengths of 80,000 and a burnin of 20,000 iterations for both Bayesian methods. The Bayesian methods were programmed using Rcpp (Eddelbuettel et al., 2011). The run times for the N = 2,000 and K=4 conditions on a cluster with 2.4 GHz processors was 2 hours, 16 hours, and 20 minutes for the E-rRUM, MH sampler, and L1 estimator, respectively.

Monte Carlo Results

Table 1 summarizes the results from the Monte Carlo simulation study regarding the accuracy of Q estimation for the methods developed in this article (i.e., the higher order and unstructured models for αi), the MH sampler, and the L1 regularization estimator. We note three features about the Monte Carlo results. First, the Monte Carlo results support the use of the developed methods in this article to estimate the E-rRUM Q matrix. That is, the higher order and unstructured models had better matrix-wise and entry-wise recovery rates than both the MH sampler and L1 methods.

Table

Table 1. Summary of Monte Carlo Simulation Study Accuracy of Q Estimation for the Higher Order Single Factor Model (HO), Unstructured Model (Uns.), Metropolis–Hastings Sampler (MH), and L1 Regularization for Values of K, N, and ρ

Table 1. Summary of Monte Carlo Simulation Study Accuracy of Q Estimation for the Higher Order Single Factor Model (HO), Unstructured Model (Uns.), Metropolis–Hastings Sampler (MH), and L1 Regularization for Values of K, N, and ρ

Second, the results in Table 1 provide evidence that matrix-wise recovery of Q is more challenging for the rRUM than for the DINA model. For example, Y. Chen, Liu, Xu, and Ying (2015) and Y. Chen, Culpepper et al. (2016) report matrix-wise recovery rates in the 80s and 90s for the DINA for the conditions studied in this article. The Monte Carlo results suggest that the additional complexity of the rRUM affects the recovery of Q. It is important to note that while the matrix-wise recovery rates are smaller for the rRUM than the DINA that the entry-wise recovery rates exceed 95% for the higher order model across the simulated conditions. Additionally, the matrix-wise recovery rates of Q improve as N increased.

Third, the relative performance of our unstructured and higher order models varied by condition. Namely, the results in Table 1 suggest that the higher order model tended to have higher matrix-wise recovery rates for larger N and ρ. Furthermore, the entry-wise recovery rates for the E-rRUM models were comparable for the K = 3 and 4 conditions.

We analyze the ECPE data set (e.g., see Templin & Hoffman, 2013), which is available in the “cdm” R package version 6.4-23 (Robitzsch, Kiefer, George, & Uenlue, 2015). The ECPE data include item responses from 2,922 examinees at the University of Michigan as a test of advanced English skills. The ECPE data were originally analyzed with the rRUM by Henson and Templin (2007b) using an expert-specified Q matrix, which consists of three attributes: (1) morphosyntactic rules, (2) cohesive rules, and (3) lexical rules. Subsequent studies analyzed the ECPE using clustering algorithms (Chiu et al., 2009) and confirmatory CDMs such as the DINA (Y. Liu, Douglas, & Henson, 2009), the rRUM (Chiu & Köhn, 2016; Henson & Templin, 2007b; Y. Liu et al., 2009), and more general CDMs (Templin & Bradshaw, 2014; Templin & Hoffman, 2013; von Davier, 2014b).

Recent research with the ECPE data set offered competing conclusions regarding the underlying structure. Specifically, Templin and Bradshaw (2014) and von Davier (2014b) analyzed the ECPE data set with a general CDM using the expert specified Q and found evidence that the probability of membership in some classes was near 0 (i.e., that several elements of π are near 0). Researchers offered two competing inferences based on this finding. First, one inference was that the attributes are arranged into a linear hierarchy where some attributes must be mastered before others (Templin & Bradshaw, 2014). Second, an alternative interpretation is that a single, ordered skill or a continuous unidimensional trait might underlie performance on ECPE items (von Davier, 2014b; von Davier & Haberman, 2014).

We apply the unstructured E-rRUM to the ECPE data to estimate Q and provide new evidence about the underlying structure. Specifically, we fit the unstructured E-rRUM with K = 2–7 and compare the exploratory models with the confirmatory rRUM that uses an expert-specified Q matrix from Templin and Hoffman (2013) (e.g., see Table 2) using the Bayesian estimation algorithm of Culpepper and Hudson (2017). Additionally, von Davier (2014b) found evidence that a two-parameter IRT model had a lower Akaike information criterion than the rRUM with an expert-specified Q, and we accordingly compare model fit of a two-parameter normal ogive (2PNO; Albert, 1992) model with the E-rRUM. We used chain of lengths of 80,000 and discard the first 20,000 iterations as burn-in. The E-rRUMs were all estimated using values for v00, v01, v10, and v11 as specified for the simulation study. In the remainder of this section, we first discuss model fit and then present results from the best fitting model.

Table

Table 2. Summary of ECPE Data Unstructured E-rRUM Model Parameter Estimates for K = 5

Table 2. Summary of ECPE Data Unstructured E-rRUM Model Parameter Estimates for K = 5

Model Fit

We first compare the relative fit of the models using the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & Van Der Linde, 2002). Specifically, we computed the marginal DIC (i.e., we marginalize the likelihood over attributes for the diagnostic models and integrate over the latent traits for the 2PNO) rather than a conditional DIC for missing data models (e.g., see Celeux, Forbes, Robert, & Titterington, 2006). The smallest DIC value of 85,058 corresponded to the E-rRUM with K = 5. In contrast, the DICs for the expert Q rRUM and the 2PNO were 111,143 and 85,203, respectively, and the DICs for the E-rRUM with K = 2, 3, 4, 6, and 7 were 85,258, 85,202, 85,124, 85,074, and 85,079.

We also evaluated relative model fit using posterior predictive checks. In particular, the Q matrix specifies the underlying structure, so one way to evaluate relative model fit is to compare how well the models reproduce the item means and observed relationships among items. That is, an inadequate model will be less likely to predict item means and reproduce the associations among the observed variables. We therefore assess model fit by computing posterior predictive probabilities (PPPs) of the item means and odds ratios for each pair of items (e.g., see W.-H. Chen & Thissen, 1997; Sinharay, Johnson, & Stern, 2006). We follow Sinharay, Johnson, and Stern (2006) and consider PPPs smaller than 0.05 or greater than 0.95 to be extreme and evidence of misfit.

We found evidence that all of the models effectively reproduced the observed item means, so we do not report those results here. We computed odds-ratio PPPs by (1) simulating observed responses Y(r) using model parameters from iteration r of the MCMC sampler; (2) computing the odds ratio for each pair of items at iteration r as OR(r)=n11(r)n00(r)/(n10(r)n01(r)), where n11(r) is the frequency of ones on both variables at iteration r, n10(r) is the frequency of ones on the first item and zeros on the second at iteration r, etc.; and (3) computing PPPs for each item pair as the proportion of generated OR(r)’s that exceeded elements of the observed odds ratios. We found evidence the E-rRUM better predicted relationships among the ECPE items than the rRUM with an expert-specified Q and the 2PNO. In fact, 28.3% of the expert Q rRUM pairwise odd ratios were considered too extreme (i.e., the confirmatory rRUM did a poor job of describing pairwise item odds ratios), and 10.8% of the 2PNO PPPs were considered extreme. In contrast, the percentage of out-of-range odds-ratio PPPs for the E-rRUM were 11.1%, 10.3%, 7.4%, 4.8%, 5.8%, and 5.8% for K = 2, 3, 4, 5, 6, and 7. The DICs and odds-ratio PPPs support the E-rRUM with K = 5. In the next subsection, we summarize the results of the empirical application of the E-rRUM with K = 5.

Empirical Results for K = 5

The results in the previous subsection support the presence of five binary attributes rather than a continuous latent trait. In this subsection, we report results for the E-rRUM with K = 5 and discuss the extent to which there is evidence for an attribute hierarchy. Table 2 reports the E-rRUM parameter estimates (i.e., Q^, δ^, r^*, and π^*) along with the previously employed expert Q. Note that Q^ is defined using the posterior element-wise mode for δ (i.e., q^jk=I(δ¯jk>0.5)). One immediate observation from Table 2 is that the expert Q is not captured by Q^, which may be expected given the differences in model fit discussed in the previous subsection.

The results in Table 2 offer evidence about the extent to which there is an attribute hierarchy. Specifically, Table 2 shows that the estimated Q implies the first attribute is needed for all items. The posterior averages of δjk in Table 2 quantify uncertainty regarding the corresponding q^jk, and the estimates suggest that the posterior probability the first attribute is required exceeds 0.90 for all items. One conclusion is that students must possess the first attribute to have the greatest probability of successfully answering the items. In fact, the penalty for not possessing the first attribute is as low as r7,1*=0.56 and r22,1*=0.42 for items 7 and 22, and the average penalty for not possessing Attribute 1 equaled 0.72.

Additional evidence to support an attribute hierarchy is found in the estimated attribute structure. Table 3 reports the estimated ECPE latent class proportions (i.e., π^) for the E-rRUM with K = 5. Our results in Table 3 agree with prior research given that the probability of membership in several classes are near 0. Specifically, the estimated π suggests there is a smaller probability of being classified into an attribute class that is missing the first attribute. In fact, only 9.8% of the sample was assigned to one of the 14 classes that possesses at least one attribute other than α1. Table 3 shows that seven of the 32 possible latent classes had membership proportions exceeding 5%. Specifically, the structural probabilities for the seven largest latent classes were .062 for (00000), .060 for (10000), .060 for (11000), .075 for (11001), .081 for (11010), .086 for (11011), and .242 for (11111).

Table

Table 3. ECPE Data Estimated Latent Class Proportions π^ for Unstructured E-rRUM With K = 5

Table 3. ECPE Data Estimated Latent Class Proportions π^ for Unstructured E-rRUM With K = 5

Templin and Bradshaw (2014) interpreted the class probabilities and concluded there was evidence of a linear attribute hierarchy. The results in Table 3 may instead provide evidence of a more complex attribute hierarchy (e.g., see Köhn & Chiu, 2018). That is, using 5% as a threshold to infer which classes are nonzero implies that Attribute 2 requires Attribute 1 is mastered, Attributes 4 and 5 require Attribute 2, and Attribute 3 requires Attributes 4 and 5 are mastered. Rather than a linear hierarchy, the estimated class probabilities provide evidence of a partially ordered attribute hierarchy, given Attributes 4 and 5 are not ordered.

We presented a new exploratory approach for estimating the rRUM Q matrix. We conclude this article in this section with a summary of our contribution and recommendations for future research.

The E-rRUM provides a more flexible modeling framework than previous research that estimated the DINA Q. Specifically, the more restrictive DINA model includes only 2-item parameters, whereas the rRUM has up to K + 1 parameters per item. It is important to note that the increased flexibility of the E-rRUM translated to lower matrix-wise recovery rates of Q than observed in prior research for the DINA model. This finding has implications for future research on estimating Q for more general CDMs. Namely, the rRUM is a special case of several general CDMs (de la Torre, 2011; Henson et al., 2009; von Davier, 2008), and it is likely that estimating Q will be more challenging for these models than the rRUM. An important goal for future research is to develop procedures for estimating Q for more general CDMs. Xu and Shang’s (2017) results provide a theoretical foundation for such work, and future research must necessarily examine how study design features (e.g., sample size, latent attribute structure, the number of attributes) impact recovery of Q for these general models.

We also offer improved Bayesian strategies for estimating the Q matrix. The Monte Carlo evidence supports the use of the developed procedures over existing Bayesian methods for estimating the rRUM Q. In the simulation study, the MH sampler did not in any replication or condition recover all elements of Q. The difference in model performance is likely attributed to the fact that our developed Bayesian algorithms addressed the confound between Q and r*. Specifically, we addressed the confound by fixing all elements of Q equal to 1 and then used spike–slab priors for the augmented item parameters to determine which penalty parameters are active. The Monte Carlo results suggest the developed methods improve upon existing research. Another important finding from the Monte Carlo study is that recovery of Q improved as N increased for K = 3 and K = 4.

The developed E-rRUM also offers a practical contribution for applied researchers. The E-rRUM is applicable for theory development in a manner similar to how exploratory factor analysis has been applied across the social sciences. In fact, our application of the E-rRUM to the ECPE data provides examples of the types of inferences that are available with exploratory CDMs. First, we found evidence that estimating Q improved model fit in comparison to using an expert-specified Q matrix and a unidimensional latent trait. Accordingly, future researchers can use the E-rRUM to evaluate the plausibility of existing cognitive theory. Second, we uncovered a different latent structure for attributes than found in previous research using an expert-specified Q matrix. One implication is that estimating Q has the potential to impact inferences and theory development. For example, as noted above, prior research concluded that the attributes underlying the ECPE data either satisfied a hierarchy or were ordered along a unidimensional trait. The results of the E-rRUM instead support the existence of five hierarchically structured attributes. That is, we found evidence from the estimated Q that the first attribute is required for all items, and the pattern of latent class probabilities may support a partially ordered attribute hierarchy. Consequently, using the E-rRUM rather than the rRUM improved model fit and altered substantive conclusions regarding the nature of the latent attribute structure.

There are several directions for research. First, although the Monte Carlo simulation study employed a similar design as previous studies, there is an opportunity in future research to examine model parameter accuracy for additional attribute latent structures, numbers of attributes, and sample sizes. Subsequent Monte Carlo research will improve our understanding of factors that impact recovery of Q. For instance, additional evidence is needed about parameter recovery and classification accuracy for cases where a subset of the items might not load on the common skills and attributes. Second, future research should consider extending the methods developed to more general CDMs. Our application of the E-rRUM to the ECPE data set assumed a conjunctive relationship, which may not be appropriate for all items. Future research should apply more general exploratory CDMs to the ECPE data to evaluate the plausibility of the conjunctive assumption and to further evaluate the presence of an attribute hierarchy. Third, we considered a fully exploratory approach for estimating Q; however, researchers may have partial information where some values are known based upon cognitive theory. In such cases, if some qjk are known, our approach can be used by updating the remaining elements of Q conditioned upon the known values. Future research should also more generally consider how to incorporate expert knowledge into estimation. One approach researchers may consider is to replace our prior with one that uses expert knowledge to specify the chance each entry is 0 versus 1.

A final direction for future research relates to model identifiability for exploratory CDMs. For instance, the estimated ECPE Q reported in Table 2 does not satisfy sufficient conditions that ensure the model parameters are strictly identified (i.e., that the model parameters map to distinct likelihoods). Specifically, Xu and Shang (2017) showed that one of the sufficient conditions to strictly identify CDM parameters is that Q includes two identity matrices (see Appendix B, available in the online version of the journal, for a summary of Xu and Shang’s [2017] results). The estimated ECPE Q matrix for K = 5 does not include simple structure for 10 items. Consequently, it may be the case that the reported E-rRUM parameters are not strictly identified. In preliminary analyses, we fit an E-rRUM that explicitly enforced model identifiability results (see Appendix B, available in the online version of the journal, for details about the algorithm for enforcing identifiability constraints). The results from the constrained E-rRUM supported a solution with K = 3, and we uncovered an estimated Q with dense structure that included 6 simple structure items and the remaining elements equal to 1s. One possible explanation for uncovering a dense Q is that the true Q may not satisfy Xu and Shang’s (2017) sufficient conditions for strict identifiability, and enforcing the identifiability conditions may yield a misspecified model. Although the estimated Q we report may not satisfy conditions for strict identifiability, the model parameters may be generically identified, which “…implies that the set of points for which identifiability does not hold has measure zero. In this sense, any observed data set has probability one of being drawn from a distribution with identifiable parameters” (Allman, Matias, & Rhodes, 2009, p. 3102). Satisfying generic identifiability is likely sufficient for practical applications of exploratory CDMs, given there is no guarantee the available items satisfy the more stringent strict identifiability conditions. Additional research is clearly needed to establish generic identifiability results for more general CDMs.

In conclusion, CDMs are popular for their ability to classify individuals on a fine-grained collection of attributes. The development of exploratory methods to estimate Q for more general models is critical for widespread application of CDMs. We introduced an exploratory version of the rRUM and made several advances for Bayesian estimation of Q. The developed methods improved upon existing research and further broadened the use and applicability of CDMs in applied research.

We thank Jeffrey Douglas, the editor, and two anonymous reviewers for helpful comments and suggestions and Yunxiao Chen for sharing R code for estimating the DINA Q matrix using the L1 regularization estimator that we modified to estimate the rRUM Q matrix. Any remaining errors belong to the authors.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Spencer Foundation Grant #201700062 and by a grant from the National Science Foundation Methodology, Measurement, and Statistics program grant #1632023. Opinions reflect those of the authors and do not necessarily reflect those of the granting agency.

Albert, J. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational and Behavioral Statistics, 17, 251269.
Google Scholar | SAGE Journals
Allman, E. S., Matias, C., Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37, 30993132.
Google Scholar
Béguin, A. A., Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541561.
Google Scholar
Celeux, G., Forbes, F., Robert, C. P., Titterington, D. M. (2006). Deviance information criteria for missing data models. Bayesian Analysis, 1, 651673. Retrieved from https://doi.org/10.1214/06-BA122 doi: 10.1214/06-BA122
Google Scholar
Chen, W.-H., Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265289.
Google Scholar | SAGE Journals
Chen, Y., Culpepper, S. A., Chen, Y., Douglas, J. (2016). Bayesian estimation of the DINA Q. Paper presented at the International Meeting of the Psychometric Society, Asheville, NC.
Google Scholar
Chen, Y., Liu, J., Xu, G., Ying, Z. (2015). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110, 850866.
Google Scholar | Medline
Chiu, C.-Y. (2013). Statistical refinement of the Q-matrix in cognitive diagnosis. Applied Psychological Measurement, 37, 598618.
Google Scholar | SAGE Journals
Chiu, C.-Y., Douglas, J. A., Li, X. (2009). Cluster analysis for cognitive diagnosis: Theory and applications. Psychometrika, 74, 633665.
Google Scholar
Chiu, C.-Y., Köhn, H.-F. (2016). The reduced RUM as a logit model: Parameterization and constraints. Psychometrika, 81, 350370.
Google Scholar | Medline
Chung, M. (2014). Estimating the Q-matrix for cognitive diagnosis models in a Bayesian framework (Unpublished doctoral dissertation). Columbia University, NY.
Google Scholar
Culpepper, S. A. (2015). Bayesian estimation of the DINA model with Gibbs sampling. Journal of Educational and Behavioral Statistics, 40, 454476.
Google Scholar | SAGE Journals
Culpepper, S. A. (2016). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika, 81, 11421163.
Google Scholar | Medline
Culpepper, S. A., Hudson, A. (2017). An improved strategy for Bayesian estimation of the reduced reparameterized unified model. Applied Psychological Measurement, 42, 99115. doi:10.1177/0146621617707511
Google Scholar | Medline
DeCarlo, L. T. (2012). Recognizing uncertainty in the Q-matrix via a Bayesian extension of the DINA model. Applied Psychological Measurement, 36, 447468.
Google Scholar | SAGE Journals
de la Torre, J. (2008). An empirically based method of Q-matrix validation for the DINA model: Development and applications. Journal of Educational Measurement, 45, 343362.
Google Scholar
de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179199.
Google Scholar
de la Torre, J., Chiu, C.-Y. (2016). A general method of empirical Q-matrix validation. Psychometrika, 81, 253273.
Google Scholar | Medline
de la Torre, J., Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333353.
Google Scholar
Eddelbuettel, D., François, R., Allaire, J., Chambers, J., Bates, D., Ushey, K. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40, 118.
Google Scholar
George, E. I., McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88, 881889.
Google Scholar
Hartz, S. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality (Unpublished doctoral dissertation) University of Illinois at Urbana, Champaign, IL.
Google Scholar
Henson, R. A., Templin, J. (2007a). Importance of Q-matrix construction and its effects cognitive diagnosis model results. Paper presented at Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
Google Scholar
Henson, R. A., Templin, J. (2007b). Large-scale language assessment using cognitive diagnosis models. Paper presented at Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
Google Scholar
Henson, R. A., Templin, J. L., Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191210.
Google Scholar
Köhn, H.-F., Chiu, C.-Y. (2018). Identifiability of the latent attribute space and conditions of Q-matrix completeness for attribute hierarchy models. In Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds), Quantitative Psychology, (pp. 363375). IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. Retrieved from https://doi.org/10.1007/978-3-319-77249-3_30.
Google Scholar
Liu, J., Xu, G., Ying, Z. (2012). Data-driven learning of Q-matrix. Applied Psychological Measurement, 36, 548564.
Google Scholar | SAGE Journals
Liu, J., Xu, G., Ying, Z. (2013). Theory of the self-learning Q-matrix. Bernoulli, 19, 17901817.
Google Scholar | Medline
Liu, Y., Douglas, J. A., Henson, R. A. (2009). Testing person fit in cognitive diagnosis. Applied Psychological Measurement, 33, 579598.
Google Scholar | SAGE Journals
Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187212.
Google Scholar
O’Hara, R. B., Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis, 4, 85117.
Google Scholar
Robitzsch, A., Kiefer, T., George, A. C., Uenlue, A. (2015). CDM: Cognitive diagnosis modeling (R package version 4.6-0) [Computer software manual]. Retrieved from http://CRAN.R-project.org/package=CDM
Google Scholar
Rupp, A. A., Templin, J. L. (2008). The effects of Q-matrix misspecification on parameter estimates and classification accuracy in the DINA model. Educational and Psychological Measurement, 68, 7896.
Google Scholar | SAGE Journals
Sinharay, S., Johnson, M. S., Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30, 298321.
Google Scholar | SAGE Journals
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 583639.
Google Scholar
Templin, J. L., Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79, 317339.
Google Scholar | Medline
Templin, J. L., Henson, R. A. (2006). A Bayesian method for incorporating uncertainty into Q-matrix estimation in skills assessment. Symposium Conducted at the Meeting of the American Educational Research Association, San Francisco, CA.
Google Scholar
Templin, J. L., Henson, R. A., Templin, S. E., Roussos, L. (2008). Robustness of hierarchical modeling of skill association in cognitive diagnosis models. Applied Psychological Measurement, 32, 559574.
Google Scholar | SAGE Journals
Templin, J. L., Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32, 3750.
Google Scholar
von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287307.
Google Scholar | Medline
von Davier, M. (2014a). The DINA model as a constrained general diagnostic model: Two variants of a model equivalency. British Journal of Mathematical and Statistical Psychology, 67, 4971.
Google Scholar | Medline
von Davier, M . (2014b). The log-linear cognitive diagnostic model (LCDM) as a special case of the general diagnostic model (GDM). Research Report. ETS RR-14-40. ETS Research Report Series. doi:10.1002/ets2.12043
Google Scholar
von Davier, M., Haberman, S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional “diagnostic” classification models: A commentary. Psychometrika, 79, 340346.
Google Scholar | Medline
Xiang, R. (2013). Nonlinear penalized estimation of true Q-matrix in cognitive diagnostic models (Unpublished doctoral dissertation). Columbia University, NY.
Google Scholar
Xu, G. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45, 675707.
Google Scholar
Xu, G., Shang, Z. (2017). Identifying latent structures in restricted latent class models. Journal of the American Statistical Association. doi:10.1080/01621459.2017.1340889
Google Scholar

Authors

STEVEN ANDREW CULPEPPER is an associate professor in the Department of Statistics at the University of Illinois at Urbana-Champaign, Illini Hall, Room 115, 725 S. Wright Street, Champaign, IL 61820, USA; email: . His research interests include large scale testing, Bayesian models and computation, restricted latent class models, and longitudinal modeling.

YINGHAN CHEN is an assistant professor in the Department of Mathematics & Statistics at the University of Nevada, Reno, 1664 N. Virginia Street, Reno, NV 89557, USA; email: . Her research interests include statistical computing, advanced Monte Carlo methods, Bayesian analysis, and latent class models.

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top