A Note on the Relationship of the Shannon Entropy Procedure and the Jensen–Shannon Divergence in Cognitive Diagnostic Computerized Adaptive Testing

The purpose of this study is to investigate the relationship between the Shannon entropy procedure and the Jensen–Shannon divergence (JSD) that are used as item selection criteria in cognitive diagnostic computerized adaptive testing (CD-CAT). Because the JSD itself is defined by the Shannon entropy, we apply the well-known relationship between the JSD and Shannon entropy to establish a relationship between the item selection criteria that are based on these two measures. To understand the relationship between these two item selection criteria better, an alternative way is also provided. Theoretical derivations and empirical examples have shown that the Shannon entropy procedure and the JSD in CD-CAT have a linear relation under cognitive diagnostic models. Consistent with our theoretical conclusions, simulation results have shown that two item selection criteria behaved quite similarly in terms of attribute-level and pattern recovery rates under all conditions and they selected the same set of items for each examinee from an item bank with item parameters drawn from a uniform distribution U(0.1, 0.3) under post hoc simulations. We provide some suggestions for future studies and a discussion of relationship between the modified posterior-weighted Kullback–Leibler index and the G-DINA (generalized deterministic inputs, noisy “and” gate) discrimination index.


Original Research
Summative assessments are typically used for grading and accountability purposes, and formative assessments are often used for supporting student learning (Laveault & Allal, 2016). Researchers and practitioners began to focus on formative assessments for student learning, rather than focus solely on summative assessments because many evidences showed that formative assessments produce significant and often substantial learning gains and improve student confidence and achievement (Black & Wiliam, 1998;Laveault & Allal, 2016). Cognitive diagnosis assessment (CDA) can be regarded as a kind of formative assessments because it is intended to promote assessment for learning to modify instruction and learning in classrooms by providing the formative diagnostic information about students' cognitive strengths and weaknesses (Jang, 2008;Leighton & Gierl, 2007). CDA has received increasing attention in recent years (Leighton & Gierl, 2007;Rupp et al., 2010;K. K. Tatsuoka, 2009), especially since the No Child Left Behind Act of 2001 mandated the selection and use of diagnostic assessments to improve teaching and learning and the new federal grant program known as "Race to the Top" (RTTT) has led into a new era of K-12 assessments which emphasized both accountability and instructional improvement (Chang, 2012).
Computerized adaptive testing (CAT) has become a popular mode of many summative and formative assessments (Quellmalz & Pellegrino, 2009). As a method of administering test items, CAT tailors the item difficulty to the ability level of the individual examinee (Chang & Ying, 2007). It is attractive to practitioners because it yields a high measurement precision with a short test. In the framework of CAT, cognitive diagnostic computerized adaptive testing (CD-CAT) is also a popular mode of online testing for cognitive diagnosis, as it can help one make informed decisions about the next steps in instruction for each student and greatly facilitate individualized learning (Chang, 2015) and provide many benefits to support formative assessments (Gierl & Lai, 2018). Particularly, the U.S. National Education Technology Plan 2017 with the title of "Reimagining the Role of Technology in Education" (U.S. Department of Education, 2017) emphasizes that technology can help us redefine assessment to meet the needs of the learner in a variety of ways. For technologybased formative assessments or CAT, test items are adapted to learner's ability and knowledge during the testing process. Thus, CAT can provide real-time reporting of results during the instructional process, which is crucial for personalized learning (Chen & Chang, 2018).
A key ingredient in CD-CAT is the item selection index. Researchers have attempted to investigate many item selection indices. The first type of index is based on the Kullback-Leibler (KL) information, such as the KL index (Cheng, 2009;McGlohen & Chang, 2008;C. Tatsuoka & Ferguson, 2003;Xu et al., 2003), the likelihood-or posterior-weighted KL (LWKL or PWKL) index and the hybrid KL index (Cheng, 2009), the restrictive progressive or threshold PWKL index (Wang et al., 2011), the aggregate ranked information index and the aggregate standardized information index (Wang et al., 2014), the modified PWKL (MPWKL) index (Kaplan et al., 2015), the KL expected discrimination index (W. Y. Wang et al., 2015), the posterior-weighted cognitive diagnostic model (CDM) discrimination index and the posteriorweighted attribute-level CDM discrimination index (Zheng & Chang, 2016), and the information product index (Zheng et al., 2018). The second is based on the Shannon entropy, called the Shannon entropy (SHE) procedure (Cheng, 2009;McGlohen & Chang, 2008;C. Tatsuoka, 2002;C. Tatsuoka & Ferguson, 2003;Xu et al., 2003Xu et al., , 2016. The third is based on the mutual information, including the expected mutual information index (Wang, 2013) and the Jensen-Shannon divergence (JSD) index (Kang et al., 2017;Minchen & de la Torre, 2016;Yigit et al., 2018). There are other indices, such as the generalized deterministic inputs, noisy "and" gate (G-DINA; de la Torre, 2011) model discrimination index (GDI; Kaplan et al., 2015), the rate function approach (Liu et al., 2015), the halving algorithm (C. Tatsuoka & Ferguson, 2003;W. Y. Wang et al., 2015;Zheng & Wang, 2017), and so on. Yigit et al. (2018) has proved that the mutual information index and the JSD index are equivalent. Although the previous simulation studies showed that the SHE and the JSD or mutual information perform quite similarly, the main purpose of this study is to describe the theoretical relationship between the SHE procedure and the JSD index.

CDMs
Before introducing item selection indices for CD-CAT, the general concept of CDMs and the G-DINA model as a kind of general CDMs are described here. CDMs have been defined by Rupp and Templin (2008) as "probabilistic, confirmatory multidimensional latent variable models with a simple or complex loading structure" (p. 226). The loading structure for a CDM is represented by its Q-matrix (K. K. Tatsuoka, 1983Tatsuoka, , 2009). The entries of a Q-matrix indicate 1 or 0, in which q jk = 1 when item j involves attribute k for answering item j correctly and q jk = 0 otherwise. CDMs define an item response function of a Q-matrix, examinee's discrete latent variables, and item parameters to predict the probability of an observable categorical response to an item. This study only focuses on CDMs designed to handle dichotomous responses. For a dichotomous CDM, the form of an item response function for a binary response variable is denoted by P PU u q where u ∈{ , }, 0 1 q j is the j th row of a Q-matrix, and β j is item parameters for item j. Note that this study will use P ju i ( ) α to discuss the theoretical relationship between the SHE procedure and the JSD index. While the G-DINA model and other item j that measures K j * attributes partitions the 2 K distinct attribute patterns into 2 K j * latent groups. To simplify the notation, let α ij * be the reduced attribute pattern of the full attribute pattern α i with respect to the required attributes for item j. Let U ij be the response of examinee i to item j. We define P PU u to be the probability distribution of the binary random variable U ij , where P j i 1 ( ) α and P P are the probabilities of getting a right answer and wrong answer on item j by examinee i with full attribute pattern α i or reduced attribute pattern α ij * . For the G-DINA model, the probability of correctly answering item j by examinee i is given by de la Torre (2011) and Ma et al. (2016). 2008). The resulting model from the log link function is referred to as the log CDM (de la Torre, 2011). For the identity link, that is g P P , the "deterministic input; noisy 'and' gate" (DINA) model (de la Torre & Douglas, 2004;Haertel, 1989;Junker & Sijtsma, 2001), the "deterministic input; noisy 'or' gate" (DINO) model (Templin & Henson, 2006), and the additive CDM (A-CDM) can be obtained from the G-DINA model when appropriate constraints are applied. For example, the item response function of the DINA model is by setting all lower-order interaction terms to zero and by taking δ j j g 0 = and δ j K j j j s g 12 1  * = − − . The parameter g j is the probability of correctly guessing the answer if an examinee lacks at least one of the required attributes, and the parameter s j refers to the probability of slipping and incorrectly answering the item if an examinee has mastered all the required attributes. The DINA model is a parsimonious and interpretable model that requires only two parameters for each item regardless of the number of attributes being considered.

SHE Procedure
After an item bank has been calibrated with a CDM, one must determine how to choose items for examinees from the item bank. CD-CAT employs algorithms to select items sequentially on the basis of examinee's responses, which is designed to classify student's attribute pattern accurately with a short test. The SHE procedure (Cheng, 2009) and the JSD index (Minchen & de la Torre, 2016;Yigit et al., 2018) are described below.
Suppose that the prior is chosen as π α 0 ( ) Assuming that π α i t c , ( ) is an updated prior probability distribution and U u ij = is an item response for candidate item j in R i t ( ) , the posterior distribution π i t , +1 then becomes , where α c t ( ) has an updated prior distribution π i,t . By considering the uncertainty of item response U ij , the marginal probability distribution of item response on item j given the probability distribution π i,t can be computed as follows which can be derived from the assumption of local independence; as the current posterior distribution π i,t can be viewed as a new prior for α c t ( ) after having seen test data u i is substituted for α c in the last term.
The next item to be selected for examinee i by the SHE is the one in R i t ( ) that minimizes the expected SHE: From the definition of conditional entropy (Cover & Thomas, 2006) as the weights. Thus, SHE ij can be considered as the conditional entropy

JSD Index
The JSD as a new class of information measures based on the SHE was introduced by Lin (1991) to measure the overall difference of any finite number of distributions. Let , respectively. By the definition of the generalized JSD in Equation 5.1 of the paper of Lin (1991), or from Equations A.3, A.4, and A.5 in online appendices of the paper of Yigit et al. (2018), the JSD for item j can be written as and The next item to be administered for examinee i is the one in R i t ( ) that maximizes JSD ij . Yigit et al. (2018) have proved that JSD ij can be considered as the mutual information between the two discrete random variables of α i and U ij . Thus, the JSD index is also a measure of the amount of information one random variable α i contains about another U ij .
Similar results have been observed by Kang et al. (2017) within the framework of dual-objective CD-CAT (Kang et al., 2017;McGlohen & Chang, 2008;Wang et al., 2014;Zheng et al., 2018). For simultaneously estimating examinees' α i and general ability θ i , the dual-objective CD-CAT is based on a CDM and an item response theory model. In other words, item response U ij has two Bernoulli distributions P ju i ( ) α and P ju i ( ) θ . Take the two-parameter logistic model as an example, the probability of responding correctly to item j is defined as Here, D is a constant, a j is the discrimination parameter, and b j is the difficulty parameter. The JSD of dual-objective CD-CAT was defined as mutual information between the two discrete random variables U ij and Z , where U ij has a mixture distribution between P ju ( ) α and P ju ( ) θ , and Z is the binary indicator variable for each distribution. For detailed information about relationship between the JSD, KL information, and Fisher information, please refer to Kang et al. (2017).

Relationship Between the SHE and the JSD
The purpose of this section is to establish the statement that the SHE and the JSD as two item selection criteria in CD-CAT are linearly related. Because the JSD itself is defined by the SHE, we apply the well-known relationship between the mutual information (or JSD) and SHE to establish a relationship between the item selection criteria that are developed using these two measures. The mutual information and SHE satisfy two well-known equations 2.43 and 2.44 from Theorem 2.4.1 in Cover and Thomas (2006, p. 21 and H X Y ( | ) are respectively mutual information and conditional entropy for two random variables X and Y . Let , which follows directly from the second well-known equation. As shown in the "SHE Procedure" section, SHE ij can be written as the π − . Next, we will provide an alternative way to prove the above statement, which would be useful for a better understanding of the relation. For simplicity, let the denominators or the normalizing constants of Equations 2 and 4 be Note that the right-hand side of C 2 is relevant to u . The detailed mathematical steps are described below. Substituting Equations 4 and 5 into Equation 6, the SHE ij can be written in the following equivalent form Recall from two basic logarithmic properties that the log of a quotient is equal to the difference between the logs of the numerator and denominator, and the log of a product is equal to the sum of the logs of the factors. Equation 11 can be written as Notice C 2 and log C 2 can be factored out from the third summation, as they remain constant over the summation index c from 1 to 2 K . Hence, Equation 12 has the form After changing the order of the summation and factoring two constant terms (i.e., π α π α i t c i t c , , ( )log ( ) and π α i t c , ( ) ) out from the new second summation, the first term on the right-hand side of the Equation 13 can be written as in the following equivalent form: which follows from Based on Equations 3, 13, and 14, the SHE ij can be written as which can be rewritten as The meaning of Equation 17 is consistent with the fact that the JSD or mutual information is a special case of a more general quantity called relative entropy. As H(π i,t ) is not a function of both item parameters and item responses of the candidate item j, it is a constant for examinee i. Thus, minimizing SHE ij is equivalent to maximizing JSD ij because they select the same item for administration at the ( ) t +1 th stage of testing. So far, this completes the proof that the SHE and the JSD have a linear relationship under the G-DINA model. Note that two proofs of the relationship between the SHE and JSD just rely on the form of an item response function for a binary response variable and do not depend on any particular CDM, because the item response function P ju i ( ), α the prior distribution π α 0 ( ) c , and the current item response vector u i t ( ) are all we need to calculate SHE and mutual information.
Tables A1 and A2 in the appendix are presented for the illustration of computation of values of indices for the SHE and JSD. Here, the SHE and JSD are computed for two items with different item response distributions or different item parameters, where a discrete uniform prior distribution for attribute patterns was used. From these two tables, the relationship between the SHE and JSD for the two items satisfied Equation 16 or Equation 17, and minimizing SHE ij is equivalent to maximizing JSD ij . Empirical examples show that the two item selection criteria are expected to behave similarly in CD-CAT.

Simulation Study
Design A small-scale simulation study was conducted to compare the performance of the SHE and JSD. Following a design similar to that in Cheng (2009) and Xu et al. (2016), the DINA model and five independent attributes were considered in the simulation study. For the generation of four item banks, a Q-matrix for 300 items should be first simulated. The entries of the Q-matrix were generated item by item and attribute by attribute. Each item has 20% chance of measuring each attribute. Four item banks were considered: (a) slipping and guessing parameters were fixed as one of the three levels, such as 0.05, 0.1, or 0.2, and (b) both slipping and guessing parameters were randomly draw from a uniform distribution on the interval [0.1, 0.3]. Test length was either fixed at 5 or 10 items. The sample size of examinees was set to 2,000. Attribute patterns for all examinees were randomly drawn from all possible attribute patterns with equal probability. Details of simulation design are presented in Table A3 in the appendix.
To consider the impact of the simulation of item responses in CD-CAT on the performance of the SHE and JSD, two types of CAT simulation were considered: full simulations or post hoc simulations (Magis et al., 2017). In case of a full CAT simulation, an item response for examinee i on item j was randomly drawn from a Bernoulli distribution, denoted by Bernoulli( ) P j i 1 ( ) α . Full CAT simulations imply that item responses for examinee i on the same set of items may be different. Under the post hoc simulation scenario, a complete item response matrix was created first for all examinees on each item bank before CD-CAT administrations and the responses to the selected items for the SHE or JSD were drawn from the complete item response matrix. First of all, post hoc simulations were considered to make use of exactly the same item responses for two item selection methods (SHE and JSD) under each item bank in CD-CAT. We replicated each type of simulation process 100 times under each condition and recorded final estimates of attribute patterns for all examinees.

Results
The attribute-level recovery rate is defined as the proportion of each attribute that is correctly identified. The pattern recovery rate is defined as the proportion of entire attribute pattern that is correctly recovered. Mean and standard deviation of attribute-level and pattern recovery rates for each level of item parameters are shown in Tables 1-4. For the SHE, our results are consistent with the results of Xu et al. (2016). Consistent with our theoretical conclusions, the SHE and JSD behaved quite similarly, because their attributelevel and pattern recovery rates were very close to each other under all conditions. Figure 1 presents pattern recovery rates for different test lengths and simulation types under slipping and guessing parameters of U(0.1, 0.3). From the two top panels of Figure 1, it can be observed that pattern recovery rates obtained by the SHE and JSD are the same for each replication under post hoc simulations. While for the full simulations, pattern recovery rates for the SHE and JSD were different for each replication because different item responses had an impact on test item selection. When the test length was 5, 43% pattern recovery rates of the SHE were higher than the mean of pattern recovery rates of the JSD, and 51% pattern recovery rates of the JSD were higher than the mean of pattern recovery rates of the SHE. The percentages became 49% and 60%, respectively, when the test length was 10. This result is consistent with the previous finding: "The mutual information item selection algorithm generates nearly the most accurate attribute pattern recovery in more than half of the conditions" (Wang, 2013(Wang, , p. 1030.
We also checked whether two item selection algorithms selected the same set of items for each examinee under post hoc simulations. For the first three item banks, two item   ( selection algorithms based on the SHE and JSD indeed selected the same set of items but with slightly different orders. Because all test items in these item banks have the same values of item parameters, some items presented in different positions have the same value of SHE or JSD. For example, two items with same item parameters but a single distinct attribute may have the same value of SHE or JSD. For the fourth item bank, we found that two item selection algorithms based on the SHE and JSD selected the same set of items.

Discussion
In this study, we complete the proof that the SHE procedure and the JSD are linearly related under CDMs. In other words, we showed that minimizing JSD and maximizing SHE can be used interchangeably because they will select the same items in CD-CAT. The two measures are linearly related but they are not equal, meaning that two measures have the form JSD H SHE . Although they are not equal, item selection methods based on the SHE and JSD will randomly select one of test items with the same value of the SHE or JSD index for administration at the next stage of testing, because H(π i,t ) is a constant for examinee i and minimizing SHE ij is equivalent to maximizing JSD ij at the current stage of testing.
This study is not without limitations. Theoretically, SHE, KL information, and mutual information are three ways to measure the uncertainty, and they are related to each other. It would be interesting to further investigate relationships of item selection indices based on the KL information, the SHE, the JSD, and other indices under general dichotomous or polytomous CDMs. For example, the GDI and MPWKL might be related, because they perform similarly and better than the PWKL in terms of correct attribute classification rates or test lengths. We believe GDI is simply a weighted variance of the probabilities of success of an item associated with attribute patterns given an attribute pattern distribution, and therefore we can start with comparing the weighted KL with the weighted variance to show a relationship. The GDI is defined as follows (Kaplan et al., 2015):  ( (   The findings of this study may contribute to the growing literature on formative assessments. First, theoretical derivations and empirical examples have shown that both indices (SHE and JSD) are expected to select the same next item given item response pattern of the same set of previous test items in CD-CAT. Consistent with our theoretical conclusions, simulation results have shown that the SHE and JSD behaved quite similarly in terms of attribute-level and pattern recovery rates. This finding can possibly be useful to help practitioners to choose an effective item selection algorithm (SHE or JSD) in the development and application of CD-CAT system in the field of educational and psychological measurement. Second, the effectiveness of item selection algorithm in CD-CAT will impact the quality of curriculum delivery and the outcomes of learning. If individual diagnosis results with a high measurement precision can be provided by using an effective item selection algorithm of CD-CAT, then diverse instructional materials can cater to the diverse needs or specific knowledge status of all learners (Lashley, 2019). Finally, informationbased indices are now not only widely applied in CD-CAT, but also useful for any test construction stage where test items are selected based on their statistical characteristics (e.g., Henson & Douglas, 2005;Henson et al., 2008;Kuo et al., 2016). For example, the cognitive diagnostic index, the attribute-level discrimination index, and their modified indices as KL information based measures have been used for the construction of diagnostic tests. Future research on automated test assembly for cognitive diagnosis will expand the scope of the application of the current finding. Table A1. Example 1 for the Illustration of Computation of Values of Indices for the SHE and JSD.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the Key Project of National Education Science "Twelfth Five Year Plan" of the Ministry of Education of China (Grant No. DHA150285).

Factors Details
Attribute structure Independent structure with five attributes CDM The DINA model Examinees Sample size is 2,000 Attribute patterns are generated by taking one of the 2 5 possible patterns with equal probability Item banks Each of four item banks consists of 300 items Each item has 20% chance of measuring each attribute Item parameters are set to s = g = 0.05, s = g = 0.1, s = g = 0.2, or s~U(0.1, 0.3) and g~U(0.1, 0.3) CD-CAT Test length is either fixed at 5 or 10 items Two item selection indices are the SHE and JSD with a prior uniform distribution MLE method is used to estimate attribute patterns of examinees Simulations Full simulations or post hoc simulations are used to generate item responses Note. CDM = cognitive diagnostic model; DINA = deterministic inputs, noisy "and" gate; CD-CAT = cognitive diagnostic computerized adaptive testing; SHE = Shannon entropy; JSD = Jensen-Shannon divergence; MLE = maximum likelihood estimation.