Development of a New Measure of Cognitive Ability Using Automatic Item Generation and Its Psychometric Properties

In the development of cognitive science understanding human intelligence and mind, measurement of cognitive ability has played a key role. To address the development in data scientific point of views related to cognitive neuroscience, there has been a demand of creating a measurement to capture cognition in short and repeated time periods. This paper introduces an innovative measure of cognitive ability based on automatic item generation approach, which can efficiently and effectively measure cognitive ability over time. We also examine its psychometric properties. Content validity of the assessment was considered based on the Cattell-Horn-Carroll theory, and construct validity via convergent and divergent validities was examined by confirmatory factor analysis. The reliability of the measure examined by internal consistencies as well as test-retest reliabilities of each subdomain of cognitive ability were satisfactory. The psychometric properties found clearly support its potential utilities in both educational and clinical settings, especially in a field requiring repeated measures of cognitive ability.


Introduction
Study of the mind in the human brain is one of main research topics in cognitive science (Friedenberg & Silverman, 2006), and the purpose of cognitive-scientific study of the mind is to learn mechanisms that underlie cognitive performance (Sternberg, 1986). In pursuit of understanding the brain mechanisms, a comprehensive and efficient assessment tool measuring cognitive abilities is required (Gur et al., 2010) along with identifying and understanding a structure of human cognition (Wasserman, 2019). On the other hand, it is still a challenge to develop a practical assessment tool even if the domains of cognitive ability are assumed to be identified, due to a lack of consensus between theoretical aspects of intelligence and practical aspects of intelligence (Canivez & Youngstrom, 2019). In spite of such a challenge, many tools and scales have been developed on the strength of psychometric theories (Beaujean, 2015;Bruijnen et al., 2020;Caemmerera et al., 2020;Dombrowski et al., 2021;Geisinger, 2019;McDicken et al., 2019). Based on the Cattell-Horn-Carroll (CHC: Bryan & Mayer, 2020;Flanagan et al., 2013;McGrew, 2009;Schneider & McGrew, 2012 theory of cognitive abilities, this study uniquely demonstrates how an innovative psychometric method, known as an automatic item generation (AIG; Drasgow et al., 2006;Embretson & Yang, 2006;Gierl & Haladyna, 2013;Gierl et al., 2021;Irvine & Kyllonen, 2002), helps to develop a new practical assessment tool of cognitive ability.
Cognitive ability tests are often used as repeated measures in longitudinal study although there is inherent limitation in serial tests, which is called as the "practice effect" (Kaufman, 1990;Temkin et al., 1999). For example, in numerous clinical settings, serial cognitive tests are administered to investigate the effect of treatments on changes of cognitive abilities (Cerulla et al., 2019;Elman et al., 2018;D. M. Jacobs et al., 2017) and to make decisions on disease progress or recovery (Beglinger et al., 2005). Serial cognitive assessments are also essential when the new pharmaceutical treatments for conditions influencing cognition is developed (Beglinger et al., 2005). The similar demand exists in educational settings, particularly, when longitudinal studies are conducted. On the other hand, repeated tests may also result in the practice effect in the educational settings, defined as changes in a person's test performance on re-testing (Hausknecht et al., 2007), due to examinees' memory and familiarity of items.
Although individually administered tests in the cognitive assessment tools provide numerous information on examinees, group-based tests have received considerable attention in both clinical and educational settings, especially in diagnostic tests for screening purposes. Clinicians continue to consider the administration of group tests because of short clinical test intervals to screen patient's current ability (Sparrow & Davis, 2000). In educational settings, individually administered ability tests have some shortcomings for talent identification even though they are very helpful for diagnostic purposes (Lohman & Gambrell, 2012). In other words, the issue of equity is raised because the expensive cost of individualized assessments limits the opportunities of children who can be assessed. Only those who can afford to pay for tests can be provided the opportunities of testing or retesting (Renzulli, 2005). The nonverbal battery of the Cognitive Abilities Test (CogAT; Lohman, 2011) and the Naglieri Nonverbal Ability Test (NNAT; Naglieri, 2008) administered as a group test have been developed for the identification of gifted students in educational settings, but the CogAT nonverbal test has an issue on practice effects (Lohman & Gambrell, 2012). Thus, there has been a demand on developing a group-based measure of nonverbal cognitive ability to reduce the practice effects.
To address the issues of practice effects and group administration in a measure of cognitive ability, we have developed a new measure of cognitive ability using automatic item generation (AIG; Drasgow et al., 2006;Embretson & Yang, 2006;Gierl & Haladyna, 2013;Irvine & Kyllonen, 2002). AIG is a method of producing computerized item models to generate item instances which have similar psychometric properties such as item difficulty. Therefore, the approach is greatly beneficial to a removal of the practice effect. Compared to traditional (manual and individual) item writing process, the AIG-based method has been known as an effective and efficient assessment development process (Wainer, 2002). Furthermore, AIG also provides other possible benefits such as increment of test security by minimizing item exposure, reduction of the unit cost of assessment item, and increment of test-retest reliability in a situation of repeated measures because it can efficiently generate isomorphic items (e.g., at the similar level of difficulty) using item models (Choi, 2018;Wainer, 2002).
In addition, we also applied the Cattell-Horn-Carroll theory of cognitive abilities which is well-known as the comprehensive and integrated model of cognitive abilities by greatly specifying the range of cognitive abilities. Accordingly, twofold of the current study objectives is (1) to demonstrate how the newly developed nonverbal cognitive ability test is developed using AIG and (2) to examine its psychometric properties. To meet the objectives, we hypothesize that the new measure of cognitive ability based on the CHC theory holds validity and reliability via empirical data analysis.

Cognitive Domains in the Measure of Cognitive Ability
To address the demand of group-administrated nonverbal cognitive measures, we have developed a new measure of cognitive ability (called as MOCA) including item models to measure fluid reasoning (Gf) and visual processing (Gv) in the CHC theory. The notion of an item model in AIG restructures the process of establishing guidelines and standards in traditional item writing by using computer coding. AIG allows item/test writers to develop constructs to be measured a priori utilizing the item/test design principles (Bormuth, 1970;Mislevy, 2018;Thorndike, 1971), which is important because cognitive domains have been identified over the century and its structure/atlas of domains have been understandable and acceptable in the CHC theory.
In the MOCA, the fluid reasoning (Gf) refers to the ability to solve unfamiliar problems without relying on previous knowledge, and the Gf of the MOCA was also specified with three sub-domains: (1) inductive reasoning (IR), (2) general sequential reasoning (SR) known as deductive reasoning, and (3) quantitative reasoning (QR). The visual processing (Gv) refers to the ability to use visual imagery to solve certain problem, and the Gv of the MOCA was measured by examining the capabilities to use simulated mental imagery. For example, the MOCA measures Gv by asking examinees to simulate how the movement of one figure affects another, or how figures might look from a different angle, which will be discussed in detail in the Results section.

Item Models Created in the Measure of Cognitive Ability (MOCA)
Initially, we developed 100 item models covering the two broad abilities, Gf and Gv, which generate item instances measuring four sub-domains: three sub-domains, inductive reasoning (IR), sequential reasoning (SR), and quantitative reasoning (QR) within fluid reasoning (Gf) and one subdomain, visualization (Vz) within visual processing (Gv). Then, we made two forms of the MOCA test consisting of 36 item models per form. Since we used the same 18 item models in both forms as common item models, the total number of item models used was 54 out of 100 item models. Table 1 summarizes the structure of each form with the difficulty levels of item models. In both forms, there are 36 unique item models (18 item models for each form) and 18 common item models. In terms of cognitive domains, there are 45 item models in Gf and 9 item models in Gv. All of item models were created by using CAFA AIG software (Choi & Zhang, 2019).

Empirical Data
Data used in this study approved by the university IRB (with both students' and parents' consents) were obtained from the MOCA administered over two time points with 2-week interval. In this nonverbal test, teachers only helped students access their accounts and take the MOCA online. A total of 1,198 participants from fifth graders (n = 141), sixth graders (n = 122), seventh graders (n = 300), eighth graders (n = 298), and ninth graders (n = 337) over 56 classes of 4 schools were considered. In the second time point (after 2 weeks), 334 (27.9%) of 1,198 students participated and they consist of fifth graders (n = 119), sixth graders (n = 114), eighth graders (n = 40), and ninth graders (n = 61) over 21 classes of 4 schools. Furthermore, 602 students of 1,198 students took Form A while 596 students took Form B at Time 1. About 174 students of 334 students took Form A while 160 students took Form B at Time 2. MOCA tests were administered repeatedly as two different forms in the Spring 2019. In this study, we considered 334 students only when examining the psychometric properties with repeated measures.

Psychometric Properties of the MOCA
Confirmatory factor analysis. Based on the structure of the cognitive domains described in the CHC theory, we conducted a confirmatory factor analysis (CFA) to examine if the structure is supported by the empirical data. We, thus, considered two CFAs according to two different datasets obtained from both forms, separately. There are 602 students in the first data obtained from the first form (form A) and 596 students from the second form (form B) at Time 1.
Factor reliability. As a psychometric property of items indicating the degree to which factor scores are precise, the reliabilities of factors were examined using the factor reliability ( ρ  ; Raykov, 1997Raykov, , 2004. The factor reliability is defined as a ratio of explained variance to total variance from CFA parameters: ∑ is the sum of the estimated unstandardized factor loadings among indicators of the same factor, φ  is the estimated factor variance, and θ ii i  ∑ is the sum of the unstandardized error variances of those indicators. In CFA, factor loadings, error variances, and error covariances are estimated, which influences true and total variance. Thus, to measure factor reliability within CFA model, factor reliability facilitating the CFA estimates is a preferred method to computing Cronbach's alpha with unrefined composite scores for the scale (Brown, 2006).

Construct validity.
As another psychometric evaluation, we examined construct validity obtaining convergent validity and discriminant validity in addition to showing that factor loadings are greater than .45 (Brown, 2006(Brown, , 2015. These indexes of convergent and discriminant validities  (2015), the CFA results provide evidence how strongly indicators of a latent variable are interrelated (convergent validity) and how lowly latent variables are correlated (discriminant validity). Convergent validity was provided by obtaining factor reliabilities are greater than .70 (Nunnally & Bernstein, 1994) and discriminant validity was provided by obtaining factor correlations are lower than .80 (Brown, 2015).

Measure of Cognitive Ability (MOCA)
Item models in inductive reasoning (IR). In the MOCA, as the key sub-domain of fluid reasoning (Gf), IR is defined as "the ability to discern rules and patterns in what is observed." (Schneider & Newman, 2015). IR ability is measured in the two sub-tests named as figure classification (FC) and figure matrices (FM) depicted in Figure 1. One item instance from a FC item model was shown in Figure 2a with a set of figures as a stem in the item. To answer correctly, examinees should understand the conceptual link and pattern of sliding object to the right-hand corner from the set of figures in A1. And then examinees select the figure that matches the set of figures in the same way. The five options given from the alternatives in the item model were generated with one correct answer and four distractors. The cognitive model for solving the item was depicted via a flowchart in Figure 3.
In Figure 2b, we demonstrated an item instance from a FM item model. Examinees are asked to find the proper objects in the lower right-hand corner. To solve the problem, examinees must induce the following relations: (1) shape changes from row to row, (2) two different objects are moving toward to the center from column to column, and (3) the two objects are laid at the center. Those relations can be induced from A1 and A2 in Figure 2b. If examinees understood the three relations, they should select the proper image for A3 from the options listed that are generated from the alternatives in the item model. As a category of cognition of figural relations, P. I. Jacobs and Vandeventer (1972) pointed out that this kind of item models for IR ability is able to be generated using a principle of cognitive test items and also allows to reduce practice effects by changing patterns and objects. The procedure can be found via a flowchart in Figure 4.
Item models in quantitative reasoning (QR). QR referring to "the ability to reason with quantities, mathematical relations, and operators" (Schneider & McGrew, 2018) is assessed in the two other sub-tests called Number Analogies (NA) and Number Puzzles (NP) in the MOCA. For example, as shown in Figure 5a for the NA sub-test, examinees should find out the relation between the two given pairs in A1 and A2 and then they are asked to select the most appropriate number for the question mark (?) in A from the options given below. In the NP item presented in Figure 5b, system of equations with variables denoted by the question mark (?) and the diamond (◊) is provided. Different from items discussed before, examinees are asked to solve an equation in A1 first, by making both sides of the equation equal. After identifying ◊ (=16), examinees apply to the equation A that is 48 ÷ ? = 16. Next, they are required to select the number from the possible options to make both sides of the equation A equal. The item within the NP sub-test is different from the other items in NA, FM, and FC in terms of the procedure that they need to apply backward approach instead of forward approach.
Item models in sequential reasoning (SR). SR defined as "the ability to reason logically using known premises and principles" (Schneider & McGrew, 2012) is assessed in the Number Series (NS) in the MOCA where a series of numbers is presented. To solve problems in this sub-test of NS, examinees are asked to select which number from the      options below should follow. In this test, examinees should find out the relation between the numbers in the series of number. The example of the NS is shown in Figure 6a.

Item models in visualization (Vz).
Vz within the visual processing (VP) is defined as "the ability to perceive complex patterns and mentally simulate how they might look when transformed (e.g., rotated, twisted, inverted, changed in size, partially obscured)." (Schneider & McGrew, 2018, p. 126). This is the key ability of Gv, and Vz is measured in the subtest, Paper Folding (PF), where a series of figures in A is presented as shown in Figure 6b. The series of figures are given in a paper of a square or ⌂-shape of a paper that will be further folded with holes cut in circle, triangle, or clover shapes, etc. The one-headed arrow indicates how the paper will be folded. In these situations, examinees must visualize what the figure A looks like in their mind when the paper A is unfolded. Based on the paper folding, examinees should select the most proper figure from the possible answers.

Reliabilities and Validity Based on CFA of MOCA
Model evaluation. We fit the second order factor model depicted in Figure 7 (adding items in Figure 1) into two datasets collected from two forms of the MOCA at Time 1. The hierarchical structure was considered to distinguish two levels: the sub-domain levels of IR, QR, SR, and Vz nested within Gf (of IR, QR, and SR) and Gv (of Vz) where Gv serves as a phantom construct of Vz. We obtain good fit indexes for both datasets such as RMSEA = 0.032 and 0.026, CFI = 0.969 and 0.977, and SRMR = 0.072 and 0.062 for Form A and Form B, respectively.
Factor reliability and validity. We examined the five factor reliabilities for IR (.938 and .877), SR (.899 and .934), QR (.962 and .960), VZ (.916 and .923), and Gf (.808 and .917) for Form A and Form B, respectively, where Gv is same as VZ. These factor reliabilities also provided convergent validities with higher factor loadings on all of four sub-domains that were greater than .45 except item 2 of QR in Form B (.429), item 27 of IR in Form A (.317), and IR of Gf in Form A (.350). Factor correlations between Gf and Gv were .685 for Form A and .642 for Form B that are lower than .80, which provides discriminant validities. Thus, based on CFA results of From A and Form B, construct validity of the MOCA was confirmed.

Item Analyses Based on Two Parameter Logistic (2PL) Model for the MOCA Item parameter estimates of Form A and Form B.
The hierarchical factor structure of the MOCA was confirmed via CFA. Now, we examined conformity of difficulty level between a priori difficulty levels of item models from AIG and difficulty levels of item instances estimated via item response theory (IRT). When selecting 54 item models for two MOCA tests, Form A and Form B, we considered three classes in terms of difficulty: low, medium, and high. Due to the complexity of procedure for answering cognitive questions shown at Figures 3 and 4, the three classes may not give any criteria or cut-off in terms of estimated difficulties via IRT. Based on the comparison among three classes, we found that items in the low class have significantly lower estimated difficulties than items in the medium or high classes (t 33 = −1.963, −3.687, −2.199, and −2.256 and p = .063, .001, .039, and .032 on Form A at Time 1, Form A at Time 2, Form B at Time 1, and Form B at Time 2, respectively). However, the estimated difficulties in the medium and high classes were not statistically different from each other. On the other hand, the discrimination parameters were not statistically different in terms of the a priori difficulty levels. The results were summarized in Tables 2 to 4.
Test-retest reliability based on two time points. Using the sample of examinees who took both MOCA at Time 1 and Time 2, we examined the test-retest reliabilities using Pearson correlation coefficients based on the narrow ability factor scores of IR, QR, SR, and Vz. All of correlations of narrow abilities between Time 1 and Time 2 were significant (p = .000). In addition, all of correlations were higher than cross-correlations with other narrow abilities except SR (Table 4). However, the correlations were slightly lower ranged from .415 of SR to .606 of QR. As expected, correlations between Vz and any of IR, QR, and SR were lower than other correlations (Figure 8).

Practice Effects
Based on the narrow abilities of IR, QR, SR, and Vz, we examined the mean differences over time using dependent sample t-tests. Both Form A and Form B indicated that IR and Vz do not show any change over time while QR and SR do show development at Time 2. The results were listed in Table 5. In terms of sub-tests, we observed that Figure

Discussion
The MOCA is not an exhaustive measure of all cognitive domains as depicted in Figure 1 but it measures the four narrow abilities, inductive reasoning (IR), sequential reasoning (SR), quantitative reasoning (QR), and visualization (Vz), nested within two broad abilities, fluid reasoning (Gf) and visual processing (Gv). However, these four domains are essential in educational setting, which is the reason that many other group-administered measures also include of these domains. On the other hand, MOCA was uniquely created by using AIG, which is beneficial in educational setting because it can be used as repeated measures.
MOCA has not only been used in educational settings to understand students' strengths and weaknesses in terms of cognition but it will also be launched in clinical settings including Atopic Dermatitis school program (Jang et al., 2015). Although Elbin et al. (2019) recently showed that the Immediate Post-Concussion Assessment and Cognitive Testing (ImPACT), the computerized neurocognitive battery, for short-term serial assessment of neurocognitive functioning was suitable in repeated administration, the use of the battery is limited to clinical settings. MOCA also covers broad age groups due to its characteristics of a nonverbal cognitive test. In this study, MOCA has been validated via psychometrics including item analysis and examining practice effects. Its reliabilities of cognitive domains and the hierarchical structure of six subtests were also validated via a CFA. Its reliability and validity mean more than its verbatim because the properties were drawn from item models of automatic item generation (AIG) instead of item instances. In other words, MOCA developed using AIG was psychometrically sound at the level of item models and also is applicable in intensive longitudinal data analysis utilizing massive item production that is possible due to the isomorphism of item instances generated from item models. Thus, all MOCA assessments for each individual will be different but conform within the same item model set in terms of measuring cognitive ability.
With the psychometric properties, MOCA can be utilized to broader age ranges of students and its capability as a group-administered measure contribute educational equity of taking cognitive ability test. As an AIG-implemented measure, MOCA does not only hold item security so that it can prevent impacts of construct irrelevant variables from practice effects but also eventually provides more economically beneficial tests to students by generating many item instances from item model. As we go through COVID-19 pandemic, educators and parents concern the disparity of student's academic achievement which is associated with their cognitive

Limitations
In this study, the number of assessments was limited to two time points. It is needed to examine the practice effects in intensive longitudinal data. In addition, linking in item analysis was ignored because this study mainly focused on its psychometrics of reliabilities and validities instead of scale development. As future research, we will examine the status of cognition by considering the anchor items. Although we considered two forms of the MOCA, it has not been considered as age-and grade-based assessment tools. In educational settings, we will consider tailored MOCA tests in terms of grade level and gifted/special education status. In clinical settings, we will consider other tailored MOCA tests consisting of different sets of cognitive domains. As a new measure of cognitive ability, it is also important to explore the concurrent validity with other cognitive measures. However, MOCA was unique in terms of implementing AIG and thus, the concurrent validity study is not a simple way to measure by examining association with the other cognitive measures but requires adjustments between AIG and the traditional item generation. We plan to study the concurrent validity.

Concluding Remarks
The MOCA, based on CHC theory and AIG method, was developed as a new measure of cognitive ability that fits studies requiring repeated measures of cognition. This psychometrically sound measure will contribute to cognitive science that researchers need to measure one's cognition by using the MOCA.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.