Summary Intervals for Model-Based Classification Accuracy and Consistency Indices

When scores are used to make decisions about respondents, it is of interest to estimate classification accuracy (CA), the probability of making a correct decision, and classification consistency (CC), the probability of making the same decision across two parallel administrations of the measure. Model-based estimates of CA and CC computed from the linear factor model have been recently proposed, but parameter uncertainty of the CA and CC indices has not been investigated. This article demonstrates how to estimate percentile bootstrap confidence intervals and Bayesian credible intervals for CA and CC indices, which have the added benefit of incorporating the sampling variability of the parameters of the linear factor model to summary intervals. Results from a small simulation study suggest that percentile bootstrap confidence intervals have appropriate confidence interval coverage, although displaying a small negative bias. However, Bayesian credible intervals with diffused priors have poor interval coverage, but their coverage improves once empirical, weakly informative priors are used. The procedures are illustrated by estimating CA and CC indices from a measure used to identify individuals low on mindfulness for a hypothetical intervention, and R code is provided to facilitate the implementation of the procedures.

Tests scores and scales scores are often used to make decisions about individuals. Examples include using scores to determine if a student meets a minimal level of proficiency (Lee, 2010), if a respondent is flagged by a screener for further treatment , or if an individual should obtain a license to practice a profession (Moses & Kim, 2015). In many situations, responses from the test or scale are aggregated into an observed summed score X Ã i , and then one determines if the summed score is above or below a cutpoint (Youngstrom, 2014). A critical measurement concern when using scores to make decisions is describing their expected classification accuracy (CA) and consistency (CC; American Educational Research Association et al., 2014).
CA refers to the probability of making a correct classification during the decision process. Three indices that describe CA are the classification rate, sensitivity, and specificity (Pepe, 2003;Youngstrom, 2014). The classification rate refers to the probability of making a correct decision. Sensitivity refers to the probability of correctly classifying individuals who meet the condition (e.g., identify students who are truly proficient or select respondents who truly need the treatment). Specificity refers to the probability of correctly classifying individuals who do not meet the condition (e.g., identify students who are not proficient or rule out respondents who do not need the treatment). In this article, those three indices are referred to as CA indices. For the CA indices, the true condition is determined by a latent variable score (e.g., scores from a gold standard, a clinical interview, etc.), but in applied settings, the classification of respondents is often based on the summed score X Ã i , which suffers from measurement error. However, classification consistency refers to the probability of making the same decision across two parallel administrations of the same test or scale (Livingston & Lewis, 1995). Note that consistency differs from accuracy in that consistency does not describe if the classification is correct or not, it just has to be the same. The estimation of classification consistency assumes that there is no change on the construct (e.g., changes due to maturation, carry-over effects, treatment effects, etc.) between the administration of the measures . Throughout the article, this index is referred to as the CC index.
In situations in which one has respondent-level classifications at Time 1, classifications at Time 2, and true classifications, the estimation of the CA and CC indices is a relatively simple task-one can cross-tabulate classifications at Time 1 with true classifications to estimate the CA indices, and cross-tabulate classifications at Time 1 and Time 2 to estimate the CC index . In this case, the CA and CC indices are proportions, so uncertainty for CA and CC indices could be estimated using confidence intervals for proportions (Newcombe, 1998). However, limited resources might prevent researchers from having either true classifications or scores from a second administration of the measure. Therefore, model-based estimates of CA and CC that require only one administration of the measure have been recently proposed and investigated Lee, 2010;Millsap & Kwok, 2004).
Previous research has shown that, when responses to the test or scale are treated as continuous, CA and CC indices can be calculated from the parameter estimates of a linear factor model (further discussed below; Millsap & Kwok, 2004). However, uncertainty estimation for model-based CA and CC has not been widely discussed (although see , for a brief discussion). Moreover, when the point estimates of factor model parameters are used to derive model-based CA and CC, the factor model parameters are treated as fixed, but these parameters suffer from sampling variability. In practice, this source of uncertainty is ignored (e.g., Lee, 2010;Millsap & Kwok, 2004), but there is a possibility that it may be consequential to the estimation of summary intervals for CA and CC indices. A promising solution would be to estimate the uncertainty of model-based CA and CC indices using the bootstrap (Efron & Tibshirani, 1993) or Bayesian estimation (Levy & Mislevy, 2017). Both approaches provide natural frameworks to propagate the uncertainty of factor model parameters to derived quantities via their sampling procedures (Kelley & Pornprasertmanit, 2016;Levy & Mislevy, 2017). Recent work suggests that percentile bootstrap confidence intervals (Kelley & Pornprasertmanit, 2016) and Bayesian credible intervals (Pfadt et al., 2021;Tanzer & Harlow, 2021) have appropriate coverage on the estimation of test score reliability. As such, it is expected that the bootstrap and Bayesian summary intervals 1 could be viable approaches to estimate the uncertainty of CA and CC indices, although this has not been investigated.
This article has three goals: (a) show how to estimate bootstrap confidence intervals and Bayesian credible intervals for model-based CA and CC indices, (b) evaluate if the summary intervals have appropriate interval coverage of the true value, and (c) facilitate the reporting of the uncertainty of model-based CA and CC indices by providing R code. The structure of the article is the following. First, the estimation of CA and consistency using parameters from the linear factor model is introduced. Then, background on bootstrapping and Bayesian inference and how they can be used to incorporate parameter uncertainty to the summary intervals of CA and CC indices is described. Next, the estimation and interpretation of bootstrap and Bayesian summary intervals are shown using an applied example. Finally, results from a small simulation study on the coverage of the percentile bootstrap and Bayesian summary intervals for the CA and CC indices are discussed. In the supplement, R code is provided to assist researchers on the implementation of this methodology.

Estimating Model-Based CA and CC
Below, some implied relations of the linear factor model for the expected mean and variance of summed scores are introduced, and then their connection to model-based CA and CC indices is reviewed.

Linear Factor Model and Some Implied Relations
Let X ij be the score of respondent i on item j. When X ij is continuous and the items are homogeneous, the relationship between the X ij and the latent variable h underlying the assessment can be described using a unidimensional linear factor model, In this case, t j is the intercept of item j, l j is the factor loading for item j, and ij is the unique score for person i on item j, with variance c jj across individuals. Recall that researchers typically make decisions using the summed item responses for each respondent, X Ã i . Consequently, the relations implied by the linear factor model can be used to calculate the expected mean m X Ã and variance s 2 X Ã of X Ã (Millsap & Kwok, 2004), In this case, t Ã is the sum of the intercepts, l Ã is the sum of the factor loadings, c Ã is the sum of the variances of the unique scores, k is the mean of h, and F is the variance of h. For the estimation of the CA indices, we need to calculate Cor X Ã , h ð Þ, the expected correlation between X Ã and h (Millsap & Kwok, 2004), which simplifies even further when the linear factor model is identified by setting F = 1. For the CC index, we need to calculate Cor X Ã 1 , X Ã 2 À Á , which is the correlation of the summed score for two administrations of the measure in quick succession. Recall that the CC estimation assumes parallel administrations of the measure, which pre- which is the equation for coefficient omega, a commonly used estimator of reliability for homogeneous tests (McDonald, 1999), which is also equivalent to Cor X Ã , h ð Þ 2 from Equation 3.

Bivariate Distribution and Indices
Collecting the preceding developments, one can define two bivariate normal distributions (see Figure 1) used in the estimation of CA and CC indices, which map to the decision cross-tabulation described above (see Figure 1A and 1C). In an ideal scenario, one would select individuals who are above a cutpoint on h, which represents a gold standard, but because latent variable scores are not observed and often not used in practice, selection takes place using the observed summed score X Ã with respect to a cutpoint on X Ã . To study the relation between selecting individuals based on X Ã and on h, one can define a bivariate normal distribution based on m X Ã , s 2 X Ã , k, F, and Cor X Ã , h ð Þand cutpoints on X Ã and h can be imposed. Imposing the cutpoints on the distribution forms quadrants A, B, C, and D (see Figure 1B), which are defined as follows: C is the proportion of respondents who were above the cutpoint on X Ã but below the cutpoint on h (a false positive; an incorrect classification), D is the proportion of respondents who were above both the cutpoint on X Ã and the cutpoint on h (a true positive; a correct classification), A is the proportion of respondents who were below both the cutpoint on X Ã and the cutpoint on h (a true negative; a correct classification), and B is the proportion of respondents who were below the cutpoint on X Ã but above the cutpoint on h (a false negative; and incorrect classification). The proportion of the bivariate normal distribution in each quadrant, estimated via integration, is then used to estimate the CA indices. One could estimate classification rate, the probability of making a correct classification, as (A + D)/(A + B + C + D), sensitivity, the probability of correctly classifying those who meet the condition determined by h, as D/(B + D), and specificity, the probability of correctly classifying those who meet the condition determined by h, as A/(A + C). Similarly, one could estimate the CC index by imposing the cutpoints on X Ã on the bivariate normal distribution based on m X Ã 1 , m X Ã 2 , s X Ã 1 , s X Ã 2 , and Cor X Ã 1 , X Ã 2 À Á . Recall that parallel tests have the same mean and variance, so m X Ã 1 = m X Ã 2 = m X Ã and s 2 = s 2 X Ã , and the cutpoint on X Ã 1 is the same as the cutpoint on X Ã 2 . Imposing the cutpoints on the distribution forms quadrants E, F, G, and H, shown in Figure 1D, and defined similar to quadrants A, B, C, and D in Figure 1B. Classification consistency, the probability of making the same classification in two parallel administrations of the measure, can be estimated as (E + H)/(E + F + G + H).
Note that on the estimation of the CA and CC indices, we need to determine m X Ã , s 2 X Ã , along with Cor X Ã , h ð Þ. These estimates rely on the point estimates of t, l, and c, as shown in Equations 2 and 3. In practice, when researchers use Equations 2 and 3, the estimates of the factor model are treated as fixed, but these estimates have standard errors and suffer from sampling variability. Consequently, the estimation of CA and CC ignores the uncertainty of the factor model parameters from the previous step. Although this might not be an issue in simple problems with a large sample size, ignoring the uncertainty of the factor model parameters might be consequential during the estimation of the uncertainty of the CA and CC indices (Levy & Mislevy, 2017), leading to artificially narrow confidence intervals for CA and CC. A possible solution is to turn to bootstrapping and Bayesian methods, which provide natural approaches to acknowledge the uncertainty in the factor model parameters on the estimation of CA and CC summary intervals. Below, bootstrapping and Bayesian methods are discussed.

Percentile Bootstrap for Confidence Intervals
A motivation for bootstrapping methods is that it is better to estimate confidence intervals using the empirical distribution of a test statistic rather than its theoretical distribution (Efron & Tibshirani, 1993), especially when the theoretical distribution of a test statistic deviates from normality. To approximate the empirical distribution of a test statistic using the bootstrap, one resamples the dataset with replacement (i.e., the same case can appear multiple times on each resample) many times (e.g., 500 or more times), the test statistic is estimated in each of the resampled datasets, and the empirical distribution of the test statistic is derived by plotting or taking summary statistics of all the estimates across resampled datasets. Then, confidence intervals can be derived using percentiles from the empirical distribution. For example, a 95% percentile bootstrap confidence interval would be defined by the 2.5th and the 97.5th percentiles of the empirical distribution. Recall that confidence intervals describe the long-run frequency of covering the true value of the parameter if confidence intervals were to be repeatedly and identically constructed.

Bayesian Estimation for Credible Intervals
To introduce Bayesian methods, it is important to contrast their focus with frequentist methodology, which tends to dominate in the social sciences. Frequentist methods assume that the data are a random sample from the population and the parameters we estimate (e.g., factor loadings, item intercepts, etc.) are fixed, so we are interested in the magnitude of the point estimate and its standard error. In Bayesian methods, data are treated as fixed observations and the parameters estimated are treated as random, so the parameters take probability distributions (Levy & Mislevy, 2017). Before the analysis, parameters are assigned prior distributions that encode the knowledge we have about the parameters. The information about the parameters can be obtained from multiple sources, such as estimates from pilot data, results from a meta-analysis, eliciting information from subject matter experts, or a combination of all (van de Schoot et al., 2021). Users specify the distribution per parameter (e.g., normal, inverse gamma, etc.) and numerical values that govern the shape of the distribution, and consequently the amount of information they have. For the estimation of the CA and CC indices, one needs to specify priors for the factor loadings, item intercepts, and residual variances. Suppose that a researcher assigns a prior normal distribution with a mean of .5 and a standard deviation of 1 to a factor loading. In this case, researchers are effectively encoding their belief that the likely values for the factor loading to be between 21.5 and 2.5 (which are, roughly, the 2.5th and 97.5th percentiles of that prior normal distribution). Perhaps there are applications in which a meta-analysis suggests that the factor loading is unlikely to be negative, so one could specify a prior distribution in which the likely values are all positive, such as a normal distribution with a mean of .5 and a standard deviation of .2. In the case of pilot data, estimates and standard errors of the parameters of interest can be translated into priors, where priors are centered in the estimates and the variability of the prior is a function of the standard error of the estimate. In situations in which information is obtained by eliciting expert information, there are many tools and protocols to facilitate the process, and these are discussed elsewhere (see Veen et al., 2017 and references therein). Overall, the inclusion of informative priors based on pilot data, metaanalyses, or expert elicitation is a strength of the Bayesian methods. In situations in which there is a lot of uncertainty about the parameters, then diffused prior distributions, prior distributions in which parameters can have a large range of values, or weakly informative distributions, informative priors as discussed above but with added uncertainty, are assigned.
Once prior distributions are assigned, the knowledge of the parameters is updated with observed data using Bayes' theorem. Bayes' theorem indicates that p ujdata ð Þ, the posterior distribution of the parameter u after observing the data, is the product of the likelihood of observing the data given the parameter, p dataju ð Þ, and the prior distribution of the parameter, p u ð Þ, divided by the marginal probability of observing the data, p data ð Þ, which reduces to p ujdata ð Þ}p dataju ð Þp u ð Þ because p data ð Þ remains the same in all the analyses. Finally, Markov Chain Monte Carlo (MCMC) estimation is used to combine the likelihood and the prior to approximate the posterior distribution. In general, MCMC estimation can be thought of as a set of sampling algorithms used to sample from a posterior distribution, and in this case, a Gibbs sampler was used (Casella & George, 1992).
Inferences about the parameters are then based on draws from the posterior distributions. The posterior distribution can be described using central tendency and summary intervals. For central tendency, one could use the mean or the median of the distribution. Summary intervals could either be equal-tail or highest posterior density (HPD) intervals. Equal-tail intervals capture the specified density and are balanced so that the excluded density is the same above the upper limit and below the lower limit. For example, a 95% credible interval is defined by the 2.5th and the 97.5th percentiles of the posterior distribution. However, HPD intervals are constructed iteratively so that they capture the specified density, but they are not balanced (Gelman et al., 2004)their property is that no value outside of the interval for the posterior distribution has a higher probability than the value inside of the HPD interval. Although HPD intervals are more common to summarize posterior distributions, similar procedures could be used to summarize empirical distributions from the bootstrap. Finally, the interpretation of credible intervals differs from the confidence intervals described above-credible intervals describe the probability that the true parameter is within the range of values.

How Bootstrapping and Bayesian Methods Handle Uncertainty of Derived Quantities?
An advantage of estimating summary intervals with bootstrapping and Bayesian methods is the calculation of uncertainty of derived quantities that are functions of other parameters. Bootstrapping methods incorporate the uncertainty of the factor model parameters in the summary intervals by (a) estimating the factor model in each of the bootstrap datasets, (b) saving the parameters of the linear factor model per dataset, (c) calculating m X Ã , s 2 X Ã , and Cor X Ã , h ð Þfrom Equations 2 and 3, (d) defining the bivariate normal distribution, (e) estimating CA and CC per dataset, and (f) collecting the estimates to derive the empirical distributions of CA and CC. Those empirical distributions of CA, sensitivity, specificity, and classification consistency could then be summarized with central tendency and summary intervals. In each bootstrapped dataset, the estimates of the linear factor model will vary, reflecting their sampling variability. However, Bayesian methods accomplish the incorporation of uncertainty by using the posterior distribution of the estimated factor model parameters, not just the point estimates, in the estimation of posterior distributions for CA and CC indices. In essence, one would estimate the linear factor model using MCMC estimation and posterior distributions per parameter would be obtained (Edwards, 2010). Then, at each MCMC iteration, we would use each set of draws from each of the posterior distributions to calculate m X Ã , s 2 X Ã , and Cor X Ã , h ð Þfrom Equations 2 and 3, define the bivariate normal distributions, and calculate the CA and CC indices. Ultimately, there would be posterior distributions for the classification rate, sensitivity, specificity, and classification consistency, which we could summarize using central tendency and summary intervals. As such, the uncertainty in the factor model parameters is propagated to the empirical distribution or the posterior distribution of CA and CC, and thus acknowledged in their summary intervals.

Present Study
So far, two approaches to estimate summary intervals for model-based CA and CC indices that accommodate the propagation of uncertainty from the factor model parameters to the CA and CC estimates have been introduced. Below, the estimation and interpretation of bootstrap and Bayesian summary intervals using an applied example are demonstrated. Also, simulation results on the bias and coverage of bootstrap and Bayesian summary intervals for the CA and CC indices are discussed to provide general recommendations. Similar to prior research (Pfadt et al., 2021), the summary intervals are expected to be centered around the true value (i.e., median estimate would have low bias) and have close to a 95% coverage, although coverage might be slightly underestimated in conditions with small sample sizes. Also, it is expected that summary interval width would decrease as the sample size increases.

Data
The dataset for this illustration is from a study by Eisenberg et al. (2018) on the measurement of the self-regulation construct. The dataset had N = 522 Mturk participants who responded to a large battery of self-regulation measures. All participants passed a basic validity check to filter out those who were giving bogus responses. Out of the respondents in the sample, 50.2% were female, 78.8% were White, and 83.9% had at least a college education. Among the battery of measures, participants responded to the Mindful Attention Awareness Scale (MAAS; Brown & Ryan, 2003), which is a 15-item measure that assesses one's tendency to be fully aware of their experience in the moment without distraction. Items are rated on a 6-point Likert-type scale ranging from 1 (almost always) to 6 (almost never), so the range from possible summed scores is 15 to 90. Items include, ''I snack without being aware that I'm eating'' and ''I do jobs or tasks automatically, without being aware of what I'm doing.'' Items were recoded so that higher scores represented more mindfulness, and the summed score reliability was a = .92. Suppose that we were interested in identifying respondents who are low on mindfulness, which is defined as being at 21 SD or below the mindfulness construct, to enroll them in an intervention. Ideally, we would choose individuals based on the latent variable score with a cutpoint of 21 (e.g., h 21 SD) that yields true classifications, however, the summed score, which is prone to measurement error, is commonly used in practice to select individuals (e.g., X Ã 21 SD, corresponding to X Ã 53). The goal of this illustration is to estimate CA and CC of the MAAS for those respective cutpoints, along with both bootstrapped and Bayesian summary intervals for the indices.

Procedure
First, a unidimensional factor model was fit to the data using lavaan (Rosseel, 2012) in R, and model fit was reported, and CA and CC indices were estimated as outlined above. Then, to estimate the bootstrapped confidence intervals, 500 datasets with replacement were sampled, a unidimensional factor model was fit to each bootstrapped dataset using lavaan, and factor model parameters for each bootstrapped dataset were saved. Then, for each set of factor model parameters, Equations 2 and 3 were used to determine m X Ã , s 2 X Ã , and Cor X Ã , h ð Þ, respectively. Next, two bivariate normal distributions were derived. First, the bivariate normal distribution of X Ã and h was used to estimate the CA indices, to which the cutpoints on X Ã and h were imposed to create four quadrants. Then, the proportion of each quadrant was estimated via integration using the pmvnorm function in the mvtnorm R-package. Second, the bivariate normal distribution of X Ã 1 and X Ã 2 , in which Cor X Ã 1 , X Ã 2 À Á was Cor X Ã , h ð Þ 2 , to which the cutpoints on X Ã were imposed, was used to estimate the CC index. Similar steps were taken to determine the proportion of each quadrant. Finally, we use the proportion of each quadrant to estimate the classification rate, sensitivity, specificity, and classification consistency as indicated above. Overall, there were 500 values of the CA and CC indices, which were summarized using the median and a 95% equal-tail interval and a 95% HPD interval using the coda package.
To estimate the Bayesian summary intervals, we fit a unidimensional factor model to the data using JAGS (Plummer, 2003) via R using the R2jags package (Su & Yajima, 2020). The full posterior distribution for the unidimensional factor model is, In this case, t j , l j , and c j had diffused priors relative to the scale of the items-t j had priors that were normally distributed with mean of 3 and a standard deviation of 10, l j had priors that were normally distributed with a mean of one and a standard deviation of 10, and c j had an inverse gamma distribution, which is a popular distribution for (residual) variances, with a shape parameter of 5 and a scale parameter of 10. In this case, the parameters of the inverse gamma roughly translate to residuals variances around 2 with not a lot of certainty, as if the estimate came from a sample of 10 individuals (Levy & Mislevy, 2017). Note that there are no prior distributions for k or F, which are set 0 and 1, respectively, for identification.
For the MCMC implementation, there were three chains which each drew 5,000 iterations. The first 1,000 were burn-in iterations and were discarded, and 4,000 postburn-in iterations per chain were examined. We specified a thinning of 4, so every fourth iteration was saved for further analyses, yielding 3,000 post-thinning iterations (i.e., 1,000 per chain) that defined the posterior distributions of each parameter. Trace plots and the potential scale reduction factor (PSRF; Brooks & Gelman, 1998) were used as evidence of convergence of the chains, where the PSRF criteria was \ 1.1. Finally, each set of draws per iteration (e.g., first draw of t 1 , t 2 , . . ., l 1 , l 2 , . . ., c 1 , c 2 ) were taken to estimate m X Ã , s 2 X Ã , along with Cor X Ã , h ð Þ, and conduct the procedure described above to estimate the CA and CC indices, which were summarized using the median, a 95% equal-tail interval, and a 95% HPD interval using the coda package (Plummer et al., 2006).

Results
Model fit statistics suggest that the unidimensional factor model fits the 15 MAAS items generally well, x 2 (90) = 494.214, p\ .001, comparative fit index (CFI) = .924, root mean square error of approximation (RMSEA) = .079, standardized root mean square residual (SRMR) = .042. For MCMC estimation, the PSRF for each factor model parameter suggests that the chains converged to the same distribution. Parameter estimates for the factor loadings, intercepts, and residual variances are shown in the supplement. Results suggest that the Bayesian equal-tail intervals, the Bayesian HPD intervals, and their bootstrap counterparts were similar to each other (see Table 1). Figure 2 shows the density plots of the empirical distribution using the bootstrap and the Bayesian posterior distributions for the CA and CC estimates, which are unimodal. Here, the equal-tail summary intervals of the CC index are interpreted, and the rest of the estimates have a similar interpretation. For the bootstrap, median of the empirical distribution of classification consistency was .923, with a 95% equal-tail confidence interval of [.914, .931], which provides an idea of the sampling variability of the estimate. For the Bayesian credible intervals, the median of the posterior distribution of classification consistency was .921, with a 95% equal-tail credible interval of [.914, .928], which means that there is a 95% probability that classification consistency is between .914 and .928. By reporting summary intervals of the CA and CC indices, the uncertainty of those estimates is acknowledged.

Simulation Factors and Procedure
For the simulation, datasets were generated in R based on a unidimensional linear factor model and three factors were varied: sample size (N = 200 and 500), the number of items (i = 5, 10, 20), and the average factor loading between the indicators and the latent variable (f = .5, .6, and .7). The factor mean and variance were set to zero and one, respectively, and the true factor loadings were drawn from a uniform distribution centered at f and limits at 6 .2 (see Table A0 in supplementary materials for these values). Furthermore, the item intercepts for all the items were specified to be zero, and the residual variances were specified so that the total variance of the indicator was 1 (i.e., residual variances were 1-f 2 ). Although the number of items and the magnitude of the factor loadings are expected to affect the CA and CC estimates, the sample size is expected to affect the sampling variability of the factor model parameters.
The general steps for data analysis, computation of classification rate, sensitivity, specificity, and classification consistency indices, and estimation of summary intervals were largely similar to those described in the illustration. Three approaches were examined: bootstrap confidence intervals, Bayesian credible intervals with diffused priors, and (for illustrative purposes) Bayesian credible intervals with empirical, weakly informative priors. Diffused priors were similar to those used in the illustration. For the analyses with empirical, weakly informative priors, the prior for the loadings and intercepts was a normal distribution centered at the empirical value obtained by fitting a unidimensional model to the data using maximum likelihood, and the standard deviation was 1. Note that there is added uncertainty in the prior distributions about the parameters, which is larger than the uncertainty reflected in the standard errors of the maximum likelihood solution. Similarly, for the residual variances, Note. EQ = equal-tail; LL = lower limit; HPD = highest posterior density; UP = upper limit.
a prior inverse gamma distribution that reflects a variance similar to the empirical values obtained and with a pseudo sample size of 20 was used. 2 For each method, a 95% equal-tail summary interval and a 95% HPD summary interval were estimated using three sets of cutpoints: both the expected summed score and h with a cutpoint at the mean, 0.75 SD above the mean, and 1.5 SD about the mean. 3 The cutpoint for the summed score was estimated using the implied moments of data-generating item parameters, so the cutpoints were fixed across replications (e.g., the cutpoints were not estimated again). Overall, there were 18 conditions (2 Sample Sizes 3 3 Average Loadings 3 3 Numbers of Items), with 1,000 replications per condition, per index (4; sensitivity, specificity, classification rate, and classification consistency), cutpoint (3; mean, .75 SD, 1.5 SD), type of interval (2; equal-tail or HPD), and method to estimate summary intervals (3; bootstrap, Bayesian diffused, Bayesian empirical, weakly informative), so there a total of 1,296,000 summary intervals examined.

Simulation Outcomes
The true values of the CA and CC indices were determined with Equations 2 and 3 and are shown in the supplementary materials (see Table A1). Based on those true values, the simulation outcomes are the relative bias of the CA and CC point estimates and summary interval coverage. Relative bias was estimated by subtracting the median estimate of the CA and CC indices per replication from the true value and dividing by the true value. A relative bias between .025 and .075 would be deemed acceptable. Summary interval coverage was estimated by the number of times that the true value was within the summary interval limits. A summary interval coverage between .925 and .975 would be deemed acceptable. The average width and the balance of each summary interval (i.e., proportion of times that the estimate was to the left or right of the summary interval limits) were also reported. Note that the true CA and CC indices vary by the cutpoint, average factor loading, and the number of items, so we describe the performance of the summary intervals within each condition, rather than comparing across conditions.

Results
Bootstrap Summary Intervals. Largely, bootstrap summary intervals exhibited the best performance out of the summary intervals studied. For conditions with a cutpoint at the mean of the summed score and latent variable, the median estimate of the bootstrap empirical distribution across CA and CC indices had a relative bias within 6 .05 (see Table A2 in the supplement). Regarding summary interval coverage, both 95% equal-tail and 95% HPD bootstrap summary intervals had coverage between .925 and .975 in all but five conditions (in which coverage was .890; see Table 2). The bootstrap summary intervals were generally balanced for sensitivity, specificity, and classification consistency, but there was right imbalance for the classification rate-the true value of classification rate was consistently above the upper limit of the interval. Also, bootstrap summary interval width decreased as sample size increased (see Table A3 in the supplement). Similar findings were also observed for conditions in which the cutpoint was at .75 SD (see Tables A4 and A5 in supplement). For conditions in which the cutpoint was 1.5 SD, most of the summary intervals for the classification rate had coverage above .975, suggesting that the summary intervals were unnecessarily wide (see Tables A6 and A7 in supplement). For sensitivity, specificity, and classification consistency, there were 14 summary intervals with coverage between .90 and .925, out of which 12 of those conditions were those with HPD summary intervals.
Bayesian Summary Intervals With Diffused Priors. Across conditions, the Bayesian summary intervals with diffused priors exhibited poor performance. The summary intervals coverage for classification rate and classification consistency were well below .925, with as low as .008, across all cutpoints (see Table 3 for the cutpoint at mean, and Tables B2 and B4 in the supplement for cutpoints at .75 SD and 1.5 SD). There Note. i is for number of items, n is for sample size, and f is for average factor loading, and in bold are intervals outside the robustness criterion. HPD = highest posterior density. Note. i is for number of items, n is for sample size, and f is for average factor loading, and in bold are intervals outside the robustness criterion. HPD = highest posterior density.
was noticeable right imbalance for the summary intervals, which maps on to the negative relative bias in the median estimates (see Tables B1, B3, and B5 in supplement). There were some conditions in which the summary intervals for sensitivity and specificity were within .925 and .975 (still exhibiting right imbalance), but those are not further discussed given the poor summary interval performance for classification rate and consistency.
Bayesian Summary Intervals With Empirical, Weakly Informative Priors. There is better coverage and narrower credible interval width by shifting from diffused priors to empirical, weakly informative priors. Across conditions, most of the median estimates for CA and CC indices had a relative bias within 6 .05 (see Table C1, C3, and C5 in the supplement). This finding is not surprising given that the prior distributions of factor model parameters were centered on their empirical values estimated via maximum likelihood. Regarding summary interval coverage, all credible intervals for sensitivity and specificity with a cutpoint at the mean had coverage between .925 and .975. There were 10 conditions with a cutpoint at the mean in which there were credible intervals for classification rate were above .975 (i.e., unnecessarily large). Also, credible intervals for classification consistency in conditions with a sample size of 200 and 10 items had credible intervals below .925 (with the lowest at .874). Similar findings regarding classification rate having unnecessarily wide credible intervals were observed in conditions with cutpoints of 0.75 SD and 1.5 SD (see Tables C2 and  C4). However, sensitivity had credible intervals below .925, while specificity and classification consistency had most of their credible intervals within .925 and .975.

Discussion
When researchers use scores to make decisions, indices such as the classification rate, sensitivity, specificity, and classification consistency can help describe the quality of the decisions. However, guidance on how to estimate the uncertainty of these CA and CC indices has been limited. In this article, this limitation was addressed by demonstrating how to estimate summary intervals for CA and CC indices by using the percentile bootstrap and Bayesian estimation. Also, the application of summary intervals for CA and CC indices was illustrated using a mindfulness measure and reported results from a small simulation study on the statistical properties of these summary intervals. Generally, simulation results suggest that the percentile bootstrap confidence intervals are unbiased and have coverage close to .95 across most conditions studied. Moreover, the results suggest that Bayesian credible intervals with empirical, weakly informative prior distributions outperform credible intervals with diffused priors regarding coverage, although there were many conditions in which the credible intervals with empirical, weakly informative prior distributions were unnecessarily wide. We do not encourage the use of data-dependent priors for the estimation of summary intervals of CA and CC indices, but a takeaway of the results is that using informative priors (e.g., from pilot data, a meta-analysis, or eliciting Note. i is for number of items, n is for sample size, and f is for average factor loading, and in bold are intervals outside the robustness criterion. HPD = highest posterior density. information from experts) could potentially improve the coverage of the CA and CC summary intervals. As such, we recommend using equal-tail bootstrap confidence intervals to estimate the uncertainty of the CA and CC indices. Overall, the example, R code, and simulation results provide guidance on how to estimate summary intervals for CA and CC indices and under which conditions the summary intervals work well.
The overarching goal of this article is to encourage researchers to report summary intervals for the CA and CC indices when they use item responses for classification. Note that the CA and CC indices are specific to the cutpoints on X Ã and h used. Often, researchers use cutpoints that aim to maximize the sensitivity and specificity of the measure, which implicitly weighs false positives and false negatives equally (Youngstrom, 2014). However, in many applications, making a false negative (e.g., failing to treat someone with depression) is riskier than making a false positive (e.g., treating an individual for depression who is not depressed). As such, these risks can be mitigated by shifting the cutpoint and increase (decrease) sensitivity by sacrificing (to gain) specificity (Youngstrom, 2014). Once the application-specific cutpoint is determined, then the CA and CC indices and their summary intervals can be reported.
There are several limitations and future directions to this study. As in any modelbased estimate, the estimation of CA and CC indices presumes that the model fits the data well and is correctly specified, but model misspecification was not studied. An important future direction would be to examine how unmodeled multidimensionality (Reise et al., 2013), local dependence (Edwards et al., 2018), or model error (MacCallum et al., 2001) affect the estimation of the CA and CC indices and their summary intervals. Moreover, it would be interesting to study the estimation and summary intervals of model-based positive and negative predictive values (PPV and NPV), which might be more useful than the CA and CC indices to understand respondent-level classifications. In this case, PPV would describe the probability that an individual has the condition as indicated by h if the individual is above the cutpoint on X Ã , and NPV would describe the probability that an individual does not have the condition as indicated by h if the individual is below the cutpoint on X Ã . As opposed to CA and CC indices discussed, PPV and NPV depend on the prevalence of the condition based on h studied (Pepe, 2003). Furthermore, the simulation study only included a limited set of prior distributions for the Bayesian summary intervals, which affects the overall conclusion of the study. Future directions should determine how to best obtain and encode prior information for the factor model, which consequently affects the CA and CC indices, along with studying the sensitivity of the results to specific priors (van Erp et al., 2018). Also, there are model-based procedures to estimate CA and CC indices for items that do not fit the linear factor model (i.e., binary of polytomous items), but the performance of summary intervals for those approaches was not studied. Future directions include examining summary intervals for model-based CA and CC with discrete items. In conclusion, researchers are encouraged to report CA and CC indices, along with bootstrap confidence intervals, whenever they are using scores to make decisions about respondents.