Exploratory Item Classification Via Spectral Graph Clustering.

Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire.


Introduction
With the rapid developments in information technology during the past several decades, Internet-based and mobile-enabled psychological assessments are multiplying. As a result, large item pools tend to emerge, together with availability of a large amount of response data. An important question then becomes how to make use of such rich resource to provide better psychological measurement. As a first step to answering this question, the authors propose a data-driven approach to assign a large number of items into clusters in such a way that items within each cluster are homogeneous, while items in different clusters are heterogeneous. One application of the proposed method is in personality assessments, a psychological assessment that is widely used to assess aspects of an individual's character, such as in relationship counseling, career counseling, and employment testing. Due to the heterogeneity of personality, its assessments traditionally have large item pools, typically on the scale of hundreds. In recent years, even larger item pools become available, and largescale response data are collected. For example, the International Personality Item Pool (IPIP; Goldberg et al., 2006) contains more than 3,000 items. The proposed method analyzes such large item pools by decomposing them into scales of items that are much easier to understand and interpret.
In this study, the authors propose a method to answer two questions: How many clusters should be formed among a set of items and how to assign each item to a cluster? In particular, they propose a classification method for item analysis based on a spectral clustering algorithm. The proposed method is capable of handling high-dimensional responses and admits a low computational overhead; for instance, a maximum of 500 items is considered in the simulation study in Online Appendix A. Unlike many traditional clustering methods, this one can be effectively used when data have missing or incomplete responses that is often the case when analyzing a large item pool. The response type can be either continuous or ordinal. And, consequently, the method can cover a wide range of situations. Moreover, the authors propose methods to systematically choose various tuning parameters, among which the most important one is the number of clusters. The proposed method assumes no prior information of the item structure and thus is exploratory.
In psychological assessment, people typically use latent variable models for analyzing item response data. The authors emphasize that the proposed method is a useful complement to, not a replacement of, the latent factor models. One common practice of assigning items into scales is to analyze the loading structure from exploratory factor analysis, which usually works well. However, in exploratory factor analysis, one needs to decide how many factors to keep, which rotation to use, what threshold to truncate the small loadings, and so on, which typically requires an experienced data analyst and can be labor intensive when the number of items is large. However, little human intervention is needed for the proposed method and thus can be used as a first step in data analysis. To further use an item cluster as an assessment scale, the items should be confirmed by domain experts and further analyzed by latent factor models. One may fit unidimensional latent factor models to each item cluster and conduct standard analysis, such as goodness of fit and dimensionality check, so that the measurement can still be backed by a latent factor model.
Clustering/classification is a common technique for exploratory data analysis used in psychology, image analysis, bioinformatics, and many others (Blashfield, 1980;Blashfield & Aldenderfer, 1988;Hartigan, 1975;Jain, Murty, & Flynn, 1999;Xing et al., 2010). In psychological and educational measurement, cluster analysis has been used to classify individuals into groups (Chiu & Douglas, 2013;Chiu, Douglas, & Li, 2009;Park & Lee, 2011;Willse, Henson, & Templin, 2007). Cluster analysis has also been widely used for exploratory item analysis as an alternative to factor analysis (e.g., Borgen & Barnett, 1987;Roussos, Stout, & Marden, 1998;Zhang, 2007Zhang, , 2013Zhang & Stout, 1999). In particular, learning the loading structure in item response theory models and diagnostic classification models (Chen, Liu, Xu, & Ying, 2015;J. Liu, Xu, & Ying, 2012, 2013Sun, Chen, Liu, Ying, & Xin, 2016) is a special type of item cluster analysis that is based on probabilistic models. As pointed out by Loevinger, Gleser, and DuBois (1953), unlike factor analysis, which weights items by their factor loadings, cluster analysis solves the practical problem of assigning items to scales on an all or none basis. In the current context, item clustering is to assign J items into M groups performed on an N-dimensional feature space, where N is the total number of respondents or the sample size. When both J and N are large, substantial computational overhead occurs for many clustering methods (Jain et al., 1999). In addition, probability-model-based methods (e.g., latent-class models) often suffer from model lack of fit for high-dimensional features. For such models, the local independence is often assumed, that is, the features are independent given the latent-class membership. Such an assumption is often violated in the presence of high-dimensional data (Chen, Li, Liu, & Ying, 2016). Moreover, the usage of most traditional clustering methods requires complete data. Upon having missing data, mainstream approaches reduce the problem to a complete data formulation by deleting missing cases or imputation, for which the costs are often significant (Chi, Chi, & Baraniuk, 2016).
Spectrum-based item clustering methods foot on graph theory, a branch of mathematics that studies the properties of graphs. A spectral clustering algorithm first constructs a graph of items based on a similarity measure defined on each pair of items, where items more similar to each other tend to be connected. Then, by spectral analysis of the graph, it produces clusters so that items within a cluster are densely connected and different clusters are loosely connected. Spectral methods admit both sound theoretical properties under regularity conditions and reliable empirical performance (Chung, 1997;Ng, Jordan, Weiss, 2002;Rohe, Chatterjee, & Yu, 2011;Shi & Malik, 2000). Such methods have been successfully applied to many fields, including image segmentation (Shi & Malik, 2000), text mining (Dhillon, 2001), and network analysis (Rohe et al., 2011).
In most applications mentioned above, the features take continuous values, and similarity measures are constructed naturally via the Euclidean distance in R N . In the context of assessment/measurement, the responses are often binary or ordinal, for which Euclidean distance is inappropriate. In this article, the authors propose a similarity matrix based on a nonparametric statistic known as Kendall's tau (Kendall & Gibbons, 1990), which is applicable to all ordinal variables that may take different numbers of values. The calculation of the Kendall's tau statistic for item pairs can be done in parallel, further adding to its computational efficiency.
In the analysis, an important task is to assess the number of clusters. In the literature of spectral clustering, eigengap heuristic is typically used that is based on the distribution of the eigenvalues of the Laplacian matrix (von Luxburg, 2007). However, a "gap" of the eigenvalue distribution is not always straightforward to identify. The authors propose a method to select the number of clusters based on the eigenvalue distribution, motivated by parallel studies in factor analysis (Horn, 1965) and the gap statistic (Tibshirani, Walther, & Hastie, 2001). In particular, they consider the distribution of the eigenvalues, when there is no cluster structure, as the reference distribution, that can be obtained by Monte Carlo method. The idea is to find the eigengap by comparing the observed eigenvalues with a reference distribution obtained by simulation. This method performs well in the authors' simulation studies.
The authors apply the method to the revised Eysenck Personality Questionnaire (S. B. Eysenck, Eysenck, & Barrett, 1985), one of the most popular personality assessments. This questionnaire is derived according to Eysenck's theory on personality, which contains three scales measuring three aspects of personality: Extraversion/Introversion, Neuroticism/ Stability, and Psychoticism/Socialization. The authors' analysis validates the three scales and explores finer test structure driven by data.
The rest of the article is organized as follows: The spectral clustering algorithm and the method for choosing the number of clusters are presented in the section "Spectral Clustering," and are illustrated via simulated examples in the section "Simulated Examples." The proposed method is further investigated by a real data example in the section "Application to EPQ-R Data." The section "Summary and Discussion" provides a summary and further discussions. Simulation studies, theory on spectral clustering, and a review of the K-means algorithm are presented in the online appendix.

Spectral Clustering Spectral Clustering Algorithm for Items
Consider N individuals, each of whom responds to J items. Let R ij be the response of individual i to item j and R = (R ij ) N×J be the entire data matrix. The responses are ordinal. Without loss of generality, assume R ij ∈ {1, 2, … , n j }. The responses to item j are further denoted by R j = (R 1j , … , R Nj ).
A similarity matrix S = (S jj′ ) J×J is introduced that is a symmetric matrix of nonnegative entries constructed based on the data, that is, S jj′ =S j′j ≥ 0 for all j and j′. The construction of S will be discussed in the section "Construction of the Similarity Matrix." The spectral clustering algorithm aims to classify items into clusters, so that items in the same cluster tend to be similar (i.e., S jj′ tends to take a large value if items j and j′ belong to the same cluster) and those in different clusters tend to be dissimilar to each other (i.e., S jj′ tends to be small if j and j′ belong to different clusters). In this article, the authors focus on similarity matrices whose entries are binary (S jj′ ∈ {0, 1}), in which case S can be viewed as the adjacency matrix of an undirected graph. That is, each vertex in the graph corresponds to an item, and there is an edge between items j and j′ if and only if S jj′ = 1.
A description of the spectral clustering algorithm (Shi & Malik, 2000) is provided, followed by explanations. First, suppose that the similarity graph S is given and the number of clusters M is known. The construction of S and the choices of M are discussed later.

Algorithm 1
The similarity matrix S and the number of clusters M are given.

1.
Construct a diagonal matrix D and a Laplacian matrix L as follows:

2.
Let λ 1 ≥ λ 2 ≥ ⋯ ≥ λ J be the eigenvalues of L and X 1 , X 2 … , X j be the corresponding eigenvectors. Let be the J×M matrix containing the first M eigenvectors.

3.
The matrix X is written in rows, and let Y 1 ,… , Y J ∈ ℝ M be the row vectors of X,that is Consider Y 1 , … , Y J as the data and apply K-means algorithm (Algorithm 1 in the online appendix) to Y 1 , … , Y J , assuming that there are M clusters. Let π̂j ∈ {1, …, M} be the classification of Y j .
In what follows, the authors describe heuristically the underlying mechanism of the spectral clustering algorithm. They consider an example of 10 items falling into three clusters, the graphical representation of which is shown in Figure 1. These items fall into three clusters naturally, that is, {1, 2, 3}, {4, 5, 6, 7}, and {8, 9, 10}, where items within a cluster are connected via a path and items from different clusters are disconnected. In this case, the similarity matrix S is block diagonal, as shown in the heat map in the right panel of Figure 1, where the values 0 and 1 are indicated in white and blue, respectively. For this similarity matrix, the X matrix obtained in Step 2 of the algorithm is as follows: (1) Let Y 1 , … , Y 10 be the row vectors of X. In particular, Y 1 , Y 2 , and Y 3 are (1/√3, 0, 0) ⊤ , Y 4 through Y 7 are (0, 1/√4, 0) ⊤ , and Y 8 , Y 9 , Y 10 are (0, 0, 1/√3) ⊤ . Thus, the K-means algorithm, which minimizes the within-cluster sum of squares, assigns items j and j′ to the same cluster if Y j = Y j′ . The spectral clustering algorithm is developed based on the above observation.
The above heuristic argument can be made general through results in graph theory. To do so, some concepts need to be introduced. Consider an undirected graph G = (V, E), where V = {1, …, J} is the set of vertices and E ⊂ V × V is the set of edges. The graph is undirected in the sense that (i,j) ∈ E implies (j,i) ∈ E. Furthermore, let S be the adjacency matrix of G. A subset A ⊂ V is said to be connected if any two vertices in A can reach each other through a path belonging to the set A. A subset A is called a connected component if it is connected, and there are no connections between A and A c , such as subsets {1, 2, 3}, {4, 5, 6, 7}, and {8, 9, 10} in the left panel of Figure 1. Each connected component is viewed as a cluster, as connectivity indicates similarity. For a graph with M connected components, if we input its adjacency matrix S and M into Algorithm 1, spectral graph theory (in particular, Theorem 1 in Online Appendix B) guarantees that Y i = Y j if i and j are from the same connected component and Y i ≠ Y j otherwise. As a result, the K-means method is able to recover the clusters. As Y 1 , …, Y J can be viewed as item-specific features characterizing their cluster memberships, they are called graph features of items, and the matrix X is called the graph feature matrix.
In practice, such an ideal, that is, perfectly separated, case, is rarely observed as the similarity matrix S is constructed with noise and the underlying cluster structure may not be perfect. However, if the observed similarity graph approximates the ideal situation, that is, items within a cluster are densely connected and different clusters are loosely connected, the spectral clustering method is able to recover the clusters, backed by matrix perturbation theory (Stewart & Sun, 1990). Readers are referred to von Luxburg (2007) for further details. This point is confirmed by simulation studies in the section "Simulated Examples" and Online Appendix A.

Construction of the Similarity Matrix
The authors now propose a method for constructing a similarity matrix S for the spectral clustering algorithm from item response data. To begin with, Kendall's tau statistic is considered as a similarity measure between two items (Howell, 2012). Let (R 1j , R 1j′ ), …, (R Nj , R Nj′ ) be the responses of N individuals to items j and j′. Then, Kendall's tau for item pair (j, j′) is where the function sgn(x) takes value 1 if x>0, 0 if x = 0, and −1 otherwise. Clearly, τ̂j j′ lies between −1 and 1. Kendall's tau, which has been widely used in nonparametric statistics (Sprent & Smeeton, 2016), measures the strength of dependence. A substantial deviation from 0 means a strong correlation between the responses, and hence a high similarity between the two items. Notice that if two items are measuring a single construct but are worded reversely, then it is likely to lead to a negative τ̂j j′ . Therefore, items j and j′ are considered highly similar, if |τ̂j j′ | takes a large value. The authors remark that τ̂j j′ can be estimated well even when data contain missing or incomplete responses, provided that there are enough individuals responding to both items. See the section "A Brief Summary and Some Remarks" for further discussion.
To better capture the local neighborhood relations between items, a K-nearest neighbor graph is considered (von Luxburg, 2007). The set of K-nearest neighbors of an item j is first defined. Let {|τ̂j (j′) |} be the ordered statistics of {|τ̂j j′ | : j′ ≠ j} in a descending order. Then, the K-nearest neighbor of item j is The similarity matrix S is defined such that j and j′ are connected if j is within the K-nearest neighbors of j′ or item j′ is one of the K-nearest neighbors of j, that is, S jj′ = 1 if j ∈ j′ (K) or j′ ∈ j (K) and S jj′ = 0 otherwise. With this construction, S is a symmetric adjacency matrix.
Algorithm 2 Construction of similarity matrix S.
Input: Responses (R ij ) N×J and connectivity parameter K.

2.
Construct the set of K-nearest neighbor j (K) for each item j.
Output: Similarity matrix S.
The authors discuss the choice of connectivity parameter K. Roughly speaking, K should be chosen such that the resulting graph S is neither too sparse nor too dense. When the graph is too sparse (K being too small), S typically has many connected components, so that the clustering result may not be stable. When the graph is too dense (K being too large), clusters tend to merge, and thus the classification becomes less accurate and the number of clusters cannot be well estimated. This will be further illustrated in the simulation studies. For large graphs, as suggested by the asymptotic connectivity results (Brito, Chavez, Quiroz, & Yukich, 1997), the resulting graph is almost certainly connected when K is chosen in the order of log(J). Following the above discussion, the authors suggest a rule of thumb for choosing K. It performs reasonably well based on results from simulation studies and real data analysis.
Algorithm 3 Choice of connectivity parameter K.
Initialize: K= ⌈log (J)⌉, where ⌈log (J)⌉ is the smallest integer greater than or equal to log (J).

1.
Obtain a similarity matrix S using Algorithm 2.

If S constructed in
Step 1 corresponds to a connected graph, stop. Otherwise, K ← K+ 1, and repeat Steps 1 and 2.
Output: Connectivity parameter K.

Finding the Number of Clusters
A commonly used method for choosing M in the spectral clustering is the so-called eigengap heuristic (Azran & Ghahramani, 2006;von Luxburg, 2007), a method catered to the spectral clustering. It chooses the number m, so that all eigenvalues λ l , …, λ m are very large, but λ m+1 is relatively small. This method comes from the ideal situation of M completely disconnected clusters, where the eigenvalue 1 has multiplicity M, and then there is a gap to the (M + 1)th eigenvalue λ m+1 <1. However, in practice, a "gap" of the eigenvalue distribution is not always straightforward to identify, as shown in the simulated examples in the section "Simulated Examples." The authors propose a method that improves on the eigengap heuristic. The method is motivated by the gap statistic (Tibshirani et al., 2001) and the parallel analysis in factor analysis (Horn, 1965).
The basic idea behind the authors' approach is to compare the observed eigenvalues with a reference distribution, that is, the distribution of the eigenvalues when there is no cluster structure. If there are M clusters, then the top M eigenvalues of the Laplacian matrix are significantly larger than this reference distribution. Specifically, the following reference distribution is considered. Given an observed similarity matrix S with binary entries, let be the graph sparsity. A random similarity matrix S̃ = (S̃j j′ ) is considered, such that Sj j′ are independent and identically distributed Bernoulli random variables with success probability p S . Let L̃ be the Laplacian matrix of S, and λ(x00303) 1 ≥ λ̃2 >… > λ̃J be the eigenvalues of L. The joint distribution of (λ̃1, …, λJ) is considered as the reference distribution, denoted by F S . For each 1 ≤ j ≤ J, let tj be the 1 − α quantile of the marginal distribution of λ̃j. The number of clusters is then chosen as where λ m is the m-th largest eigenvalue of the observed Laplacian matrix L. In their implementation, the authors choose α = 5%. In practice, the distribution F S is not analytically tractable, and they obtain it by Monte Carlo. The following description summarizes the method: Algorithm 4 Parallel analysis for choosing the number of clusters.
Input: Symmetric similarity matrix S.

1.
Compute the eigenvalues λ 1 ,…, λ J of L and the sparsity of S, P S .

2.
Simulate B independent samples from the reference distribution F S , denoted by 3.

1.
Computation. The most computationally intensive step in Algorithm 1, the Kmeans step, has O(JM 2 ) complexity per iteration, while that of directly applying the K-means clustering is O(JMN). The difference is significant for a large sample size N. In addition, although spectral clustering has an additional step of constructing a similarity matrix, it can be efficiently obtained by parallel computing, that is, computing |τ jj′ | and s jj′ can be paralleled in item pairs, and calculating j (K) can be paralleled in items.

2.
Comparison. One traditional approach to assigning items to scales is exploratory item factor analysis, which often produces reasonable results. It typically requires human intervention, which can be labor intensive when the number of items is large. However, no human intervention is needed for the proposed method. In addition, as a nonparametric approach, the proposed method performs quite robustly under different settings in terms of choosing the number of clusters and classification, whereas exploratory factor analysis may fail due to model misspecification. See the section "Comparing With Exploratory Factor Analysis" for examples.

3.
Missing data. Item response data often contain missing values. When the number of items is large, it is likely that many or even all individuals have missing responses. Most traditional clustering methods suffer from such a situation. This is because they usually throw away incomplete cases, resulting in severe information loss. Another strategy for handling missing data is imputation, which fills in missing entries with plausible estimates of their values. Imputation methods can be very complicated, because information about the joint distribution of the missingness patterns and the data is needed for effective imputation. It is nontrivial to find a reasonable imputation model, and even with a given model, the imputation process is often computationally expensive (Chi et al., 2016).
The proposed method makes better use of data in a computationally efficient manner. Let Z ij indicate whether the response from individual i to item j is missing (Z ij = 0) or not (Z ij =1). Then for an item pair (j,j′), Kendall's tau statistic adapted to missing data becomes where can be viewed as the effective sample size for item pair (j,j′). For the traditional methods, the sample size of complete cases is , which is usually much smaller than Ñ. For a reasonably large Ñ, τ̃j j′ is a good estimator of its population version. In particular, it is consistent when data are missing at random. Therefore, to develop a test, it can be designed to have each individual responding to a subset of items of an item pool, making data collection easier and reduce the time cost for each respondent. It is worth noting that one should be careful when the data missingness is nonignorable.

4.
Data driven. Personality scale construction starts with a large item pool, with items from multiple sources, such as textbooks in psychiatry, earlier published scales of personal and social attitude, and so on. In the early stage of inventory construction, little is known for a large number of items. Two items may measure a single construct but are reversely worded, such as items "Do you prefer to go your own way rather than act by the rules?" and "Should people always respect the law?" from the P scale of Eysenck Personality Questionnaire-Revised (EPQ-R). Similarity measure based on the Euclidian distance regards two items as dissimilar, based on which the two items are unlikely to be clustered together. The proposed method solves this issue by taking the absolute value of Kendall' tau statistic as the similarity measure. This point will be further illustrated in the real data analysis discussed in the section "Application to EPQ-R Data."

5.
Other similarity measures. Kendall's tau statistic is a standard measure of the association between ordinal variables, and is easy to compute. However, thanks to the flexibility of the authors' method, other reasonable similarity measures can also be used. For example, if it is believed that the responses are obtained by truncating Gaussian random variables, the similarity may be better measured by polychoric correlation (Lee, Poon, & Bentler, 1995).

6.
Relationship between clusters. When item clusters are obtained from exploratory factor analysis, each cluster is associated with a latent factor and one may quantify the similarity between the clusters by the corresponding correlation between factors. There is an analog under the proposed approach, which makes use of the constructed similarity graph of items. That is, the similarity between two item clusters could be quantified by connectivity measures, such as the number of edges between two clusters or its normalized version (Kolaczyk, 2009).

Simulated Examples Illustration
The authors first illustrate the use of the proposed method using simulated examples from a latent factor model, which is one of the most popular models for personality test (Cattell & Kline, 1977;H. J. Eysenck, 1982). A simple structure is often believed to exist in these factor models (Cattell & Kline, 1977), that is, each item loads on only a few factors. In this section, the authors investigate the spectral clustering algorithm using two data sets generated from a latent factor model. The data admit a simple structure that each item only loads on one factor, so that the item clusters are well defined (by the latent factor that each item measures). In particular, there are M latent factors. For each individual i, let θ i = (θ i1 , …, θ iM ) be the M-dimensional factor score. For each item j, let π j ∈ {1, … , M} be the factor it measures and a j be the corresponding factor loading. More precisely, Furthermore, each item loads on factor m with probability P(π j = m) =p m for m = 1, …, M. The authors construct the similarity matrix corresponding to K= 10 nearest neighbors. The similarity matrices are visualized via heat maps in Figure 2, in which blue cells indicate an edge between two items. They also circle items in the same cluster. There are more withinclass edges than between-class edges. In addition, the top eigenvalues are displayed in Figure 3. For the case n = 400, the largest three eigenvalues are close to 1.
The authors illustrate the performance of the spectral clustering algorithm for these two data sets. The vectors Y 1 ,…, Y J corresponding to the second and the third eigenvectors are displayed in Figure 4. The authors did not display the first eigenvector because it is a vector of all ones and contains no information (Proposition 1 in Online Appendix B). In the figure, the true class memberships are labeled by different colors. Cluster boundaries identified by the K-means algorithm are the black lines. For both data sets, the Y vectors admit the same cluster structure as that of the items. The proposed method performs well, even if the similarity matrices contain certain level of noise. There are six items misclassified for the case n = 100 and one item misclassified for the case n = 400.
To illustrate the sensitivity to the choice of the connectivity parameter K, the authors consider the data set with n = 100 and various connectivity parameter values K = 5, 10, 20, and 30. Note that K = 5 is the choice based on Algorithm 3. The corresponding similarity matrices and the results of clustering are shown in Figures 5 and 6. According to Figure 5, as K increases, the number of edges increases. Nonetheless, there are still more within-cluster edges than between-cluster edges. In addition, according to Figure 6, the Y vectors corresponding to different clusters are reasonably separated. Therefore, the spectral clustering algorithm is reasonably robust to the choice of K, even for a large connectivity parameter value that leads to a dense graph (e.g., when K = 30). The numbers of misclassified items are small, that is, 7, 6, 5, and 5, respectively, for K = 5, 10, 20, and 30. The sensitivity to the choice of K on clustering accuracy is further studied in the simulation studies in Online Appendix A.
The authors further illustrate the proposed parallel analysis for selecting M. They consider several connectivity parameters K = 5, 10, 20, and 30. Figures 7 and 8 show the observed eigenvalues and the corresponding thresholds computed based on the Monte Carlo samples from the reference distribution. All these plots start from the second largest eigenvalue, because the largest eigenvalue is always 1 (Proposition 1 in Online Appendix B). According to the results, the proposed method correctly selects the number of factors under all K. The influence of K on the parallel analysis will be further evaluated in the simulation studies and real data analysis.
For these two data sets, if K-means algorithm is directly applied (given the number of clusters), the numbers of misclassified items are 5 and 1, respectively, which is similar to the result of spectral clustering. However, if there exist reversely worded items, the K-means algorithm performs poorly. For example, the authors randomly choose half of the items and flip the responses, so that these items can be viewed as reversely worded items. The number of misclassified items for the two data sets then increases to 30 and 22, respectively. This is because the K-means algorithm measures the similarity between data points by their Euclidian distance.
Under this measure, two reversely worded items that measure the same construct will be regarded as dissimilar and thus not clustered together. However, the proposed method is invariant due to the use of the absolute value of Kendall's tau statistic as a similarity measure.

Comparing With Exploratory Factor Analysis
When data are generated from the latent factor model as above, exploratory factor analysis tends to work in terms of choosing both the number of factors (clusters) and classification.
Exploratory factor analysis is applied to the data set with N = 400 in the section "Illustration," using the R package "psych" (Revelle, 2016). The top eigenvalues of the estimated polychoric correlation are shown in Figure 9, and the parallel analysis of exploratory factor analysis (Horn, 1965) selects three factors. Under the "oblimin" rotation, a simple structure consistent with the generating model is observed, where "oblimin" is a widely used oblique rotational method (Costello & Osborne, 2009). Clustering the estimated loadings using the K-means algorithm only misclassifies one item.
The above result is not surprising, as the data are generated from the model assumed by exploratory factor analysis. However, what if item clusters are generated from a different model? To answer this, the authors simulate an example from an unfolding model, which is widely used in attitude measurement (Roberts, Donoghue, & Laughlin, 2000). The data are generated from a multidimensional unfolding model (DeSarbo & Hoffman, 1986), with the following specifications: the number of clusters M = 5, the number of items J = 60, sample size n = 1, 000, and P(π j = m) = 1/M, where π 1 , …, π J are the cluster membership.
Furthermore, each item is presented by an ideal point, denoted by δ j = (δ j1 , δ j2 ) ⊤ , on a twodimensional space, such that where μ 1 = (1, 0) ⊤ , μ 2 = (0, 1) ⊤ , μ 3 = (−1, 0) ⊤ , μ 4 = (0, −1), and μ 5 = (0, 0). The generated ideal points of the items are shown in Figure 10, where clusters are indicated by colors. Each individual is represented by an ideal point, denoted by θ i , on the same space. It is assumed that θ i , follows a standard multivariate normal. The responses are generated by where ║θ i − δ j ║ is the Euclidian distance between individual i and item j in the ideal point space. The closer the two ideal points, the more likely the individual agrees with the item (R ij = 1).
The spectral clustering method correctly selects the number of clusters, and only three points are misclassified. The authors then analyze the data with exploratory factor analysis. As the item clusters are generated from a different model, the estimated number of factors is no longer a good indication of the number of clusters and the extracted factors are difficult to interpret. The top eigenvalues of the estimated polychoric correlation matrix are shown in Figure 11, and parallel analysis selects four latent factors. Although the exploratory factor analysis model is misspecified, the estimated factor loadings still contain information about the clusters. That is, if we fit five-dimensional and four-dimensional exploratory factor models and then run a K-means algorithm to the estimated factor loadings under the "oblimin" rotation, the numbers of misclassified items are 5 and 6, respectively. Finally, the authors replicate the comparison of the above results by simulating 100 data sets for each setting, and the corresponding results are shown in Table 1. For spectral clustering, they check the estimated number of clusters and the classification error when the number of clusters is given. For exploratory factor analysis, the estimated number of factors from parallel analysis is used as the estimated number of clusters, and the classification is from Kmeans clustering of the estimated loadings of an M -dimensional factor model. For the estimation of the number of clusters, they report the percentage of correct estimation over 100 replications, and for the classification, they report the averaged classification error rate and its standard error. From Table 1, it is evident that both methods perform well when the true model is a latent factor model. When data are generated from an unfolding model, the parallel analysis of exploratory factor analysis is not intended to estimate the number of clusters and underestimates the number of clusters all the time, while the spectral clustering method correctly identifies the number of clusters 76% of the time. The misclassification error rates are similar for both methods, with the spectral clustering method performing slightly better.

Application to EPQ-R Data
In this section, the authors apply the spectral clustering method to the EPQ-R (S. B. Eysenck et al., 1985). The study is initially a confirmatory study associated with three scales: Psychoticism/Socialization (P), Extraversion/Introversion (E), and Neuroticism/Stability (N). Briefly speaking, the P scale measures a personality pattern typified by aggressiveness and interpersonal hostility; the E scale measures an attitude type characterized by concentration of interest on the external object, and the N scale measures a personality type characterized by anxiety, fear, moodiness, worry, envy, frustration, jealousy, and loneliness. The data set contains a total of 79 items with 824 respondents. There are 32, 23, and 24 items on the P, E, and N scales, respectively. The specific questions and the score keys can be found in Appendix 1 and Table 4 of S. B. Eysenck et al. (1985). To facilitate the presentation, the items are reodered by scales. The new item IDs in this analysis and the corresponding ones in S. B. Eysenck et al. (1985) are presented in Table 2. In particular, Items 14 to 32 and 53 to 55 (under the current set of IDs) are reversely worded. The data have been analyzed using item response theory models, such as in Maydeu-Olivares and Liu (2015) and Y. Liu, Magnus, Quinn, and Thissen (in press). To the best knowledge of the authors, this is the first cluster analysis of EPQ-R items.
The three scales are first validated by cluster analysis. In Table 3, the results of spectral clustering are presented, assuming that there are three clusters, and the connectivity parameter K is chosen as 5, as suggested in Algorithm 3. The results turn out to be insensitive to the choice of K. The item clustering results are identical to the confirmatory structure mentioned in the previous paragraph, except for six items: Item 14 "Do you stop to think things over before doing anything?" from the P scale is classified as Cluster 2 that corresponds to the E scale, and Items 4 "Do you have enemies who want to harm you?" 15 "Do you take much notice of what people think?" 16 "Would being in debt worry you?" 24 "Does it worry you if you know there are mistakes in your work?" and 29 "Can you on the whole trust people to tell the truth?" from the P scale are classified as Cluster 3 that corresponds to the N scale. Figure 12 presents the graph features corresponding to the second and the third eigenvectors of the Laplacian matrix, where the items from the P, E, and N scales are marked in black,red,and blue,respectively. Items 14,4,15,16,24, and 29 are on the boundary, suggesting that Item 14 is similar to both items from the P and E scales,and Items 4,15,16,24,and 29 are similar to both items from the P and N scales. The algorithm reconstructs the three scales reasonably well without prior information of the confirmatory structure, given that M = 3 clusters are enforced.
The authors compare the results above with results obtained from the K-means method. Simply applying the K-means method completely destroys the three-scale structure of EPQ-R, as shown in Table 4, mainly due to the existence of reversely worded items. In particular, it is observed that each cluster is a mixture of items from the three scales. If the data are preprocessed by manually flipping the responses to the reversely worded items, the K-means method produces reasonable results. However, even with that, the result of K-means method is still not as good as that of spectral clustering. In particular, as shown in Table 5, the three clusters correspond to the P, E, and N scales, respectively, but there are 12 items misclassified. These items include 42, 68, 58, 77, 78, 2, 6, 8, 20, 30, 45, and 46. The authors point out that flipping the responses to a subset of items does not change the result of the proposed method, because the absolute values of Kendall's tau statistics do not change. The better performance of spectral clustering is possibly due to that the similarity measure based on Kendall's tau statistic captures the underlying structure of the items better than that based on the Euclidian distance. In addition, it is often unknown which items are reversely worded, especially at the early stage of inventory construction. In such a situation, the proposed method still works, while the K-means is likely to fail.
The authors extend the analysis to the situation when the number of clusters is unknown. They choose K = 5 as suggested in Algorithm 3. The result of parallel analysis is shown in Figure 13, where the suggested number of clusters is 8. The clustering results for M = 8 and K = 5 are shown in Table 6. The eight clusters shown in Table 6 are basically finer scales of the original three scales in the confirmatory analysis. In particular, the P scale is split into three subscales, the E scale is split into two subscales, and the N scale is split into three subscales, except that Items 40, 45, and 48 from the E scale are classified into Cluster 3, a subscale of Psychoticism.
Kendall's tau statistic. This construction makes the algorithm applicable to both binary and ordinal response data and effective when data have a significant amount of missing or incomplete responses. The proposed algorithm is computationally efficient and easy to implement. The authors also propose a method to select the number of clusters. The method is easy to implement and performs well according to the simulation studies and real data application. In the application to EPQ-R data, eight clusters are suggested, and the corresponding clusters receive good interpretation. The authors point out that the proposed spectral clustering algorithm is an exploratory tool. To be used as assessment scales, the resulting item clusters need to be confirmed or improved upon by domain experts and by standard analysis based on latent factor models.
The proposed method is applicable to analyzing item response and other behavioral data from many areas, including personality, psychiatry, marketing, political science, and so on. Thanks to its computational efficiency, flexibility to different data structures, and reliable performance, the proposed method could serve as a useful exploratory tool for analyzing data that are heterogeneous and of large scale.
There are remaining issues that need to be addressed: First, the Kendall's tau similarity measure can be replaced by a different measure of similarity, such as the polychoric correlation. Second, the theoretical properties of the proposed method need to be established. In particular, if data are generated from a latent factor model with a simple structure, it is not difficult to show that the proposed method consistently clusters items that measure the same latent factor, as the sample size goes to infinity. A more challenging asymptotic setting is that the sample size, the number of items, and the number of factors all grow to infinity. It is of interest to investigate when the proposed algorithm works and when it does not. Finally, it is likely that there are items that measure multiple latent factors. For this, it is necessary to develop spectral clustering algorithms that allow observations to belong to multiple clusters. The graphical representation of an ideal example that each cluster is a connected component and different clusters are disconnected.   Result from parallel analysis for the EPQ-R data. Note. EPQ-R = Eysenck Personality Questionnaire-Revised. Chen  Results of Comparing Spectral Clustering and Exploratory Item Factor Analysis.
Appl Psychol Meas. Author manuscript; available in PMC 2018 February 01.  Table 3 Spectral Clustering With Three Clusters.