SPSS Syntax for Combining Results of Principal Component Analysis of Multiply Imputed Data Sets using Generalized Procrustes Analysis

Multiple imputation (Rubin, 1987) is a well-known method for handling missing data. Applying the procedure to an incomplete data set results in several plausible complete versions of the incomplete data set which are then all analyzed with the same statistical analysis. In order to obtain one overall analysis that is used for interpretation, the analysis results of these several completed data sets are combined using specific combination procedures. For principal component analysis (PCA), Van Ginkel and Kroonenberg (2014) proposed generalized procrustes analysis (GPA; Gower, 1975; Ten Berge, 1977) to combine the results. To date, GPA seems to have been little used for combining PCA results in multiply imputed data sets, as shown from relatively few citations of Van Ginkel and Kroonenberg (2014) by applied research papers. One reason could be that there are only few software packages that have implemented GPA. Exceptions are the ‘‘shapes’’ package (Dryden & Mardia, 2016) in R (R Core Team, 2018) and the standalone program 3WayPack (Kroonenberg & De Roo, 2010). In addition, these software packages may not be well known by applied researchers, and it may not be obvious to them that they may also be used for combining the results of PCA in multiply imputed data. For these researchers, the authors developed a user-friendly SPSS subroutine which is specifically aimed at combining the results of PCA, as described by Van Ginkel and Kroonenberg (2014), and which can be applied completely within SPSS. To run the subroutine, one must first carry out a PCA on each of the imputed data sets in SPSS and save the results to a data file. Next, the subroutine may carry out the combining of the saved results using GPA. Within the subroutine, a number of required arguments and some optional arguments are specified. Among the most important optional arguments are the display of the Varimax rotated centroid solution in the output, and

• incomplete2 imp.sav: A multiple-imputation data set in SPSS format containing the responses of 300 'respondents' to 41 items, denoted V1, . . ., V41. The original incomplete data set is a simulated data set from a simulation study by Van Ginkel, Van der Ark & Sijtsma (2007). The data file contains the original incomplete data set, plus five completed versions of the incomplete data set. The different versions are indicated by an additional variable imputation , which contains the data set number (0 indicating the original incomplete data set). The five completed versions were created using multiple Two-Way imputation for separate scales (Van Ginkel, Van der Ark, & Sijtsma, 2007). Variables V1 to V40 have ordered answer categories ranging from 0 to 4, variable V41 is a dichotomous variable with values 1 and 2. Five percent of the scores are missing. Missing values are indicated by a comma.

Generalized Procrustes Analysis
In GPA a transformation matrix for each imputed dataset is computed, which minimizes the sum of the squared distances between the transformed loadings of the imputed data. Let A m be a matrix containing the PCA loadings of the m'th imputed dataset (m = 1, ..., M ), and T m the transformation matrix for the m'th imputed dataset. The transformation matrices are computed by minimizing the following function, using an algorithm by Ten Berge (1977).
The mean of the M transformed loadings is called the centroid solution. The centroid solution is used as the combined solution for the imputed datasets. For more information on the algorithm for Generalized Procrustes Analysis, see Gower (1975), Ten Berge (1977, and Van Ginkel and Kroonenberg (2014).

Disclaimer and Bugs
It should be emphasized that this SPSS syntax is distributed without any warranty on the part of the authors. Although the SPSS syntax has been tested thoroughly, one can never fully exclude the possibility of errors. The authors appreciate suggestions and reports of detected errors (please enclose SPSS data file). All correspondence can be sent to Address to be revealed after acceptance 3 Using the SPSS Syntax

Requirements
This SPSS syntax uses the programming language Python. Python is included in the standard installation of SPSS version 22.0 and later versions.

Preparing your SPSS File
Several steps must be taken before the component loadings can be combined using the SPSS syntax file. After imputing the data, a PCA needs to be carried out on each imputed dataset separately. Next, the output of these PCAs must be saved to an SPSS data file. These steps are explained in the following sections and illustrated using the file Incomplete2 imp.sav.

Split File
Incomplete2 imp.sav is a file containing multiple imputed versions of the incomplete dataset. The PCA needs to be carried out on each imputed dataset separately. To do this, we must split the dataset by the imputation number in SPSS. Select the Split File option in the Data task bar. Select  Compare groups and add imputation as the grouping variable.  These steps are shown in Figure 1 and Figure 2. Alternatively, one can use the following syntax command:

OMS
The OMS (Output Management System) option allows us to save the output of the analyses to an SPSS data file. To use this option, select OMS Control Panel from the Utilities task bar ( Figure 3). Next, select Tables, Factor Analysis and Factor Matrix.
Under Output Destinations select File and specify a file name and location of your choosing. Next, click on Options and select SPSS Statistics Data File as format. These steps are shown in Figure 4. Lastly, click on Add and close the window by clicking OK twice ( Figure 5).
Next, carry out the principal component analysis for the multiply imputed dataset. This can be done by means of Analyze, Dimension Reduction, Factor ( Figure 6). Add V1 to V40 to the Variables box. Click on Extraction and set factors to Fixed number of factors to extract to the number of components you want to extract (Figure 7).
Do not base the numbers of components on the Eigenvalue criterion, as different imputations can lead to a different number of extracted factors with the Eigenvalue method. Four components were used in our example. Next click on OK. The last step is to turn off OMS logging by going back to the OMS Control Panel under utilities and clicking on end (Figure 8). Close this window by clicking OK twice (Figure 9).
If the steps are followed correctly on the Incomplete2 imp.sav file, the newly created file will look like ExamplePCA.sav ( Figure 10).

Syntax options
The options for the syntax are in Run GPA.sps. This file can be adjusted to change the options used in the GPA.sps file. When the Run GPA.sps file is opened, the syntax looks like this: At least five arguments need to be specified to successfully carry out the procedure. These arguments are explained below.

Specifying the File Locations
Two file locations need to be specified in order for the syntax to be run. Note that forward slashes are used in the example. SPSS may not be able to locate the file if backward slashes are used. After source file has been specified, the location of the file with the component loadings needs to be specified as well. This is the file that was created in section 3.2.2 using the OMS command. Suppose this file is called ExamplePCA.sav file and is located in the folder C:\Example\. the line: source_file = "{path using / + filename of the source file}" , must then be changed into:

Specifying the Names of the Components
To specify the names of the components, adjust the following line: component_names = "{names of components}" , by replacing {names of components} with the names of the variables holding the components you want to use. Separate these names by commas. For example, in ExamplePCA.sav these variables are names 1, 2, 3, and 4. To use these names, the following line can be used: component_names = "@1, @2, @3, @4" ,

Specifying the Variable that Includes the Original Variable Names
After using the OMS option to create a file with the factor matrices, a variable will be created including the original variable names. In ExamplePCA.sav this variable is called Var1 (see Figure 11). To specify that the original variable names are in Var1, change variable_names = "{variable with the original variable names}" , into variable_names = "Var1" ,

Optional Arguments
Besides the five required arguments there are 10 optional arguments. They differ from the mandatory arguments in that they have a default setting that will be used if nothing is specified.

Specifying the Rotation Method
This command supports two options: printing only the centroid solution of the Generalized Procrustes Analysis or printing the centroid solution followed by a Varimax rotation as well. Use the following line to print the Varimax rotated centroid solution, add the following line: to the syntax. Thus, suppose that we want to display the Varimax rotated solution of the ExamplePCA.sav file, the syntax must be modified into: When the rotation argument is not added, only the unrotated pooled solution is provided. However, displaying only the unrotated solution can also be specified by adding rotation = "no" rather than adding rotation = "var".

Specifying whether the Original Dataset is Included
When multiple imputation is used on a dataset with missing values in SPSS, SPSS adds the imputed datasets below the original dataset. The standard setting for the GPA.sps file is to ignore the first dataset for the Generalized Procrustes Analysis. However, if the first dataset is not the original dataset with missing data, the following line must be added to the syntax: Thus, suppose that the ExamplePCA.sav did not include the original dataset, the complete syntax would look like: BEGIN PROGRAM. import spss spss.Submit(r"INSERT FILE='C:/Example/GPA.sps'.") theDict['gpa_pool']( source_file = "C:/Example/ExamplePCA.sav" , component_names = "@1, @2, @3, @4" , M = 5 , variable_names = "Var1" , original_dataset = "no" , ) END PROGRAM.
If the first dataset in the file is the original dataset with missing values, this option can either be set to "yes" or removed altogether. In this case the syntax will revert to the default setting of "yes". Alternatively, the options "y" and "n" may also be used.

Saving the Combined Results in a Data File
The dataset with the results of the Generalized Procrustes Analysis is stored in a new SPSS data file. The default value of this parameter is "no", which does not save the SPSS data file. If this option is chosen, the file can still be saved manually. Suppose the desired file name is outcome.sav and the desired location is C:\Example\. To save the data file automatically, add the following line:

Graphical Options
The syntax has five graphical options. In the first two options it can be specified whether the results of either the Generalized Procrustes Analysis or the Generalized Procrustes Analysis with Varimax rotation are displayed in loading plots or not. For both options, the default value is "no". If the loading plots are shown, the default option is to display the centroids and convex hulls of the solutions for all the variables. To display the loading plots of the Generalized Procrustes Analysis, add the following line: To display the loading plots of the Generalized Procrustes Analysis with Varimax rotation, add graph_varimax = "yes" , If the loading plots are displayed, the default setting is to show both the centroids and the convex hulls. To disable displaying the convex hulls, use: To disable displaying the centroid solutions, use: Alternatively, the options "y" and "n" may also be used. The last graphical option provides the possibility to show only a subset of the original variables.
If the original dataset has many variables, the loading plots can become cluttered and difficult to interpret. With this option a subset of variables may be specified.For example, if you only want to see the variables V1, V3, and V4, add the following line: subset = "V1, V3, V4" , Figure 12: Example of a Graph Created by the Syntax.

Specifying the Maximum Number of Iterations
Both the maximum number of iterations for the Generalized Procrustes Analysis algorithm and for the Varimax rotation algorithm in order to converge may be specified. The default maximum number of iterations for both procedures is 1000. The maximum number of iterations for the Generalized Procrustes Analysis algorithm can be changed. To set the maximum number of iterations of the GPA algorithm to, say, 3000, use: