Development and Validation of an 18-Item Medium Form of the Ravens Advanced Progressive Matrices

The Ravens Advanced Progressive Matrices (APM) is a widely used measure of general intelligence (g), both across settings and cultures. Due to its lengthy 40-min administration time, several researchers have developed short-form scales, yet these forms typically yield a significantly lower reliability. This article describes the creation of an 18-item short form (APM-18) and its validation in three samples of Southwestern U.S. university students (total N = 633). The APM-18 shows similar psychometric properties to both the previously published 36-item long form and 12-item short form, but retains a reliability estimate closer to the original APM. This, plus the shorter administration time (25 min) relative to the complete APM (40-60 min), makes it useful for time-constrained or mass-testing situations.

The Ravens Progressive Matrices Test, developed by Raven (1941) as a measure of general intelligence (g), has undergone many revisions, ranging from colored versions for children to the standard and advanced matrices for adults of different cognitive levels. The most recent published version is the Ravens Advanced Progressive Matrices (APM; Raven, Raven, & Court, 1993), which was developed for higher ability adult populations (i.e., college-level and above). This test is constructed of 36 items of increasing difficulty broken into three 12-item sets; in each item, the examinee is asked to complete a visual pattern by choosing one of eight possible solutions.
Due to its nonverbal format, the APM is purported to be a culturally fair, unbiased measure of fluid intelligence (Cattell, 1963), educative ability (J. Raven et al., 1993), or, as we will refer to it, general intelligence (g; Spearman, 1927), and has shown itself to be especially useful in situations where English is not an individual's primary language. As such, the Standard and Advanced Progressive Matrices have been used extensively in many applied settings in the United States (e.g., Ackerman, 1992) and across many cultures (Owen, 1992; J. C. Raven, 2000;Rushton, Cvorovic, & Bons, 2007). However, the positive aspects of this test are marred by its lengthy administration time (40-60 min), making it difficult to use in time-constrained multivariate research or classroom settings.
In answer to these various limitations, Arthur and Day (1994) developed a 12-item short form of the APM (which we call APM-12), with an administration time of 15 min. Several studies have shown that this 12-item form shows acceptable psychometric properties (e.g., Cronbach's alpha, test-retest reliability, convergent validity; see Arthur, Tubre, Paul, & Sanchez-Ku, 1999, for review). However, this short form shows relatively low and variable internal consistency (IC). For example, Cronbach's alphas range from .58 to .66 for short form itself to and .72 to .73 for the 12 short-form items extracted from the full 36-item version (Arthur & Day, 1994).
More recently, Hamel and Schmittmann (2006) have argued that the complete 36-item APM can be administered as a 20-min speed test. Scores on this speeded form of the APM show strong correlations with scores on slower timed (40 min, r = .74) and untimed versions (r = .75) of the APM. However, these authors failed to report the IC of the Speed Test Scale. We also suspect that giving typical adults only 20 min to complete 36 very challenging abstract reasoning problems might impose undue stress.
The purpose of the current study was to develop a mediumform version of the APM that resulted in higher IC than the 12-item version (APM-12), but shorter administration time than the full 36-item APM (APM-36)-a combination of features that might be useful for time-constrained and masstesting situations. Here, we report the development and construct validity of this 18-item scale.

Study 1: Scale Construction and Construct Validity
Method Participants. A total of 633 students (198 male, 435 female) from three southwestern universities participated in this study as a partial requirement for experimental course credit. The mean age for participants was 20.92, SD = 4.07 (M male = 20.85, SD male = 3.90; M female = 20.96, SD female = 4.15). Ages ranged from 17 to 58 years old (male = 18-41, female = 17-58).

Measures
The Ravens Advanced Progressive Matrices 18-Item Short Form . This 18-item short-form version of the APM is printed in a booklet format on 8½″ by 11″ white paper, with each test item printed on a separate page. The first four pages of the test booklet contain three example items (Practice Items 1, 5, and 9 from APM-36) to explain the task.
The 18 actual test items were derived by adding six items from the longer 36-item version (J. Raven et al., 1993) to Arthur and Day's (1994) published 12-item version. Arthur and Day used Items 1,4,8,11,15,18,21,23,25,30,31,and 35 from the 36-item APM based on a set of three decision rules, which can be summed up as (a) dividing the APM into 12, three-item sections based on difficulty; (b) taking the item with the highest item-total correlation for each section; and (c) in the case of a tie, including the item that resulted in the largest drop in IC if it was excluded from the full test. Following these same rules, we added six more items of increasing difficulty-two that were easy (96% and 75% of examinees from the normative sample answered correctly), two that were moderate (50% and 48% of examinees from the normative sample answered correctly), and two that were difficult (37% and 32% of examinees from the normative sample answered correctly). These items (2, 20, 22, 24, 34, and 32) were integrated of difficulty to mimic their presentation order in the original APM.
Procedure. The new APM-18 test was given in classroom settings with several examinees at a time. This was done because this test was developed as a measure of g that could be used in environments such as classrooms, where there are time limits on research sessions. In one subsample (n = 175), tests were given with no time constraints, but with completion times recorded, to determine the average time needed for completion. The other two subsamples (n = 232 and n = 226) were constrained to finish the test within 25 min, with no individual completion times recorded.
Analyses. All statistical analyses were conducted using SAS Version 8.2 (SAS Institute, 1999). Cronbach's alphas and bivariate correlations were computed using the PROC CORR procedure. Tests for mean differences between sexes were calculated through t test (PROC TTEST) procedures. Hierarchical general linear models (GLMs) were tested using PROC GLM.

Results
IC estimates were computed by using Cronbach's alpha. The IC of the APM-18 scale yielded moderate reliability (α = .79). This alpha is lower than normative IC reports for the APM-36 (α = .84; Forbes, 1964), but higher than those for the APM-12 (ranging from α = .58-.66; see Arthur et al., 1999). Furthermore, the alpha of the APM-18 was larger than that of the embedded APM-12 (α = .73). Table 1 shows the results for each of the APM-18 items, with respect to their item-total correlations, item difficulties, and scale α of the overall scale if the item is deleted. As seen, deleting any item reduces the overall reliability of the scale, suggesting that all items should be retained.
The mean APM-18 score was 9.73, SD = 3.59 (M male = 10.43, SD male = 3.52; M female = 9.41, SD female = 3.59), with a range of 18. For the subsample in which completion times were recorded (n = 175), the mean test completion time was 17.5 min (SD = 4.67), with a range of 7 to 25 min; 21% of the participants took longer than 20 min, but no one took longer than 25 min. In this subsample, there was a significant positive relationship between the amount of time it took for participants to take the test and their APM-18 score (r = .41, p < .001), but there was no relationship between age and APM-18 score, or age and time required to complete the test (for each r ≤ .03, p ≥ .71). However, for the complete sample (N = 633), younger participants scored a little higher (age and APM-18 scores correlated r = −.15, p < .001), and males scored a little higher-sex (female = 0, male = 1) and APM-18 scores correlated r = .13, p < .001.
Hierarchical GLMs were tested to explore whether the apparent differences in male and female APM-18 scores might have been indirectly attributable to the relationship between age and APM-18 scores. This model defined the APM-18 score as the criterion variable, with the ordered predictor variables being age and then sex. The hierarchical model was designed to allow age to absorb as much variance as possible, with sex entered into the model only afterward. Using this model, both GLMs indicated a significant effect for age (F = 15.39, p < .001) and then also for sex after age had been statistically controlled (F = 10.66, p = .001).

Discussion
The results presented here suggest that the APM-18 may serve as a useful compromise between the lower reliability APM-12 and the much longer APM-36. The hierarchical GLMs identify both age and sex to be significant predictors of APM-18 scores, with younger individuals and males generally scoring higher. These results are consistent with many previous studies looking at general intelligence (e.g., Jackson & Rushton, 2006). Results of Study 1, however, do not test the convergent validity of this scale relative to other measures of intelligence. Study 2 was designed to do this.

Study 2: Convergent Validity
Study 2 was conducted to assess the convergent validity of the APM-18 with other measures of intelligence, academic achievement, and personality. To do so, we tested two separate subsamples (n = 193 and 229) taken from the Study 1, each of which used different criterion measures. In Sample 1, two widely used measures of adult intelligence were used: the Mill-Hill Vocabulary Scale-Multiple Choice Sets A & B (MHV-MC; J. Raven, Raven, & Court, 1997), developed to be used in conjunction with the APM-36 as a measure of reproductive ability, that is, the ability to store and retrieve information (J. C. Raven, 1989); and the Shipley Institute of Living Scale (SILS; Zachary, 1986), which is a stand-alone intelligence test comprised of two subscales-Vocabulary, which tests crystallized intelligence, and Abstraction, which tests fluid intelligence. Also, we examined academic performance via self-reported grade point average (GPA) and scholastic aptitude test (SAT) scores. In Sample 2, we examined correlations between APM-18 scores and Big Five personality dimensions assessed with the NEO Five-Factor Inventory (NEO-FFI) Scale (Costa & McCrae, 1992), and verbal and drawing creativity (Miller & Tal, 2007). In addition, ACT scores were collected in this second sample.

Method
Participants. Sample 1 was comprised of 193 students (94 male, 99 female) from an introductory psychology course at the University of Arizona. Mean age of participants was 19.11, SD = 1.62 (M male = 19.23, SD male = 1.07; M female = 19.01, SD female = 2.00). Due to the length of time required to administer the APM-18, the Shipley, and the Mill-Hill, 10 participants did not complete the Mill-Hill Test. We urged participants to record their SAT, ACT, and GPA scores only if they were certain of them; due to this constraint, many of these scores were also missing.
Sample 2 was comprised of 229 students (65 male, 164 female) from various undergraduate courses at the University of New Mexico. Mean age of participants was 20.19, SD = 3.43 (M male = 21.05, SD male = 5.01; M female = 19.85, SD female = 2.48). Again, we urged participants to record their ACT scores only if they were certain of them, leaving us with ACT scores for only 129 participants.

Measures
APM-18. The APM-18 consisted of the same items identified in Study 1. In Sample 1, the form was presented first in a series of measures examining adult intelligence. In Sample 2, it was presented in the middle of a questionnaire packet concerning personality, creativity, sexual behavior, and intelligence.
The SILS. The SILS (Zachary, 1986) is a timed (10 min per subscale), 60-item self-report measure that examines both verbal intelligence (40 items) and abstract intelligence (20 items). The test is considered appropriate for average English-speaking individuals from 14 to adult ages, who are motivated test takers. Validities and norms published in the manual were taken from a sample of 322 army recruits. Split-half reliabilities for each subscale are reported as .87 for Vocabulary, .89 for Abstraction, and .92 for the total score.
The MHV-MC. The MHV-MC (J. Raven et al., 1997) is a 68-item self-administered multiple-choice vocabulary test designed to complement the APM-36. Whereas the APM aimed to measure an individual's ability to solve novel problems and think in novel ways (i.e., fluid intelligence), the Mill-Hill aimed to measure an individual's ability to recall learned information (i.e., crystallized intelligence). To this extent, it indicates educational attainments, cultural background, and familiarity with the test's language. The Mill-Hill typically shows split-half reliabilities more than .90 and test-retest reliabilities ranging between .87 and .95 (Raven et al., 1997).
Academic performance. Academic performance was measured by self-reported GPAs and SAT scores in Sample 1.  Sample 2 participants were asked for SAT and ACT scores. A variety of studies have identified moderate to strong correlations between these academic achievement and aptitude measures, and a variety of other traits, including intelligence, personality, and psychopathology (Barton, Dielman, & Cattell, 1971;Brown, 1994;Dyer, 1987;Mouw & Khanna, 1993).

NEO-FFI.
The NEO-FFI (Costa & McCrae, 1992) is the most widely used measure in research on the Five-Factor model of personality. It is a shortened version of the 240item Revised NEO Personality Inventory (NEO-PI-R; Costa & McCrae, 1992), comprised of 60 items that measure five global personality factors (12 items per factor): Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. In our version, participants rated degree of agreement with statements about their personalities and behavioral propensities on a 5-point scale ranging from −2 (strongly disagree) to 0 (neutral) to +2 (strongly agree). This scale has shown strong IC, with Cronbach's alphas ranging between .74 to .89, for each factor, and consistent cross-cultural validity (McCrae & Costa, 1997).
Verbal and drawing creativity tasks. Participants completed six 2-min verbal creativity tasks and eight 1-min drawing creativity tasks (Miller & Tal, 2007). Because a mating-oriented mind-set promotes creativity (Griskevicius, Cialdini, & Kenrick, 2006), participants were asked to complete these tasks as creatively as possible with the intention of attracting a romantic partner. Examples of verbal tasks included writing answers to thought-provoking questions, such as "How would you keep a marriage exciting after the first couple of years?" "What do you hope the world will be like in a 100 years?" and "Imagine that all clouds had really long strings hanging from them-strings hundreds of feet long. What would be the implications of that fact for nature and society?" There were two types of drawing tasks, four abstract (e.g., "Please draw an abstract symbol, pattern, or composition that represents your happiness as a child doing a favorite activity") and four representational (e.g., "In the space below, please draw an animal that you admire for its strength, grace, speed, or beauty"). Each participant's responses to each of the 14 creativity tasks were scored independently by four raters on a 1-to 5-point creativity scale. The resulting composite verbal creativity and drawing creativity measures showed high interrater reliability and IC (Cronbach's alphas = .91 and .90, respectively; Miller & Tal, 2007).

Sample 1
IC estimates were computed using Cronbach's alpha. The APM-18 showed moderate IC (α = .71), with the embedded APM-12 yielding a slightly lower value (α = .63). Although these internal consistencies are lower than those reported in Study 1, they are still moderate in strength.
The mean APM-18 score was 10.68, SD = 3.25 (M male = 11.07, SD male = 3.13; M female = 10.31, SD female = 3.34), with a range of 13 (four to 17). There was no relationship between APM-18 score and age (r = .03, p = .71), or sex (r = .11, p = .10). Mean scores for each of the measures of intelligence can be seen in Table 2. Due to the significant sex difference between APM-18 scores in Study 1, mean sex differences on all measures in this study were checked via t tests. There were no significant sex differences for any of the intelligence measures in this sample, except for a moderate male advantage on self-reported SAT scores (t = −3.00, p = .003). Therefore, the remaining analyses were conducted on the full sample rather than by sex.
As seen in Table 3, both the APM-18 and embedded APM-12 correlated significantly with most of the other measures of intelligence and academic achievement and aptitude used in this sample. Specifically, both the APM scales correlated positively and significantly most strongly with the Shipley Abstraction scale and self-report SAT scores. This is not surprising as the APM is designed to be a measure of g, which may be most easily identified in relation to abstract, analytical measures, of which the Shipley abstraction is one, and the SAT contains an analytical subscale.

Sample 2
IC estimates were again computed using Cronbach's alpha. As in Study 1 and Sample 1 of this study, the APM-18 showed moderate reliability (α = .79), whereas the embedded APM-12 again shows slightly lower reliability (α = .74). The mean APM-18 score was 9.53, SD = 3.57 (M male = 10.29, SD male = 3.98; M female = 9.23, SD female = 3.36), with a range of 1 to 18. There was no relationship between APM-18 score and age (r = −.03, p = .61), but there was between APM-18 score and sex (r = .13, p = .04), with males again scoring slightly higher (t = −2.04, p < .05). There were no other sex differences on the other intelligence measures (for all ts ≥ −1.20, p > .05). Table 4 shows mean scores for the intelligence measures and NEO-FFI factors.

Discussion
Each sample in Study 2 used different methods of assessing the convergent validity of the APM-18. Sample 1 focused on   relationships between the APM-18 and other standard measures of intelligence and academic achievement (e.g., verbal intelligence tests, self-reported GPA, and SAT scores); whereas Sample 2 examined the relationship between the APM-18, creativity, self-reported ACT scores, and Big Five personality traits. Both studies confirmed that the APM-18 is related to these measures in a predictable manner. Generally speaking, both the APM-18 and the embedded APM-12 showed the same pattern of correlations with the other measures used in these studies. However, the higher IC of the APM-18 suggests that it may be better at detecting individual variation in g.

Conclusion
Each of the 18 items used in this new APM-18 test was chosen to maintain the progressive difficulty of both the long form (APM-36) and the short form (APM-12). Unsurprisingly, although the APM-18's reliability was lower than that of the APM-36, it was higher than that of the APM-12 developed by Arthur and Day (1994). Furthermore, the patterns of correlation with other measures of intelligence are virtually identical to the APM-12, which has, in previous studies, been shown to mimic the APM-36 results (Arthur & Day, 1994;Arthur et al., 1999). Combined with an average administration time of 17.53 min (25 min maximum), these findings suggest that the APM-18 may work well as a compromise for researchers who want a quite accurate measure of general intelligence in a quite short amount of time. The cross-validation in the three samples reported here is an initial attempt to collect normative data for the APM-18. Our results may generalize only to other college students. However, the APM-18's short administration time, high IC, reasonable validity, and ease of administration by paper and pencil in large college classroom settings make it ideal for behavioral science studies where researchers want a reasonably fast, accurate intelligence score as part of a larger questionnaire battery.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research and/or authorship of this article.