Retrieval Practice: Beneficial for All Students or Moderated by Individual Differences?

Retrieval practice is a learning technique that is known to produce enhanced long-term memory retention when compared to several other techniques. This difference in learning outcome is commonly called “the testing effect”. Yet there is little research on how individual differences in personality traits and working memory capacity moderate the size of the retrieval-practice benefits. The current study is a conceptual replication of a previous study, further investigating whether the testing effect is sensitive to individual differences in the personality traits Grit and Need for Cognition, and working memory capacity. Using a within-subjects design (N = 151), participants practiced 60 Swahili–Swedish word pairs (e.g., adhama–honor) through retrieval practice and re-studying. Learning was assessed at three time points: five minutes, one week, and four weeks after practice. The results revealed a significant testing effect at all three time points. Further, the results showed no association between the testing effect and the personality traits, or between the testing effect and working memory, at any time point. To conclude, retrieval practice seems to be a learning technique that is not moderated by individual differences in these specific personality traits or with working memory capacity, thus possibly beneficial for all students.

of including feedback in retrieval-practice research is to equate exposure to the learning materials between learning conditions (Kang et al., 2007).
Some aspects of the effectiveness of retrieval practice are still uncertain. For example, it is still underexplored whether individual differences in personal attributes associated with learning will moderate the effect of retrieval practice, or if the method is equally beneficial regardless of academic aptitude. Tse and Pu (2012) demonstrated that cognition and personality can interact in such a way that people with higher levels of a personality trait, such as trait test anxiety, and lower cognitive levels (measured by WMC) benefit less from retrieval practice than people with a higher cognitive level. However, in a follow-up study, Tse et al. (2019) were unable to replicate the results, which calls for further research.
In this study, the focus is on investigating the impact of personality traits and WMC on the testing effect, in order to bring more clarity into this area. We argue that it is important to evaluate whether individual characteristics that are said to influence school performance (see e.g., Alloway & Alloway, 2010;Cacioppo et al., 1996;Duckworth & Quinn, 2009) are also critical for the effects of retrieval practice.
Research has identified several personality traits that affect people's aptitude for learning (Arbabi et al., 2015;Duckworth & Quinn, 2009;Lounsbury et al., 2003;Sadowski & Gulgoz, 1992). One specific trait is Grit, which has received much attention in recent years. Grit is defined as "perseverance and passion for long term goals" (Duckworth et al., 2007(Duckworth et al., , p. 1087 and contains two subconstructs-that is, consistency of interest and perseverance of effort. Grit has been found to be predictive of both academic performance and other types of success (e.g., completion of the summer training program at West Point Military Academy and performance in Scripps National Spelling Bee; Duckworth & Quinn, 2009). However, Grit is known to have a strong positive correlation with Big Five Conscientiousness (r ¼ .77; Duckworth et al., 2007;Duckworth & Quinn, 2009;Meriac et al., 2015), and some researchers question the distinctness of Grit from Conscientiousness (Cred e et al., 2017). Another personality trait is Need for Cognition (NFC), defined as "the tendency for an individual to engage in and enjoy thinking" (Cacioppo & Petty, 1982, p. 116). NFC explains individual differences in motivation and effort when engaging in cognitive activities (van Seggelen-Damen, 2013). The concept of NFC has been examined for a long time and shown to be positively related to enhanced academic performance (Sadowski & Gulgoz, 1992), attending gifted classes (Meier et al., 2014), problem solving (Cacioppo et al., 1996), and the use of more advanced strategies for learning (Cazan & Indreica, 2014).
It seems likely that theoretical explanations for the testing effect emphasizing the advantage of retrieval effort should be more influenced by Grit and NFC, while explanations such as TAP would be less influenced. The present study focuses on the acquisition of a foreign language vocabulary for upper-secondary school students. The typical experimental procedure includes a learning phase and a manipulation phase (retrieval vs re-study practice) followed by retention tests assessing learning (in this case, five minutes, one week, and four weeks after the learning phase). Participants with high Grit should be able to persevere through the learning phase to a higher extent and thus learn more and perform better across all three retention tests, while individuals with high NFC should engage more in assignments involving thinking and would therefore perform better at the retention tests irrespectively of material. High motivation and effort when engaging in cognitive activities, which, as pointed out above, characterize individuals with high NFC, are aspects found to be especially important for test performance (Unsworth et al., 2013;Van Barneveld, 2007). Further, knowledge of effective strategies (for example, retrieval practice), which also characterize individuals with high NFC, should be especially helpful for long-term memory consolidation (see e.g., Antony et al., 2017) independent of cognitive ability.
Only two previous studies have specifically examined NFC and Grit in relation to the effects of retrieval practice (Bertilsson et al., 2017;.  used a between-subjects design (N ¼ 98) to compare the learning effects of retrieval practice with group discussions with or without feedback, and whether retention was influenced by NFC. The results showed no relationship between NFC and the testing effect. Bertilsson et al. (2017) conducted two experiments, in which the participants learned Swedish-Swahili word pairs. Experiment 1 (N ¼ 39) investigated the effect of retrieval practice relative to repeated studying using a between-subjects design. In Experiment 2 (N ¼ 29), a within-subjects design was employed, and all participants used both retrieval practice and re-study to learn the materials. The learning outcome was assessed by means of cued recall tests at three different time points: immediately, one week, and four weeks after the intervention. The result in both experiments showed that neither Grit nor NFC were related to the effect of retrieval practice. While these findings are interesting, the conclusions are based on only two studies with rather few participants, and only one of the studies included Grit (i.e., Bertilsson et al., 2017). To make firm conclusions and be able to generalize the findings, more research is needed targeting both NFC and Grit using a larger sample.
Beside specific personality traits, various cognitive abilities have been shown to have a significant impact on our ability to learn. Studies show that students with high WMC (Alloway & Alloway, 2010;Cowan, 2014), executive functioning (St. Clair-Thompson & Gathercole, 2006), or IQ (Deary et al., 2007) perform better in school than students who possess lower abilities in these areas. Working memory (Baddeley, 2010) has been suggested to have an essential role in a number of skills required for being successful in school, as well as for coping well with classroom activities in general (see Alloway, 2006, for a review). There are several hints that WMC may play an important role in learning word pairs (Swedish-Swahili word pairs were used in the present study) and to retrieving them across a period of four weeks. For example, studies have shown that the frontal lobe is critical during the acquisition of vocabulary (Karlsson Wirebring et al., 2015) and that the search process for long-term memory retrieval is driven by WMC (Unsworth et al., 2013). However, prior studies investigating the relationship between WMC and the testing effect have so far produced quite differing results. In a sample of college students who were instructed to learn general knowledge facts, Agarwal et al. (2017) found that on a delayed test two days after learning, retrieval practice improved performance for all students, but more so for low WMC students. In contrast, there is also a number of studies that have not found a relationship between WMC and the testing effect (Bertilsson et al., 2017;Brewer & Unsworth, 2012;Minear et al., 2018;Tse et al., 2019;Wiklund-H€ ornqvist et al., 2014). One possible explanation for these mixed results is that the experiments use different methods. Many aspects of the retrieval practice intervention (i.e., type of material, number of items, amount of practice, the lag between practice and retention test) vary between experiments and are likely to impact the relationship between WMC and the testing effect.
When studying predictors of academic performance, researchers have repeatedly found that personality traits are predictive of academic performance over and above variance predicted by cognitive ability (e.g., O'Connor & Paunonen, 2007). Under some circumstances, personality traits are better predictors of academic performance than cognitive ability is (Chamorro-Premuzic & Furnham, 2008;Furnham et al., 2003). In the present study, we, therefore, controlled for WMC before entering the traits Grit and NFC in the analyses with the aim to investigate the amount variance remaining after controlling for WMC.
In light of previous individual-difference research on the effects of retrieval practice, the purpose of this study is to further investigate whether the testing effect is sensitive to individual differences in Grit, NFC, and WMC, or whether the technique is equally beneficial for all students. One way to achieve this is to verify results from previous studies through replication. Replicability is a vital part of all research since conclusions drawn from results are not valid if the results cannot be replicated (Asendorpf et al., 2013). This is especially true when there are very few studies, as with personality and the testing effect, and when the results are inconclusive, as with WMC and the testing effect. Of the limited number of studies that have included personality traits, only one has used a within-subjects design (i.e., Bertilsson et al. 2017), which is preferable when investigating the effects of individual differences, but it included a small number of participants. Another strength with the experimental design of Experiment 2 in Bertilsson et al. (2017) is that retention was measured at three different time points, and both accumulated and uniquely tested word pairs were included at each retention test, making it possible to investigate the relationship between the independent variables and the testing effect at different intervals. The present study will, therefore, replicate the design, procedure, and, partially, the statistical analyses of Experiment 2 in Bertilsson et al. (2017), using a larger sample. In addition, while Bertilsson et al. (2017) examined individual differences in learning effects from retrieval practice and re-study practice separately, the present study makes an innovative contribution by extracting the performance difference scores between the learning conditions (retrieval practice vs re-study practice) at each retention interval. This calculation is necessary when examining individual differences in relation to the testing effect with a within-subject design, as separate analyses of the two conditions might cause problems with validity.

Participants
In total, 196 students (49.5% female; M age ¼ 17.2, SD ¼ .65) from two types of study programs (natural sciences and social sciences) were recruited from an upper-secondary school in northern Sweden. Thirty-eight students did not complete all parts of the study and were therefore excluded. Seven outliers were identified in the measure for WMC using an interquartile range of 1.5, and the corresponding cases were excluded from all analyses. No outliers were identified in the measures of Grit and NFC. This resulted in a final sample of 151 participants with ages ranging from 16 to 20 years (45.7% female; M age ¼ 17.1, SD ¼ .62). The participants received two movie tickets as a reimbursement for their participation, and written informed consent was obtained in accordance with the Declaration of Helsinki. The study was approved by the Regional Ethical Review Board, Sweden (2017/517-31).
Based on Bertilsson et al. (2017;Experiment 2, N ¼ 29) two a priori power analyses were conducted using G*Power 3.1.9.7 (Faul et al., 2009). The first a priori power analysis (repeated-measures ANOVA) targeted the testing effects at each retention interval (five minutes, one week, four weeks) with the lowest effect size (g p 2 ¼ 0.80) from Bertilsson et al. (2017, Experiment 2) as input. The analysis indicated that with an alpha of 0.05 and a statistical power of 0.95, a minimum of six participants is required, showing that the study by Bertilsson et al. (2017) had a sufficient sample size. The second a priori power analysis (multiple linear regression) targeted the effects of the Mental Effort Tolerance Questionnaire (METQ; Dornic et al., 1991) and Grit at each retention interval (five minutes, one week, four weeks). The correlations between METQ and retrieval practice word pairs, and Grit and retrieval practice word pairs at each retention interval was entered as input in the power analysis. The analysis indicated that with an alpha of 0.05 and a statistical power of 0.95, a sample size of 89 participants is required, which extends that of Bertilsson et al.'s (2017) sample size in Experiment 2.

Materials
The material used in the learning intervention consisted of 60 Swahili-Swedish word pairs (Karlsson Wirebring et al., 2015;Nelson & Dunlosky, 1994). Further, three instruments were used to measure WMC and personality; these are described below.
WMC. An automated version of the Operation Span task (Ospan; Unsworth et al., 2005), a complex working memory task, was used to measure WMC. The automated Ospan is administered on a computer and can be completed by the participants independently from the experiment leader. The task is comprised of two subtasks -a letter span, and a concurrent math task -that are alternated so that a letter is presented between each math operation. The participant is required to solve the math task while at the same time maintaining the presented letters in memory. This continues for 3-7 trials before the participant is shown a matrix of 12 letters and is asked to recall the letters by clicking a box next to the letters in the order they were shown. To ensure that participants do not ignore the math tasks in favor of rehearsing the letters (thus measuring short-term memory rather than working memory) there is an 85% accuracy criterion on the math tasks. The current percentage of correctly solved math tasks is displayed to the participant during letter recall. Unsworth et al. (2005) reported a good testretest reliability, r ¼ .83, and internal consistency, a ¼ .78. They found the automated Ospan to be a valid measure of WMC as it was significantly related to the original Ospan (Turner & Engle, 1989), r ¼ .45, and the two measures correlated similarly with Ravens Progressive Matrices, a measure of fluid abilities (automated Ospan, r ¼ .38; original Ospan, r ¼ .42) (Unsworth et al., 2005). The statistical analyses were conducted using the number of letters recalled in the correct position (i.e., partial credit load scoring, cf. Conway et al., 2005).
NFC. NFC was measured using the METQ (Dornic & Ekehammar, 1991), a Swedish adaption of the original NFC scale. The original scale is a well-validated measure of NFC (for meta-analysis, see Cacioppo et al., 1996) that was developed by Cacioppo and Petty (1982). The METQ contains 30 items with responses given using a 5-point Likert-type scale ranging from 1 (Do not agree at all) to 5 (Agree completely). The NFC score is the total sum of all items; however, 18 of the items are phrased negatively and therefore require reverse scoring. The internal consistency in the present study was a ¼ .88, which is in line with previous studies that have evaluated the psychometric properties of the questionnaire (Dornic et al., 1991;. Psychometric evaluations have also found evidence of validity in the Swedish adaption of the NFC scale . Grit. The Short Grit Scale (Grit-S; Duckworth & Quinn, 2009), an eight-item adaption of the original Grit Scale, was used to measure Grit. The questionnaire was translated to Swedish and independently back-translated to English by a professional translator to ensure good quality.
Grit-S contains two subconstructs-that is, consistency of interest and perseverance of effort, each measured by four items in the questionnaire. Responses are given using a 5-point Likerttype scale ranging from 1 (Not like me at all) to 5 (Very much like me). Half of the items are phrased negatively and require reverse scoring. To generate the Grit score, the sum of the items is divided by the number of questions. Grit-S has been reported to have good validity and reliability, with an internal consistency ranging between a ¼ .73 and .84 (Duckworth & Quinn, 2009). Similar levels of internal consistency have also been reported in previous studies using the Swedish translation (Bertilsson et al., 2017). In the present study, the internal consistency was a ¼ .65, which suggests that the scale has a questionable reliability in this case. This lower level might be explained by differences between the two subconstructs that the scale measures. Hence, a single construct scale would potentially have had a higher internal consistency than a scale that is built on two subscales (for an overview of the Grit scale, see Cred e et al., 2017).

Design
A schematic overview over the study design and experimental procedure can be seen in Figure 1 (a)-(c). A 2 Â 3 factorial within-subjects design was used, meaning that the participants learned half of the word pairs using retrieval practice-repeated testing with immediate feedback-and the other half using repeated studying. Three retention tests were given at different delays in order to assess the amount of learning: five minutes after learning, after one week, and after four weeks. In addition, the word pairs were randomly divided into three groups to be measured at different lags.

Procedure
The automated Ospan (WMC), METQ, and Grit were completed one week before the intervention as part of a data collection within a larger research project with the title 'Learning to engage the brain'. The intervention consisted of a learning session, and an assessment session including three retention tests.
Learning session. The learning session took place in the participants' classrooms during a class period and was conducted on individual computers using a web-based program that was designed to present the 60 Swahili-Swedish word pairs. Immediately prior to the learning session, all 60 word pairs were presented, one at a time, on the participants' computer screens in order to familiarize the participants with the material. Each pair was shown for eight seconds. Next, the learning session started and consisted of six practice rounds in which all participants learned half of the material via retrieval practice and the other half through repeated study (see Figure 1(b)). Re-study word pairs and retrieval practice word pairs were interleaved and randomly assigned to one of the two conditions on an individual level, meaning that retrieval practice and re-study word pairs differed between participants. The instructions explained that when both words in a pair were presented (Adhama-Honor, re-study condition) the participants should read and learn the words, and when only the Swahili word was presented (Mashua-?, retrieval practice condition) the participants were instructed to type in the Swedish equivalent. Word pairs in the retrieval practice condition were shown for eight seconds while the participants wrote the Swedish translation, followed by one second of correct answer feedback. To equate exposure to the material in the two conditions, word pairs in the re-study condition were shown for nine seconds each.
Immediate and Delayed Tests. Retention was assessed by means of a cued recall test at three separate time points (see Figure 1(c)). A five-minute break separated the learning phase from the retention test. At each retention interval (5 minutes, 1 week, and 4 weeks) participants were tested on 20 unique Swahili-Swedish word pairs, 10 re-study, and 10 retrieval practice pairs. In addition, the word pairs tested in prior retention rests were included in the subsequent tests as well. As a result of this procedure, each retention test contained an increasing number of word pairs. For example, after 4 weeks, participants were tested on all 60 word pairs. Of those 60 pairs, 20 had previously been tested after 5 minutes and 1 week, 20 had previously been tested after 1 week, and the final 20 (10 re-study and 10 retrieval practice pairs) were unique to the 4-week test. See Table 1 for a correlation matrix of all variables used in this study.

Results
The alpha level was set to .05, and as measures of effect size, partial eta square (g p 2 ) and coefficient of determination (r 2 ) were used, where applicable. Greenhouse-Geisser corrected degrees of freedom were reported when Mauchly's test indicated that the assumption of sphericity had been violated.
First, a factorial 2 Â 3 repeated-measures ANOVA was used to investigate changes in retention for retrieval practice and re-study word pairs, with the variables retention interval (five minutes, one week, and four weeks) and practice condition (retrieval practice and re-study practice) as within-subjects factors. The analysis was conducted using the word pairs that were unique to each retention test-that is, 20 word pairs at each interval. The results revealed main effects of retention interval, F(1.8, 268) ¼ 396.687, p < .001, g p 2 ¼ .73, and practice condition, F(1, 150) ¼ 248.982, p < .001, g p 2 ¼ .62, as well as a practice condition Â retention interval interaction, F(2, 300) ¼ 6.247, p ¼ .002, g p 2 ¼ .04. The interaction was driven by a decreased difference between retrieval practice and re-study word pairs from the one-week to the four-week test, potentially caused by recall of re-study word pairs approaching the floor (see Figure 2). To determine whether there was a significant testing effect-that is, that retrieval practice led to better retention than re-studying at each retention interval-the pairwise comparisons (Bonferroni corrected) for the main effects were inspected. They revealed that the significant main effect of retention interval reflects significant differences between all levels of the variable (all ps < .001) and that the significant main effect of practice reflects a significant difference in favor of the retrieval-practice condition (p < .001, see Figure 2).
Next, the effect of WMC, Grit, and NFC on the testing effect was investigated using hierarchical regression analyses where WMC was entered in the first step, and Grit and NFC were added in the second step. The three regression analyses were Bonferroni corrected for multiple analyses (p < .017). As mentioned in Bertilsson et al. (2017), the reason for setting up the analyses this way was to examine the effects of WMC separately, and then to control for WMC when analyzing the effects of Grit and NFC. The difference in retention between retrieval practice and re-study practice on the three retention tests were the dependent variables (i.e., the testing effect). This setup is in contrast to Bertilsson et al. (2017) where separate regression analyses were conducted using performance in the retrieval-practice and re-study-practice conditions at each retention interval. The regression analyses for unique word pairs in the present study revealed no significant relations between WMC, Grit, or NFC and the testing effect (Table 2).
To investigate whether the results would be confounded by retrieval-practice effects that arise from taking the tests after 5 minutes and 1 week (i.e., accumulated word pairs), a second factorial 2Â3 repeated-measures ANOVA was conducted using the accumulated word pairs tested at each retention interval (5 minutes ¼ 20 items; 1 week ¼ 40 items; 4 weeks ¼ 60 items). The results showed main effects of retention interval, F(1.6, 242) ¼ 232.906, p < .001, g p 2 ¼ .61, as well as practice condition, F(1, 150) ¼ 83.264, p < .001, g p 2 ¼ .36 (see Figure 3), but no interaction effect. The lack of interaction between retention interval and practice condition suggests that retention of both types of word pairs declined between each of the three consecutive retention tests (see Figure 3).
Identical hierarchical regression analyses as for the unique word pairs were conducted, but now including word pairs that had been tested in previous retention tests (i.e., accumulated; Bonferroni corrected for multiple analyses, p < .017). The results revealed again that none of the independent variables were related to the difference in performance on any of the retention tests (Table 3).
Because of the non-significant effects of NFC, Grit, and WMC, a post-hoc power analysis was conducted (Onwuegbuzie & Leech, 2004) using G*Power 3.1.9.7 and the statistical test of multiple linear regression (Faul et al., 2009). The correlations between NFC, Grit, and WMC was used as input and provided an effect size of f 2 ¼ 0.21. With an alpha level set to .05 and the sample size of 151 the post-hoc power analysis provided a power of .99, indicating a small risk of a Type-II error.

Discussion
This study partially replicated the design and procedure previously used by Bertilsson et al. (2017) using a larger sample size with the purpose of adding additional insight to how the

Figure 3. Recall Performance for Accumulated Word Pairs
Note. Proportion of correctly recalled accumulated word pairs as a function of practice (retrieval vs restudy) and retention interval (5 min, 1 week, vs 4 weeks). benefits of retrieval practice are associated with interindividual differences in WMC and personality traits. Furthermore, this study contributes new insights compared to Bertilsson et al. (2017) by examining the testing effect using difference scores (retrieval practice -re-study practice). As in Bertilsson et al. (2017), the ANOVA conducted using unique word pairs showed that retention of retrieval practice word pairs was significantly better than the retention of re-study word pairs on all three retention tests-that is, testing effects. However, in contrast to the results in Bertilsson et al. (2017), an interaction effect between retention interval and practice condition was found, meaning that the decline in performance between the retention tests differed between the learning techniques. When it comes to the ANOVA conducted using accumulated word pairs, within-subjects testing effects were obtained, and, in line with the previous results, no interaction effect was found between retention interval and practice condition. This indicates that for unique word pairs there is a difference in the decline in performance between the two learning techniques, while for accumulated word pairs the decline is similar irrespective of learning technique. However, as can be seen in Figure 2, a floor effect can potentially explain the interaction rather than a real decreased difference between retrieval practice and re-study word pairs at the four-week test (but for related findings, see Carpenter et al., 2008). The better retention of accumulated word pairs, compared to the unique word pairs, across time illustrates that in a pedagogical setting it is essential as a learner to have the possibility to retrieve the to-be-learned material several times in order to better consolidate the information (e.g., Rawson & Dunlosky, 2011;Wiklund-H€ ornqvist et al., 2020). Bertilsson et al. (2017) did not find any relationships between NFC, Grit, or WMC and performance on word pairs, in either retrieval-practice or re-study-practice conditions. However, the analyses were conducted separately for retrieval practice and re-study word pairs, using performance at each retention interval as the dependent variables, which could be interpreted as reflecting episodic memory performance rather than the testing effect. In the present study, the regression analyses were performed using the performance difference score between the two conditions at each retention interval. The analyses were again conducted on both unique and accumulated word pairs, and, in line with the findings from Bertilsson et al. (2017), no relationships were found between any of the predictors and performance on any of the retention tests (albeit corrected for multiple comparisons).
The non-significant relationships between Grit, NFC, and retrieval practice performance found in both the current study and Bertilsson et al. (2017), with respect to both unique and accumulated word pairs, suggest that individual differences in personality traits emphasizing effort seem to be unrelated to the effects of retrieval practice. It therefore seems plausible that the testing effects in the present study and in Bertilsson et al. (2017), to a large extent, were driven by TAP-an argument that is in line with Agarwal's (2019) findings that TAP is the critical process in retrieval practice. However, an important caveat is, of course, that we did not directly manipulate practice and retention test format. Further, since it was found that the measure of Grit had a lower internal consistency in the present study than it has previously been shown to have, we have to be cautious in drawing conclusions about the relationship between Grit and the testing effect.
With that in mind, one aspect to consider is whether students who enjoy thinking (NFC) or show Grit found the experimental setup and the vocabulary language materials to be cognitively stimulating or motivational. Perhaps grittiness and/or NFC would have a significant impact on performance with more complex material, if the tasks had been subject for grading, or if the participants were allowed to choose whether they wanted to use retrieval practice or not. Moreover, it is also possible that the use of a more comprehensive measure of personality (e.g., the 48-item Conscientiousness-scale of the Revised NEO Personality Inventory [Costa & McCrae 2008]) would have yielded a different result. Such aspects should be investigated in future studies. Thus, it cannot be ruled out that for Grit, the small number of items included in the instrument and possible differences in the two subconstructs is a validity problem.
The non-significant relation of WMC and the testing effect is in line with previous studies (Brewer & Unsworth, 2012;Minear et al., 2018;Wiklund-H€ ornqvist et al., 2014) and further underscore the conclusion that cognitive abilities (at least WMC) are of less importance for the use of retrieval practice. However, it should be noted that both Brewer and Unsworth (2012) and Minear et al. (2018) included measurements of intelligence (gf) beyond examining the non-significant effects of WMC. While Brewer and Unsworth (2012) found that retrieval practice was most beneficial for those with lower gf (relative higher gf), Minear et al. (2018) found that students with lower (compared to higher) gf showed a larger testing effect for easy items. The opposite pattern was found for difficult items, such that students with higher (compared to lower) gf showed a larger testing effect. However, no significant relationship between gf and the overall testing effect was evident (Minear et al., 2018). In line with the current study, both studies used foreign language vocabulary as the to-be-learned material, but in contrast to the current study the testing effect was examined after one day (Brewer & Unsworth, 2012) or after two days (Minear et al., 2018), while the current study spanned across weeks (see also Wiklund-H€ ornqvist et al., 2014). In addition, the current study included both accumulated and unique word pairs, and, as evident from the results, the level of performance differed between accumulated and unique word pairs such that accumulated word pairs were retained at a higher degree relative to unique word pairs, possibly also due to testing. Importantly, independent of accumulated or unique word pairs, nonsignificant effects of WMC were evident across all three retention intervals, suggesting that cognitive load associated with retrieving unique word pairs relative to accumulated word pairs was comparable despite differences in performance level.
In sum, the results from Bertilsson et al. (2017) and the present study indicate that retrieval practice is a useful learning strategy in the context of acquiring a foreign language vocabulary. The present study also highlights retrieval practice as an effective learning strategy useful for students irrespective of the cognitive prerequisites and personality characteristics targeted. Such scientific evidence further emphasizes the significance of merging psychological and didactical knowledge for the purpose to optimize learning outcomes in the classroom. Together with previous studies, we argue that retrieval practice should be explained and taught to students and teachers both as a pedagogical tool and as an individual learning strategy.
The present study contains some limitations regarding the conclusions that can be drawn from the results. While a testing effect was identified, it is not possible to discern whether it is the result of a direct effect of testing, an indirect effect of testing (e.g., test-potentiated learning or forward testing effect), or whether both direct and indirect effects contribute to the advantage of retrieval practice. Arnold and McDermott (2013) suggest that the observed benefit of feedback on the testing effect may actually be the result of the testing effect and test-potentiated learning in conjunction. While the majority of the literature regarding the testing effect largely ignores this ambiguity, some studies have made attempts at separating the effects (e.g., Arnold & McDermott, 2013;Kubik et al., 2016).
Future research aiming to investigate the effectiveness of retrieval practice should differentiate between direct and indirect effects of testing to enable interpreting individual differences in regard to different types of the testing effect. Although we know from a wealth of studies that retrieval practice produces superior long-term retention even in the absence of feedback (e.g., Roediger & Butler, 2011), and independent of being a teacher or student, acquiring long-lasting learning is one of the challenges within the educational system. From an educational perspective, the results in the current study indicate that retrieval practice accompanied by feedback can be one way for educators to respond to individual variability in terms of personality traits and cognitive abilities well associated with learning. Van Barneveld, C. (2007). The effect of examinee motivation on test construction within an IRT framework. Applied Psychological Measurement, 31 (1)

Author Biographies
Frida Bertilsson is a PhD student with a masters degree in cognitive science. She is part of a research project denoted as 'The learning brain' which focuses on studying learning strategies in relation to cognition.
Tova Stenlund is an associate professor in the Department of Psychology, Ume˚a University, Sweden. She has a PhD degree in educational measurement, and her research and teaching interests lie in the field of educational psychology and cognitive psychology, with a particular focus on validity of test and assessment results, test-taking behavior, and effects of repeated testing on memory and learning.
Carola Wiklund-H€ ornqvist is an assistant professor in the Department of Psychology, Ume˚a University, Sweden. She has a PhD in psychology, and her research and teaching interests lie in the field of educational neuroscience, focusing on the relationship between memory and learning, specifically; on the effects of repeated testing on memory and learning related to both behavioral and neuroimaging data.
Bert Jonsson is a professor in the Department of Applied Educational Science. He is the principal investigator of the project 'The learning brain' financed by the Swedish Research Council. The project investigates fundamental questions arising in educational science and pertaining to the cognitive neuroscience of children's learning. The project focuses mainly on learning strategies in relation to cognition.