The Development of Scientific Strategy Knowledge Across Grades

In this study, we developed a new test on scientific strategy knowledge and investigated the construct validity of the resulting test scores. Moreover, measurement invariance across grade levels has been analyzed to ensure the generalizability of the assessment. Furthermore, convergent and discriminant validity were investigated. A total of N = 1,182 German high school students of Grade Levels 8, 10, and 12 completed tasks on strategy knowledge, fluid intelligence, content knowledge, interest in science, and scientific self-concept within a cross-sectional study. Multigroup confirmatory factor analysis was used to check for measurement invariance. Our results show that scalar invariance holds across grades and that there are significant differences in performance favoring students of higher grade levels. Furthermore, fluid intelligence and content knowledge are relevant predictors of strategy knowledge, whereas gender and motivational constructs do not show significant effects. Implications for developmental studies on strategy knowledge and assessment practice are discussed.


Introduction
Metacognitive abilities are key factors in scientific problem solving. Research has provided evidence on its importance in carrying out the individual steps of problem-solving processes assumed to underlie scientific investigations (H. Kim & Pedersen, 2011;Kuhn, Iordanou, Pease, & Wirkala, 2008;Künsting, Wirth, & Paas, 2011;Thillmann, 2007;Zimmerman, 2007). Moreover, research suggests that metacognitive abilities are prerequisites for the transfer of knowledge across different steps in problem solving and can, therefore, be regarded as essential in science education (Cooper & Sandi-Urena, 2009;Gamo, Sander, & Richard, 2010;Kapa, 2007). Efklides and Vlachopoulos (2012) differentiated three main facets of metacognition, which were also proposed by Flavell (1979) and Kuhn (2000). First, metacognitive knowledge refers to declarative knowledge about tasks (Kuhn & Pearsall, 1998;, beliefs about knowledge and cognition (Liu, 2010), persons, and knowledge about strategies, which can be used in specific problemsolving situations (Efklides & Vlachopoulos, 2012;Neuenhaus, Artelt, Lingel, & Schneider, 2011;Taasoobshirazi & Glynn, 2009). Second, metacognitive skills refer to the application of procedural knowledge in problem-solving processes (Funke & Frensch, 2007;Goode & Beckmann, 2010). Finally, metacognitive experiences are closely related to emotions and one's self-efficacy (Efklides & Vlachopoulos, 2012). Based on the concept of scientific inquiry, Mayer (2007) argued that knowledge about strategies and tasks can be regarded as one crucial factor in problem solving. In this context, problem solving is defined as the ability to perform operations to bridge the gap between an initial and a goal state (e.g., Novick & Bassok, 2005). But the relationship between knowledge about strategies and the application of strategies in problem situations is not deterministic (Amsel et al., 2008;Kuhn & Pearsall, 1998;Neuenhaus et al., 2011), although correlations have been found (Funke & Frensch, 2007;Goode & Beckmann, 2010;. Whereas research on metacognition has mainly focused on procedural aspects, little is known about the structure and development of the declarative components. Given the lack of assessments of knowledge about strategies (strategy knowledge), the questions of domain-specificity and how this construct could be evaluated in specific contexts have rarely been addressed (Neuenhaus et al., 2011;Rickey & Stacy, 2000;. Researchers who are interested in the assessment of strategy knowledge have also argued that developmental research on students' performance 522076S GOXXX10.1177/2158244014522076SAGE OpenScherer andTiemann research-article2014 1 Humboldt-Universität zu Berlin, Germany and its relation to constructs such as intelligence and content knowledge can be regarded as desiderata in educational research Van der Stel & Veenman, 2010). Taking into account these research gaps, we developed a test on strategy knowledge for the domain of chemistry and evaluated the resulting measurement model according to its factorial structure. To use the test for analyzing differences across grade levels, further statistical properties such as measurement invariance are investigated. In addition, this study provides information on the relationships between strategy knowledge and related constructs as evidence on the measure's convergent and discriminant validity (Borsboom, Mellenbergh, & Van Heerden, 2004;Campell & Fiske, 1959;Cronbach & Meehl, 1955;Messick, 1995). Solaz-Portolés and Sanjosé López (2008) defined declarative knowledge as "static knowledge about facts and principles that apply within a certain domain" (p. 107). It contains knowledge about variables and about how to control them to obtain information on a system or experiment (Flavell, Miller, & Miller, 2002;. It is, therefore, part of the declarative metamemory (Flavell, 1979). According to Kuhn's (2000) framework of metacognition, in which metacognitive awareness and controlling processes were differentiated, strategy knowledge refers to the first category of awareness. Artelt, Beinicke, Schlagmüller, and Schlagmüller (2009) as well as Schneider and Artelt (2010) followed this approach, and defined strategy knowledge by operationalizing the construct as declarative knowledge about the nature of problems, tasks, and strategic behavior. This definition also contains knowledge about the control of variables and processing information. In our study, we refer to this operationalization of the construct as a part of metacognitive knowledge.

Metacognitive Knowledge About Strategies
Until now, there have been various studies on the question of whether or not certain types of knowledge or competencies are domain-general or domain-specific (De Jong & Ferguson-Hessler, 1996). From a psychological perspective, strategy knowledge might be domain-general because it refers to common problem-solving strategies that could be applied in various contexts (Van der Stel & Veenman, 2010;Veenman, Elshout, & Meijer, 1997). From an educational perspective, knowledge and competencies are always acquired in specific contexts and can, thus, be regarded as domain-specific (e.g., Hambrick, 2005;Neuenhaus et al., 2011). De Jong and Ferguson-Hessler (1996) further argued that knowledge about strategies is always bound to specific problems within a domain. In our study, we consider strategy knowledge as a domain-specific construct but argue that there might be an overlap with other domains.
There have been a few approaches to assess students' strategy knowledge in science. For example, Thomas, Anderson, and Nashon (2008) as well as Velayutham, Aldridge, and Fraser (2011) developed paper-and-pencil tests that were based on self-reporting scales. Other studies used interview scenarios or peer tutoring procedures (see . Recent research has focused on the assessment of strategy knowledge with questionnaires that contain specific problems and strategies (e.g., Artelt et al., 2009;Klieme et al., 2010;Shahat, Ohle, Treagust, & Fischer, 2013;Thillmann, 2007). In these approaches, students have to evaluate strategies according to their adequacy in solving a problem or task. Schneider and Artelt (2010) noted that strategy knowledge should not be assessed during problem-solving processes because it is regarded as a declarative and therefore static component of knowledge. In contrast, procedural knowledge could be assessed during problem-solving processes to capture its application in specific situations. In our study, the development of an appropriate test instrument for the domain of chemistry refers to the static assessment procedure proposed by Artelt et al. (2009).
According to Berger andKarabenick (2011) andThillmann (2007), there are different covariates that affect students' performance on strategy knowledge. Besides motivational and self-related constructs, domain-specific performance (e.g., school grades) and fluid intelligence show significant correlations (Hoffman & Schraw, 2009;Panaoura & Philippou, 2007;Van Kraayenoord & Schneider, 1999). In addition, De Jong and Ferguson-Hessler (1996) identified content knowledge and the accessibility of mental models as further covariates. Taking these results into account, the present study also intends to describe the relationships among strategy knowledge and covariates to check for convergent and discriminant validity (Campell & Fiske, 1959;Messick, 1995).

Developmental Aspects of Strategy Knowledge
There have been only a few attempts to investigate strategy knowledge across grade levels. In this section, the main outcomes of these studies are presented.
Kolic-Vehovec, Bajsanki, and Roncevic Zubkovic (2010) argued that the importance of metacognition in reading comprehension intensified with increasing age. In a longitudinal study, they found that patterns of age differences varied across the different components of metacognition. Furthermore, the development of metacognitive knowledge differed according to the transitions across the life span (Annevirta & Vauras, 2001;Panaoura & Philippou, 2007;Schneider, 2008Schneider, , 2010. For instance, Schneider (2010) proposed a major shift in strategy knowledge at the end of kindergarten and elementary school, whereas Lemaire and Lecacheur (2011) found a development from Grades 3 to 7 in selecting problem-solving strategies. Van der Stel and Veenman (2010) were also able to show that metacognitive skills and procedural knowledge increased over time. Further studies revealed that developmental patterns of metacognition also differed across gender groups. For instance, girls performed better in strategy knowledge tasks during high school Leutwyler, 2009). But first-grade boys significantly outperformed girls of the same age group in the domain of mathematics within a study conducted by Carr, Jessup, and Fuller (1999). In sum, research indicates that there is a growth in metacognitive knowledge and skills with increasing age or grade level.

Measurement Invariance
The growing interest in comparing students' performance across grade levels, gender, or ethnic groups has led to the question of the psychometric properties, which have to be fulfilled to conduct these comparisons. If researchers intend to compare test performance across different groups, they have to ensure that the tests assess the same construct in each of these groups. This test feature accounts for the generalizability and structure of the construct (Messick, 1995). Psychometricians have recently focused on establishing statistical models that can be used to investigate whether or not a construct holds across groups (Brown, 2006;E. S. Kim & Yoon, 2011). In these models, the analysis of measurement invariance has become an important method to check for test validity. Invariance can also be regarded as a prerequisite for establishing vertical scales in longitudinal or cross-sectional studies and facilitates the modeling of learning progressions (Kolen & Brennan, 2004;Köller & Parchmann, 2012;Wang & Jiao, 2009). If a measure holds across grade levels, researchers could use the test for investigating developmental changes by comparing latent means, variances, and covariances. Educational researchers and psychologists have therefore applied this concept for various constructs across different domains (e.g., Bowden, Saklofske, & Weiss, 2011;Doran, Aldridge, Roesch, & Myers, 2011;Lakin, 2012;Martin, 2009).

Research Goals
In light of the research gaps on strategy knowledge, the present study focuses on the assessment of the construct across grade levels and analyzes the factorial invariance of the resulting measurement model. In this regard, we provide a methodological approach of comparing students' performance across grades, which could be transferred to similar cross-sectional studies. More precisely, our research goals are as follows: 1. Developing a test on strategy knowledge for the domain of chemistry 2. Analyzing the resulting measurement model and its measurement invariance across grade levels 3. Comparing students' performance of strategy knowledge across grades 4. Analyzing the effects of covariates on strategy knowledge to validate the test In our study, we mainly focus on the evaluation of measurement invariance to legitimize comparisons in strategy knowledge across Grades 8, 10, and the upper secondary level (Grade 12). The analysis of measurement invariance provides evidence on the generalizability of test scores across grade levels and thus contributes to construct validity. Further analyses account for construct validity by analyzing the relationships among the construct and its covariates. These analyses are regarded as methods of obtaining evidence on whether the strategy knowledge scores measure the intended construct (Borsboom et al., 2004;Cronbach & Meehl, 1955). In this regard, we aim to show that strategy knowledge, fluid intelligence, and content knowledge are empirically related constructs. To sum up, this article provides a new assessment of strategy knowledge and validates the resulting test scores across three grade levels in the context of construct validity.

Participants
As this study aims to compare performance on strategy knowledge across grades, we chose a cross-sectional design with three grade levels. In our study, participants were N = 1,182 German high school students of Grade Levels 8 (n 8 = 453), 10 (n 10 = 378), and the upper secondary level (n 12 = 351, Grade 12), who attended 1 of 61 chemistry classes. The mean age of the entire sample was 15.61 years (SD = 1.68) and ranged between 13 and 21 years. Of these students, 48.9% were female. Students worked on computerized versions of tests on strategy knowledge and related constructs.

Measures
Strategy knowledge. The development of the test on knowledge about problem-solving strategies (strategy knowledge) followed a common approach of assessing the construct within the domains of reading, mathematics, and physics Klieme et al., 2010;Neuenhaus et al., 2011;Thillmann, 2007). Due to the domain-specificity of strategy use (e.g., Schukajlow & Leiss, 2011), these tasks contain specific problems or scientific hypotheses as well as various strategies that might be performed to solve the problem or to check the hypotheses. Following the argumentation of Mayer (2007), these tasks refer to metacognitive knowledge about scientific problems and problem-solving procedures. The resulting measure strongly refers to metacognitive knowledge and indicates the degree of the students' awareness of the best problem-solving methods . In this framework, the construct is part of the declarative component of metacognitive knowledge and knowledge about tasks or contexts (Efklides & Vlachopoulos, 2012). The process of test development was performed in two steps: First, we developed six chemistry problems that contained a specific research question or hypothesis. According to Berry and Dienes (1991), we distinguished between implicit and explicit tasks by varying the surface structure of the stimuli. This concept was transferred to the construct of strategy knowledge and resulted in three tasks that did not contain explicit information on the variables to be controlled ( Figure 1). Accordingly, students had to identify relevant variables of the system or experiment first to evaluate possible strategies of implicit tasks. In contrast, explicit information on the number and types of variables was given in three explicit tasks ( Figure 2).
These tasks were administered to 20 experts in the field of science education and chemistry. Each one of them was asked to evaluate the strategies on a 6-point Likert-type scale ranging from 1 = highly adequate to 6 = insufficient according to their adequacy in solving the problem. We then analyzed whether the expert ratings significantly differed from each other by transposing the data matrix into a matrix with raters as variables and item ratings as cases. The internal consistency was computed as a measure of intraclass correlation (see Thillmann, 2007). The resulting value of .96 was significant, F(80, 1520) = 26.94, p < .001, and indicated a sufficient rater agreement for the entire test. To evaluate the agreement on an item-level, we subsequently checked the agreements on how raters compared pairs of two strategies. More precisely, if more than 70% of the raters evaluated Strategy A better than B, the agreement on this comparison was regarded as sufficient and was used for further analyses Scherer & Tiemann, 2012). Finally, 20 comparisons fulfilled the above mentioned criteria and were used for further analyses. The number of items obtained in this study was reasonable, as compared to Artelt et al.'s (2009) reading strategy test that contained 26 out of 77 comparisons. This procedure of item development accounted for content validity of the underlying measure (Messick, 1995).
Second, the students' ratings were scored as follows (e.g., Artelt et al., 2009;Artelt, Schiefele, & Schneider, 2001;Thillmann, 2007): If a student rated Strategy A better than B and this relation aligned with the experts' ratings, then the item of comparing A and B was coded to 1. In the case of equally ranked Strategies A and B, 0.5 points were yielded and 0 else. Regarding this scoring procedure, it seems apparent that items are dependent because the comparisons among strategies of one stimulus are interwoven. For example, rating Strategies A and C might be affected by rating Strategies A and B. Due to these dependencies among items and the varying numbers of items within a task, we used the z-transformed sum scores (SK01-SK06) as six indicators of strategy knowledge. This is a common procedure when dealing with testlets or tasks, which consist of several dependent items (Lakin, 2012;Little, Cunnigham, Shahar, & Widaman, 2002;Yen, 1993).
Covariates of strategy knowledge. As previous research on covariates of strategy knowledge suggested (Neuenhaus et al., 2011;Phakiti, 2008;Van der Stel & Veenman, 2010;Veenman et al., 1997), fluid intelligence, content knowledge, self-concept, and motivational constucts can be regarded as predictors of the construct and could, therefore, be used for an external validation of the measure. Besides these covariates, we additionally assessed grades in chemistry to obtain information on the domainspecificity of strategy knowledge and used interest and enjoyment in science as indicators of motivation.
Sandra hypothesizes that different factors determine the solubility of substances in water.

Different procedures occur which might be performed in order to check her hypothesis in a series of experiments.
Evaluate these methods with grades ranging from 1 (=very good) to 6 (=not sufficient). Fluid intelligence. The test on fluid intelligence was based on the figural scale of a cognitive ability test, which significantly loads on the g factor of intelligence (Heller & Perleth, 2000). This test consisted of 45 items, including 15 anchor items for adjacent grade levels and 10 grade-specific items. Each student had to solve 25 figural problems within a time limit of 8 min. The resulting answers were dichotomously scored.
Content knowledge and grades in chemistry. Content knowledge was assessed with 44 multiple-choice items which referred to the contents of the strategy knowledge tasks. We developed one test form for each grade level and linked these forms with 3 to 10 anchor items. Finally, each test contained 20 to 24 dichotomous items. Due to the many test contents, we expected the reliability to be low but sufficient.
Furthermore, grades in chemistry were assessed to examine the influence of this performance-based variable on strategy knowledge. This variable ranged from 1 = very good/superior to 6 = insufficient and was handled as an ordinal variable. As implemented in the German curricula, students with high performance in chemistry were assigned to low grades.
Due to the anchor designs of the tests on fluid intelligence and content knowledge, an item response theory model had to be used to concurrently transform students' person parameters on a common scale (Kolen & Brennan, 2004). The resulting person parameters of the Rasch model (WLE, weighted likelihood estimators) representing the latent trait were used for further analyses. Item response modeling was performed in ACER ConQuest 2.0 (Wu, Adams, Wilson, & Haldane, 2007).
Motivation and self-concept. Motivational and personrelated constructs were assessed by empirically validated questionnaires of the Programme for International Student Assessment (PISA) 2006 study (Organisation for Economic Co-Operation and Development [OECD], 2009). In our study, we focused on two scales of motivation and interest: enjoyment (6 items), and interest in science (9 items). We further used the PISA scale for scientific self-concept (5 items). These three scales were administered as subtests with a 4-point Likert-type scale which ranged from 1 = I disagree to 4 = I totally agree.

Procedure
In our survey, we chose a cross-sectional design to describe and compare subpopulations of different grade levels. This design has the major advantage of assessing students' strategy knowledge at one time point without any follow-ups and dropouts over time. In our study, we did not aim to capture individual growth patterns, but rather differences between subpopulations. The tests on strategy knowledge and covariates were administered as computerized versions in two sessions of 90 min each. In all tests, students were able to return to previous items and correct their solutions, if necessary. The resulting data were simultaneously logged and finally coded in SPSS 19 (IBM, 2010).

Statistical Analyses
Confirmatory factor analysis (CFA). To check for the structure of the strategy knowledge test, we conducted CFAs in Mplus 6.0 (Muthén & Muthén, 2010). In these models, missing values can be imputed on a model-based level. In this context, the Full Information Maximum Likelihood (FIML) procedure was used (Enders, 2010). In our study, 3.2% of the data were Reaction rates depend on different factors such as temperature, concentration, and the properties of surfaces. Tim wants to investigate how the temperature of hydrochloric acid influences the rate of the reaction with CaCO 3 .

Different procedures occur which might be performed in order to analyze the influence of temperature in a series of experiments. Evaluate
these methods with grades ranging from 1 (=very good) to 6 (=not sufficient). missing. According to Little's missing completely at random (MCAR) test that tests the assumption of MCAR against missing at random, our data were more likely to follow MCAR, χ 2 (37, N = 1,182) = 37.09, p = .51. Therefore, the model-based imputation of the FIML procedure was legitimate.
In our analyses, we used the robust maximum likelihood estimator and evaluated the goodness of fit by taking into account the following statistics: the Satorra-Bentler scale corrected chi-square value (SB-χ 2 ), the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean Square Residual (SRMR; Brown, 2006). Common guidelines for an acceptable model fit require a nonsignificant SB-χ 2 value, a CFI above .90, a RMSEA below .08, and a SRMR below .09 (Hu & Bentler, 1999;Marsh, Hau, & Grayson, 2005). However, we note that the statistical significance of the SB-χ 2 value strongly depends on the sample size (Berger & Karabenick, 2011). For further comparisons of competing and nested CFA models, we applied χ 2 -difference tests with a Satorra-Bentler correction (Bryant & Satorra, 2012).
Multigroup CFA. The investigation of measurement invariance was conducted by multigroup CFA, which allows evaluating different types of invariance and yields information on modelbased fit criteria (Campbell, Barry, Joe, & Finney, 2011). In this concept, four types of measurement invariance are considered (Hildebrandt, Wilhelm, & Robitzsch, 2009;Lubke, Dolan, Kelderman, & Mellenbergh, 2003): (a) configural invariance, which can be established by freely estimating factor loadings and by assuming the same number of factors and loading patterns across groups; (b) metric invariance, in which equal factor loadings are added to the configural model; (c) scalar invariance, in which equal intercepts are introduced to the metric model; and finally (d) strict invariance, in which equal residual variances are added to the metric model. Again, χ 2 -difference tests and goodness of fit indexes can be used to compare these hierarchically ordered models (Byrne & Stewart, 2006;Cheung & Rensvold, 2002). As Brown (2006) suggested, if at least scalar invariance is met, comparisons of means, variances, and covariances can be conducted.

Latent regression analyses.
To examine convergent and discriminant validity of the strategy knowledge measure, we conducted regression analyses with scientific self-concept, enjoyment in science, interest in science, content knowledge, fluid intelligence, grades in chemistry, and gender as manifest predictors. In this analysis, strategy knowledge was modeled as a latent construct with a single factor. These analyses were based on the best invariance model of the multigroup CFA.
General comments on latent variable modeling and construct validity. In the present study, we used a latent variable modeling approach. First, measurement models were specified, which comprised a latent factor and manifest indicators of strategy knowledge. Second, the model's generalizability across grade levels was investigated by latent multigroup CFA. Third, the relations among covariates and strategy knowledge were used to check for convergent and discriminant validity. These analyses finally served as tools to investigate the test's construct validity (Campell & Fiske, 1959;Cronbach & Meehl, 1955). In this regard, latent variables were modeled as representatives of unobservable constructs, which are measured by manifest and observable items. This procedure has the major advantage of correcting and controlling for measurement error (Borsboom, Mellenbergh, & Van Heerden, 2003). Furthermore, the latent regression analyses resulted in more precise path coefficients when using the latent measurement model of strategy knowledge. Brown (2006) also argued that using latent models reduces the dimensionality of data because latent factors represent the common variance shared by manifest indicators.

Results
In this section, we first present the psychometric properties and descriptive statistics of the tests on strategy knowledge and covariates. Second, we analyze the internal structure of strategy knowledge by means of CFA. Based on the outcomes of these analyses, we check for measurement invariance to justify comparisons across grade levels. Finally, latent regression models are applied with the aim of obtaining information on construct validity of the strategy knowledge measure.

Descriptive Statistics
Strategy knowledge. As shown in Table 1, the test on strategy knowledge revealed an internal consistency of α = .77 for the entire sample with item-to-total correlations between .16 and .52. Furthermore, the strategy knowledge scale showed a slight ceiling effect and yielded a mean sum score of 16.29 (SD = 2.93) for the 20 items and the entire sample. The internal consistencies were sufficient and comparable with previous studies on metacognitive knowledge in science (e.g., Liu, 2010;Thillmann, 2007). Further descriptives of the scales' sum scores are presented in Table 1. Differentiating between grade levels led to acceptable reliabilities between .76 and .81 for the test, which can be regarded as sufficient at this stage of the test development Liu, 2010).

Fluid intelligence, content knowledge, and grades in chemistry.
Using the unidimensional Rasch model, we found a sufficient reliability of the test on fluid intelligence (WLE reliability = .90) and a lower value for the test on content knowledge (WLE reliability = .61). However, the reliability score of the knowledge test is still arguable for a test that measures knowledge in different content areas (Kalyuga, 2006). For both tests, students' mean ability scores were transformed to zero. The students' grades in chemistry were moderate (M = 2.65, SD = .89, Median = 3.00, Minimum = 1.00, Maximum = 5.00).

Measurement Model of Strategy Knowledge: CFA
To check the structure of strategy knowledge, we established a measurement model with one latent factor for the data of the combined sample. The resulting model is shown in Figure 3. This model revealed an acceptable goodness of fit and represented the data sufficiently (see Table 2).
We then analyzed whether or not this model held across grade levels by conducting the analysis in each grade level separately. Similar to the combined sample, reasonable and acceptable model fits resulted for Grades 8 and 12 (Table 2).
However, goodness of fit statistics were poor for Grade Level 10. These might have resulted from less homogeneous response patterns in the strategy knowledge items in Grade 10. But due to the fact that this model can be applied for the combined sample, Grades 8 and 12, we continued with further analyses. As Lakin (2012) simplified the statistical prerequisites of the baseline model by arguing that the measurement model must reveal an acceptable goodness of fit for each subgroup in terms of plausibility, we conclude that the proposed model of strategy knowledge could be accepted as a baseline model.

Measurement Invariance: Multigroup CFA
Having established a baseline model for each grade level and the combined sample, we analyzed different levels of measurement invariance across grades. First, we checked for configural invariance by applying a multigroup CFA model.
In this analysis, all factor loadings, residuals, and intercepts were freely estimated. The resulting values of the fully standardized factor loadings across grades are shown in Table 3.
As shown in Table 3, the loadings of the z-standardized indicators followed the same pattern in each grade level. The highest loadings of the latent factor were found for items SK01 and SK04. Taken together, the configural invariance model revealed an acceptable fit with similar patterns of factor loadings across grades. Given that configural invariance was supported, we further constrained the factor loadings to equality across grades (metric invariance). Accordingly, the resulting model revealed acceptable fit statistics and was, thus, accepted. We then compared the two nested models of configural and metric invariance by conducting a ΔSB-χ 2 difference test and found that the metric model was empirically preferred (Table 4). This indicated that the items had equal salience in Grade Levels 8, 10, and 12. Taken together, the factorial structure of the unidimensional measurement model held across grades.
As a further step, scalar invariance was tested. Again, the model revealed a remarkable goodness of fit and sufficiently represented the data (Table 2). Compared with the previous Note. These values are based on the raw scores before the z transformation. Only complete data sets were included in the analysis of internal consistencies (Cronbach's α). model, the scalar model was empirically preferred (Table 4). Taken together, the measurement model of strategy knowledge, assuming one latent factor, could be used to compare group means across grades. Finally, we applied the most restrictive model of strict invariance by constraining residuals to equality across grades. In this model, the assumption was made that the measurement of the underlying construct was not biased and revealed the same reliability and accuracy in each grade level. The resulting fit indexes were poor (Table 2) and indicated that the model was empirically not preferred (Table 4). In light of these results, the scalar model was accepted as the final model. Given the scalar invariance of the unidimensional model, we were able to compare means, variances, and covariances of students' performance across grade levels (Byrne & Stewart, 2006;Lakin, 2012).

Comparing Latent Means Across Grade Levels
As another step in evaluating the effects of grade level on strategy knowledge, we analyzed the differences in latent means. In these analyses, we used the scalar model and constrained latent means of one grade level as a reference to zero (Byrne & Stewart, 2006;Van de Schoot, Lugtig, & Hox, 2012). By setting Grade 8 as a reference group, we were able to compare the means of Grades 8 and 10 as well as 8 and 12. The second analysis with Grade 10 as a reference was necessary to compare Grade Levels 10 and 12. Moreover, the resulting measurement model with constrained means showed a very good model fit and was, therefore, accepted, SB-χ 2 (47, N = 1,177) = 65.22, p = .17, CFI = .97, RMSEA = .02, p(RMSEA ≤ .05) = .99, SRMR = .04. To evaluate the practical importance of these differences, we computed Hedges's g and transformed the resulting value into the standardized effect size of r (Steinmetz, Schmidt, Tina-Booh, Wieczorek, & Schwartz, 2009). The results are shown in Table 5.
There were significant differences between Grade Levels 8 and 10, 8 and 12, and 10 and 12 favoring students of higher grades. Effect sizes (r) were small to moderate and ranged between .16 and .42. The differences in latent means between Grades 10 and 12 were smaller than for Grades 8 and 10 or 8 and 12, respectively. Taken together, our data suggested an increase in strategy knowledge across grades.

Effects of Covariates on Strategy Knowledge
The analysis of convergent and discriminant validity was conducted by introducing covariates of strategy knowledge to the scalar model for the entire sample. The resulting model represented the data with an acceptable goodness of fit and explained 39.9% of variance (Table 6). The effect size for the latent factor was medium (f 2 = .664).
Based on this regression model, we found that content knowledge, fluid intelligence, and grades in chemistry were significant predictors. Further covariates such as gender, interest in science, scientific self-concept, and enjoyment in science did not contribute to the variance explanation in  Note. 90% CI = 90% confidence interval of the RMSEA. The Satorra-Bentler corrected χ 2 values (SB-χ 2 ) were based on the robust maximum likelihood estimator (Bryant & Satorra, 2012). In this analysis, 5 observations had to be excluded due to a unreasonably high number of missing values. strategy knowledge. In sum, the data suggested that performance-based predictors significantly affected strategy knowledge, whereas personality-related constructs did not show effects.

Discussion
To compare students' performance across subpopulations, the present study investigated whether a unidimensional measurement model of strategy knowledge held across grade levels. In these analyses, the concept of measurement invariance was applied. Furthermore, the relationships between scientific strategy knowledge and covariates were analyzed.

Assessment of Strategy Knowledge
We were able to develop a test on strategy knowledge for the domain of chemistry with sufficient internal consistencies across grades. The values of Cronbach's α for the overall scale were above .70 for each grade and could, thus, be regarded as sufficient at this stage of research. For example, Thillmann (2007) found values between .58 and .85, and Artelt et al. (2009) between .56 and .81 for different domains. As Liu (2010) and Muis, Bendixen, and Haerle (2006) discussed in their reviews, these values are common for the assessment of metacognitive knowledge. Although factor loadings and item-to-total correlations were comparably low for some items, our findings somehow align with previous test statistics which were derived from the PISA assessments of strategy knowledge in reading (α = .84, r it = .21-.57) and mathematics (α = .78, r it = .03-.45) (Klieme et al., 2010). However, at this stage of the test development, we cannot clarify whether or not this was an artifact of statistical or methodologies' issues (e.g., item parceling). Further studies have to be conducted to improve the measurement accuracy of the strategy knowledge scale. We further discuss these results in light of the difficulty of measuring constructs of implicit metacognition (Flavell et al., 2002;Muis et al., 2006;Reber, 1993). Within implicit tasks, many processes are involved that cannot be assessed by explicit answers on specific tasks. These processes often relate to constructs such as content knowledge (De Jong & Ferguson-Hessler, 1996), epistemological views of science (Liu, 2010;Tsai, Ho, Liang, & Lin, 2011), reading ability (Artelt, Schiefele, & Schneider, 2001), and modes of knowledge representation (Alibali, Phillips, & Fischer, 2009;Dienes & Perner, 1999). The resulting effects on the strategy knowledge scale could have led to weaker statistical properties and interfered with students' performance. In addition, students' views of scientific knowledge and methods play an important role while evaluating strategies according to their goodness of fit (Liu, 2010).
The analysis of the structure of the strategy knowledge test was based on a single latent factor, which was measured by implicit and explicit tasks. However, there might be further dimensions of strategy knowledge. As Leutwyler (2009) and Phakiti (2008) suggested, the construct of strategy knowledge could show a higher order structure with different factors across domains and facets of the construct. However, in our study, we were did not specify a multidimensional structure as we hypothesized only one latent trait in a single domain.
At this stage of the proposed research, we are not able to empirically clarify whether or not the construct of strategy knowledge can be regarded as domain-specific or domaingeneral. Together with Kaberman and Dori (2009), Lingel,   Neuenhaus, Artelt, and Schneider (2010), and Neuenhaus et al. (2011), we argue that there is a domain-specific component. To address this issue, further research should analyze the structure of the construct across further domains . In our study, we conceptualized strategy knowledge as a construct, specifically defined and assessed for chemistry. After all, our assumption has been supported by the significant correlations with students' grades in chemistry, but still requires further investigation. Certainly, further studies on our tasks need to be conducted to validate the test measure and improve its quality. In addition, qualitative approaches could clarify further cognitive and metacognitive processes which are closely related to strategy knowledge. Moreover, it would also be interesting to investigate the relationships with further aspects of metacognition, which was proposed by Efklides and Vlachopoulos (2012).
Our study provided a test on strategy knowledge which can be regarded as objective, reliable, and valid according to the assumptions of classical test theory. Moreover, we combined research on knowledge in scientific problem solving with the theory of declarative metacognition for a specific domain. It, therefore, contributes to the field of research on metacognition in educational settings .

Measurement Invariance Across Grades
In our study, the results revealed scalar invariance of the measurement model of strategy knowledge across grades. Hence, as the construct holds across grades, we argue that the strategy knowledge test could be used to assess developmental changes of strategy knowledge with increasing grade level (Hildebrandt et al., 2009). From a methodological perspective, we also note that multigroup CFAs are appropriate models for assessing invariance and legitimatize meaningful comparisons of means, variance, and covariances (Sass, 2011;Van de Schoot et al., 2012). In light of the current discussion on how to model learning progressions, we further claim that these procedures gain importance in educational research because they provide evidence whether or not there is a construct shift over time and, thus, contribute to construct validity of test measures (Köller, & Parchmann, 2012). Our study, therefore, provided an example of how to handle multigroup data to assess differences across grade levels.

Latent Means Across Grades
Our analyses also revealed a significant effect of grade level on students' performance in strategy knowledge favoring upper grades. This finding confirms the results of a study conducted by Short, Schatschneider, and Friebert (1993), who found that there is a development of strategy knowledge with increasing grade level in math. In addition, our finding aligns with previous longitudinal studies in other domains (Neuenhaus et al., 2011;Schneider, 2010). In light of these results, we argue that students grow in their cognitive awareness of problem-solving strategies.
Our results also show that the effects are smaller for the differences between Grade Levels 10 and 12. This might be due to the heterogeneity of the upper secondary level. In this grade level, students attended basic and advanced courses, which differed in their curricula and could be voluntarily chosen by the students. In addition, students of adjacent age groups also took part in these courses, leading to a more heterogeneous subsample of the upper secondary level. Therefore, grade level differences of one year occurred within this subgroup of participants. Furthermore, we note that our sample was derived from the German high school and was, therefore, quite selective. The effects of grade level on strategy knowledge might be different across school types or federal states. In addition, the selectiveness of the German school system might have influenced the grade level differences. For example, students of the upper secondary level were educated for the transition to the tertiary level, whereas students of the lower secondary level had to take an obligatory education in all subjects. In light of this selectivity, the effect sizes of the differences between students at Grade Level 10 and the upper secondary level could be influenced by further effects of the school system.

Effects of Covariates on Strategy Knowledge
All analyses of the relationships between strategy knowledge and covariates revealed correlations below .50 and, thus, indicated that the construct can be distinguished from its covariates. Within the regression models, content knowledge significantly predicted strategy knowledge across grade levels. This finding supports the results of a study conducted by Cromley and Azevedo (2011), who found that the application of domain-specific strategies is determined by students' background knowledge. Also, the findings of H. Kim and Pedersen (2011) can be supported. We conclude that meaningful comparisons of strategies that refer to scientific problems or hypotheses require a certain level of content knowledge that can be activated. Furthermore, the construct was strongly affected by fluid intelligence. In light of theories on knowledge application and metacognition (De Jong & Ferguson-Hessler, 1996;Flavell, 1979;Neuenhaus et al., 2011), intelligence is a crucial factor that determines the application of (metacognitive) knowledge. Thillmann (2007) supported this argument for the domain of physics. It seems apparent that content knowledge and intelligence are important predictors. In addition, students' grade in chemistry showed significant effects, whereas motivational constructs did not contribute to strategy knowledge. The latter finding indicates that our measure of strategy knowledge is more determined by cognitive factors (Scherer & Tiemann, 2012). In the context of construct validity, convergent validity was present for cognitive constructs and discriminant validity was found for motivational constructs. Taken together, the regression coefficients suggested that strategy knowledge is strongly determined by performance-based variables.
In light of this discussion, we also note that strategy knowledge is quite difficult to assess because it interferes with epistemologies, students' views of strategies, and content knowledge. But to understand these processes more deeply, further analyses are necessary. Taken together, our findings align with previous research on strategy knowledge Schneider, 2008Schneider, , 2010 and lead to the conclusion that our test measures a construct which does not only require intelligence and content knowledge, but far greater knowledge and skills.

Limitations of the Present Study
Although our data supported the theoretical assumptions and models of strategy knowledge and provided evidence on the differences of the construct across grades, this study has a number of limitations that warrant discussion. First, our results revealed that a unidimensional structure of strategy knowledge was present. However, this result should be interpreted with caution because our sample did not represent the entire German school system across all grade levels. As Rost (2009) argued, the dimensionality of a construct could be due to the selectivity of the sample. Second, further studies across all grade levels are necessary to generalize our findings. A broader range of age should be taken into account in the modeling process, as this might lead to a greater variation in performance, and to more precise measurement models.
Third, further covariates should be taken into account to obtain more precise information on construct validity of the strategy knowledge measure. It might be worth analyzing the effects of domain-specific competencies and domain-general metacognitive abilities. Results on these issues would contribute to the discussion of the domain-specificity of the construct. Fourth, due to the cross-sectional character of this study, our conclusions are limited to the comparison of grade levels. Our results are therefore only indicators for the development of strategy knowledge and refer to interindividual changes. Further longitudinal analyses are necessary to identify intraindividual growth patterns and to explain causal effects of covariates such as the degree of problem-based learning scenarios in science lessons.

Conclusion
Taken together, strategy knowledge remains a measure which is quite difficult to assess. In light of our modeling approach, measurement invariance is given. In general, further research on learning progressions or the development of psychological constructs should take into account invariance as a prerequisite for analyzing differences between groups. But due to the findings of Artelt et al. (2009) and Welsh and Huizinga (2005), which indicated that strategy knowledge does not necessarily imply a meaningful application of knowledge within problem situations, it would now be interesting to check for the relationship among strategy knowledge and its application in problem-solving scenarios. First, small-scale studies revealed moderate relationships (Thillmann, 2007) for physics, but no effects for chemistry (Scherer & Tiemann, 2012). These analyses could reveal significant implications for educational classroom practice and contribute to new teaching strategies or approaches on test development. For instance, studies on fostering the application of strategies in problem-solving situations could take into account that various processes are involved in strategy knowledge. The proposed assessment approach and its connection to manifest indicators could provide an instructional guide for educational assessment.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.