Researchers in the field of early adolescence interested in quantifying the environmental influences on a response variable of interest over time would use cluster sampling (i.e., obtaining repeated measures from students nested within classrooms and/or schools) to obtain the needed sample size. The resulting longitudinal data would be nested at three levels (e.g., repeated measures [Level 1], collected across participants [Level 2], and nested within different schools [Level 3]). A previous publication addressed statistical analysis issues specific to cross-sectional three-level data analytic designs. This article expands upon the previous cross-sectional three-level publication to address topics specific to longitudinal three-level data analyses efforts. Although all analysis examples are demonstrated using SAS, the equivalent SPSS and Mplus syntax scripts, as well as the generated example data and additional supplemental materials, are available online.

Longitudinal research designs and data analysis techniques have become the sine qua non for investigating how and why concepts and constructs change across individuals over time, and methods such as longitudinal multilevel modeling (MLM) have quickly become the data analysis method of choice for such research designs. It is well-known that sampling repeated measures over time from a number of study participants is an example of longitudinal data nested at two levels (e.g., Hox, 2010; Raudenbush & Bryk, 2002; Singer & Willett, 2003; Snijders & Bosker, 2012), and that such data require techniques such as MLM to accurately quantify and describe the “how” and “why” of individual change. However, developmental researchers in general and early adolescent researchers in particular often are interested in the effect of environmental settings on individual change over time.

“Contextualism” is a term that is becoming synonymous with the study of change within particular environments, and “context effects” (e.g., De Leeuw & Kreft, 1986; Hofmann & Gavin, 1998; Raudenbush & Bryk, 2002; Y.-W. B. Wu & Wooldridge, 2005) as a methodological term has been used to describe a class of longitudinal data analysis techniques that quantify the impact of higher level environmental settings (Level 3) on lower level change over time (Level 1) across individuals (Level 2). Although any number of published sources are available to researchers wanting assistance in analyzing change over time data (Level 1 and Level 2; for example, see Longford, 1993; Peugh, 2010; Peugh & Enders, 2005; Raudenbush & Bryk, 2002; Singer, 1998; Singer & Willett, 2003; Snijders & Bosker, 2012; Verbeke & Molenberghs, 2009), resources available to researchers seeking aid in analyzing longitudinal three-level data (i.e., repeated response variable measurements over time, Level 1; sampled from numerous individuals, Level 2; nested within organizational structures, for example, classrooms or schools, Level 3) have limited options currently. The purpose of this article is to demonstrate the step-by-step procedures needed to accurately analyze, present, and discuss the results of longitudinal three-level analyses using a linear model emphasis and demonstrational style similar to that found in published two-level (e.g., Peugh, 2010; Peugh & Enders, 2005; Singer, 1998) and three-level cross-sectional (e.g., Peugh, 2014) manuscripts.

One of the often-cited (e.g., Hox, 2010; Singer & Willett, 2003) advantages of a MLM approach to longitudinal data analysis is its ability to handle both unstructured and unbalanced data. Longitudinal data are balanced if the repeated measures response data are collected on the same schedule for all individuals, regardless of whether the intervals between adjacent data collection occasions are equally (e.g., every 3 months for all participants) or unequally (e.g., 3 months for Person A, 3.75 months for Person B) spaced. In most applied longitudinal data collection scenarios, however, individuals differ in their data collection occasions over time and the data are likely unbalanced (i.e., unequally spaced). Furthermore, if all individuals are assessed on the same number of occasions (assuming no attrition/dropout) over time, the data are said to be structured, but in applied research settings, dropout/attrition likely results in participants having an unequal number of assessment occasions over time, or unstructured data. The simulated data used here are both balanced (all participants were assessed on the same dates) and structured (all participants have T = 4 response variable measurements over time). However, the procedures illustrated here apply to unbalanced and unstructured data as well.

This article will use simulated data based on a selected subset of the Hawaiian Department of Education database (e.g., Heck, 2000, 2007; Heck & Hallinger, 2009) that contains response variable data collected at T = 4 yearly time points (Level 1) between 1994 and 1997 from i = 2,925 (Level 2) students from within j = 50 (Level 3) schools across the Hawaiian Islands. A longitudinal three-level analysis model will be used to illustrate the steps needed to answer the following research questions:

  • Research Question 1: What longitudinal trend best quantifies mathematics achievement score change over time across participants, on average?

  • Research Question 2: What effect does gender and percentage of students within a school receiving free/reduced lunch have on mathematics achievement scores?

Consistent with the two research questions posed above, this article will (a) illustrate the need to define and center the metric of time (Level 1) used to observe changes in mathematics achievement response variable scores over time; (b) show how to determine what combination of linear and non-linear functions of time best model changes in mathematics achievement scores over time, and show how to test whether those time functions vary significantly across individuals (Level 2) and across schools (Level 3); (c) discuss why testing for an alternative residual covariance structure at Level 1 is a prudent step in longitudinal MLM; (d) elaborate upon why binary covariates should be centered prior to testing the effects of a Level 2 covariate, such as gender, on mathematics achievement scores; (e) demonstrate how to determine if the effect of gender on mathematics achievement variable varies significantly across schools (Level 3); and (f) examine the possible effect of a school-level or “contextual” predictor variable, such as the percentage of students within a school receiving free/reduced lunch, on mathematics achievement scores.

To accomplish these goals, a “model-building” approach, involving multiple data analysis “runs” conducted in a sequential and stepwise fashion, will be needed for three reasons. First, repeated measurements at Level 1 are nested within participants at Level 2, who are in turn nested within schools at Level 3. A misspecified model at Level 1 not only results in biased parameter estimates at Level 1, but that bias also migrates to Level 2 and Level 3, resulting in biased parameter estimates at those levels also (e.g., Ferron, Dailey, & Yi, 2002; Kwok, West, & Green, 2007). Second, a well-known and well-documented difficulty with MLMs is parameter estimation non-convergence due to excessive numbers of variances or random effects estimates in a proposed model. A model-building approach tends to ensure only those random effect estimates essential to answering the research question are included in the MLM. Third, a model-building approach that increases the likelihood that each level-specific model is properly specified while estimating only the random effects needed to answer the research question also increases the possibility that the resulting parameter estimates can be easily interpreted in the context of the original research question.1 Finally, all of the procedures mentioned above will be illustrated here using SAS, but the equivalent SPSS and Mplus (Muthén & Muthén, 1998-2015) syntax files for each analysis (where applicable) are provided as additional online materials at http://jea.sagepub.com/supplemental.

Three sources of response variable variation arise from longitudinal and multilevel sampling schemes such as the one used to collect the Hawaiian Department of Education mathematics achievement data: (a) variation in the response variable scores that occurs across all (T = 4) repeated measurements (Level 1) collected from all i individuals nested within j schools, (b) variation in the average response variable scores for all (I = 2,925) students (Level 2) nested within j schools, and (c) variation in the mean response variable scores across all (J = 50) schools (Level 3).2 As such, it would be reasonable for researchers to pose two questions. The first question would involve the proportion of overall response variable (math achievement) variance present at each of the three levels. The second question would involve the magnitude of math achievement response variable relationships (or non-independence) at various levels, stemming from the fact that both (T = 4) repeated measures within i individual students and the average response variable scores between i individual students within the same j school are both correlated. As will be shown below, both questions can be answered via straightforward computations using the parameter estimates obtained from an unconditional (i.e., without predictor variables) three-level model (e.g., see Hox, 2010).

Partitioning and quantifying the response variable variance that occurs at all three levels can be a reasonable first data analysis step because many three-level longitudinal research questions involve quantifying the influence of an environmental context (Level 3) on average response variable change over time (Levels 1 and 2). The level-specific linear model equations used by some statistical analysis packages (e.g., Mplus, HLM, etc.) to specify the example three-level model analyses presented here are shown in Appendix A. The “combined” linear model equations that result from substituting the Level 3 equation into the Level 2 equation, and substituting the result into the Level 1 equation, used by SAS and SPSS for model specification will be presented here. The combined linear model equation that specifies an unconditional (sometimes referred to as an empty or null) three-level model is

Ytij=γ000+u00j+r0ij+εtij.

In Equation 1, the math achievement score (Y) at time t for participant i enrolled in school j is modeled as a linear combination of a grand mean math achievement score (γ000) averaged across all repeated measures for all students within all schools, plus a random effect (or variance) estimate at Level 3 that quantifies the variation in mean math achievement scores for all j schools as they deviate from the sample mean math achievement score (u00j; VAR[u00j]=τβ000), plus a random effect estimate at Level 2 that quantifies the variation in student i’s mean math achievement scores as it deviates from their j school-specific mean math achievement score (r0ij; VAR[r0ij]=τπ000), plus a third random effect at Level 1 (i.e., residual variance) that quantifies the variation in repeated math achievement scores collected across T = 4 time points as they deviate around each i individual’s (within school j) mean math achievement score (εtij; VAR[εtij]=σ2).3 The SAS syntax needed to estimate the unconditional three-level linear model shown in Equation 1 is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT/SUBJECT=PARTICIPANT_ID(SCHOOL_ID);RANDOM INTERCEPT/SUBJECT=SCHOOL_ID;

In the interest of space, a brief point-by-point review of the SAS PROC MIXED commands shown above and used for all analysis examples is presented in Appendix B (e.g., see Singer, 1998; Verbeke & Molenberghs, 2009).

Results from the unconditional model specified in both Equation 1 and the above SAS syntax is shown in the second column of Table 1. The sample mean math achievement score was MATH_DV: γ000 = 651.46. Random effects estimates (shown as covariance parameter estimates in SAS and SPSS output) showed the following math achievement score variance estimates: σ2 = 1,571.72 at Level 1, τπ000=490.37 at Level 2, and τβ000=168.06 at Level 3. Now that the math achievement score variance has been partitioned at all three levels, it can easily be shown that 70% of math achievement variation (1,571.72 / [1,571.72 + 490.37 + 168.06] = 0.70) occurs across repeated measurements within i students nested within j schools, 22% of math achievement variation (490.37 / [1,571.72 + 490.37 + 168.06] = 0.22) occurs across the mean math achievement scores for i students nested within j schools, and 8% of math achievement variation (168.06 / [1,571.72 + 490.37 + 168.06] = 0.08) occurs across the mean math achievement scores for all J = 50 schools (see “Folder_2_Response_Variable_ICC”). Prior to demonstrating how these partitioned variances can be used to compute additional indices, a word of caution is warranted. Notice in Table 1 that even though the three random effects (σ2, τπ000, and τβ000) estimates with their respective standard errors are listed, no indication of statistical significance is given. It is well known (e.g., Hox, 2010; Snijders & Bosker, 2012) that significance tests of variance estimates for random effects increases the likelihood of committing Type-1 inference errors because the sampling distribution for variance estimates is skewed in the population (e.g., Hox, 2010). However, for pedagogical purposes, we will include their respective interpretations for this and all other models presented, but we will also present the correct procedure for testing models that differ in random effect estimates later in our model-building discussion (see “Model Comparisons” section below).

Table

Table 1. Modeling the Effects of Centered Decimal AGE (Agetij − 12) on Mathematics Achievement at Level 1.

Table 1. Modeling the Effects of Centered Decimal AGE (Agetij − 12) on Mathematics Achievement at Level 1.

After partitioning response variable variance, the question of how much response variable variance is shared, or correlated, across different combinations of levels could be asked. The intraclass correlation (ICC) in MLM analyses has been defined as “. . . an estimate of the expected (population) correlation between two randomly chosen elements in the same group” (Hox, 2010, p. 34). As stated previously, one purpose of longitudinal three-level modeling is to assess environmental (Level 3) influences on individual average (Level 2) change over time (Level 1). As such, three different ICCs could be calculated to assess the influence of school on change in math achievement over time. First, as shown above in Equation 1 and in the second column of Table 1, the proportion of total response variable variation that occurs between math achievement means across J schools was estimated as (168.26 / [1571.82 + 490.27 + 168.26]) = 0.08. Consistent with the above definition, this Level 3 ICC estimate is interpreted as the expected correlation between two math achievement scores drawn completely at random (from any T time point) from two students within the same school. Second, an alternative Level 3 ICC estimate can be calculated as (168.06 / [168.06 + 490.37]) = 0.26 and interpreted as the expected correlation between the mean (i.e., averaged across T = 4 repeated measurements) response variable scores from two students drawn completely at random from the same school (e.g., see Hoffman, 2015; Snijders & Bosker, 2012). Finally, a Level 2 ICC could also be calculated by ([490.37 + 168.06] / [1,571.72 + 490.37 + 168.06]) = 0.30 and is interpreted as the expected correlation between two repeated measurements sampled from the same student. Readers interested in additional information regarding ICCs can consult several sources (Davis & Scott, 1995; Hedges & Hedberg, 2007; Hoffman, 2015; Siddiqui, Hedeker, Flay, & Hu, 1996; Snijders & Bosker, 2012).

Some of the most useful techniques, if somewhat under-utilized, available to researchers conducting longitudinal analyses are graphs and plots that allow researchers to visualize the average mathematics achievement score (across students within schools) change over the course of a longitudinal study, as well as obtain a sense of the overall variation in mathematics achievement scores across students within schools at each assessment. Students in the Hawaiian Department of Education simulated dataset were all assessed at T = 4 time points, so the mean mathematics achievement values (across students within schools) at each of the four time points, along with their respective 95% confidence intervals, can be plotted graphically as error bars (by using integer time points 1 to 4 to represent the repeated assessment occasions on the x axis), as shown in Figure 1. Not surprisingly, mathematics achievement scores increase, on average, across the four time points, and variability in the mean scores across the four time points remains relatively constant. However, the question to address is how to best model and quantify this mathematics achievement score change over time. Recall that a trend analysis within a repeated measures ANOVA design involving p repeated measurements could require as many as p − 1 polynomial trend components to model the trend in the means accurately. As such, T − 1 = 3 possible fixed effect polynomial trend components (linear, quadratic, and cubic) could be needed to capture and model the mathematics achievement change over time shown in Figure 1 (e.g., Singer & Willett, 2003). So the more specific question for researchers becomes whether linear-only, linear and quadratic, or linear, quadratic, and cubic polynomial components for time would be needed to model change accurately and parsimoniously. A quick glance at the figure, for example, suggests a linear-only trend may likely capture the change in individuals’ scores optimally over time. Furthermore, it is not difficult to imagine that the change in mathematics achievement for each individual i student, as well as for each j school, could vary randomly around the average change trajectory specified by the needed polynomial trend components (i.e., linear, quadratic, etc.). Specifically, with T = 4 repeated measurements, T − 2 random slope effects are possible (Snijders & Bosker, 2012), meaning that significant linear change variance or significant linear and quadratic change (if needed) variance could be present in the mathematics achievement trajectories across both i = 2,925 students at Level 2 and across j = 50 schools at Level 3. The procedures needed to ensure that the most parsimonious model, defined as a model with a minimum number of fixed and random effects, needed to capture mathematics achievement score changes accurately over time across all students within all schools is shown in subsequent sections. However, before beginning the process of modeling mathematics achievement change over time, a discussion of the most appropriate metric of time necessary to clock such changes is needed.


                        figure

Figure 1. Error bar graph of mathematics achievement change over time.

Arguably, the most important decisions to be made in a longitudinal data analysis situation are the choices of the metric of time used to clock response variable changes, and how the chosen time metric is to be centered. The means, variances, and covariances of the observed repeated measures, as well as several parameter estimates (i.e., intercept fixed effect, intercept random effect, and the intercept-slope covariance), all depend upon the chosen time metric (Biesanz, Deeb-Sossa, Papadakis, Bollen, & Curran, 2004; Mehta & West, 2000). The simulated Hawaii Department of Education data contain each participant’s birthdate as well as the dates that each of the four longitudinal assessments occurred for each participant (see “Folder_1_Database Setup”). With both birthdate and date of assessment, several chronological functions of participant age (in days, weeks, months, or years) could be calculated. For illustration purposes, it is assumed that age in years is the most meaningful chronological metric for which to clock math achievement changes.

Once the metric of time, such as age in years, is chosen, the next immediate step to consider is how age should be centered. Centering predictors allows a meaningful interpretation for the intercept as the value of the response variable when the value of each predictor in the model is at zero. The need for centering becomes clear when we consider the substantively and theoretically dubious interpretation of the intercept (γ000) that would result if age in years were entered into the analysis model as a Level 1 predictor in un-centered form: the expected mathematics achievement score for an individual at age 0 (birth?). Although a complete discussion of all predictor variable centering options (and their possible exceptions) is beyond the reach of this article, a brief review of centering a Level 1 predictor variable in a longitudinal analysis model is needed.

As has been shown elsewhere, two types of predictor variable centering are used most often in MLM analyses in general and longitudinal analyses specifically: grand mean centering and group mean centering (sometimes referred to as “centering within cluster”; Enders & Tofighi, 2007). In longitudinal analyses, group mean centering of a Level 1 predictor such as age could be accomplished in one of three ways: (a) Age could be centered at each participant’s age at their first assessment time point (AGEtijAGE(t=1)ij), (b) age could be centered at each participant’s average age across the T = 4 assessment time points (AGEtijAGE¯ij), or (c) age could be centered at the average age for the particular j school attended (AGEtijAGE¯j). Statistically, centering AGE at each participant’s age at their first assessment time point, or at each participant’s average age across the T = 4 assessment time points, results in ICCs of zero for AGE at Level 2 and Level 3, meaning that AGE cannot be related, or share variance, with mathematics achievement at Level 2 or Level 3. By extension, centering AGE at the average age for the particular j school attended results in an ICC of zero for AGE at Level 3, meaning also that AGE cannot have a relationship, or share variance, with mathematics achievement at Level 3. Although centering a metric of time such as AGE in any of these ways could be needed based on the research question, it is uncommon for a time metric to be centered in this way.

Grand mean centering of a longitudinal Level 1 predictor such as age could be accomplished in one of two ways: subtracting the mean age for the data sample from the age of each participant (AGEtijAGE¯), or by subtracting a meaningful constant age value from the age of each participant (e.g., AGEtij12years). Statistically, as opposed to the group mean or within cluster centering options shown above, centering AGE at the sample grand mean or a theoretically meaningful constant results in a non-zero ICC for age at all three levels, meaning that a non-zero relationship between AGE and mathematics achievement is possible at all three levels. Although these and other centering options for the metric of time are available, the choice of centering the time metric at Level 1 should be driven by both the research question and a consideration of where in the span of time captured by the longitudinal research design the theoretical concerns occur (but see Kreft, De Leeuw, & Aiken, 1995; Nezlek, 2001; Singer & Willett, 2003, for exceptions and additional possibilities). For illustrative purposes, and assuming both the research question and additional substantive concerns related to mathematics achievement occur at the center of the early adolescent (ages = 10-14) age range, age will be centered at age 12 (i.e., AGEtij12years in the combined model equations, and shown as “C_AGE” in the SAS syntax examples) and added to the multilevel model as a Level 1 predictor for all subsequent analyses.

Recall that with T = 4 repeated measurements, three polynomial fixed effects and two random polynomial slopes may be needed to accurately model math achievement change over time. The key question to keep in mind when building the data analysis model at Level 1 is “How should researchers parsimoniously capture and quantify the average change across individuals in mathematics achievement scores as age increases?” However, a parsimonious model must also be specified correctly to allow for the proper interpretation of the parameter estimates. The steps illustrated here, taken in the interests of building both a parsimonious and an interpretable model, will be guided by three general principles of multilevel model-building.

First, one way of thinking about fixed effects is as averages, or means. For example, the fixed effect estimate for centered AGE (AGEtij − 12 years) quantifies the constant straight line change in mathematics achievement scores per yearly increase in age as an average across all I = 2,925 participants nested within J = 50 schools. Such a fixed effect could also be thought of as the mean or average around which each i individual student’s change in mathematics achievement scores could deviate at Level 2. Similarly, the change over time in mathematics achievement for each j school at Level 3 could also deviate from the average linear change established by the fixed effect estimates. For this reason, fixed effects are added to a multilevel model first to establish an average, followed immediately by adding and testing the corresponding random effects at Level 2, and then at Level 3, respectively, to test for the presence of significant variation in the average trajectory across both i students at Level 2 and across j schools at Level 3. Second, a given polynomial function of age could have a non-significant fixed effect estimate, but a test of the corresponding random effect for that same polynomial function of age could be statistically significant. For example, assume the parameter estimate for a linear fixed effect for age showed a statistically non-significant effect on mathematics achievement scores, but the random effect for linear age was statistically significant. Both the fixed effect and random effect for linear age would remain in the multilevel model because the fixed effect estimate indicates, on average, linear change in math achievement scores is not different from zero, but the significant random effect shows that individual participant trajectories at Level 2 (and possibly individual school trajectories at Level 3) notable non-zero straight line increases or decreases in mathematics achievement scores over time, indicating that the fixed effect is not a representative indication of change across all i individuals in j schools. Finally, a note on the use of polynomial functions of age is needed and is best illustrated by example: If a cubic fixed effect for age, that is, ([AGEtij − 12 years]3), is needed to model change in mathematics achievement over time accurately, all lower order fixed effects, such as the intercept ([AGEtij − 12 years]0), linear age ([AGEtij − 12 years]1), and quadratic age ([AGEtij − 12 years]2), must all be included regardless of their statistical significance because higher-order fixed effects make a unique and additive contribution to modeling mathematics achievement change only if the lower polynomial components are included. In similar fashion, for example, if a random effect for quadratic age is needed, then the random effects for the intercept and linear age must also be included (Hedeker & Gibbons, 2006; Hoffman, 2015; Singer & Willett, 2003). As such, the process of modeling change in mathematics achievement over time begins by first adding the effect of linear age (AGEtij − 12 years) to the previous unconditional model.

The combined linear model equation that adds centered linear age as a Level 1 predictor is

Ytij=γ000+γ100(AGEtij12)+u00j+r0ij+εtij,

and building upon the syntax for the previous unconditional model, the SAS syntax needed to estimate the model in Equation 2 is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT/SUBJECT=PARTICIPANT_ID(SCHOOL_ID);RANDOM INTERCEPT/SUBJECT=SCHOOL_ID;

where estimating the fixed effect for centered linear age on mathematics achievement scores is indicated by adding “C_AGE” to the “MODEL” command line.

Results for the model that adds centered linear age as a Level 1 predictor variable are shown in the third column of Table 1. As predicted by linear age, the average mathematics achievement score at age 12 for all I participants nested in J schools was γ000 = 667.36, and the constant expected rate of change in mathematics achievement scores per yearly increase in age beyond 12, on average, for all I participants nested in J schools was γ100 = 27.01 points. Significant variation in mathematics achievement scores remained at all three levels, but two results were noteworthy. Level 1 residual variance decreased sharply from σ2=1,571.72 in the unconditional model to σ2=272 from adding centered linear age as a Level 1 predictor. Specifically, adding linear age to the previous unconditional multilevel model decreased the Level 1 residual variance ([1,571.72 − 272] / 1,571.72 = 0.83) 83%. However, mathematics achievement intercept variance at Level 2 and Level 3 both increased from the previous unconditional model (see “Folder_3_Age_Fixed_Effect”). More specifically, at Level 2, variation in individual i student mean math achievement scores at baseline (T = 0) as predicted by linear age (τπ000) increased from 490.37 to 909.01. At Level 3, variation across all j school means, obtained by averaging the mean math achievement scores for all i students in school j as predicted by linear age (τβ000), increased from 168.06 to 188.65. This result is not surprising and underscores the fact that the three analysis levels are separate, but not independent. As discussed elsewhere (e.g., see Hox, 2010; Snijders & Bosker, 1994) and shown below, initial observed variance estimates at higher levels must be corrected for random variance at lower levels of the multilevel model.

Specifically, let the “true” Level 2 intercept variance in the unconditional model be the sum of the observed Level 2 intercept variance (τπ000=490.37) plus the Level 1 time point specific random variation (σ2/T,; 1,571.72 / 4 = 392.93) in repeated measures (490.37 + 392.93 = 883.30), or τπ000=883.30. Furthermore, if decimal age is replaced with an integer indicator of time (e.g., wave = 0, 1, 2, 3, for all participants) as a Level 1 predictor, the dataset is perfectly balanced; all i = 2,925 participants have four repeated measures indicated by an integer value for each assessment. For the repeated occasions, all subjects have the same series of timepoint measurements, so there is no variation associated with time points across subjects at Level 2. If wave is added to the model as a Level 1 predictor, the Level 2 intercept variance estimate was shown as τπ000=815.31 (a notable increase from the observed Level 2 intercept variance in the unconditional model, 490.37), and the Level 1 residual variance was observed to be σ2=271.94. However, because there is no between-group variation in the integer values for wave, the full information maximum likelihood (ML) or restricted maximum likelihood (REML) estimator will adjust for the smaller Level 1 variance estimate, but as a result, the between-groups variance component will increase (Hox, 2010). To illustrate, if we define the “observed” Level 2 variance as the difference between the “true” Level 2 intercept variance from the unconditional model (883.30) and the random variation in repeated measurements obtained from the conditional model that used wave (i.e., 271.94 / 4 = 67.99), the observed Level 2 intercept variance from the conditional model that used wave can be reproduced as 883.30 − 67.99 = 815.31. As such, what appears to be an increase in the Level 2 intercept variance as a result of adding a Level 1 predictor to the model is in reality a change in the corrective factor (σ2/T=4) used to obtain the true sampling variance at Level 1 for the unconditional model as it is applied to the observed Level 2 intercept sampling variance for the conditional model (for additional details, see Hoffman, 2015; Hox, 2010; Snijders & Bosker, 2012). Finally, it is worth noting that students are not independent from the schools within which they are nested, and if the number of students within each school were equivalent across the J = 50 schools (i.e., if the data were balanced at Level 2), we could use similar logic to explain the increase in Level 3 intercept variance from the unconditional model to the model containing centered age as a fixed effect (τβ000 increased from 168.06 to 188.65, respectively; see Hoffman, 2015).

As shown in the previous analysis, mathematics achievement scores increased (γ100 = 27.01) 27.01 points, on average, per yearly increase in age beyond 12 years across all I students in J schools. However, researchers collect repeated measurements data from students (within schools) over time expecting differences in rates of change between students. One way to conceptualize the effect of linear age on mathematics achievement scores at Level 1 varying randomly at Level 2 is to first imagine that each of the i students in one j school could have her or his own linear age slope that quantifies the expected increase in mathematics achievement scores for each year of increase in their age beyond 12 years. If the slopes for all the i students in the specific school j were averaged, the result would be the average linear effect of centered age on mathematics achievement for that j school. As such, the question becomes, “Is there significant variation in the linear age slopes for each of the i students within school j around the mean linear age slope for school j?” Furthermore, if this process were repeated for all J = 50 schools, then the key question regarding variation in the Level 1 effect of linear age on mathematics achievement at Level 2 is “On average, is there significant variation in the age slopes for each of the I = 2,925 students as they deviate around the mean change trajectory for each of the j schools within which the students are nested?”

To answer this question, the combined linear model that allows the effect of centered linear age on mathematics achievement to vary across individual i students at Level 2 (within J schools at Level 3) is

Ytij=γ000+γ100(AGEtij12)+u00j+r0ij+r1ij(AGEtij12)+εtij,

where, compared with the previous model in Equation 2, Equation 3 above contains an additional random effect, or variance, term (r1ij) that represents the effect of centered linear age on mathematics achievement varying randomly across all I individuals nested within their respective J schools. The SAS syntax to estimate the model shown in Equation 3 is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE/SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT/SUBJECT=SCHOOL_ID;

where allowing the effect of linear age (C_AGE) to vary randomly across i individuals in j schools is specified by adding “C_AGE” to the “RANDOM” command line with the “SUBJECT=PARTICIPANT_ID(SCHOOL_ID)” specification. Furthermore, adding the “TYPE=UN” specification results in three random effects estimated at Level 2: mathematics achievement variation across students (τπ000) at age 12, variation in the effect of centered age on mathematics achievement across students within schools (τπ110), and the intercept/slope covariance (τπ010), or the shared variation between mathematics achievement at age 12 and the change in mathematics achievement per yearly increase in age for all I students within J schools.

Results for the model specified by Equation 3 and the SAS syntax above are shown in the fourth column of Table 1. The expected mathematics achievement score for an age 12 participant (γ000 = 667.68), as well as the expected linear change in mathematics achievement per year increase in age beyond 12 years (γ100 = 27.08), remained relatively unchanged from the previous analysis. The Level 1 residual variance (σ2=174.70;[272174.70]/272=0.36) decreased 36%, the mean mathematics achievement variation across schools (Level 3; τβ000=178.71; [188.65 − 178.71] / 188.65 = .05) decreased 5%, while the variation in mean mathematics achievement scores across age 12 students within schools (Level 2; τπ000=974.85) increased from the previous analysis (due to estimating the random effect at Level 2 for the age slope parameter in this model rather than it being fixed to 0 in the previous model). Results further showed that linear increases in mathematics achievement scores differed significantly across all I students within all J schools (VAR[r1ij]=τπ110=55.24,p<.001). The fixed effect slope estimate for the effect of linear age on mathematics achievement (γ100 = 27.08), together with the estimate of the slope variance at Level 2 (τπ110=55.24), can be used to construct a 95% confidence interval to facilitate interpretation. Specifically, mathematics achievement scores for 95% of students within schools increase linearly, on average, by between 12.52 and 41.64 points (55.24 = 7.43; 27.08 ± [1.96 × 7.43] = 12.52, 41.64), around a mean increase of 27.08 points per year increase in age beyond 12 years. Furthermore, the intercept/slope covariance estimate (τπ010=52.83) was also statistically significant (p < .01). Together with a significant and positive slope for linear age (γ100 = 27.08, p < .01), the covariance estimate can be interpreted as follows: Students within schools having higher mathematics achievement scores at age 12 will tend to show greater increases in mathematics achievement scores per year increase in age beyond 12 years compared with students within schools having lower mathematics achievement scores at age 12 (see “Folder_4_Age_Random_At_Level2”).

Just as the effect of linear age on mathematics achievement scores can vary significantly across participants at Level 2, as shown in the previous analysis, the effect of age on mathematics achievement scores could also vary significantly across schools at Level 3. The fixed effect slope for the effect of centered linear age on mathematics achievement (γ100 = 27.08) describes the expected change in mathematics achievement per yearly increase in age beyond 12 years for all I students in J schools. It was shown in the previous section how the mean linear age slope could be obtained for a particular school j, and this process could be repeated such that a school-specific slope for the effect of centered linear age on mathematics achievement scores per yearly increase in age beyond 12 could be obtained for each of the J = 50 schools. The key question regarding variation in the Level 1 effect of linear age on mathematics achievement across schools at Level 3 is “Is there significant variation in the age slopes for each of the J = 50 schools as they deviate around the grand mean age slope (γ100 = 27.08)?”

To answer this question, the linear model that allows the effect of centered age on the change in mathematics achievement scores to vary across schools at Level 3 is

Ytij=γ000+γ100(AGEtij12)+u00j+u10j(AGEtij12)+r0ij+r1ij(AGEtij12)+εtij.

In this analysis, the effect of centered linear age on mathematics achievement scores will be allowed to vary randomly both across individuals at Level 2 (τπ110) and across schools at Level 3 (VAR[u10j]=τβ011). The SAS syntax needed to analyze the model shown in Equation 4 above is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE/SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE/SUBJECT=SCHOOL_IDTYPE=UN;

where allowing the effect of age on mathematics achievement to vary across schools is specified by adding “C_Age” to the “RANDOM” command line with the “/SUBJECT=SCHOOL_ID” specification.

The results for the model specified by Equation 4 above that allows the effect of age on mathematics achievement to vary both across individuals and across schools is shown in column 5 of Table 1. The fixed effects estimates (γ000=667.54;γ100=26.91) and the Level 1 residual variance (σ2=174.69) remained essentially unchanged from the previous analysis. All of the random effect estimates at Level 2 (τπ000=967.43,τπ010=44.88, and τπ110=46.91) decreased notably from the previous analysis, but all remained statistically significant.

At Level 3, significant intercept variance (τβ000=206.72) remained, but results further showed significant variation in the effect of age on mathematics achievement across schools (τβ011=8.78). Again, the fixed effect slope estimate for the effect of linear age on mathematics achievement (γ100 = 26.91), together with the estimate of the slope variance at Level 3 (τβ011=8.78), can be used to construct a 95% confidence interval to facilitate interpretation. Specifically, mathematics achievement scores for 95% of schools increase linearly, on average, by between 21.11 and 32.71 points (8.78 = 2.96; 26.91 ± [1.96 × 2.96] = 21.11, 32.71), around a mean increase of 26.91 points (γ100 = 26.91) per year increase in student age beyond 12 years. Furthermore, the intercept/slope covariance estimate at Level 3 (τβ010=19.88) was also significant; schools with higher mathematics achievement means for students aged 12 years show steeper increases in mathematics achievement scores per year (see “Folder_5_Age_Random_At_Level2_&_Level3”).

Results from the previous analyses showed significant linear change (or [Agetij − 12]1) in mathematics achievement scores per year increase in age beyond 12 years, and showed that the linear increase varies randomly both across individuals within schools (Level 2) and across schools (Level 3). Consistent with the model-building template described earlier, the next question to consider is whether the addition of a second-order polynomial, or quadratic fixed effect of age (i.e., [Agetij − 12]2; shown as C_AGE2 in syntax), would better model average changes in mathematics achievement scores over time for all I students nested in J schools. A linear model that adds a quadratic function of centered age to the analysis model as a Level 1 predictor variable (fixed effect) is shown as

Ytij=γ000+γ100(AGEtij12)+γ200(AGEtij12)2+u00j+u10j(AGEtij12)+r0ij+r1ij(AGEtij12)+εtij.

Furthermore, the SAS syntax needed to estimate to model shown in Equation 5 above is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE C_AGE2/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE/SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE/SUBJECT=SCHOOL_IDTYPE=UN;

where both the linear model Equation 5 above (γ200[AGEtij12]2) and the SAS syntax above (C_AGE2) differ from the previous model only in the addition of the quadratic age fixed effect term.

The Level 1 residual variance estimate (σ2=174.80) increased slightly, while the three random effect estimates at Level 2 (τπ000=966.08,τπ010=43.57, and τπ110=46.70) and at Level 3 (τβ000=206.58,τβ010=19.75, and τβ011=8.77) all decreased slightly following the addition of the quadratic fixed effect. The two fixed effect estimates from previous analyses (γ000=667.69;γ100=26.71) did change slightly, but both remained statistically significant; 667.69 is the average expected mathematics achievement score for a student at age 12 as predicted by linear and quadratic age, but the interpretation of the effect of linear age on mathematics achievement (γ100=26.71) now changes with the inclusion of the quadratic fixed effect (γ200) in the model. The linear effect of centered age, in previous models, was interpreted as the constant rate of expected change in mathematics achievement scores for any yearly increase in age beyond 12 years. However, in the presence of a quadratic fixed effect, the linear effect is now interpreted as the instantaneous rate of expected change because with each increase in age beyond 12 years, the expected rate of change in mathematics achievement scores (γ100=26.71) gets altered by the inclusion of the quadratic effect per year increase in age.

The key to understanding the effect of quadratic age on mathematics achievement involves interpreting the quadratic change (or curvature; see Singer & Willett, 2003) fixed effect (γ200 = −0.17, p > .05) estimate correctly, and three considerations should be kept in mind. First, it is important to keep in mind that, unlike the interpretation of the linear effect for age given above, the interpretation of the quadratic effect for age does not depend on our choice for centering age. As a general rule, the highest order polynomial fixed effect in a multilevel model can be interpreted without regard to the choice of centering, but the interpretation of all lower order polynomial terms must acknowledge how the Level 1 metric of time (age) was centered (Raudenbush & Bryk, 2002).

Second, it may be instructive to think of the quadratic effect of age not as a polynomial, but as an interaction term. If a between-subjects factorial ANOVA analogy is used to illustrate, then begin by considering main effects A and B, as well as the A × B interaction term, and their effects on the response variable. By definition, the effect of A on the response variable cannot be quantified unless a specific level of B is known (and vice versa), or the main effect of A is moderated by the level of B. If we consider the quadratic effect of centered age on mathematics achievement not as a polynomial, but as an (Age − 12 × Age − 12) interaction term and use the same logic, we interpret that interaction as the main effect of centered linear age on mathematics achievement as moderated by centered linear age. Said differently, centered linear age is considered twice because it serves both as a main effect and a moderator for interpretation purposes. For this reason, the quadratic change fixed effect is doubled when the expected mathematics achievement score change per year increase in age beyond 12 years is considered, or γ100 + [2 × γ200 × T], or 26.71 + [2 × −0.17 × T]. As such, the fixed effect coefficient estimate (γ200=0.17) is interpreted as half of the effect of quadratic age on mathematics achievement scores because it reflects only 50% of the slowing down, or deceleration, in predicted mathematics achievement scores per year increase in age beyond 12 years (e.g., see Hoffman, 2015; Raudenbush & Bryk, 2002).

Third, to further illustrate the slowing down of expected mathematics achievement score changes per yearly increase in age, every quadratic function has a point (sometimes referred to as a peak, a trough, or a point of stationarity) along the x axis (centered age) at which the slope of change in expected mathematics achievement scores is zero. This zero slope point can be found by solving the equation −γ100 / [2 × γ200], or −26.71 / [2 × −0.17] = 78.56, then recalling that age was centered at 12 years, the point on the quadratic function (or age) at which expected change in mathematics achievement scores slows to a slope of zero is 78.56 + 12 = 90.56 years of age. This result underscores the fact that the effect of the quadratic fixed effect of centered age on mathematics achievement was non-significant. However, as stated previously, this fixed effect should temporarily be kept in the multilevel model because the non-significant fixed effect estimate could show significant variation at Level 2, Level 3, or both, indicating that the non-significant quadratic fixed effect belies the fact that students, schools, or both could show significant non-zero variation in their respective quadratic change in mathematics scores, rendering the fixed effect a poor average indicator of actual quadratic changes.

Finally, although not included here in the interests of space, and consistent with the model-building steps described earlier, two additional models were analyzed. The first model allowed the quadratic effect of centered age on mathematics achievement to vary randomly across students within schools. Results from that analysis showed an estimate of zero for the quadratic random effect (τπ220) at Level 2. The second model allowed the quadratic effect of centered age on mathematics achievement to vary randomly across students within schools at Level 2 and across schools at Level 3. That model failed to converge upon admissible parameter estimates, possibly due to attempting to “over-fit” the Level 1 model. As such, the first research question, regarding the most parsimonious Level 1 model to describe mathematics achievement score changes over time, has been answered: A linear fixed effect for centered age that is allowed to vary randomly both across students within schools, and across schools, best describes average mathematics achievement score change at Level 1 (see “Folder_6_Quadratic_Age_Fixed-Effect”).

MLM techniques such as the three-level longitudinal analysis used here is conceptually very similar to analysis techniques such as ordinary least squares (OLS) multiple regression. But one of the ways MLM differs markedly from OLS multiple regression is in the contrasting assumptions made by the two analysis techniques regarding residual variance, defined for our purposes here as that portion of mathematics achievement variance not explained by the linear effect of centered decimal age. Two classic residual variance assumptions made by OLS multiple regression are homogeneity (or homoscedasticity) and independence. Homoscedasticity means that the amount of variance in the response variable not explained by a predictor variable is the same across all values of that predictor variable, and independence refers to the assumption that response variable residual variance is both uncorrelated with the predictor(s) and uncorrelated across all i participants (e.g., Cohen, Cohen, West, & Aiken, 2003; Pedhazur, 1997; Singer & Willett, 2003).

In contrast, two response variable residual variance assumptions commonly associated with MLM techniques are heteroscedasticity and autocorrelation. For our purposes, heteroscedasticity means that the amount of variance in mathematics achievement scores not explained by the linear effect of centered decimal age can differ, or be heteroscedastic, across the T = 4 time points of assessment for all I students in J schools. However, recall that our example dataset is unstructured with respect to our chosen metric of time; that is, each i = 2,925 students could have different decimal ages at the T = 4 assessment time points, implying that there could be as many as 2,925 different residual variance values for each of the T = 4 assessment time points, one set of T = 4 residual variance values for each of the I = 2,925 students (within the J = 50 schools). To ease the computational burden, however, the T = 4 residual variance values are typically constrained to equality across all i students within the J = 50 schools (e.g., see Singer & Willett, 2003). Furthermore, autocorrelation refers to the fact that, just as repeated mathematics achievement response values are likely to share variance, or correlate, within students over time, the residual response variable variance at a given T assessment time point likely is correlated with the response variable variance at all other T time points.

One question researchers may be wondering at this point is “After entering centered decimal age as a Level 1 predictor into the longitudinal three-level model, what do the residual variances and covariances for mathematics achievement look like?” As discussed previously, the advantage of using decimal age as a Level 1 predictor in the analysis model is that it is the most accurate metric of time available for clocking changes in mathematics achievement scores over time. One difficulty that arises from the use of decimal age is that it is an unstructured measure of time and does not lend itself to the computation of a single residual covariance matrix. More specifically, each of the i = 2,925 students in the dataset likely has different values for centered decimal age at each of the T = 4 assessment time points. As such, there is not a single residual covariance matrix, but there could be as many different residual covariance matrices as there are unique values for decimal age across all i = 2,925 students (in the J = 50 schools) in the dataset.

Recall that earlier we used an integer metric of time, called “wave” (with values of 0, 1, 2, 3 for all i = 2,925 students), to explain a “jump” in Level 2 variance because “wave” is time structured. If we used “wave” (time structured) instead of decimal age (time-unstructured) as the metric of time in Equation 4 and its associated SAS syntax, a single residual covariance matrix can be calculated4 from the resulting parameter estimates (sometimes referred to as the “standard model”) for demonstration purposes. As shown in the Level 1 covariance matrix below, the different variances shown on the main diagonal indicate the presence of heteroscedasticity, and the off-diagonal covariances illustrate the autocorrelation, or shared variance, between all the possible pairs of residuals at the four time points. However, as described elsewhere (e.g., see Singer & Willett, 2003), the residual covariance matrix resulting from such a “standard model” relies heavily on the metric of time chosen in computations and makes several additional assumptions regarding the behavior of the residual variance values over time that may not produce an acceptable fit to the data.

Table

Table

Although not exhaustive (at this writing, SAS can fit as many as 23 different residual covariance matrix models), Table 2 illustrates seven popular and possible alternative residual covariance matrix models. A quick glance at the third column of Table 2 shows that residual variances can be modeled as an additive constant, homogeneous, or heterogeneous, and residual covariances can be modeled as constant (i.e., all covariances constrained to equality), autocorrelated (i.e., all covariances multiplied by a correlation estimate), or banded (i.e., immediately adjacent timepoint residual variances are more highly correlated than residual variances separated by one time point, which are more highly correlated than residual variances separated by two time points, etc.). But again, the residual matrix shown was computed based on “wave” because decimal age does not lend itself to the computation of a single residual covariance matrix. So the question of how to test the suitability of the “standard” model residual covariance matrix becomes paramount. However, as will be shown below, we need only the fit index values from a “standard model” that uses centered decimal age as the metric of time at Level 1 to evaluate that residuals covariance matrix against alternatives.

Table

Table 2. Modeling the Residual Covariance Structure: Descriptive.

Table 2. Modeling the Residual Covariance Structure: Descriptive.

At this point, a third question might arise: “Why is this issue important?” Recall that one reason for the “model-building” approach taken here, where fixed and random effects are added one at a time and tested for significance, is to reduce the possibility of model misspecification. In similar fashion, if the “standard model” residual covariance matrix is adopted by researchers automatically, but is found to be a less than optimal fit to the data, that is a source of model misspecification at Level 1. As discussed previously, the three levels of analysis are correlated, which means that misspecification of the residuals matrix at Level 1 can propagate to the student (Level 2) and school (Level 3) levels. Specifically, although the fixed effect parameter estimates for predictor variables added at Level 2 and Level 3 are unbiased, Type 1 or Type 2 errors for the significance tests of those predictor variables can occur due to positively or negatively biased standard error estimates resulting from a misspecified residual covariance matrix at Level 1.

Given the importance of properly specifying the residual covariance matrix at Level 1, the final question becomes “What are the specific procedures involved in identifying the best-fitting residuals covariance matrix at Level 1?” For clarity, we return first to our OLS multiple regression analogy. In OLS multiple regression, residuals (εi) are defined as the difference between each i individual’s observed (Yi) and model-predicted (Yi) response variable scores, and residual variance is defined as the variation in those difference scores. Extending this analogy to a longitudinal repeated measures design, Level 1 residuals (εt) are defined as the difference between observed (Yt) and model-predicted (Yt) response variable scores at each of the T = 4 time points of assessment, and residual variation is defined as the time point specific variation in those difference scores. As shown in the response variable partitioning section above, variation in the response variable difference (YtYt) scores at each of the T = 4 time points defines response variable variance at Level 1; response variable variation also occurs across all I students at Level 2 and across all J schools at Level 3. If the influence of student-level and school-level response variable variation could be removed, Level 1 variation would remain.

For convenience, the SAS syntax for the best-fitting Level 1 model is reprinted below.

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE /SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE /SUBJECT=SCHOOL_IDTYPE=UN;

Recall that the two RANDOM statements above allow the variance in mathematical achievement scores to be partitioned into the portions that occur across students at Level 2 and across schools at Level 3.

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE/SOLUTION DDFM=KENWARDROGER NOTEST;REPEATED WAVE/TYPE=CS SUBJECT=PARTICIPANT_ID(SCHOOL_ID)R;

With the two “RANDOM” statements in the SAS analysis syntax removed, all deviations from the average rate of linear change in mathematics achievement scores defined by the fixed effect of centered decimal age (C_AGE) are determined by the heteroscedasticity and autocorrelations of the Level 1 residuals covariance matrix. As such, different forms of heteroscedasticity and autocorrelation for the Level 1 residuals can be tested with the covariance matrix models shown in Table 2, and the fit statistics for those models can be tested against the fit of the analysis model that included fixed and random effects for centered decimal age shown in the first row of Table 2.

The previous two “RANDOM” statements have been replaced with a “REPEATED” statement that includes the time structured “WAVE” variable. This statement, together with the subject (SUBJECT = PARTICIPANT_ID[SCHOOL_ID]) specification, allows “WAVE” to be essentially dummy coded so that the Level 1 residual variances only for mathematics achievement at each of the T = 4 time points, and all residual covariances, can be estimated. Finally, the covariance matrix to be tested can be specified with the “TYPE =” subcommand. Although compound symmetry (CS) is used as an example, any of the models shown in Table 2 can be tested by substituting the appropriate abbreviation shown in the first column. Finally, the covariance matrix specified by the “TYPE =” can be printed to the output window by including the “R” specification (the equivalent residuals correlation matrix can be printed to the output window by adding the “RCORR” specification).

The simulated Hawaiian Department of Education dataset was fit to all Level 1 residual covariance matrix models shown in Table 2, and the results are shown in Table 3. Only two residual covariance matrix models showed a better fit to the data versus the “standard” model using fixed and random effects for centered decimal age. The heterogeneous Toeplitz model, which estimates seven additional parameters, showed a 316 point decrease in deviance (−2 LogL) and Akaike information criterion (AIC), and a 287.6 point decrease in Bayesian information criterion (BIC). However, the unstructured model, which estimates 10 additional parameters, showed a 9,008.1 point drop in deviance (−2 LogL), a 9,002.1 point drop in AIC, and an 8,955.7 point drop in BIC. From these results, it is clear that an unstructured residuals covariance matrix may be needed to properly model the Level 1 residuals (for more information on minimal differences between information criteria values in model comparison and selection situations, see Anderson, 2008; Liu, Rovine, & Molenaar, 2012; Verbeke & Lesaffre, 1997; W. Wu, West, & Taylor, 2009).

Table

Table 3. Modeling the Residual Covariance Structure: Results.

Table 3. Modeling the Residual Covariance Structure: Results.

Now that the average change trajectory in mathematics achievement scores across participants over time and the variation in the average change trajectory have both been identified, the process of answering the second research question can begin: “What effect does gender and percentage of students within a school receiving free/reduced lunch have on mathematics achievement scores?” In keeping with the model-building process, the Level 2 predictor variable gender is first added to the multilevel model as a fixed effect. However, just as the Level 1 predictor variable decimal age needed to first be centered to facilitate its interpretation, the question of centering needs to be addressed prior to adding gender at Level 2. At first glance, there would seem to be no need to center a binary predictor variable. If a binary variable such as gender were dummy coded (0 = female, 1 = male), the intercept (γ000) is the expected mathematics achievement score for a 12-year-old female student (i.e., when gender and centered decimal age both equal 0), and the intercept variance term (τβ000) is interpreted as the variation expected mathematics achievement scores for females at 12 years of age across schools. By definition, for binary variables, a value of zero has a meaningful interpretation. However, analysis situations could arise in which estimating the means and variances of the group coded zero is less informative, and centering a binary covariate allows for intercepts and variances to be interpreted in terms of the average mathematics achievement score for the two coded groups (e.g., Hox & Roberts, 2011). So, prior to adding gender to the model as a Level 2 predictor variable, a brief conceptual, statistical, and theoretical discussion of binary covariate centering is needed.

Group mean centering gender (sometimes referred to as “centering within cluster,” or in this case, centering within schools) involves subtracting the gender mean (i.e., the proportion of male students) for each j school from all of the gender values within that school (GENDERijGENDER¯j). Conceptually, group mean centering gender results in a “weighted” or “unadjusted” intercept (γ000), as illustrated below.

Gender CWS (centered within schools):

γ000=(nj=1β00(j=1)+nj=2β00(j=2)+nj=50β00(j=50))/nJ.

Each school-specific mathematics achievement mean is (β00j) now “weighted” by the differing number of males and females (nj) within each j school. As such, the resulting intercept is “unadjusted” because each school-specific response variable mean is weighted by the different numbers of males and females within each school rather than treated as if the numbers of males and females across all schools were equal (e.g., Cohen et al., 2003; Enders & Tofighi, 2007). Statistically, just as mathematics achievement was shown earlier to have non-zero variation (and non-zero ICC values) at all three levels, the average proportions of males versus females will vary across schools at Level 3 (and have a non-zero ICC at Level 3). However, if gender is group mean centered (or centered within schools), the resulting centered gender variable will have zero variance (and an ICC = 0) at Level 3.

From a practical perspective, the issue of choosing between grand mean (centering on the sample mean) or group mean (or centering within cluster) centering involves a theoretical consideration of whether an absolute or relative effect of the predictor variable on the response variable is of interest. For example, if two very similar female students would be expected to have very different mathematics achievement scores if they were placed into schools that differed notably in the proportion of females versus males (e.g., an all-female school vs. schools with lower or higher proportions of female students), a relative effect of gender on mathematics achievement would be of interest and group mean centering, or centering within cluster, would be appropriate.

In contrast, grand mean centering gender involves subtracting the gender grand mean, or the overall proportion of males in the dataset, from all the gender values in the dataset (GENDERijGENDER¯). Conceptually, grand mean centering gender results in an “unweighted” or “adjusted” intercept (γ000), as illustrated below.

Gender GMC (grand mean centered):

γ000=(β00(j=1)+β00(j=2)+β00(j=3)+β00(j=50))/J

Each of the school-specific mathematics achievement means at Level 3 (β00j) is interpreted as the school-level mathematics achievement means that would have been observed had the proportions of males to females been equal (i.e., unweighted) across all (J = 50) schools. The intercept (γ000) that results from grand mean centering gender is conceptually “adjusted” for the different proportions of males and females across schools by treating each βj as if there were an equal number of males and females in all (J) schools. Statistically, in contrast to group mean centering (or centering within cluster), grand mean centering binary gender results in a variable that will show variation in average proportions across schools at Level 3 (i.e., non-zero ICCs at Level 2 and Level 3). From a practical perspective, grand mean centering would be appropriate if an absolute (i.e., the effect of gender on mathematics achievement would not be expected to vary if the gender proportion varies across schools), rather than a relative (i.e., the effect of gender on mathematics achievement would be contingent on the gender proportion of the school), effect of gender on mathematics achievement scores is of theoretical interest.

Group mean centered gender (i.e., GENDERijGENDER¯j; shown as GROUP_GENDER in the SAS syntax below) will be added to the three-level analysis model as a Level 2 predictor variable under the assumption that a relative effect of gender on mathematics achievement, rather than an absolute effect, is of theoretical interest. The linear equation for a model that includes both the main effect of group mean centered gender, as well as the centered decimal age by centered gender interaction effect, on mathematics achievement is shown as

Ytij=γ000+γ100(AGEtij12)+γ010(GENDERijGENDER¯j)+γ110(AGEtij12)(GENDERijGENDER¯j)+u00j+u10j(AGEtij12)+r0ij+r1ij(AGEtij12)+εtij.

Furthermore, the SAS syntax needed to estimate the model shown in Equation 6 above is5

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE GROUP_GENDER C_AGEGROUP_GENDER/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE /SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE /SUBJECT=SCHOOL_IDTYPE=UN;

Results for the Level 2 predictor variable model shown in Equation 6 and in the above SAS syntax are shown in the second column of Table 4. The main effect for gender was significant (γ010=13.95); age 12 males showed, on average, mathematics achievement scores that were 13.95 points lower than the mathematics achievement scores for age 12 females. The gender by age interaction effect was also significant (γ110=1.60); change in mathematics achievement scores for males per year beyond age 12, on average, was 1.60 points lower than female students. This significant interaction is presented graphically (Aiken & West, 1991) in Figure 2. All other fixed and random effect estimates remained statistically significant and their interpretations relatively unchanged from the previous analysis (see “Folder_7_Gender_Fixed_Effect”).

Table

Table 4. Level 2 and Level 3 Predictor Variables: Gender and Percentage of Students Receiving Free/Reduced Lunch.

Table 4. Level 2 and Level 3 Predictor Variables: Gender and Percentage of Students Receiving Free/Reduced Lunch.


                        figure

Figure 2. A graphical display of the significant two-way interaction.

The significant main effect for (group mean centered) gender could vary significantly across schools at Level 3 just as the effect of linear age on mathematics achievement was shown to vary significantly across both individuals at Level 2 and across schools at Level 3. The linear model equation that allows the effects of gender to vary randomly across schools at Level 3 is

Ytij=γ000+γ100(AGEtij12)+γ010(GENDERijGENDER¯)+γ110(AGEtij12)(GENDERijGENDER¯)+u00j+u01j(GENDERijGENDER¯)+u10j(AGEtij12)+r0ij+r1ij(AGEtij12)+εtij,

and the SAS syntax needed to estimate to model shown in Equation 7 above is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE GROUP_GENDER C_AGEGROUP_GENDER/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE /SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE GROUP_GENDER /SUBJECT=SCHOOL_ID TYPE=UN;

Results for the model that allowed the centered gender main effect to vary across schools at Level 3 is shown in the third column of Table 4. Two findings are noteworthy. First, all of the fixed and random effect estimates remained statistically significant and did not appreciably change from previous results. Second, allowing the gender main effect to vary randomly at Level 3 resulted in three additional random effects estimated at Level 3. The relationship between the linear model notation, the SAS output for the random effect estimates, and the Level 3 covariance matrix (τβ) can be shown as

τβ=[τβ000τβ010τβ011τβ020τβ021τβ022]=[UN(1,1)UN(2,1)UN(2,2)UN(3,1)UN(3,2)UN(3,3)],=[208.3119.988.7938.054.4356.06].

It is important to note that estimates along the main diagonal of the Level 3 covariance matrix (τβ) are listed in the same order that they are listed in the “RANDOM / SUBJECT = SCHOOL_ID;” line. As such, τβ000=208.31 is the estimate of variation in the average mathematics achievement score across schools for age 12 students, and τβ011=8.79 is the estimate of the variation in linear mathematics achievement score change per yearly increase in age beyond age 12. However, the purpose of this analysis model was to test whether the main effect of gender on mathematics achievement varied significantly across schools at Level 3.

As shown in the third column of Table 4 and in the reproduced Level 3 covariance matrix above, the estimate of the variation in the main effect of gender on mathematics achievement (τβ022=56.06) was statistically significant. Once again, the estimate of the effect of gender on mathematics achievement (γ010 = −13.67) combined with the estimate of the variance in the effect of gender on mathematics achievement (τβ022=56.06) can be used to compute a 95% confidence interval to facilitate interpretation. The effect of centered gender on mathematics achievement can vary, on average, by between −28.35 and 1.01 (56.06 = 7.49; −13.67 ± [1.96 × 7.49] = −28.35, 1.01) points around a mean effect of −13.67, across 95% of the J = 50 schools. The three remaining off-diagonal elements in the Level 3 covariance matrix (τβ) are all possible covariances among these three random effect estimates (see “Folder_8_Gender_Random_At_Level3”).

Now that the average change in mathematics achievement scores as a function of increasing age has been modeled at Level 1 and the effects of gender have been modeled at Level 2, a Level 3 predictor variable can also be added. The percentage of students in a school receiving free/reduced lunch has been used in educational research as a proxy indicator for school-level socio-economic status (SES). Prior to including this predictor variable in the analysis model, the issue of centering again must be considered. Grand mean centering is the only option for centering a Level 3 predictor variable such as the percentage of students in a school receiving free/reduced lunch because free/reduced lunch percentages vary across, but not within, schools. As such, a model that includes grand mean centered percentage of students receiving free/reduced lunch (PCT_FR_LUNCHjPCT_FR_LUNCH¯; shown as GRAND_FR_LUNCH in the SAS syntax below) as a Level 3 predictor variable (but see also Bauer, Gottfredson, Dean, & Zucker, 2013) is

Ytij=γ000+γ100(AGEtij12)+γ010(GENDERijGENDER¯j)+γ001(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+γ110(AGEtij12)(GENDERijGENDER¯j)+γ101(AGEtij12)(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+γ011(GENDERijGENDER¯j)(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+γ111(AGEtij12)(GENDERijGENDER¯j)(PCT_FR_LUNCHjPCT_FR_LUNCH¯)u00j+u10j(AGEtij12)+u01j(GENDERijGENDER¯j)+r0ij+r1ij(AGEtij12)+εtij,

and the SAS syntax needed to estimate the model shown in Equation 8 is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE GROUP_GENDER GRAND_FR_LUNCHC_AGEGROUP_GENDER C_AGEGRAND_FR_LUNCHGROUP_GENDERGRAND_FR_LUNCHC_AGEGROUP_GENDERGRAND_FR_LUNCH/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE/SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE GROUP_GENDER/SUBJECT=SCHOOL_ID TYPE=UN;

As shown in Equation 8, the SAS syntax above, and in the fourth column of Table 4, adding the percentage of students receiving free/reduced lunch as a Level 3 predictor variable results in eight fixed effect estimates: the main effects for centered decimal age, centered gender, and free/reduced lunch; the two-way interactions of centered decimal age by centered gender, centered decimal age by free/reduced lunch, and the centered gender by free/reduced lunch interaction; and one three-way interaction for centered decimal age by centered gender by free/reduced lunch; all as they impact mathematics achievement. As also shown in the fourth column of Table 4, four of these eight fixed effect estimates were statistically significant. The main effect for free/reduced lunch on mathematics achievement (γ001=2.99) was significant; average mathematics achievement scores for age 12 students decreased by roughly three points, on average, for every percentage increase in students receiving free/reduced lunch beyond the sample mean. Neither of the two two-way interactions nor the three-way interaction involving the percentage of students receiving free/reduced lunch was significant. However, the centered decimal age by centered gender interaction (γ110=1.56) remained statistically significant. All other fixed and random effect estimates, and their statistical significance, remained relatively unchanged from previous analyses with one exception: the intercept variation in the average mathematics achievement scores across schools at Level 3 (τβ000=141.60) decreased notably from the previous analysis (τβ000=208.31) as a result of adding the effects of free/reduced lunch percentage to the model at Level 3 (see “Folder_9_Free_Reduced_Lunch_Fixed_Effect”).

When looking from left to right at the rows labeled “−2LogL (Deviance)” in Tables 1 and 4, readers will quickly notice that as the model-building process proceeds and the number of estimated parameters increases with each analysis model, the log-likelihood decreases, indicating the fit of the multilevel models to the data continues to improve. However, if a model comparison process could be described as a researcher’s offer of proof that the additional parameters estimated at each successive step in the model-building process results in an improved fit of the three-level model to the sample data, the next logical question becomes “Do the additional parameters estimated at each successive step in the model-building process produce a statistically significant improvement in the fit of the analysis model to the sample data?” To answer this question, researchers must first pose three additional questions: (a) Are the two multilevel models being compared “nested”? (b) If so, are the models nested with respect to fixed effect estimates or random effect estimates? and (c) Were the parameters for the two models under comparison estimated using ML or REML?

Multilevel model comparison tests can only be conducted on “nested” models, and a model that estimates a lower number of parameters is nested within a model that estimates a larger number of parameters if fixing one or more parameter estimates to zero in the larger model results in the smaller model. For example, the model that estimates the fixed effects of gender (Table 4, second column) is nested with respect to random effects within the model that allows gender effects to vary randomly at Level 3 (Table 4, third column) because fixing the following three random effects to zero (τβ020,τβ021,andτβ022) in the model that allows gender effects to vary randomly at Level 3 results in the gender fixed effects model. In similar fashion, the model that allows gender effects to vary randomly at Level 3 (Table 4, third column) is nested with respect to fixed effects within the model that includes free/reduced lunch percentages as a Level 3 predictor variable (Table 4, fourth column) because fixing the following four fixed effects to zero (γ001,γ101,γ011,andγ111) in the free/reduced lunch percentages predictor variable model results in the gender effects varying randomly at Level 3 model. These two examples, all involving models with parameters estimated using REML instead of ML, will be used to illustrate two model comparison testing procedures: a likelihood ratio (or chi-square) nested model test and a multi-parameter Wald test (cf. Schafer, 1997).

At this point, researchers may be wondering why two model comparison procedures would be needed, and the answer lies in the differences between ML and REML parameter estimation in model comparison situations. Although described in detail elsewhere (e.g., see Goldstein, 2011; Hox, 2010; Singer & Willett, 2003; Snijders & Bosker, 2012), a quick review of parameter estimation for model comparison purposes is needed. ML estimation considers fixed effect parameters in a multilevel model to be “known” values and does not allot the proper degrees of freedom to fixed effect parameter estimation. This results in negatively biased random effect parameter estimates due to positively biased degrees of freedom for random effect parameter estimation. In contrast, REML considers fixed effect parameters to be unknown values and allots the proper degrees of freedom for their estimation. Those same degrees of freedom get subsequently removed from the likelihood function so that unbiased random effect estimates can then be obtained. As such, and assuming REML estimation was used, competing models that are nested in random effects can be compared using a likelihood ratio (or equivalently, chi-square) nested model test, which involves first computing both the difference in the −2 LogL (or chi-square) model fit statistics (i.e., Δ−2LogL = [−2LogLsmaller] − [−2LogLlarger]) and the difference in the number of estimated parameters (Δp = plargerpsmaller) between the two nested models. For example, a likelihood ratio nested model test comparing a model that estimated the fixed effects of gender (Table 4, second column) with a model that allowed gender effects to vary randomly at Level 3 (Table 4, third column) could easily be conducted by first using the information at the bottom of Table 4 to compute these difference values (Δ−2LogL = [−2LogLsmaller] − [−2LogLlarger] = 105349.4 − 105340.4 = 9; Δp = plargerpsmaller = 14 − 11 = 3). The actual nested model “test” involves referring the −2LogL difference value (Δ −2LogL = 9) to a chi-square sampling distribution at degrees of freedom equal to the difference in the number of estimated parameters (Δp = 3; χ32=9;p = 0.03). The results of the likelihood ratio nested model test show that allowing the effects of gender to vary randomly at Level 3 resulted in a significant improvement in the fit of the analysis model to the data.

However, how can the model that allows gender effects to vary randomly at Level 3 (Table 4, third column) be compared against a model that includes free/reduced lunch percentages as a Level 3 predictor variable (Table 4, fourth column)? These two models are nested with respect to fixed effect estimates, and models estimated using REML that are nested with respect to fixed effects cannot be compared with a likelihood ratio (or chi-square) nested model test. As shown elsewhere (e.g., Enders, 2010; Li, Raghunathan, & Rubin, 1991; Schafer, 1997), these two models can be compared using a multi-parameter Wald test in a manner akin to how the coefficient of determination (i.e., R2) in a multiple regression tests the inclusion of all predictor variables simultaneously for a significantly non-zero proportion of response variable variance explained. The SAS syntax needed to compare the random gender effects model to the free/reduced lunch percentages model (both estimated using REML) via a multi-parameter Wald (MPW) test is

PROCMIXEDMETHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=C_AGE GROUP_GENDER GRAND_FR_LUNCHC_AGEGROUP_GENDER C_AGEGRAND_FR_LUNCHGROUP_GENDERGRAND_FR_LUNCHC_AGEGROUP_GENDERGRAND_FR_LUNCH/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT C_AGE/SUBJECT=PARTICIPANT_ID(SCHOOL_ID)TYPE=UN;RANDOM INTERCEPT C_AGE GROUP_GENDER/SUBJECT=SCHOOL_ID TYPE=UN;CONTRAST MPW: Four Fixed EffectsGRAND_FR_LUNCH1,C_AGEGRAND_FR_LUNCH1,GROUP_GENDERGRAND_FR_LUNCH1,C_AGEGROUP_GENDERGRAND_FR_LUNCH1;

where the “CONTRAST” command is used in SAS for multi-parameter Wald testing. A text title (“MPW: Four Fixed Effects”) can be included, each of the four fixed effect parameters are listed to indicate their inclusion in the test followed by a “1” that indicates each fixed effect is weighted equally, and each parameter is separated by a comma to allow the four effects to be tested as a unified set. Results for the multi-parameter Wald test in SAS, F(4, 63.8) = 5.53, p < .001, showed that the inclusion of the four fixed effects in the free/reduced lunch model resulted in a significant improvement in model fit (multi-parameter Wald test syntax specifications for SPSS and Mplus are included in “Folder_10_Multi_Parameter_Wald_Test”).

In the previous section addressing model comparisons, the multi-parameter Wald test was described as being able to compare multilevel models that differ in multiple fixed effect parameter estimates by testing the multiple fixed effects as a set in a manner similar to the way that an R2 statistic tests all predictor variable slopes in a multiple regression model as a unified “set” for their ability to explain response variable variance. To continue this analogy, it is well-known that an R2 statistic is frequently cited as an overall effect size estimate in OLS multiple regression. Perhaps less well-known, because R2 statistics are almost always provided as the default output for OLS multiple regression analyses, is the three-step process that would allow researchers to calculate an OLS multiple regression R2 statistic: (a) use the intercept and slope estimates, solve the OLS multiple regression equation for all participants to obtain model-predicted response variable scores; (b) obtain the Pearson correlation between the model-predicted response variable scores calculated in the previous step and the observed response variable scores in the dataset; then (c) square the Pearson correlation. Using these same three steps, an overall “pseudo-R2” effect size estimate (Hox & Roberts, 2011; Peugh, 2010, 2014; Singer & Willett, 2003) can be calculated that quantifies the proportion of mathematics achievement variance modeled by the inclusion of all predictor variable fixed effects (it is a “pseudo”-R2 because unlike the independent response variable values assumed in OLS multiple regression, the mathematics achievement response variable scores are not independent across the three levels of analysis as previously shown).

Specifically, if only the fixed effects portion of Equation 8 above is considered, and the gammas (γ) were replaced with their coefficient estimates from Table 4, the result would be shown as

Ytij=667.96+26.93(AGEtij12)13.62(GENDERijGENDER¯j)2.99(PCT_FR_LUNCHjPCT_FR_LUNCH¯)1.56(AGEtij12)(GENDERijGENDER¯j)0.17(AGEtij12)(PCT_FR_LUNCHjPCT_FR_LUNCH¯)0.16(GENDERijGENDER¯j)(PCT_FR_LUNCHjPCT_FR_LUNCH¯)0.13(AGEtij12)(GENDERijGENDER¯j)(PCT_FR_LUNCHjPCT_FR_LUNCH¯).

Solving Equation 9 will produce a column of predicted mathematics achievement scores (MATH_DVPredictedtij). Obtaining the Pearson correlation between the observed mathematics achievement scores (MATH_DVObservedtij) and the model-predicted mathematics achievement scores (MATH_DVPredictedtij), then squaring the resulting correlation (i.e., [rPredicted,Observed]2) will produce the pseudo-R2 value. Results for the simulated Hawaii Department of Education data showed that (RPseudo2=[.66]2=0.44) 44% of the variance in mathematics achievement scores across the T = 4 time points for all I students within all J schools was modeled by the main and interaction effects involving age, gender, and percentage of students receiving free/reduced lunch as shown in Equation 9. However, one important caveat warrants mention: Equation 9 contains only fixed effect estimates, not random effect estimates. This means that the scores generated from solving Equation 9 are predicted math achievement scores for students with values of zero on all predictor variables. More specifically, the scores generated from solving Equation 9, because of the choices made here for centering each of the three predictor variables, are predicted scores for an age 12 student, male or female (because gender was centered), attending a school with a percentage of students receiving free-reduced lunch at the sample mean. As such, it is important to note that the RPseudo2 value obtained should be considered a conditional effect size value.

Furthermore, just as a pseudo-R2 estimate can be used as an overall effect size, indices analogous to pseudo-R2 can be computed to obtain an effect size estimate for a specific predictor variable based on the proportional reduction in mean squared prediction error (e.g., Snijders & Bosker, 2012). Returning to OLS multiple regression as an analogy, with no predictor variables in the regression model, the best predictor of any response variable score (Yi) for a sample of data is the sample mean response variable score (Y¯), prediction error would be defined as YiY¯, and mean squared prediction error would be defined as VAR(YiY¯), or just VAR(Yi). If a predictor variable X is added to the regression model, the best predictor of any response variable score (Yi) is the value predicted by the regression equation β0+β1Xi, prediction error would be defined as (Yi[β0+β1Xi]), and mean squared prediction error would be defined as VAR(Yi[β0+β1Xi]). If this logic is extended to consider the proportion of reduction in mean squared prediction error that resulted from adding a Level 1 predictor variable (R12), such as centered decimal age (AGE − 12), to a three-level model predicting mathematics achievement scores (MATH_DV), it can be shown that (Snijders & Bosker, 2012)

R12=1VAR(MATH_DVtij[γ000+γ100(AGE12tij)])VAR(MATH_DVtij),=1(σ2+τπ000+τβ000)CONDITIONAL_MODEL(σ2+τπ000+τβ000)UNCONDITIONAL_MODEL,

and from Table 1, it can be shown that

R12=1([272.00+909.01+188.65]/[1571.72+490.37+168.06])=0.39,or 39% of the variance in mathematics achievement scores was modeled by adding centered decimal age as a Level 1 predictor variable (see also Snijders & Bosker, 1994).6

The goals of this article were threefold: (a) to integrate the current but relatively scarce existing empirical literature on the currently accepted three-level longitudinal data analysis methods, (b) to conceptually illustrate how the linear model equations at all three levels can be used to properly specify the needed statistical analysis syntax using the same pedagogical style as in previously published two-level (e.g., Peugh, 2010; Peugh & Enders, 2005; Singer, 1998) and cross-sectional three level (Peugh, 2014) articles, and (c) to address topics not widely discussed in MLM publications, such as why binary predictor variables should also be centered, and why it is important to model the Level 1 residuals covariance matrix. Yet, pedagogical articles such as this one suffer from inherent limitations, one of which an anonymous reviewer of this article summarized quite well: “It is impossible to provide a comprehensive treatment of this model in the confines of a journal article.” We could not agree more with the anonymous reviewer’s perspective. However, just as we sought to accomplish throughout the article the aforementioned three goals, we conclude this article with a threefold hope: (a) that readers would find the example analyses and illustrations presented here and in the online supplemental materials as helpful in familiarizing themselves with MLM in general and three-level longitudinal analysis models specifically so that they could (b) expand their MLM expertise to enable the posing of more intricate MLM research questions that would require more complex (e.g., mediation, moderated mediation, mediated moderation) multilevel analyses, and (c) do so by expanding upon the SAS (PROC MIXED), SPSS (MIXED), and Mplus (TYPE = THREELEVEL) analysis syntax examples used here, or by translating the decisions made and the procedures used here for use in any statistical analysis software package capable of analyzing three-level longitudinal data.

Level-Specific Linear Model Equations

Unconditional Longitudinal Model

Level1:Ytij=π0ij+εtij

Level2:π0ij=β00j+r0ij

Level3:β00j=γ000+u00j

Level 1 Model: Age Fixed Effect

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+r0ij

Level2:π1ij=β10j

Level3:β00j=γ000+u00j

Level3:β10j=γ100

Level 1 Model: Age Random at Level 2

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+r0ij

Level2:π1ij=β10j+r1ij

Level3:β00j=γ000+u00j

Level3:β10j=γ100

Level 1 Model: Age Random at Level 2 and Level 3

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+r0ij

Level2:π1ij=β10j+r1ij

Level3:β00j=γ000+u00j

Level3:β10j=γ100+u10j

Level 1 Model: Quadratic Age Fixed

Level1:Ytij=π0ij+π1ij(AGEtij12)+π2ij(AGEtij12)2+εtij

Level2:π0ij=β00j+r0ij

Level2:π1ij=β10j+r1ij

Level2:π2ij=β20j

Level3:β00j=γ000+u00j

Level3:β10j=γ100+u10j

Level3:β20j=γ200

Level 2 Model: Gender Fixed Effect

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+β01j(GENDERijGENDER¯j)+r0ij

Level2:π1ij=β10j+β11j(GENDERijGENDER¯j)+r1ij

Level3:β00j=γ000+u00j

Level3:β01j=γ010

Level3:β10j=γ100+u10j

Level3:β11j=γ110

Level 2 Model: Gender Main Effect Random at Level 3

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+β01j(GENDERijGENDER¯j)+r0ij

Level2:π1ij=β10j+β11j(GENDERijGENDER¯j)+r1ij

Level3:β00j=γ000+u00j

Level3:β01j=γ010+u01j

Level3:β10j=γ100+u10j

Level3:β11j=γ110

Level 3 Model: Percentage of Students Receiving Free/Reduced Lunch

Level1:Ytij=π0ij+π1ij(AGEtij12)+εtij

Level2:π0ij=β00j+β01j(GENDERijGENDER¯j)+r0ij

Level2:π1ij=β10j+β11j(GENDERijGENDER¯j)+r1ij

Level3:β00j=γ000+γ001(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+u00j

Level3:β01j=γ010+γ011(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+u01j

Level3:β10j=γ100+γ101(PCT_FR_LUNCHjPCT_FR_LUNCH¯)+u10j

Level3:β11j=γ110+γ111(PCT_FR_LUNCHjPCT_FR_LUNCH¯)

SAS Proc Mixed Syntax Descriptions

PROC MIXED METHOD=REML NOCLPRINT COVTEST NOITPRINT;CLASS PARTICIPANT_ID SCHOOL_ID;MODEL MATH_DV=/SOLUTION DDFM=KENWARDROGER NOTEST;RANDOM INTERCEPT/SUBJECT=PARTICIPANT_ID(SCHOOL_ID);RANDOM INTERCEPT/SUBJECT=SCHOOL_ID;

  • The “METHOD = REML” option specifies the use of restricted (REML) maximum likelihood, as opposed to full information maximum likelihood (ML), parameter estimation. REML has been shown to produce more accurate random effect estimates at sample sizes commonly found in a wide variety of research settings (see Peugh, 2014, for additional specifics)

  • The “NOCLPRINT” option inhibits the output of class level (CL; that is, CLASS PARTICIPANT_ID SCHOOL_ID;) descriptive information such as the number of students at Level 2 and the number of schools at Level 3.

  • The “COVTEST” option prints the significance test results for the variances and covariances at all three levels, while the “NOITPRINT” option suppresses the output of the parameter estimation iteration information.

  • Categorical identification variables at Level 1 (WAVE, shown later), Level 2 (PARTICIPANT_ID), and Level 3 (SCHOOL_ID) are listed as classification variables after the “CLASS” statement.

  • The response variable (MATH_DV =) is always listed first after the “MODEL” statement, and the predictor variables (i.e., fixed effects) to be tested for significance are listed on the right side of the “=” on the “MODEL” statement line.

  • The “/SOLUTION” specification prints the results of the fixed effects significance tests to the SAS output window.

  • The “DDFM=KENWARDROGER” option specifies the use of the Kenward–Roger (Kenward & Roger, 1997) denominator degrees of freedom correction method (DDFM) for statistical significance tests involving fixed effect estimates. The Kenward–Roger procedure has been shown to better maintain Type-1 errors across a number of multilevel dataset conditions (see Peugh, 2014, for additional details).

  • The “NOTEST” statement suppresses the (somewhat redundant) output of Type-III significance tests for fixed effects.

  • Significance tests for random effects (i.e., variances and covariances) are listed after the “RANDOM” statement, and the “/SUBJECT=” modifier allows researchers to specify where the variation to be tested occurs. For example, “RANDOM INTERCEPT /SUBJECT = PARTICIPANT_ID (SCHOOL_ID);” indicates response variable (MATH_DV) variation occurs in average response variable scores across participants (intercepts at Level 2) nested within schools, and “RANDOM INTERCEPT /SUBJECT= SCHOOL_ID;” indicates that variation occurs in the mean response variable scores (intercepts) across schools at Level 3.

The authors wish to extend their appreciation and gratitude to Jill Tao, Kathleen Kiernan, and Phil Gibbs of SAS Technical Support () for their very helpful assistance.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Aiken, L. S., West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks, CA: SAGE.
Google Scholar
Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. New York, NY: Springer.
Google Scholar | Crossref
Bauer, D. J., Gottfredson, N. C., Dean, D., Zucker, R. A. (2013). Analyzing repeated measures data on individuals nested within groups: Accounting for dynamic group effects. Psychological Methods, 18, 1-14. doi:10.1037/a0030639
Google Scholar | Crossref | Medline | ISI
Biesanz, J. C., Deeb-Sossa, N., Papadakis, A. A., Bollen, K. A., Curran, P. J. (2004). The role of coding time in estimating and interpreting growth curve models. Psychological Methods, 9, 30-52. doi:10.1037/1082-989X.9.1.30
Google Scholar | Crossref | Medline | ISI
Cohen, J. C., Cohen, P., West, P., Aiken, S. G. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Davis, P., Scott, A. (1995). The effect of interviewer variance on domain comparisons. Survey Methodology, 21, 99-106.
Google Scholar
De Leeuw, J., Kreft, I. (1986). Random coefficient models for multilevel analysis. Journal of Educational and Behavioral Statistics, 11, 57-85.
Google Scholar | SAGE Journals
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.
Google Scholar
Enders, C. K., Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12, 121-138. doi:10.1037/1082-989X.12.2.121
Google Scholar | Crossref | Medline | ISI
Ferron, J., Dailey, R., Yi, Q. (2002). Effects of misspecifying the first-level error structure in two-level models of change. Multivariate Behavioral Research, 37, 379-403. doi:10.1207/S15327906MBR3703_4
Google Scholar | Crossref | Medline | ISI
Goldstein, H. (2011). Multilevel statistical models (4th ed.). West Sussex, UK: John Wiley.
Google Scholar
Heck, R. H. (2000). Examining the impact of school quality on school outcomes and improvement: A value-added approach. Educational Administration Quarterly, 36, 513-552. doi:10.1177/0013161X00364002
Google Scholar | SAGE Journals | ISI
Heck, R. H. (2007). Examining the relationship between teacher quality as an organizational property of schools and students’ achievement and growth rates. Educational Administration Quarterly, 43, 399-432. doi:10.1177/0013161X07306452
Google Scholar | SAGE Journals | ISI
Heck, R. H., Hallinger, P. (2009). Assessing the contribution of distributed leadership to school improvement and growth in math achievement. American Educational Research Journal, 46, 659-689. doi:10.3102/0002831209340042
Google Scholar | SAGE Journals | ISI
Hedeker, D., Gibbons, R. D. (2006). Longitudinal data analysis. New York, NY: John Wiley.
Google Scholar
Hedges, L. V., Hedberg, E. C. (2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60-87.
Google Scholar | SAGE Journals | ISI
Hoffman, L. (2015). Longitudinal analysis: Modeling within-person fluctuation and change. New York, NY: Routledge.
Google Scholar
Hofmann, D. A., Gavin, M. B. (1998). Centering decisions in hierarchical linear models: Implications for research in organizations. Journal of Management, 24, 623-641.
Google Scholar | SAGE Journals | ISI
Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge.
Google Scholar | Crossref
Hox, J. J., Roberts, J. K. (Eds.). (2011). Handbook of advanced multilevel analysis. New York, NY: Routledge.
Google Scholar
Kenward, M. G., Roger, J. H. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983-997.
Google Scholar | Crossref | Medline | ISI
Kiernan, K., Tao, J., Gibbs, P. (2012). Tips and strategies for mixed modeling with SAS/STAT procedures (Paper 332-2012). Orlando, FL: SAS Global Forum. Retrieved from http://support.sas.com/resources/papers/proceedings12/332-2012.pdf
Google Scholar
Kreft, I. G., De Leeuw, J., Aiken, L. S. (1995). The effect of different forms of centering in hierarchical linear models. Multivariate Behavioral Research, 30, 1-21.
Google Scholar | Crossref | Medline | ISI
Kwok, O. M., West, S. G., Green, S. B. (2007). The impact of misspecifying the within-subject covariance structure in multiwave longitudinal multilevel models: A Monte Carlo study. Multivariate Behavioral Research, 42, 557-592. doi:10.1080/00273170701540537
Google Scholar | Crossref | ISI
Li, K. H., Raghunathan, T. E., Rubin, D. B. (1991). Large-sample significance levels from multiply imputed data using moment-based statistics and an F reference distribution. Journal of the American Statistical Association, 86, 1065-1073.
Google Scholar
Liu, S., Rovine, M. J., Molenaar, P. (2012). Selecting a linear mixed model for longitudinal data: Repeated measures analysis of variance, covariance pattern model, and growth curve approaches. Psychological Methods, 17, 15-30. doi:10.1037/a0026971
Google Scholar | Crossref | Medline | ISI
Longford, N. T. (1993). Random coefficient models. New York, NY: Oxford University Press.
Google Scholar
Mehta, P. D., West, S. G. (2000). Putting the individual back into individual growth curves. Psychological Methods, 5, 23-43. doi:10.1037/1082-989X.5.1.23
Google Scholar | Crossref | Medline | ISI
Muthén, L. K., Muthén, B. O. (1998-2015). Mplus user’s guide (7th ed.). Los Angeles, CA: Author.
Google Scholar
Nezlek, J. B. (2001). Multilevel random coefficient analyses of event- and interval-contingent data in social and personality psychology research. Personality and Social Psychology Bulletin, 27, 771-785.
Google Scholar | SAGE Journals | ISI
Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). Fort Worth, TX: Harcourt Brace College Publishers.
Google Scholar
Peugh, J. L. (2010). A practical guide to multilevel modeling. Journal of School Psychology, 48, 85-112. doi:10.1016/j.jsp.2009.09.002
Google Scholar | Crossref | Medline | ISI
Peugh, J. L. (2014). Conducting three-level cross-sectional analyses. Journal of Early Adolescence, 34, 7-37. doi:10.1177/0272431613498646
Google Scholar | SAGE Journals | ISI
Peugh, J. L., Enders, C. K. (2005). Using the SPSS mixed procedure to fit cross-sectional and longitudinal multilevel models. Educational and Psychological Measurement, 65, 717-741. doi:10.1177/0013164405278558
Google Scholar | SAGE Journals | ISI
Raudenbush, S. W., Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Newbury Park, CA: SAGE.
Google Scholar
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Boca Raton, FL: Chapman & Hall.
Google Scholar | Crossref
Siddiqui, O., Hedeker, D., Flay, B. R., Hu, F. B. (1996). Intraclass correlation estimates in a school-based smoking prevention study: Outcome and mediating variables, by gender and ethnicity. American Journal of Epidemiology, 144, 425-433.
Google Scholar | Crossref | Medline | ISI
Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355. doi:10.3102/10769986023004323
Google Scholar | SAGE Journals
Singer, J. D., Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press.
Google Scholar | Crossref
Snijders, T. A. B., Bosker, R. J. (1994). Modeled variance in two-level models. Sociological Methods & Research, 22, 342-363.
Google Scholar | SAGE Journals | ISI
Snijders, T. A. B., Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London, England: SAGE.
Google Scholar
Verbeke, G., Lesaffre, E. (1997). The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Computational Statistics & Data Analysis, 23, 541-556.
Google Scholar | Crossref | ISI
Verbeke, G., Molenberghs, G. (2009). Linear mixed models for longitudinal data. New York, NY: Springer.
Google Scholar
Wu, W., West, S. G., Taylor, A. B. (2009). Evaluating model fit for growth curve models: Integration of fit indices from SEM and MLM frameworks. Psychological Methods, 14, 183-201.
Google Scholar | Crossref | Medline | ISI
Wu, Y.-W. B., Wooldridge, P. J. (2005). The impact of centering first-level predictors on individual and contextual effects in multilevel data analysis. Nursing Research, 54, 212-216.
Google Scholar | Crossref | Medline | ISI

Author Biographies

James L. Peugh is an associate professor of pediatrics, and has a joint appointment as a quantitative psychologist and a biostatistician at the Cincinnati Children’s Hospital Medical Center. His research focuses on the use of Monte Carlo simulation techniques to test advanced cross-sectional, longitudinal, and multilevel latent variable (continuous and categorical) structural equation mixture models. He has publications in a variety of quantitative, educational, and psychological journals.

Ronald H. Heck is a professor and former Dai Ho Chun Chair in the College of Education at the University of Hawaii at Manoa. He has written extensively in the areas of educational policy, performance assessment, and research methods. He has publications in a wide variety of quantitative, educational and public policy journals.

Article available in: