In order to investigate causality in situations where random assignment is not possible, propensity scores can be used in regression adjustment, stratification, inverse-probability treatment weighting, or matching. The basic concepts behind propensity scores have been extensively described. When data are longitudinal or missing, the estimation and use of propensity scores become a challenge. Traditional methods of propensity score estimation delete cases listwise. Missing data estimation, by multiple imputation, can be used to alleviate problems due to missing values, if performed correctly. Longitudinal studies are another situation where propensity score use may be difficult because of attrition and needing to account for data when propensities may vary over time. This article discusses the issues of missing data and longitudinal designs in the context of propensity scores. The syntax, datasets, and output used for these examples are available on http://jea.sagepub.com/content/early/recent for readers to download and follow.

In our previous article (Beal & Kupzyk, 2014), we reviewed the concept of propensity scores, how they can be estimated, and how they can be used to account for confounding variables to assign causal inference in situations where random assignment is not possible. A propensity score is the probability of assignment to or membership in a group, conditional on a set of observed covariates (Rosenbaum & Rubin, 1983). The scores can be estimated using a logistic regression model with group membership as the outcome variable. The saved probability of group membership for each individual in the dataset is the propensity score variable; that is, the probability, or propensity, of being in the focal group conditional on the set of covariates entered into the model. A variety of analytic approaches with propensity scores can then be conducted (Austin, 2011). Matching can be performed to create a subsample of participants from each group that are similar in their propensity or probability of being in the focal group, to account for preexisting differences between groups (Bai, 2011; Fan & Nowell, 2011). Data can be stratified based on propensity scores to account for nonlinearity in relations between treatment and outcome, dependent on the probability of receiving treatment (Linden & Adams, 2008). Finally, propensity scores can be used to account for measured confounds via adjustment in a multiple regression model (D’Agostino, 1998). Our previous article described each of these techniques. Propensity scores can also be used via inverse-probability treatment weighting (IPTW) to adjust the contribution of observations based on likelihood of receiving treatment, as described by Robins and Hernán (2009). This article highlights several situations that make estimation of a propensity score challenging, regardless of how the propensity score is used. First is the problem of missing data. Traditional methods of propensity score estimation will exclude a case if missing data are present on any of the covariates included. The second situation is when research using propensity scores is carried out longitudinally.

Having some amount of missing data is a common occurrence in social science research, especially in longitudinal studies (for a review, see Enders, 2010), and there are different reasons data may be missing and different methods of handling missing data. The three types, or mechanisms, of missing data originally described by Rubin (1976) are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The ideal case is MCAR, in which data are missing completely at random, and none of the variables, dependent or independent, are related to or predict missingness. Traditional analysis methods that implement listwise deletion assume that data are MCAR, and methods are available to establish that data are not MCAR (see Enders, 2010; Raykov, 2011, for further details). It is important to note that MCAR cannot be proven to be true but can be refuted to a reasonable degree with a null hypothesis test. Statistical methods that use maximum-likelihood (ML) estimation assume that data are MAR, meaning that the probability of missingness in one variable may be related to another variable in the model (Enders, 2010). If missingness on a variable is due to the unobserved value on the variable, data are said to be MNAR. It is also important to note that MNAR may exist due to unobserved variables that could be a cause of missingness, because a possible predictor of the variable with missing data is being omitted. Although there are some statistical methods that can account for MNAR (see Enders, 2011, for details), such as pattern-mixture models, most analysis methods used in research cannot, and there is no way to test whether it is the case. As a result, when independent variables or covariates are associated with missing data on an outcome, analysts can only assume that missingness is MAR at best. There are many articles that have addressed the types of missing data and further issues related to missing data. While a review of that literature is outside the scope of this article, an understanding of missing data handling in general is an important foundation for understanding missing data with propensity scores. We encourage readers to review Enders (2010) and Graham (2009) to gain a better understanding of the issues more generally. Now, we turn to the features of missing data that pose particular challenges to propensity scores, in the context of three approaches to handling missing data: listwise deletion, multiple imputation (MI), and full-information maximum-likelihood (FIML) estimation.

Listwise Deletion

Listwise deletion is an approach used wherein only those cases with complete data on all variables in the model of interest are included in the analyses (Enders, 2010). By default, a logistic regression model, which is used to estimate propensity scores, deletes cases in a listwise manner when outcomes or predictor variables are missing. Cases that have a missing value on any one of the covariates in the model will be excluded from the analysis and will not be assigned a propensity score. Having many covariates in the model, while increasing the precision of the propensity score, also greatly increases potential loss of cases due to missing values. Hong and Yu (2008) used approximately 200 background variables to estimate propensity scores and match children who had been retained in school to those who were permitted to advance to the next grade in order to assess the effects of different retention policies. Such a large number of covariates greatly increased the chance of losing cases in an analysis due to listwise deletion. In this type of situation, researchers may choose to exclude certain covariates due to high levels of missing values and only include variables in propensity score models that have complete data. While this strategy would defend against loss of cases, the covariates excluded may have been important for the accurate estimation of propensity scores. Failing to include those variables known to be important for treatment leads to bias in the estimation of propensity scores, negating the use of propensity scores for causal influence.

MI

Imputation deals with missing data by predicting, or imputing, the missing data based on the correlations observed between the variables in the dataset (Enders, 2010). Imputed datasets thus contain model-predicted values (imputed values in place of previously missing data), with some amount of error added to maintain unbiased standard errors. Imputing values only once (i.e., single imputation) is known to result in underestimated standard errors (Schafer, 1999). MI (Rubin, 1987) has become a commonly used method for alleviating missing data problems by creating a number of imputed datasets (a minimum of five; Schafer & Olsen, 1998), performing analyses on each dataset, and combining the parameter estimates and standard errors using a set of combination rules informally known as Rubin’s rules (1987; see also Schafer & Olsen, 1998). The rules take into account the variability in the parameters and standard errors across the imputations and provide a final statistical test for the outcome of interest. The SAS MI procedure can be used to run MI, and PROC MIANALYZE summarizes the results using the combination rules. Importantly, this approach may not be appropriate in instances where missingness is not at random (i.e., MNAR; Enders, 2010).

In order to use MI efficiently, variables that are associated with the outcome of interest need to be included in the imputation model (Enders, 2010). The interdependence between variables is utilized to predict the missing values. Variables that are to be included in the imputation model can be continuous or categorical, and are referred to as background variables, covariates, or auxiliary variables if they are not part of the analytic model to be carried out. An increase in the number of variables included in the model that are typically associated with the outcome will reduce bias in imputations. Collins, Schafer, and Kam (2001) determined that a more inclusive strategy for MI of including as many variables as possible had increased efficiency and less bias, as compared with a more restrictive strategy. Including too many variables, however, could produce increased variability in the prediction. Hardt, Herke, and Leonhart (2012) found that inclusion of more than 10 variables decreases precision and underestimates regression coefficients when missing data rates exceed 40%. When missing data rates exceeded 20%, bias increased when the number of auxiliary variables was greater than 20. Unfortunately, no agreed-upon decision rule exists for determining how many or which variables to include. A correlation cutoff of .1 in associations between variables and the outcome, a measure of interdependence, is used to decide which auxiliary variables to include by a Multivariate Imputation by Chained Equations (MICE) package for R (van Buuren & Groothuis-Oudshoorn, 2011), while Enders (2010) suggested a more stringent criterion of .4 in determining whether auxiliary variables should remain in the imputation model.

Adding to the complexity of MI with propensity scores, Hardt et al. (2012) suggested that the reason adding too many variables increases bias is the instability introduced when the ratio of cases to variables becomes too low. The authors suggest that the number of complete cases should be at least 3 times the number of auxiliary variables. The highest number of auxiliary variables that can be used, therefore, is a function of both the amount of missing data and the sample size of the study. The more data that are missing and the smaller the dataset, the poorer MI performs. This combined with the lack of clear direction as to when auxiliary variables should be included suggests that researchers should be careful to begin any imputation (and analysis more generally) by examining descriptive and bivariate statistics, including rates of missing data and correlations between predictors and outcomes. Such information should be used when deciding whether to impute and which predictors should be included in the imputation model. With too much missing data, estimation may be so poor that MI does more harm than good. Situations differ, so there is no one best way to handle missing data. When deciding what approach to use for imputation, we encourage readers to review Enders (2010) and similar sources for guidance.

As with any analysis using MI, researchers must determine when to aggregate results to have one summarized finding. Mitra and Reiter (2012) compared two approaches with combining propensity scores and MI. In the first method, the “within” approach, matching and analysis are performed on each imputed dataset with the treatment effect averaged across the imputations (see also Mattei, 2009). Using the second method, the “across” approach, propensity scores are estimated for each imputation and averaged across imputed datasets prior to matching. Although the second approach (averaging scores, then matching) is much simpler, it was found to produce more biased results than when averaging treatment effects across the matched samples from each imputation. These findings together with the use of MI for other analyses suggest that researchers should perform analyses within each imputed dataset, and then combine findings.

When planning how to proceed with propensity score estimation and matching across MIs, the analyst will also need to decide whether or not to use a caliper, which is the maximum allowable distance in the propensity score metric for a control participant to be matched with a case. Most often, analysts will use simple nearest-neighbor matching. When performing MI, nearest-neighbor matching will result in equal sample sizes across imputations. If a caliper is used, however, the sample sizes across imputations may be different if certain cases cannot be matched to a control within the caliper. Similarly, choosing to exclude treatment cases with propensity scores that do not overlap with control cases (and vice versa) will also introduce unequal sample size across imputations. Rubin’s rules for combining parameter estimates do not take the sample size of the imputations into account directly. If some imputations have a slightly smaller sample size, that should be reflected in the slightly larger standard errors for the parameter estimates from those imputations. Rubin’s combination rules (Rubin, 1987) do take into account the standard errors from each imputation, so unequal sample sizes across imputations are not a cause for concern when the MI and MIANALYZE procedures are used in SAS. However, researchers should be aware that unequal sample sizes may be problematic when combining imputation estimates in other statistical packages (e.g., SPSS, R).

Performing propensity score matching in conjunction with MI raises another analytic issue, that of balance checking. When matching is typically carried out, the propensity score model may be modified several times until a set of participants have been matched that achieve balance on demographic variables and pretreatment covariates (Rosenbaum & Rubin, 1984). When performing MIs with matching done on each imputation, it is possible that balance may be achieved on some of the imputations, while groups differ on other imputations. With a small amount of missing data, it is likely that no differences will occur across imputations, but this may not always be the case. We were unable to find any literature that considered what course of action to take in the event that imputations differ on balance after matching. One could attempt to improve the propensity score model by adding predictors, and then reestimate until balance is achieved across all imputations. Alternatively, the number of imputations could be increased until the desired number of imputations with balanced groups is achieved, and then only balanced imputations could be used for subsequent analyses. This is akin to simulation studies, or Monte Carlo studies, where datasets can be produced that do not converge, and analysts may continue creating new replications until the desired number of converged solutions is attained. More research is needed to empirically determine how to proceed when balance is not achieved across all imputed datasets.

FIML Estimation

FIML uses all available data to estimate parameters for the population of interest, without losing cases in the way listwise deletion does (Enders, 2010). Many studies have compared FIML with MI and listwise deletion, with varying results. Cheung (2007) conducted a simulation study using latent growth models, and assuming data were MCAR. The results showed that FIML and listwise deletion both outperformed MI, in terms of correctly estimating standard errors. FIML is always recommended over listwise deletion because all available data can be utilized (Enders, 2010). Cheung did not examine the MAR case, under which the results may be different, but this and the previously discussed results show that MI is not always better, and at times may be worse, than ML estimation. MI performs particularly poorly under simulated conditions of MNAR.

While FIML is known to perform at least as well as MI in many analytic approaches (Enders, 2010), it is not particularly well suited to working with propensity scores. When propensity scores are estimated using a logistic regression with FIML, a score will not be assigned to a subject when any of the covariates are missing. Any propensity score estimation model using FIML will result in missing propensity scores for all cases where predictors of the propensity score were missing. FIML is a better approach when there are missing data only on the outcome of interest (i.e., treatment) in a propensity score model because in those instances FIML will still provide a propensity score estimate. When covariates are missing and FIML is used to estimate a propensity score, cases with covariates missing will not have an estimated propensity score. Cases without a propensity score would subsequently be excluded from further analyses. While excluding cases from analyses because of missingness always has the potential to introduce bias, it is particularly concerning when coupled with small sample sizes, because the findings from such analyses may be underpowered to detect significant effects. Some propensity score techniques (e.g., matching) almost always reduce the sample further. Therefore, if researchers plan to use FIML to estimate propensity scores and then match, they must ensure that the final sample has sufficient size to detect significant results. While propensity score adjustment does not reduce the sample, any case that had missing data on any of the covariates in the propensity score model will not have a propensity score assigned to it, and will also be dropped when the propensity score is used as a predictor variable, essentially resulting in listwise deletion. In either case, matching or covariate adjustment, MI may be the only reasonable solution for dealing with missing data in an analysis involving propensity scores. With MI, the imputations have replaced all of the points of item-level nonresponse with an estimated value, so that each case will be assigned a propensity score.

Comparing Missing Data Approaches With Propensity Scores

Mattei (2009) provided information on methods for combining missing data estimation and propensity score matching. She compared several methods of handling missing background data, including a complete-case approach (traditional), a pattern-mixture model approach developed by Rosenbaum and Rubin (1984), and MI. No evidence was found to support the use of any one approach over the others. Mattei did find that a MI approach with the outcome variable included in the imputation model outperformed other approaches. Under the technique Mattei describes, the propensity score model fitting takes place on the imputed datasets, and both matching and stratification are used. In the matching demonstration, propensity scores are estimated and matching was performed for each imputation. The average treatment effect was then calculated across imputations. As stratification is performed for each imputation, the average treatment effect is calculated for each imputed dataset, and then averaged across imputations. While more research and replication of findings are needed to resolve the question of how to appropriately address missing data in studies involving propensity scores, the studies reviewed thus far seem to indicate that when researchers have missing data patterns that are MCAR and MAR, either FIML estimation methods or MI could be used to account for missingness (Enders, 2010). When deciding which approach is more appropriate, however, it is important to consider the rates of missingness and how much information is available to predict missing values, as well as the impact of using FIML (as compared with MI) on sample size.

Up to this point, we have primarily discussed propensity score analysis assuming cross-sectional data. Using propensity scores in cross-sectional designs is much more straightforward than when a study involves repeated measures, because there are multiple ways in which scores can be used. When propensity scores are utilized in a longitudinal setting, the most common approach used in the literature is to match at baseline, and then carry out longitudinal statistical methods using the matched groups (e.g., Slade et al., 2008). This approach is most consistent with a randomized control trial (Shadish, Cook, & Campbell, 2002). In that instance, the implicit assumption is that individuals are not changing treatment groups. There are several examples of this in the literature. Kohls and Walach (2008) used propensity score matching to select groups of spiritual versus nonpracticing individuals to compare changes in psychological outcomes. Repeated-measures ANOVA was performed as it normally would be after matching. John, Wright, Duku, and Williams (2008) used longitudinal hierarchical linear models with samples matched at baseline to compare behavioral outcomes of youth who did and did not receive arts instruction.

Although most longitudinal studies implementing propensity scores use matching, some have used stratification to determine differences in treatment effects for individuals with differing propensity for receiving treatment (for a description of how to use stratification with propensity scores, see Beal & Kupzyk, 2014). Hodges and Grunwald (2005) performed repeated-measures ANCOVA with stratification accounted for at baseline, in order to assess the effects of a home-based mental health program on youth at risk. In this case, stratum was used as a factor instead of performing separate analyses within each stratum. A treatment by stratum interaction was included along with the overall treatment effect. While this method may be preferable to conducting stratum-specific analyses due to the pooled error term, the authors did not discuss it in great detail, and we were unable to find any other examples of this method being used. One could also conduct separate analyses for each stratum, which may make it easier to see and interpret how the results may differ at each level of probability for treatment, prior to averaging the treatment effect. Adding stratum as a factor may account for some of the variability in the outcome of interest across strata but does not directly assess how the relationship between treatment and the outcome differs across strata. If an interaction term were to be added, that would indicate whether the treatment effect varies across strata.

Alternatively, a researcher could use regression adjustment in a longitudinal study where all predictors are measured only at baseline and the outcome is measured at one or more subsequent time points, a procedure that would be similar to using regression adjustment with cross-sectional data. With this approach, the propensity score would be estimated at baseline and included in analyses that predict outcomes occurring post-baseline (e.g., Time 2 outcome, outcome slope). Of note, with this approach the underlying assumption is that treatment condition is time invariant (i.e., treatment condition would never change across time). As an example, Prendergast and colleagues examined whether women-only or mixed-gender substance abuse treatment was more effective in preventing recidivism 12 months later, using propensity score adjustment at baseline to account for existing bias in who receives which program (Prendergast, Messina, Hall, & Warda, 2011). The use of propensity score adjustment at baseline indicates that (a) the authors did not expect enrollment in treatment to occur after baseline and (b) participants could only be in one of the two treatment conditions (i.e., not switching from a women-only program to a mixed-gender program). Importantly, propensity score adjustment will reduce bias because it accounts for confounds between the treatment conditions; however, this approach is generally known to be less effective than other techniques (e.g., matching) in reducing confounding bias (Austin, 2011).

A fourth option, IPTW, can also be used. IPTW uses the inverse of the propensity score to adjust the relative contribution of each case to the overall model, thereby creating an effect estimate for the pseudo-population, where some individuals in the sample population may represent fewer than one case, and others may represent more than one case (Robins & Hernán, 2009). While several approaches to estimating IPTW exist (see Robins & Hernán, 2009, for further discussion), the most straightforward approach is to use the formula wi = (Zi / pi) + ([1 − Zi] / [1 − pi]), where wi represents the weight for a given individual, Zi represents the treatment condition for a given individual, and pi represents the propensity score for a given individual (Austin, 2011). Take, for example, two cases in the treatment condition—one with a treatment probability of .9 (high) and the other with a treatment probability of .2. The individual who was in the treatment condition (Z = 1) with high probability for treatment would be weighted to represent 1.11 cases in the pseudo-population, while the individual with a low probability for treatment would be weighted to represent five cases in the pseudo-population. In contrast, an individual with a low probability for treatment who was in the control condition (Z = 0) would be weighted to represent 1.25 cases in the pseudo-population, whereas an individual in the control condition with a high probability for treatment would be weighted to represent 10 cases in the pseudo-population.

Using this technique, cases can be weighted at baseline and outcomes at one or more subsequent time points could be predicted; the contributions of cases where individuals could have plausibly been in the other condition (e.g., low probability for treatment but in treatment condition, high probability of treatment but in control condition) are elevated over cases where treatment condition was unlikely to change. For example, Bassok (2010) used IPTW to estimate the impact of race on subsequent preschool involvement. The assumption in this instance is again that treatment (i.e., race) is not changing across time. In most cases, IPTW leads to similar estimated effects of treatment compared with matching (Kurth et al., 2006). However, cases in the data with extreme scores (e.g., .01, .99) can result in very large weights for some cases, which can lead to IPTW performing poorly (Kurth et al., 2006). Researchers should be cautious when interpreting results if this is the case. In addition, individuals with a propensity score of 0 or 1 will not have treatment weights assigned; as noted by Linden and Adams (2010), the population for IPTW therefore becomes individuals for whom the probability of treatment is greater than 0 but is also not 100%. It is also worth noting that up to this point, we have been considering analyses where covariates and treatment are time invariant, that is, they are not changing across time in longitudinal designs.

Some studies have taken more than one strategy in longitudinal settings and compared results. Dunlay, Pack, Thomas, Killian, and Roger (2014), for example, used propensity score adjustment, IPTW, and matching at baseline to compare patients who attended cardiac rehabilitation with those who did not on readmission and death rates. The authors found that all three methods report a reduction in risk for participants compared with nonparticipants, but stated that although the methods agreed, IPTW was the most conservative. Similarly, Bassok (2010) tested whether the effects of preschool attendance on cognitive outcomes varied by race using regular ordinary least squares regression, propensity score matching, and IPTW. Although the significance of some covariates differed between methods, the effects of interest were very similar. It may not always be the case that these different methods produce similar results, but researchers will likely benefit from testing more than one method. Whether or not methods differ, the results could provide interesting and important points of discussion.

Time-Varying Covariates

In a longitudinal design where background information is obtained only at baseline (i.e., they are time invariant; Singer & Willett, 2003), the only option for propensity score estimation is to use baseline data. In this instance, we are typically interested in how groups differ in change on a longitudinal outcome, as tested by a time by group interaction, where time varies and treatment is fixed. If, however, there are time-varying covariates, then propensity scores could logically change over time, and separate estimation of propensity scores at each time point may be appropriate (Robins & Hernán, 2009). Consider as an example smoking behavior in adolescents over time. As youth age, they may be more likely to start smoking due to many reasons, including exposure to friends who smoke, increased stress levels, or more willingness to experiment (Williams, 1973). In this case, the likelihood of being a smoker can increase throughout the course of a study, and youth may indeed change from being a nonsmoker to a smoker during a study. As a result, both the treatment (smoking) and the predictors of treatment (e.g., friends who smoke) are time varying, and failure to account for that in a longitudinal analysis may lead to biased results (Robins & Hernán, 2009).

To understand the challenges in analyzing these data with propensity scores, we must keep in mind that a key difference between the use of propensity scores in time-invariant treatment versus time-varying situations is the assumption of what causes treatment. Specifically, in time-invariant situations treatment is treated as exogenous (i.e., not systematically affected in response to changes in other variables in the model; Gunasekara, Carter, & Blakely, 2008). However, treatment post-baseline is often endogenous (i.e., systematically affected in response to changes in other variables in the model) because it is dependent on or predicted by previous treatment status, predictors, or covariates. To build on our smoking example, when smoking status does not vary across time for most participants (e.g., teens who never smoke, teens who always smoke) smoking status at Times 2 and 3 would be strongly predicted by smoking status at the previous time point. In contrast, when smoking status changes often across time, it is presumably because of the covariates used to estimate the propensity score (e.g., best friend started smoking at Time 2, teen started smoking at Time 3). Failing to account for this in the model using propensity scores would introduce bias because the model did not include a variable that is known to cause changes in the model’s predictors and outcomes (i.e., assumed an endogenous variable was exogenous; Bollen & Bauldry, 2011). Making such an error is known to result in biased estimates of the effect of the predictor and outcome—in this case, failing to account for time-varying effects of treatment and/or covariates, for example, would result in biased estimates of treatment effects on the outcome. Given that propensity score techniques in particular emphasize determining cause using observational data, this bias is an added concern. Thus, it is critical that researchers accurately specify treatment and covariates as time varying when they do change across time.

There are several analytic approaches that can account for time-varying propensity for treatment. One option is to use IPTW, where propensity scores are used as weights to adjust the relative contribution of an individual based on their probability of receiving treatment (Hernán, Lanoy, Costagliola, & Robins, 2006) either as time invariant or as time varying (Robins & Hernán, 2009). A second approach is to adjust for time-varying propensity scores in a regression framework without using weighting. Achy-Brou, Frangakis, and Griswold (2010) demonstrated how propensity scores can be estimated at each time point in a longitudinal study. In that study, propensity scores were used as time-varying covariates. Several authors have advised using caution when attempting to use propensity scores as covariates in a regression model as was done by Achy-Brou et al. due to the sensitivity to unequal variances and covariances between groups (D’Agostino, 1998) and potential for nonlinearity in the effect on the outcome of interest (Austin & Mamdani, 2006). When covariates and treatment are time varying, however, challenges related to matching or stratifying a sample across multiple waves (described further below) may make using propensity scores as time-varying covariates in regression models (either through regression adjustment or through IPTW) the only feasible course of action.

When treatment is time varying, using matching or stratification is not recommended because of the increased bias when time-varying treatment effects are not accounted for (Robins & Hernán, 2009). It is also much less practical to use matching in this setting. If matching were to be performed separately at multiple time points, it is likely that cases would be placed in different groups over time or not assigned to a group at some point. Furthermore, in a randomized controlled trial, where propensity scores are intended to mimic, we do not allow for participants to change groups or conditions, and if it did happen they may be excluded from the analysis after the point at which they switched groups (Nye, Hedges, & Konstantopoulos, 2000; Sheridan et al., 2014; Sheridan, Knoche, Kupzyk, Edwards, & Marvin, 2011). For researchers interested in conducting propensity score analysis with time-varying covariates, careful consideration of the hypotheses to be tested and the constructs included should be used to guide decisions about when to match (i.e., baseline vs. at multiple time points) and how to account for individuals who change treatment groups after baseline.

Two additional approaches to time-varying effects and propensity scores are regression adjustment (particularly within the context of hierarchical linear modeling [HLM]) and IPTW. Including time-varying propensity scores in a HLM model is fairly straightforward—the propensity score is calculated at each time point and included as a time-varying covariate in the HLM model. This allows the likelihood of treatment to vary within individuals across time. As is the case with any time-varying covariate in a model, the effect of treatment would be adjusted for propensity to receive treatment. Importantly, while including propensity score as a covariate that adjusts the effect of treatment is potentially useful, it is difficult to interpret and does not make a compelling case for causal inference because it does not completely eliminate bias from the findings (Austin, 2011). Primarily, this bias comes because, unlike other methods using propensity scores, regression adjustment does not compare the effects of treatment on outcome when treatment is equally likely for both groups. Instead, adjustment can occur even if there is differentiation in propensity scores between treatment and control conditions. Such differentiation would indicate that there are distinct populations in the study, rather than there being a homogeneous sample with a mix of those who did and did not receive treatment.

IPTW has been more prominently investigated and used when treatment effects or covariates change across time. As described by Robins and Hernán (2009), IPTW can be estimated at each time point and used in a subsequent model to adjust the relative contribution of each individual at each time point to the overall effect of the estimate of treatment on the outcome variable. Using this approach and assuming that the proper covariates have been selected, the data are adjusted in such a way that the sample simulates what would have been expected if treatment had been randomized, allowing for strong causal inference. For example, Sampson, Sharkey, and Raudenbush (2008) used IPTW to estimate the impact of neighborhood disadvantage on subsequent verbal ability. Their findings suggested that exposure to concentrated neighborhood poverty has detrimental effects on children’s verbal ability, an effect that is diminished but persistent when children move out of those communities.

Longitudinal Studies With Missing Data

Another complexity for longitudinal designs is that of attrition. In a longitudinal study, item nonresponse is not the only issue regarding missing data (item nonresponse was addressed in the previous section). Attrition, or dropout, is always a concern in longitudinal designs and leads to additional missing data (Enders, 2010). Depending on when attrition occurs, the outcomes or time-varying covariates necessary for missing data estimation may not be available. In this case, measures collected prior to when an individual dropped out of the study may be used to impute the missing outcomes and time-varying covariates (Enders, 2010). Prior research has indicated that there is increased bias in imputing missing data as part of longitudinal designs, especially when these data are being imputed to estimate values for time points beyond when participants were present (Engles & Diehr, 2003). Due to these considerations, we advise caution when imputing data outside the bounds of the time points where participants were present. MI was designed to predict item-level missingness and was not initially intended to estimate a person’s full set of data for a time point (Enders, 2010). Imputing for time points after a person has dropped out means doing so without any of the time-varying outcomes or covariates that are used for other individuals with some data available. If data are to be imputed, it makes sense to only impute for time points where there is some other information gathered from which to reasonably predict missing values. In other words, MI approaches will be most reliable where there are some other data available from that time point (i.e., item-level missingness).

When longitudinal datasets are to be imputed, however, they are typically set up in “wide” format, with one row per person and repeated measures included as separate variables. This makes it especially difficult to avoid imputing values after an individual has dropped out of a study. To avoid this, one would have to build a macro that creates a matrix of present/not present codes, indicating whether an individual is still in the study at each time point. This could then be compared with the imputed datasets and within each imputation, and the imputed values after dropout could be removed for the cases that have dropped out of the study. This would be exceedingly difficult, and we are not aware of any examples in the literature where such a method has been used.

The alternative would be to restructure the data file into “long” format, where there are as many rows for each individual as there are valid time points (the format that is used for mixed models). Imputing in long format would avoid the problem of imputing values after dropout, but doing so still comes at a cost. The imputation model assumes independence of cases because it utilizes bivariate correlations between variables. Having multiple rows from an individual results in a dependence of cases (because data from the same person have increased shared variance compared with data from two different people) and introduces a potential bias in the imputations. There are some examples in the literature of MI of multilevel data that have predominantly used cross-sectional clustered data (e.g., students within classrooms), but this is analytically equivalent to time points within persons. The imputation procedure in SPSS and PROC MI in SAS ignores any clustering present in hierarchical data (van Buuren, 2011).

Some user-contributed SAS packages have been developed that will more accurately model the dependence between cases using a model-based imputation approach (e.g., MMI_IMPUTE macro; Mistler, 2013). Two major limitations with using MMI_IMPUTE in the context of propensity score analysis are that the macro assumes multivariate normality, so it cannot impute dichotomous or categorical variables, which, by definition, occur in propensity score analyses; additionally, it only works with PROC MIXED. Propensity score analyses require an estimated probability of treatment, which can only be estimated using PROC LOGISTIC. As a workaround, Mplus could carry out these imputations but cannot execute all of the subsequent steps involved in a propensity score analysis (e.g., matching on propensity score). In order to properly impute multilevel data and perform the propensity score analyses on the imputed datasets, one would have to impute in Mplus or other imputation programs that support imputing categorical, multilevel data, and transfer the imputations to SAS for analysis. Unless a procedure such as this is undertaken, the best course of action may simply be to use FIML, which, as described earlier, will result in a complete-case analysis. This should only be considered if there is enough available complete data and there is no indication that participants with complete data are qualitatively different from those with incomplete data.

In deciding between imputing after dropout and violating independence of cases, the researcher may simply have to decide which is the lesser of two evils; most often, analysts will use the wide format and impute missing values even after attrition has occurred. It seems possible, however, that in some cases individuals who drop out of a study do so because of their levels on the variables collected. When this is the case, the data are MNAR and MI will produce biased results. As with any analysis involving MNAR, if a researcher has reason to believe that data are MNAR, the best course of action may be FIML. In studies where propensity scores are being used with large datasets, restricting the analysis to individuals who completed all measures in a study may not be as problematic. Data may still be imputed, of course, on variables and items that were skipped by the individual, but the analyst would not have to deal directly with attrition and with the great amount of uncertainty that comes from imputing after dropout.

A third analytic difficulty encountered in longitudinal studies is that participants are not always measured at exactly equal intervals, complicating the use of time-varying propensity scores. Assessors and data collectors may not be able to gather data on all participants on the same day. For example, in a study that takes measures monthly, if some participants’ second time point in a study is obtained at 3 weeks and some at 5 weeks, a great deal of variability may be observed in the amount of time that has elapsed since baseline across participants. For this reason, among many, mixed models (i.e., hierarchical or multilevel models; Raudenbush & Bryk, 2002) are often preferred over basic repeated-measures analysis of variance models because mixed models can accommodate varying interval lengths. This situation would cause time-specific estimation of propensity scores to be inappropriate and uninterpretable, by assuming every individual’s second wave and third wave line up in time when they really occurred at differing amounts of time since baseline. If propensity score estimation and matching or stratification is to be done only at baseline, then the logistic regression could be done in wide format, where only one row is present per individual. Alternatively, if propensity scores are to be estimated for each time point, and the waves do line up in time, then a separate logistic regression should be run for each time point using either the wide format or the long format with the analysis split by wave (i.e., using the split file command in SPSS or BY command in SAS). If, however, waves do not line up across individuals, time-specific estimation would be better done by estimating propensity scores using one logistic regression in long format. With a variable representing time since baseline included, the propensity score model can take into account the varying amounts of time that have elapsed since baseline, and any changes in time-varying covariates will be captured as well.

For this example, we extend what was previously presented in Example 1 of Beal and Kupzyk (2014). The data for this example are derived from a larger longitudinal study of adolescent girls. The study (Dorn, Susman, Pabst, Huang, Kalkwarf, & Grimes, 2008) had a 90% retention rate across 3 years of data collection; however, attrition did contribute to missing data sporadically throughout the study. Our research question, which is an extension of previously published work (Dorn et al., 2013), is whether smoking among adolescent girls negatively affects bone health via depletion in bone mineral content (BMC). Specifically, we hypothesize that adolescent girls who smoke will, as a result of their smoking, experience lower levels of accrual in BMC when compared with their peers who do not smoke. The challenge to testing this causal hypothesis is that girls may be motivated to smoke for a variety of reasons (e.g., increased depression, increased time spent with older peers who smoke, propensity for risk taking), and those factors may also directly or indirectly affect bone (e.g., depression decreases accrual amounts in girls, girls with earlier onset of puberty compared with their peers not only have higher levels of accrual compared with same-aged peers but also spend more time with older peers and engage in more risky behaviors). While a randomized study examining these mechanisms in human adolescents would in some ways be ideal, it is also highly unethical to randomize adolescents to smoking or nonsmoking conditions. As a result, we are left with relying on propensity scores and other methods to examine causality in these relations. Pertinent to the discussions in this article, this example has three annual time points, has missing data due to participants not being present for all time points, and has a subset of the sample who changed smoking status during the course of the study.

Drawing on the principles and issues raised in this article and the features of this example dataset, we analyzed and compared results using the following approaches: (a) propensity score adjustment in a mixed modeling framework, with propensity score as a time-varying covariate; and (b) propensity score matching at baseline with a later (Wave 3) outcome. Each approach was examined using the differing techniques for handling missing data that have been described above: ML Estimation and MI. Thus, there are four sets of output to review as part of this example. The syntax and annotated output are available at [http://jea.sagepub.com/content/early/recent]. Below, we review relevant output and describe key features of the differences with estimating and evaluating each approach. To avoid redundancy with our previous article, we will not discuss the process of estimating propensity scores in extensive detail here, and encourage readers unfamiliar with those approaches to review the following articles: Austin (2011), Beal and Kupzyk (2014), D’Agostino (1998), and Rosenbaum and Rubin (1983).

Method

Propensity scores were estimated for all three time points, using measures collected within each time point as predictors of the underlying propensity score. Those measures included race, age, physical activity, use of birth control, body mass index, calcium intake, depressive and anxious symptoms, socioeconomic status, tanner stage, and age at menarche. Only one of those predictors (race) was time invariant across participants.

Imputation of data was conducted with data structured into long format—while this violates the assumptions of independence previously described, it also prevented data from being imputed for entire time points where participants were missing and accounted for variability in the timing of data collection by wave. Exact age at time of measurement was included in the propensity score estimation for that reason. Five imputations were conducted, and separate propensity scores were estimated within each imputation. In instances where matching occurred, nearest-neighbor matching was done separately for each imputation, as well as with the original data being used for the ML estimation example. Balance checking was performed for each imputed dataset, and balance was achieved across all imputations. Upon completion of all analyses, estimates from MI Iterations 1 to 5 output were aggregated, in line with published recommendations (Rubin, 1987; Schafer & Olsen, 1998). Summaries of that output are included in text when appropriate.

Adjusting for Time-Varying Propensity to Smoke

We first examine findings when the propensity score is used as a time-varying covariate to adjust for confounding effects in the estimate of the effect of smoking on BMC. Those findings are summarized in Table 1. Covariate balance was achieved between groups in each of the imputations. Of note, there was little variation in the estimates across each of the five imputations, and the aggregated results are similar to what was estimated with ML. Findings indicate that, regardless of which approach to handling missing data was used, smoking contributed to an initial decrease in BMC, with minimal differences in accrual rate for girls who do and do not smoke, once initial and ongoing differences in the likelihood of smoking are taken into account. This effect can be contrasted with an estimate of the effect of smoking and changes in smoking when propensity score is not treated as a time-varying covariate. In that instance, results with ML estimation indicate an effect of smoking initially at −209.21 (p < .01), and an increase in BMC of 113.73 (p < .01) for girls who do not smoke, and 36.67 (p < .01) for girls who do smoke. Comparing these results with those reported in Table 1, failure to account for changes in propensity score contributed to an almost fourfold increase in the effect of smoking on the intercept, and a slope estimate that was lower for girls who did not smoke but almost 5 times lower for girls who did smoke. Thus, results are substantially different when potential variation in propensity for smoking across time is not taken into account.

Table

Table 1. Model Estimates for Smoking With Propensity Score as a Time-Varying Covariate.

Table 1. Model Estimates for Smoking With Propensity Score as a Time-Varying Covariate.

Matching at Baseline to Predict Subsequent BMC

We now turn to examining differences in findings when matching is conducted at baseline. Of note, matching occurred within each iteration, and for this example data were restructured into wide format. No adjustment for changes in propensity score was used, and MI and ML approaches were estimated to address missing data issues. Once matching had occurred, mean differences in BMC at Time 3 were compared using t tests. A significant mean difference was found in both instances; with ML, the mean difference was 192.3, with girls who smoke having significantly lower BMC compared with girls who did not smoke (p < .01). In contrast, the aggregated results using MI indicated a significant mean difference of 247.5, with girls who smoke having lower BMC than girls who do not smoke (p < .01).

Taken together, these examples reveal the following principles to keep in mind when analyzing longitudinal data with time-varying treatment conditions and covariates. First, there was a marked difference in the estimate of the impact of smoking on bone when time-varying adjustment was used versus baseline matching. This may be in part because propensity as a time-varying covariate accounted for more of the effect on bone, diminishing the effect of smoking. Consistent with that notion, the overall effect of smoking on bone when propensity was treated as time invariant was similar to the mean difference when matching was used. This speaks to the importance of accounting for changing propensities for treatment, when that is appropriate. Second, there were minimal differences in model estimates when ML versus MI was used, at least in the case where propensity score was also properly accounted for. This is consistent with the pattern of findings reported by Mattei (2009) and may suggest that modeling propensity score variance across time is a larger concern than which approach to missing data is used, at least when rates of missingness are relatively low.

Statistical methodologists are continuing to study the estimation and use of propensity scores. New methods and estimation models are being developed and tested. Thus far, however, there does not appear to be clear evidence of a method that consistently performs better than the traditional estimation and matching methods that have been used by many researchers since Rosenbaum and Rubin developed them in the 1980s. Basic propensity score estimation with a logistic regression model and matching using a nearest-neighbor or optimal matching approach is a reasonably safe and efficient method for most researchers to help assign causal inference to nonrandomized groups. In longitudinal studies, the method most likely to provide helpful and understandable results is to match on propensity scores using baseline data, and maintain those groups for the remainder of the time points. This most closely resembles a typical randomized longitudinal study. There certainly will be situations where more complex methods may perform better than traditional methods, and more work should be done to better understand when these methods can and should be applied and how the results can be interpreted. One particularly difficult situation is when treatment is not fixed, as we saw in our example. If group status can change over time, this poses a challenge for deciding what to do analytically. Different methods of dealing with this situation are possible, and they can result in different interpretations. How to analyze these data may depend on what question the researcher is asking. For example, even if group membership and the propensity score are time varying, using that information may not be necessary. Mimicking an intent-to-treat analysis, cases can simply be analyzed as they were at baseline, or assuming that they remain in the group they were in at baseline. If future treatment decisions are to be made in a clinical setting based on information gathered at baseline, matching at baseline would be the best plan of action, as opposed to using time-varying propensity scores as covariates.

In general, matching at multiple time points should be avoided. The situation may lead to individuals being both included and excluded at varying time points, and probabilities of group membership may change as a result of changing covariate values. Logically, the situation does not make intuitive sense, as we are used to having participants in one group throughout the course of a study. Having individuals change groups results in a dependence between groups, due to groups not having independent cases. Still, one’s propensity to be in a group can indeed change over time, as in the example of youth becoming smokers or quitting smoking during a study. In this case, if a researcher wishes to account for the possibility of changing groups, treatment can be used as a within-persons predictor, and propensity scores can be used as a weight or time-varying covariate instead of being used for matching. Substantive knowledge and theory will also be important in guiding these decisions.

If there is a small amount of missing data in a study, using a missing data estimation method may not be worth the trouble involved if a large sample is still available with complete data. In situations where a dataset has a large amount of missing data with an insufficient number of participants with complete data, MI is a reasonable method for estimating the missing data, so that propensity scores can be calculated for all individuals in a sample. The important thing to remember in this case is that the “within” approach should be used, where matching and analysis are performed on each imputed dataset, and the treatment effect is then averaged across the imputations. Simply averaging the propensity scores across imputations and performing the intended analysis may bias results.

Many of the issues discussed in this article have very little literature available on these topics (citations have been included where possible), so we have made general recommendations on theoretical grounds. Several issues were raised, around MI, especially, that pose difficult questions as to the best course of action. Further research into these topics would be valuable in guiding researchers in making decisions. There are many more uses that methodologists are exploring for propensity scores, including latent growth curve models (Retelsdorf, Becker, Köller, & Möller, 2012), group-based trajectory modeling (Haviland, Nagin, Rosenbaum, & Tremblay, 2008), structural equation modeling (Hoshino, Kurata, & Shigemasu, 2006), moderation analyses (Green & Stuart, 2014), and mediation analyses (Coffman, 2011). In basic and advanced analytical situations, the use of propensity scores should continue to increase, as they are valuable tools for evaluating possible causality in situations where randomized clinical trials are not feasible.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Achy-Brou, A. C., Frangakis, C. E., Griswold, M. (2010). Estimating treatment effects of longitudinal designs using regression models on propensity scores. Biometrics, 66, 824-833. doi:10.1111/j.1541-0420.2009.01334.x
Google Scholar | Crossref | Medline | ISI
Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399-424. doi:10.1080/00273171.2011.568786
Google Scholar | Crossref | Medline | ISI
Austin, P. C., Mamdani, M. M. (2006). A comparison of propensity score methods: A case-study estimating the effectiveness of post-AMI statin use. Statistics in Medicine, 25, 2084-2106. doi:10.1002/sim.2328
Google Scholar | Crossref | Medline | ISI
Bai, H. (2011). A comparison of propensity score matching methods for reducing selection bias. International Journal of Research & Method in Education, 34, 81-107. doi:10.1080/1743727x.2011.552338
Google Scholar | Crossref
Bassok, D. (2010). Do Black and Hispanic children benefit more from preschool? Understanding differences in preschool effects across racial groups. Child Development, 81, 1828-1845.
Google Scholar | Crossref | Medline | ISI
Beal, S. J., Kupzyk, K. A. (2014). An introduction to propensity scores: What, when, and how. Journal of Early Adolescence, 34, 66-92. doi:10.1177/0272431613503215
Google Scholar | SAGE Journals | ISI
Bollen, K. A., Bauldry, S. (2011). Three Cs in measurement models: Causal indicators, composite indicators, and covariates. Psychological Methods, 16, 265-284.
Google Scholar | Crossref | Medline | ISI
Cheung, M. (2007). Comparison of methods of handling missing time-invariant covariates in latent growth models under the assumption of missing completely at random. Organizational Research Methods, 10, 609-634. doi:10.1177/1094428106295499
Google Scholar | SAGE Journals | ISI
Coffman, D. L. (2011). Estimating causal effects in mediation analysis using propensity scores. Structural Equation Modeling, 18, 357-369. doi:10.1080/10705511.2011.582001
Google Scholar | Crossref | Medline | ISI
Collins, L. M., Schafer, J. L., Kam, C.-M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351. doi:10.1037//1082-989X.6.4.330
Google Scholar | Crossref | Medline | ISI
D’Agostino, R. B. (1998). Propensity score methods for bias reduction in the comparison of a treatment to a nonrandomized control group. Statistics in Medicine, 17, 2265-2281. doi:10.1002/(SICI)1097-0258(19981015)17:19<2265::AID-SIM918>3.0.CO;2-B
Google Scholar | Crossref | Medline | ISI
Dorn, L. D., Susman, E. J., Pabst, S., Huang, B., Kalkwarf, H., Grimes, S. (2008). Association of depressive symptoms and anxiety with bone mass and density in ever-smoking and never-smoking adolescent girls. Archives of Pediatrics & Adolescent Medicine, 162, 1181-1188. doi:10.1001/archpedi.162.12.1181
Google Scholar | Crossref | Medline
Dorn, L. D., Beal, S. J., Kalkwarf, H. J., Pabst, S., Noll, J. G., Susman, E. J. (2013). Longitudinal impact of substance use and depressive symptoms on bone accrual among girls aged 11-19 years. The Journal of Adolescent Health, 52, 393-399. doi:10.1016/j.jadohealth.2012.10.005
Google Scholar | Crossref | Medline | ISI
Dunlay, S. M., Pack, Q. R., Thomas, R. J., Killian, J. M., Roger, V. L. (2014). Participation in cardiac rehabilitation, readmissions, and death after acute myocardial infarction. The American Journal of Medicine, 127, 538-546. doi:10.1016/j.amjmed.2014.02.008
Google Scholar | Crossref | Medline | ISI
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.
Google Scholar
Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychological Methods, 16, 1-16. doi:10.1037/a0022640
Google Scholar | Crossref | Medline | ISI
Engles, J. M., Diehr, P. (2003). Imputation of missing longitudinal data: A comparison of methods. Journal of Clinical Epidemiology, 56, 968-976. doi:10.1016/S0895-4356(03)00170-7
Google Scholar | Crossref | Medline | ISI
Fan, X., Nowell, D. L. (2011). Using propensity score matching in educational research. Gifted Child Quarterly, 55, 74-79. doi:10.1177/0016986210390635
Google Scholar | SAGE Journals | ISI
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549-576. doi:10.1146/annurev.psych.58.110405.085530
Google Scholar | Crossref | Medline | ISI
Green, K. M., Stuart, E. A. (2014). Examining moderation analyses in propensity score methods: Application to depression and substance use. Journal of Consulting and Clinical Psychology, 82, 773-783. doi:10.1037/a0036515
Google Scholar | Crossref | Medline | ISI
Gunasekara, F. I., Carter, K., Blakely, T. (2008). Glossary for econometrics and epidemiology. Journal of Epidemiology & Community Health, 62, 858-861.
Google Scholar | Crossref | Medline | ISI
Hardt, J., Herke, M., Leonhart, R. (2012). Auxiliary variables in multiple imputation in regression with missing X: A warning against including too many in small sample research. BMC Medical Research Methodology, 12, Article 184. doi:10.1186/1471-2288-12-184
Google Scholar | Crossref | Medline | ISI
Haviland, A., Nagin, D. S., Rosenbaum, P. R., Tremblay, R. E. (2008). Combining group-based trajectory modeling and propensity score matching for causal inferences in nonexperimental longitudinal data. Developmental Psychology, 44, 422-436. doi:10.1037/0012-1649.44.2.422
Google Scholar | Crossref | Medline | ISI
Hernán, M. A., Lanoy, E., Costagliola, D., Robins, J. M. (2006). Comparison of dynamic treatment regimes via inverse probability weighting. Basic & Clinical Pharmacology & Toxicology, 98, 237-242. doi:10.1111/j.1742-7843.2006.pto_329.x
Google Scholar | Crossref | Medline | ISI
Hodges, K., Grunwald, H. (2005). The use of propensity scores to evaluate outcomes for community clinics: Identification of an exceptional home-based program. The Journal of Behavioral Health Services & Research, 32, 294-305. doi:10.1007/BF02291829
Google Scholar | Crossref | Medline | ISI
Hong, G., Yu, B. (2008). Effects of kindergarten retention on children’s social-emotional development: An application of propensity score method to multivariate, multilevel data. Developmental Psychology, 44, 407-421. doi:10.1037/0012-1649.44.2.407
Google Scholar | Crossref | Medline | ISI
Hoshino, T., Kurata, H., Shigemasu, K. (2006). A propensity score adjustment for multiple group structural equation modeling. Psychometrika, 71, 691-712. doi:10.1007/s11336-005-1370-2
Google Scholar | Crossref | ISI
John, L., Wright, R., Duku, E. K., Williams, J. D. (2008). The use of propensity scores as a matching strategy. Research on Social Work Practice, 18, 20-26. doi:10.1177/1049731507303958
Google Scholar | SAGE Journals | ISI
Kohls, N., Walach, H. (2008). Validating four standard scales in spiritually practicing and non-practicing samples using propensity score matching. European Journal of Psychological Assessment, 24, 165-173. doi:10.1027/1015-5759.24.3.165
Google Scholar | Crossref | ISI
Kurth, T., Walker, A. M., Glynn, R. J., Chan, K. A., Gaziano, J. M., Berger, K., Robins, J. M. (2006). Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. American Journal of Epidemiology, 163, 262-270. doi:10.1093/aje/kwj047
Google Scholar | Crossref | Medline | ISI
Linden, A., Adams, J. L. (2008). Improving participant selection in disease management programmes: Insights gained from propensity score stratification. Journal of Evaluation in Clinical Practice, 14, 914-918. doi:10.1111/j.1365-2753.2008.01091.x
Google Scholar | Crossref | Medline | ISI
Linden, A., Adams, J. L. (2010). Using propensity score-based weighting in the evaluation of health management programme effectiveness. Journal of Evaluation in Clinical Practice, 16, 175-179. doi:10.1111/j.1365-2753.2009.01219.x
Google Scholar | Crossref | Medline | ISI
Mattei, A. (2009). Estimating and using propensity score in presence of missing background data: An application to assess the impact of childbearing on wellbeing. Statistical Methods & Applications, 18, 257-273. doi:10.1007/s10260-007-0086-0
Google Scholar | Crossref | ISI
Mistler, S. A. (2013, April). A SAS macro for applying multiple imputation to multilevel data. In Proceedings of the SAS Global Forum. San Francisco, CA: Contributed Paper (Statistics and Data Analysis).
Google Scholar
Mitra, R., Reiter, J. P. (2012). A comparison of two methods of estimating propensity scores after multiple imputation. Statistical Methods in Medical Research, 25, 188-204. doi:10.1177/0962280212445945
Google Scholar | SAGE Journals | ISI
Nye, B., Hedges, L. V., Konstantopoulos, S. (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123-151.
Google Scholar | SAGE Journals | ISI
Prendergast, M. L., Messina, N. P., Hall, E. A., Warda, U. S. (2011). The relative effectiveness of women-only and mixed-gender treatment for substance-abusing women. Journal of Substance Abuse Treatment, 40, 336-348. doi:10.1016/j.jsat.2010.12.001
Google Scholar | Crossref | Medline | ISI
Raudenbush, S. W., Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: SAGE.
Google Scholar
Raykov, T. (2011). On testability of missing data mechanisms in incomplete data sets. Structural Equation Modeling: A Multidisciplinary Journal, 18, 419-429. doi:10.1080/10705511.2011.582396
Google Scholar | Crossref | ISI
Retelsdorf, J., Becker, M., Köller, O., Möller, J. (2012). Reading development in a tracked school system: A longitudinal study over 3 years using propensity score matching. British Journal of Educational Psychology, 82, 647-671. doi:10.1111/j.2044-8279.2011.02051.x
Google Scholar | Crossref | Medline | ISI
Robins, J. M., Hernán, M. A. (2009). Estimation of the causal effects of time-varying exposures. In Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G. (Eds.), Advances in longitudinal data analysis. New York, NY: Chapman & Hall/CRC Press, 553-599.
Google Scholar
Rosenbaum, P. R., Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55. doi:10.1093/biomet/70.1.41
Google Scholar | Crossref | ISI
Rosenbaum, P. R., Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516-524. doi:10.2307/2288398
Google Scholar | Crossref | ISI
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581-592. doi:10.1093/biomet/63.3.581
Google Scholar | Crossref | ISI
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: John Wiley.
Google Scholar | Crossref
Sampson, R. J., Sharkey, P., Raudenbush, S. W. (2008). Durable effects of concentrated disadvantage on verbal ability among African-American children. Proceedings of the National Academy of Sciences, 105, 845-852.
Google Scholar | Crossref | Medline | ISI
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3-15. doi:10.1177/096228029900800102
Google Scholar | SAGE Journals | ISI
Schafer, J. L., Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33, 545-571. doi:10.1207/s15327906mbr3304_5
Google Scholar | Crossref | Medline | ISI
Shadish, W. R., Cook, T. D., Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Google Scholar
Sheridan, S. M., Knoche, L. L., Edwards, C. P., Kupzyk, K. A., Clarke, B. L., Kim, E. M. (2014). Efficacy of the getting ready intervention and the role of parental depression. Early Education and Development, 25, 746-769. doi:10.1080/10409289.2014.862146
Google Scholar | Crossref | Medline | ISI
Sheridan, S. M., Knoche, L. L., Kupzyk, K. A., Edwards, C. P., Marvin, C. (2011). A randomized trial examining the effects of parent engagement on early language and literacy: The getting ready intervention. Journal of School Psychology, 49, 361-383. doi:10.106/j.jsp.2011.03.001
Google Scholar | Crossref | Medline | ISI
Singer, J. D., Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press.
Google Scholar | Crossref
Slade, E. P., Stuart, E. A., Salkever, D. S., Karakus, M., Green, K. M., Ialongo, N. (2008). Impacts of age of onset of substance use disorders on risk of adult incarceration among disadvantaged urban youth: A propensity score matching approach. Drug and Alcohol Dependence, 95(1-2), 1-13. doi:10.1016/j.drugalcdep.2007.11.019
Google Scholar | Crossref | Medline | ISI
van Buuren, S . (2011). Multiple imputation of multilevel data. In Roberts, J. K., Hox, J. J. (Eds.), Handbook of advanced multilevel analysis. New York: Routledge, 173-196.
Google Scholar
van Buuren, S. (2011). Multiple imputation of multilevel data. In Roberts, J. K., Hox, J. J. (Eds.), Handbook of advanced multilevel analysis (pp. 173-196). New York, NY: Routledge.
Google Scholar
Williams, A. F. (1973). Personality and other characteristics associated with cigarette smoking among young teenagers. Journal of Health and Social Behavior, 14, 374-380.
Google Scholar | Crossref | Medline

Author Biographies

Kevin A. Kupzyk is an assistant professor and statistician at the University of Nebraska Medical Center. He received his training in quantitative methods in education at the University of Nebraska–Lincoln, and his research emphasizes power analysis and the use of hierarchical linear modeling for studying change as a result of interventions.

Sarah J. Beal is an assistant professor at Cincinnati Children’s Hospital Medical Center. She received her training in developmental psychology and research methods at the University of Nebraska–Lincoln. Her program of research emphasizes development in adolescence and the transitions to adulthood.

Article available in:

Related Articles

Citing articles: 0