Natural experiments offer a unique opportunity to explore some of the most elusive questions about the political world, characterized by circumstances that allow researchers to assume as-if random or haphazard treatment conditions even though treatment allocation is not defined by a random device (
Rosenbaum, 2010;
Dunning, 2012;
Keele, 2015). One of the most popular types of natural experiments uses geographic or administrative boundaries to construct treated and control groups by exploiting certain geographical features that generate as-if random variation in the treatment assignment (
Keele and Titiunik, 2016). Such geographic natural experiments (GNEs) have been used to study topics such as ethnic relations in Zambia and Malawi (
Posner, 2004), political polarization in the United States (
Nall, 2018), electoral choices after disasters in Chile (
Visconti, 2022), and support for authoritarian regimes in East Germany (
Kern and Hainmueller, 2009) that might otherwise escape attempts to establish causal inference.
However, as with any methodological approach, there are important limitations to (geographic) natural experiments. For instance, because randomization is not guaranteed, researchers must provide a compelling justification for the as-if random assumption (
Dunning, 2012;
Sekhon and Titiunik, 2012). Even with empirical and theoretical justification, though, “the strong possibility that unobserved differences across groups may account for difference in average outcomes is always omnipresent in observational studies” (
Dunning, 2008; 289). This concern may obscure important relationships and even undermine the validity of causal claims.
The local geographic ignorability design (LGID) therefore emerges as an attractive empirical approach to limit potential unobservable factors from biasing results. Under the assumption that the treatment was as-if randomly assigned to units that are especially close to a given geographic or administrative boundary, there can be greater confidence in the assumed independence of potential outcomes (
Keele and Titiunik, 2016).
1 For example, when studying the effects of a policy intervention, a LGID would examine differences between residents living within a small buffer area from the border. Presumably, these residents would be more similar to each other than to those living far away from the administrative boundary. The LGID approach, however, might still require adjustment for pretreatment covariates. One solution to this problem is to enhance geographic designs by using matching as a flexible form of statistical adjustment (
Keele et al., 2015).
2While the latter is an undoubtedly powerful research design, it is also accompanied by an important limitation-its inherent locality. Although matched treated and control groups may, in fact, be quite similar to each other, they could also be markedly distinct from a larger population of interest (e.g., a city or state). Given an unrepresentative sample, “the estimate of a causal effect may fail to characterize how effects operate in the population of interest” (
Aronow and Samii, 2016; 250). Such external validity concerns are often of particular interest for political scientists (
McDermott, 2002), as it may be difficult to determine whether any causal effects identified must be restricted to only areas within the narrowly defined boundary or if they can be generalized across cases to answer fundamental questions about broader political phenomena.
We present an approach that addresses this problem, inspired by the idea of template matching (
Silber et al., 2014) as well as by recent advances in optimal matching and the construction of representative matched samples (
Visconti and Zubizarreta, 2018;
Bennett et al., 2019). Using a target population as a template to implement the matching, such as a city, state, or country, matched treated and control groups will not only be similar to each other but also similar to the population of interest. This can increase the generalizability of causal evidence from GNEs, providing a kind of external validity check. By implementing this method, researchers would not have to only rely on collecting multiple studies conducted in diverse contexts to learn about the generalizability of an effect since template matching reveals the hidden studies that resemble other populations within the original study. We see this strategy as a second step to be implemented after the main analysis to explore whether results are consistent across samples that look like the populations of interest. In the following sections, we describe the assumptions and the methodology for this approach and provide an empirical illustration.
Notation and assumptions
When using a sample to draw causal inference, the evidence can be generalized to a target population only when that sample was randomly selected from the target population of interest. In the case of geographic natural experiments, the sample (e.g., the buffer from either side of the administrative boundary) is not constructed by randomly selecting people from the target population (e.g., the city). As a consequence, generalizability efforts must rely on an observational data analysis assumption (
Stuart et al., 2018).
In randomized experiments, the most common quantity of interest is the average treatment effect (ATE). Let
Yi(1) denote the potential outcome if subject
i were treated and
Yi(0) if subject
i were not treated. The average treatment effect or ATE =
E(
Yi(1)) −
E(
Yi(0)). In observational studies, the estimand of interest is usually the average treatment effect on the treated (ATT), which can be expressed as: ATT =
E(
Yi(1)|
Ti = 1) −
E(
Yi(0)|
Ti = 1). The counterfactual control units are not observed. As a result, it is necessary to construct a control group by using two assumptions: conditional independence and common support (
Hidalgo and Sekhon, 2011).
In this paper, we instead focus on a different estimand: the target average treatment effect on the treated (TATT), which will inform us about how the treatment effects operate on the target population of interest. In a sample of n units, . In this case, the sample of n units needs to resemble the target population of interest.
We propose a design based on template matching to extend beyond the local effects estimated when using local geographic ignorability designs and to recover the target average treatment effect on the treated (TATT). Template matching was developed by
Silber et al. (2014) to make standardized comparisons based on observed characteristics. Their study randomly selected 300 patients (i.e., the template) and used them to match 300 patients at 217 hospitals, constructing a sample that resembled the template used to implement the multivariate matching.
Two assumptions are needed to claim that the matched sample resembles the population of interest and to provide causal evidence after adjusting on observables. The first, the
ignorability of sample selection, states that after adjusting for the relevant observed covariates, treatment effects are the same in the matched sample and the target population (
Visconti and Zubizarreta, 2018;
Stuart et al., 2018). Specifically, the target average treatment effect on the treated (TATT) and the population (of interest) average treatment effect on the treated (PATT) need to be equivalent. In that case, we expect that
. Second, the
conditional geographic ignorability in local neighborhood assumption, holds that within a neighborhood the potential outcomes are independent of treatment assignment conditional on observed covariates (
Keele et al., 2015). In this case, every unit
i has a score defined
Sj = (
Sj1,
Sj2) that refers to the geographic location of the subject, which will be used to compute the distance to any point (
b1,
b2) located on the boundary. A collection of points within a small geographic neighborhood is defined as
N(
b1,
b2). The set of covariates used to obtain covariate balance is defined as
Xi. Therefore, for each point (
b1,
b2) located on the boundary, we can find a neighborhood
N(
b1,
b2) where (
Yi(1),
Yi(0))
Ti|
Xi for all subjects
i with score (
Sj1,
Sj2) in
N(
b1,
b2).
A key question is how to define what is the appropriate template or target population. Recent research has advocated for a stronger connection between theory and causal identification. Scholars point to the advantages of theory-driven endeavors, which can help to better recognize undefined potential outcomes (
Slough, 2022), to improve covariate balance (
Resa and Zubizarreta, 2016), and to generalize a causal effect to other contexts (
Gailmard et al., 2021). While the nature of causal identification strategies may require a narrow focus, the theories researchers wish to test may be far more extensive. When constructing a generalizable geographic natural experiment, we argue that researchers should ask not only what identification strategy is best to recover causal effects, but also what template or population they wish to mimic that would best test their broader theory.
For example,
Posner (2004) takes advantage of the border between Zambia and Malawi to study the political salience of a cultural cleavage. Chewa and Tumbuka people live on both sides of the border. While their cultural differences are identical on both sides of the border, their political differences are more salient in Malawi than Zambia. The rationale behind exploiting this distinction is that Chewas and Tumbukas are large groups relative to the country as a whole in Malawi and, therefore, can be used as a base for coalition-building. Meanwhile, in Zambia, Chewas and Tumbukas are small relative to the country as a whole, creating little incentive to rely on them for coalition-building.
As a result,
Posner (2004)’s theory directly connects with a population of interest (i.e., the entire Chewa and Tumbuka people in Malawi and Zambia) rather than just four villages along the border used in the study. If people from these villages have different distributions of observed characteristics than in the entire country,
3 using a traditional geographical experiment might generate estimates that do not speak to the theory. Thus, we would advocate implementing a generalizable geographic natural experiment to improve the connection between theory and causal identification.
The utility of using template matching is also evidenced in more recent implementations of geographic natural experiments.
Keele and Titiunik (2018) aim to uncover the effects of all-mail voting on turnout. To do so, they rely on data from two counties, one that used only in-person voting and one that used all-mail voting. While the resulting estimates can tell us about turnout effects at a local scale, they may not be able to extend to the true populations of interest, Colorado, and even the United States as a whole. Employing template matching in this case would provide a kind of external validity check on how well the theory underlying the paper connects with the analysis and results of the causal identification strategy.
It is important to note that we do not equate external validity and representativeness. Our goal is to show that a treatment effect can be generalized across different populations of interest (i.e., external validity). We use template matching to construct representative matched samples that are similar to the population of interest (i.e., representativeness). Using template matching to build representative matched samples can improve the limited external validity of studies that have an especially local nature, often a result of researchers’ efforts to reduce heterogeneity and decrease sensitivity to hidden biases (
Rosenbaum, 2005). In observational studies, reducing heterogeneity often means decreasing the sample size to improve comparability between units (
Keele, 2015). Therefore, we could end up with a treated and control group that allows us to make credible inferences but that might be substantially different from the target population.