The use of online convenience samples for experimental research has exploded in recent decades, with far-reaching and mostly positive consequences for scholarship in the social sciences. Due to its low cost and quick turnaround time, Amazon’s Mechanical Turk (MTurk) in particular has become a popular testing ground for many social scientific hypotheses. Where once researchers may have only speculated about causal effects, now they can test, refine, and retest in short order.
The main purpose of this paper is to introduce a new source of subjects – Lucid – that satisfies many desiderata, including a large subject pool, demographic diversity, and low cost. Lucid is an aggregator of survey respondents from many sources. It collects basic demographic information from all subjects who flow through their doors, facilitating quota sampling to match the US Census demographic margins.
Berinsky et al. (2012) demonstrated the validity of MTurk by replicating classic experiments originally conducted on probability samples; we follow their lead and do the same on Lucid. As an empirical matter, our Lucid replications recover very similar treatment effect estimates to the original studies. That said, whether or not our particular set of replications on Lucid match previous estimates is only tangentially related to whether researchers should adopt the platform. Past success is no guarantee of future success and what worked for one experiment may not work for the next.
For this reason, a second purpose of the paper is to consider the question of when survey experimenters should opt for convenience samples in general and Lucid in particular. Our answer to this question follows the “fit-for-purpose” compromise proposed by public opinion scholars in an attempt to resolve debates over non-probability samples. The basic distinction drawn in
Baker et al. (2013) is between descriptive work, which requires probability samples, and work that “models relationships between variables,” which can make fruitful use of non-probability samples. Similarly, we think that if the purpose of a study is to estimate sample average treatment effects (SATEs), convenience samples are usually fit for purpose. A key distinction will be whether the causal effects under study should, according to the social scientific theories guiding the design of the experiment, obtain among the convenience sample as well as a broader population. In our experience, it is the rare theory whose scope conditions specifically exclude the sort of people who take online surveys, though one could come up with counterexamples, for example theories whose predictions depend on the level of digital literacy (
Koltay, 2011).
If it is determined that a convenience sample is fit for purpose for survey experimentation, the question remains which source to use. MTurk is a widely used platform and scholars know a tremendous amount about it, both positive and negative. On the positive side, recent meta-analyses of experimental studies conducted on both MTurk and US national probability samples (
Coppock, 2017;
Coppock et al., 2018;
Mullinix et al., 2015) have found high replication rates. On the negative side,
Behrend et al. (2011) show that MTurk responses are slightly more susceptible to social desirability bias than other samples. Others are concerned that MTurk respondents perceive a conditional relationship between the answers they give and the pay they earn.
Bullock et al. (2015) have shown that the political beliefs (as expressed by a survey response) can be affected by payments for “correct” responses. Rightly or wrongly, subjects on MTurk may believe that they will earn more money if they respond in a particular manner. We note that recent experimental evidence has found little to no evidence of demand effects (
De Quidt et al., 2018;
Mummolo and Peterson, 2018;
White et al., 2018), even when indicating the investigators’ preferred responses with heavy-handed messages. Some scholars are concerned that MTurk is “overfished” and that many respondents have become professional survey takers (
Chandler et al., 2015;
Rand et al., 2014).
Stewart et al. (2015) estimate the pool of active MTurk respondents for a given lab to be approximately 7300 subjects at any one time. Lastly, MTurk subjects have access to websites where they share information about academic surveys, which is particularly troubling for experiments in which subjects’ compensation depends on how they respond. MTurk participants share advice on how to maximize these payoffs on sites such as Turkopticon (
turkopticon.ucsd.edu) or Turkernation (turkernation.com).
When are convenience samples fit for purpose?
Before turning to the specifics of the Lucid platform, we consider the conditions under which researchers should turn to online convenience samples as sources of subjects in general.
1 Convenience samples have met with resistance largely because they have no design-based justification for generalizing from the sample to the population and typically have to rely on some combination of statistical adjustment and argument instead. Debates over the scientific status of non-probability samples have raged for decades.
Warren Mitofsky’s 1989 presidential address to the American Association for Public Opinion Research (AAPOR) describes an acrimonious dispute from a half-century prior over quota versus probability sampling in which “[t]here was no meeting of the minds among the participants.” In that same address, Mitofsky describes his own journey from probability-sample purist to convenience-sample convert, at least in some settings and for some scientific purposes.
In 2013, AAPOR issued a report on non-probability samples (
Baker et al., 2013) that formalizes a “fit-for-purpose” framework for assessing whether a given sampling design is fit for the scientific purpose to which it is put. The fit-for-purpose framework represents a compromise: for descriptive work, we need probability samples, but for research that models the relationships between variables, convenience samples may be acceptable.
2 Levay et al. (2016) provides some empirical support for the compromise’s underlying reasoning: MTurk and probability samples are descriptively quite different, but the correlations among survey responses are similar after a modicum of statistical adjustment. And while it is commonplace in the popular media to conduct opinion polls using convenience samples of viewers or listeners (
Kent et al., 2006), most descriptive work in political science uses explicit random sampling or reweighting techniques to target population quantities (
Park et al., 2004). We would note, however, that even extremely idiosyncratic convenience samples (e.g. Xbox users;
Gelman et al., 2016) can sometimes produce estimates that turn out to have been accurate. Nevertheless, in line with the fit-for-purpose framework, we would not generally recommend using Lucid (or any convenience sample) when the goal is descriptive inferences that are representative of a particular population.
In contrast to descriptive studies which seek to estimate a population quantity on the basis of a sample, the goal of much experimental work is to estimate a particular
sample quantity, the SATE, though other estimands (such as SATEs conditional on pretreatment covariates) are also common. Estimates of the SATE are said to exhibit strong internal validity if the standard experimental assumptions are met; this logic extends to samples obtained from Lucid.
3But the question of whether a particular convenience sample should be used depends not on whether we can estimate the SATE well, but on whether the SATE is worth estimating at all. In our view, the choice to use a convenience sample should depend on whether the SATE is
relevant for theory. A similar distinction is drawn in
Druckman and Kam (2011), who were responding to the critique of student samples given in
Henrich et al. (2010).
Druckman and Kam (2011) point out that a convenience sample might pose a problem if it lacks variation on an important moderating variable. Indeed, variation in the moderator is required to demonstrate that effects are different for different subgroups, but we would submit that even in the absence of such variation, the SATE in a convenience sample could be relevant for theory.
Whether a given SATE is relevant for theory will doubtless be a matter of debate in any substantive area. If the goal is to study the effect of an English-language newspaper article on political opinion, the SATE from a convenience sample of French-only monolinguals would not be relevant for theory, for the simple reason that the hypothesized causal process would not take place because the subjects do not speak English. A heuristic for determining whether a SATE is relevant for theory is to consider whether the theory’s predictions also apply to that sample, not whether that sample is “representative” of some different population. Our guess is that if a theory applies to the US national population (i.e., adult Americans), it should usually apply to a subset of that population (i.e., adult Americans on Lucid), though we grant there may be exceptions.
The SATE is often contrasted with the population average treatment effect (PATE), and the SATE is said to exhibit poor external validity if the SATE is different from the PATE. We do not share this view of external validity. The PATE and the SATE are different estimands, and estimates of each may be more or less useful depending on the target of inference.
4 If a SATE is relevant for theory, then it is interesting in its own right, regardless of whether the SATE and the PATE are the same number (or even have the same sign). Researchers always have to defend the provenance of their samples; defending convenience samples means specifically arguing that the theory under examination applies to the people in the convenience sample.
Why would SATEs and PATEs ever differ? We need to distinguish between three kinds of heterogeneity: idiosyncratic, treatment-by-covariate, and treatment-by-treatment (
Gerber and Green, 2012: ch. 9). Idiosyncratic heterogeneity occurs when subjects’ responses to treatment are different, but this heterogeneity is not caused by systematic factors. Treatment-by-covariate heterogeneity occurs when groups of subjects defined by pre-treatment covariates have different average responses to treatment. This kind of heterogeneity can cause SATEs and PATEs to differ if the covariates that are correlated with treatment effects are also correlated with the characteristics that influence selection into the convenience sample (
Hartman et al., 2015;
Kern et al., 2016). If these important moderators are measured in the sample and are known in the population, then SATEs can be reweighted to estimate PATEs (
Franco et al., 2017,
Miratrix et al., 2018). Lastly, treatment-by-treatment heterogeneity occurs when the response to one treatment depends on the level of another treatment, as in a two-by-two factorial design. In our empirical section, we investigate both treatment-by-covariate and treatment-by-treatment interactions.
As it happens, survey experimental SATE and PATE estimates are frequently quite similar (
Coppock, 2017;
Coppock et al., 2018;
Mullinix et al., 2015), and the main explanation for this finding seems to be low treatment effect heterogeneity in response to the sorts of treatments studied by social scientists in survey experiments.
Boas et al. (n.d.) report a similar finding from a comparison of subjects recruited via Facebook, Qualtrics, and MTurk. Whether or not future experiments will also exhibit low treatment effect heterogeneity is, of course, only a matter of speculation.
A second kind of external validity is about whether the treatments and outcomes in the experiment map on to the “real-world” treatments and outcomes that the study is meant to illuminate. This sort of external validity has less to do with who the experimental subjects happen to be and more to do with the strength of the analogy from the experimental design to the social or political phenomenon of interest. Our ability to aggregate experimental findings into a broader understanding of politics and society is arguably much more important than the relative magnitudes of particular SATEs and PATEs. Assessing this kind of external validity is outside the scope of the current paper, but our guess is that the choice of one convenience sample over another does not alter it for better or worse.
In our empirical section, we replicate five survey experiments that were originally conducted on other samples. As an exercise, we read each paper with an eye towards understanding whether the theory under study should, in principle, apply to the sorts of people who participate in online surveys. We also noted whether treatment effects were predicted to be moderated by particular variables in the original paper. This is relevant because, as noted in
Druckman and Kam (2011), a sample needs sufficient variation on a moderating variable in order to demonstrate the presence of treatment effect heterogeneity.
Table 1 pulls together the results of this exercise. In three cases, the group to whom the theory appears to apply is all adult English-speaking Americans and, in two cases, the groups is simply all adult humans. Lucid subjects are strict subsets of both groups. The theoretical moderators were education, ideology, gender, risk acceptance, education, subject attentiveness, and partisanship.
Experiments
We now turn to our five replication experiments. For space reasons, we provide brief descriptions of each experiment in the main text along with summary figures comparing the estimated treatment effects across sample. In the Welfare, Asian Disease, Kam and Simas, Hiscox and Berinsky facets found in
Figure 2, we present standardized treatment effect estimates, where we have scaled the outcome variables for Lucid and MTurk by the mean and standard deviation of the original experiments. The Berinsky facet does not include an MTurk estimate since it has not been previously replicated on an MTurk sample. Fuller descriptions of our procedures and results (including treatment and outcome question wordings as well as regression tables of our results) are available in the
online appendix. We did not pre-register our analyses because, in the main, we follow the analysis strategies of the original authors. Again, following the original authors, we drop subjects with missing or don’t know outcomes.
9 In all cases, we estimate HC2 robust standard errors to construct 95% confidence intervals and conduct hypothesis tests.
Experiment 1: Welfare spending
Our first experiment replicates a classic question wording experiment. Control subjects are asked whether we are spending too little, about right, or too much on “welfare.” Treatment subjects are asked the same question about “Assistance to the poor” or “Caring for the poor.” The General Social Survey (GSS) has conducted this experiment every other year since 1984; we use the 2014 GSS estimate as the baseline result. This experiment behaves on Lucid much as it does on MTurk and the GSS – a large increase in support for redistribution when the question is phrased as assistance or caring for the poor rather than as “welfare.”
Experiment 2: Asian Disease problem
Our second experiment is also a classic, this time of the behavioral economics literature.
Tversky and Kahneman (1981) show that people take the riskier option when in a “loss frame” rather than a “gain frame.” Subjects are asked to “Imagine that your country is preparing for the outbreak of an unusual disease, which is expected to kill 600 people. Two alternative programs to combat the disease have been proposed.” Subjects in the control condition are told: “If Program A is adopted, 200 people will be saved. If Program B is adopted, there is one-third probability that 600 people will be saved, and two-third probability that no people will be saved.” Subjects in the treatment group (the “mortality frame”) are told: “If Program A is adopted, 400 people will die. If Program B is adopted there is one-third probability that nobody will die, and two-third probability that 600 people will die.”
Across all three samples (the original experiment was conducted in a classroom setting among undergraduates), the treatment has average effects in the same direction, with subjects in the mortality (loss) frame far more likely to choose the probabilistic (risky) outcome, though the magnitudes of the effects do differ substantially by sample. Lacking a US national sample benchmark, it is unclear how to grade Lucid’s performance relative to MTurk, though we would argue that the qualitative conclusions drawn from the experiment are the same across all samples.
Experiment 3: Framing and risk
Our third experiment replicates
Kam and Simas (2010), who show that risk acceptance correlates with choosing the risky option in an Asian Disease-type experiment, but that the treatment effect of the mortality frame does not vary appreciably with risk acceptance.
10 This finding was replicated in both MTurk and Lucid samples. Receiving the mortality frame increases the likelihood of selecting the probabilistic choice. Risk acceptance correlates with choosing the risky option, but does not moderate the effect of treatment, as we will discuss in greater depth.
Experiment 4: Free trade
Study 4 is a replication of
Hiscox (2006), which measured the effects of positive, negative, and expert opinion frames on support for free trade. The study employed a 2 × 4 design. The first factor is the Expert treatment, which informed subjects that economists are nearly unanimously in favor of free trade. The second factor is the valence frame, which highlights positive, negative, or both positive and negative impacts of free trade on the economy and jobs. Control subjects saw no frames before proceeding to the outcome question answered by all subjects: “Do you favor or oppose increasing trade with other nations?” The Expert frame increases support for free trade in all examined samples, while the positive frame has negligible (or even negative) effects and the negative frame has unambiguously negative effects. In both the original sample and the MTurk sample, the combination of the positive and negative frames decreased support. Overall, the studies yield similar experimental estimates.
Experiment 5: Healthcare rumors
We conclude our set of five experiments with a note of caution. We attempted to replicate
Berinsky’s 2017 experiment on belief in rumors surrounding the Affordable Care Act (ACA), specifically the false rumor that the ACA would create “death panels” that would make end-of-life decisions for patients without their consent. In the original experiment (conducted in 2010 on a sample provided by Survey Sampling International (SSI)), a large portion of the sample believed the rumor, and corrections delivered by Republicans, Democrats, and Nonpartisan groups were all effective in correcting false beliefs.
When we replicated the experiment on Lucid, we found a similar level of baseline belief in the rumor. On a −1 to 1 scale (with 0 indicating the respondent was “not sure”), average levels of belief were −0.17 on Lucid, compared with −0.19 in the original. However, none of the corrections (with the possible exception of the Republican correction) appear to have had effects as large as was documented in the original. It could be that the Lucid sample is uniquely impervious to these corrections, but that explanation is hard to reconcile with the fact that the original sample was an online convenience sample much like Lucid. We think that a more plausible explanation for this divergence is that the opinion on the ACA has hardened in the six intervening years between the original implementation and when we conducted our replication. These results underline that treatment effects can both vary across individuals within the same time period and across time periods within individuals.
Treatment effect heterogeneity
As previously discussed, in addition to estimating average treatment effects (ATEs) for overall sample populations, one of the important determinants of whether an experimental sample is fit for purpose is whether the sample can be used to estimate conditional average treatment effects (CATEs). In this section, we assess treatment effect heterogeneity in four of the five experiments replicated above.
11For the welfare spending experiment, we test whether subjects’ race or ethnicity conditions the effect of the “assistance to the poor” phrasing. Though this is not one of the factors theorized to condition the phrasing treatment effect in the original iteration of this experiment, race has since been identified as perhaps the single most important influencing factor in position on welfare spending (
Gilens, 1996). We assess whether the treatment effect of receiving the “assistance to the poor” versus “welfare” phrasing varies among white, black, and Latino respondents, among both the Lucid sample and respondents in the 2016 GSS. Both samples generally exhibit low treatment effect heterogeneity, with CATEs among white, black, and Latino respondents being statistically indistinguishable from one another. While the point estimates for the 2016 GSS CATEs are nearly identical, in the Lucid sample white respondents exhibited larger treatment effect magnitude than black or Latino respondents, though again these differences are not statistically significant. These estimates are shown in the first facet of
Figure 3.
In the third facet of
Figure 3, we assess whether subjects’ prior risk acceptance conditions the effect of receiving the mortality frame. In neither sample do we see a significant conditioning effect for risk assessment – both the original sample and Lucid sample are able to replicate estimates of (the lack of) heterogeneous treatment effects.
For the Hiscox free-trade framing experiment, we test for two different types of treatment effect heterogeneity, as can be seen in
Figure 4. We assess both heterogeneity based on respondents’ prior characteristics, in the form of education levels, as well as heterogeneity that is randomly assigned as part of the experimental design, in whether or not subjects receive the summary of expert opinions. Across both the original sample and Lucid sample, we see no evidence of heterogeneous treatment effects for any of the possible treatment conditions. While the results seem to suggest that subjects with “low” education levels, as defined in
Hiscox (2006) as subjects who have not attended any college, are slightly more influenced by both framing and expert opinions, these differences are far from statistically significant. Receiving the expert opinions does not appear to moderate the effect of receiving any of the possible treatment frames.
Treatment effect heterogeneity for the healthcare rumors is also low for both the original
Berinsky (2017) sample and for the Lucid sample (
Figure 5). Treatment effects for all possible treatment possibilities are similar for both Democrats and Republicans for both samples. It is important to note that while we do not replicate the original ATEs for this study in the previous section, here we see that the CATEs are statistically differentiable for only Democrats receiving the Democratic correction to the healthcare rumor. We can, therefore, clarify our findings for this replication. Subjects identifying as Democrats sampled from SSI in 2010 reacted differently to the Democratic correction than did Democratic subjects sampled from Lucid in 2016. Whether this difference is due to altered political context over time, solidification of beliefs towards the ACA, or differences in the sample pools between SSI and Lucid cannot be stated for certain.
Discussion
The surge in research conducted online has many positive benefits. Researchers can pilot quickly and make adjustments to strengthen their designs. Because online convenience samples are inexpensive to collect, researchers can more easily conduct experiments at scale. Online surveys have also lowered the barriers to entry for early career scholars. The dramatic increase in the use of online convenience samples raises at least two questions. First, for which research tasks are online convenience samples appropriate? Second, when convenience samples are appropriate, is MTurk the best option, or are there alternatives?
We have relied on the fit-for-purpose framework to answer the first question. The purpose of most survey experiments is to estimate a SATE; whether a given SATE is interesting depends on whether the sample is relevant for theory. Theoretical relevance concerns whether the theory’s predictions extend to a particular sample, not whether the sample is drawn at random from some population. While we think the theories that underlie most survey experiments conducted in the USA would extend to Lucid, we emphatically do not mean to that any experiment conducted on a convenience sample is relevant for theory.
In our five experiments, Lucid performed remarkably well in recovering estimates that come close to the original estimates. In most cases, our estimates matched the original in terms of sign and significance. In zero cases did we recover an estimate that was statistically significant and had the opposite sign from the original. We think that the best explanation for this pattern is low treatment effect heterogeneity, which is another way of saying that the causal theories laid out in the original papers extend in a straightforward way to the Lucid sample. We test this heterogeneity directly and conclude that in nearly every case, low treatment effect heterogeneity is indeed the reality, at least along the dimensions we assess.
Among our five experiments, we have one instance of the Lucid sample producing substantively different results compared to the original study. In no way do we think our results contradict or overturn those reported by
Berinsky (2017). Instead, we suspect that the correction no longer works because times have changed since the original experiment. While this line of reasoning is admittedly post hoc, one might argue that the Lucid sample was not relevant for theory because by 2016, attitudes and opinions about Barack Obama were strongly held by most Americans. If so, this heterogeneity in response to treatment is a feature of Americans generally and not a unique feature of the special subset of Americans who take surveys on Lucid. Alternatively, we might say that, ex-ante, we considered the Lucid sample relevant for theory and these new results require us to update the theory forwarded in that paper.
Regarding the second question of how to choose among sources of convenience samples, we believe we have shown that subjects obtained via Lucid can serve as a drop-in replacement for subjects recruited on MTurk. Lucid boasts a much larger pool of subjects than MTurk; the risk of cooperation among subjects is minimal given their diverse sources; subjects are less professionalized; subjects are more similar to US national benchmarks in terms of their demographic, political, and psychological profiles. Experimental results obtained on Lucid are solidly in line with the results obtained on other platforms. That said, researchers have developed tools to implement a wide variety of studies on MTurk. For example, the MTurk software (
Leeper, 2015) makes it easy to implement panel studies on MTurk. Similar tools have not been developed for Lucid, so some researchers would face significant costs of changing their workflows.
Lastly, we note that MTurk survey respondents are among the very best-studied human beings on the planet. While we advocate in this paper that scholars seek out new sources of survey respondents, we recognize that the knowledge we have about MTurk workers is valuable. As a research community, we have honed our understanding about how these people respond to incentives, question wordings, and experimental stimuli. We know how they respond to attention checks and distraction tasks. Journal editors and peer reviewers are already familiar with the strengths and weaknesses of MTurk data. Diversifying our subject pools will necessarily involve learning how other online samples are similar and different. While we are reassured that on most dimensions, Lucid data appear to equal or outperform MTurk data, we also recognize that changing data sources does not come without costs.