Equivalence Tests

Scientists should be able to provide support for the absence of a meaningful effect. Currently, researchers often incorrectly conclude an effect is absent based a nonsignificant result. A widely recommended approach within a frequentist framework is to test for equivalence. In equivalence tests, such as the two one-sided tests (TOST) procedure discussed in this article, an upper and lower equivalence bound is specified based on the smallest effect size of interest. The TOST procedure can be used to statistically reject the presence of effects large enough to be considered worthwhile. This practical primer with accompanying spreadsheet and R package enables psychologists to easily perform equivalence tests (and power analyses) by setting equivalence bounds based on standardized effect sizes and provides recommendations to prespecify equivalence bounds. Extending your statistical tool kit with equivalence tests is an easy way to improve your statistical and theoretical inferences.

Scientists should be able to provide support for the null hypothesis. A limitation of the widespread use of traditional significance tests, where the null hypothesis is that the true effect size is zero, is that the absence of an effect can be rejected, but not statistically supported. When you perform a statistical test, and the outcome is a p value larger than the a level (e.g., p > .05), the only formally correct conclusion is that the data are not surprising, assuming the null hypothesis is true. It is not possible to conclude there is no effect when p > a-our test might simply have lacked the statistical power to detect a true effect.
It is statistically impossible to support the hypothesis that a true effect size is exactly zero. What is possible in a frequentist hypothesis testing framework is to statistically reject effects large enough to be deemed worthwhile. When researchers want to argue for the absence of an effect that is large enough to be worthwhile to examine, they can test for equivalence (Wellek, 2010). By rejecting an effect (indicated in this article by D) more extreme than predetermined lower and upper equivalence bounds (ÀD L and D U , e.g., effect sizes of Cohen's d ¼ À.3 and d ¼ .3), we can act as if the true effect is close enough to zero for our practical purposes. Equivalence testing originates from the field of pharmacokinetics (Hauck & Anderson, 1984), where researchers sometimes want to show that a new cheaper drug works just as well as an existing drug (for an overview, see Senn, 2007, Chapters 15 and 22). A very simple equivalence testing approach is the "two one-sided tests" (TOST) procedure (Schuirmann, 1987). In the TOST procedure, an upper (D U ) and lower (ÀD L ) equivalence bound is specified based on the smallest effect size of interest (SESOI; e.g., a positive or negative difference of d ¼ .3). Two composite null hypotheses are tested: H0 1 : D ÀD L and H0 2 : D ! D U . When both these onesided tests can be statistically rejected, we can conclude that ÀD L < D < D U or that the observed effect falls within the equivalence bounds and is close enough to zero to be practically equivalent (Seaman & Serlin, 1998).
Psychologists often incorrectly conclude there is no effect based on a nonsignificant test result. For example, the words "no effect" had been used in 108 articles published in Social Psychological and Personality Science up to August 2016. Manual inspection revealed that in almost all of these articles, the conclusion of "no effect" was based on statistical nonsignificance. Finch, Cumming, and Thomason (2001) reported that in the Journal of Applied Psychology, a stable average of around 38% of articles with nonsignificant results accept the null hypothesis. This practice is problematic. With small sample sizes, nonsignificant test results are hardly indicative of the absence of a true effect, and with huge sample sizes, effects can be statistically significant but practically and theoretically irrelevant. Equivalence tests, which are conceptually straightforward, easy to perform, and highly similar to widely used hypothesis significance tests that aim to reject a null effect, are a simple but underused approach to reject the possibility that an effect more extreme than the SESOI exists (Anderson & Maxwell, 2016).
Psychologists would gain a lot by embracing equivalence tests. First, researchers often incorrectly use nonsignificance to claim the absence of an effect (e.g., "there were no gender effects, p > .10"). This incorrect interpretation of p values would be more easily recognized and should become less common in the scientific literature if equivalence tests were better known and more widely used. Second, where traditional significance test only allows researchers to reject the null hypothesis, science needs statistical approaches that allow us to conclude meaningful effects are absent (Dienes, 2016). Finally, the strong reliance on hypothesis significance tests that merely aim to reject a null effect does not require researchers to think about the effect size under the alternative hypothesis. Exclusively focusing on rejecting a null effect has been argued to lead to imprecise hypotheses (Gigerenzer, 1998). Equivalence testing invites researchers to make more specific predictions about the effect size they find worthwhile to examine. Bayesian methods can also be used to test a null effect (e.g., Dienes, 2014), but equivalence tests do not require researchers to switch between statistical philosophies to test the absence of a meaningful effect, and the availability of power analyses for equivalence tests allows researchers to easily design informative experiments.
There have been previous attempts to introduce equivalence testing to psychology (Quertemont, 2011;Rogers, Howard, & Vessey, 1993;Seaman & Serlin, 1998). I believe there are four reasons why previous attempts have largely failed. First, there is a lack of easily accessible software to perform equivalence tests. To solve this problem, I've created an easy to use spreadsheet and R package to perform equivalence tests for independent and dependent t tests, correlations, and meta-analyses (see https://osf.io/q253c/) based on summary statistics. Second, in pharmacokinetics, the equivalence bounds are often defined in raw scores, whereas it might be more intuitive for researchers in psychology to express equivalence bounds in standardized effect sizes. This makes it easier to perform power analyses for equivalence tests (which can also be done with the accompanying spreadsheet and R package) and to compare equivalence bounds across studies in which different measures are used. Third, there is no single article that discusses both power analyses and statistical tests for one-sample, dependent and independent t tests, correlations, and meta-analyses, which are all common in psychology. Finally, guidance on how to set equivalence boundaries has been absent for psychologists, given that there are often no specific theoretical limitations on how small effects are predicted to be (Morey & Lakens, 2017) nor cost-benefit boundaries of when effects are too small to be practically meaningful. This is a chicken-egg problem, since using equivalence tests will likely stimulate researchers to specify which effect sizes are predicted by a theory (Weber & Popova, 2012). To bootstrap the specification of equivalence bounds in psychology, I propose that when theoretical or practical boundaries on meaningful effect sizes are absent, researchers set the bounds to the smallest effect size they have sufficient power to detect, which is determined by the resources they have available to study an effect.

Testing for Equivalence
In this article, I will focus on the TOST procedure (Schuirmann, 1987) of testing for equivalence because of its simplicity and widespread use in other scientific disciplines. The goal in the TOST approach is to specify a lower and upper bound, such that results falling within this range are deemed equivalent to the absence of an effect that is worthwhile to examine (e.g., where D is a difference that can be defined by either standardized differences such as Cohen's d or raw differences such as .3 scale point on a 5-point scale). In the TOST procedure, the null hypothesis is the presence of a true effect of D L or D U , and the alternative hypothesis is an effect that falls within the equivalence bounds or the absence of an effect that is worthwhile to examine. The observed data are compared against D L and D U in two one-sided tests. If the p value for both tests indicates the observed data are surprising, assuming D L or D U are true, we can follow a Neyman-Pearson approach to statistical inferences and reject effect sizes larger than the equivalence bounds. When making such a statement, we will not be wrong more often, in the long run, than our Type 1 error rate (e.g., 5%). It is also possible to test for inferiority, or the hypothesis that the effect is smaller than an upper equivalence bound, by setting the lower equivalence bound to 1. 1 Furthermore, equivalence bounds can be symmetric around zero ( When both null hypothesis significance tests (NHST) and equivalence tests are used, there are four possible outcomes of a study: The effect can be statistically equivalent (larger than D L , smaller than D U ) and not statistically different from zero, statistically different from zero but not statistically equivalent, statistically different from zero and statistically equivalent, or undetermined (neither statistically different from zero nor statistically equivalent). In Figure 1, mean differences (black squares) and their 90% (thick lines) and 95% confidence intervals (CIs; thin lines) are illustrated for four scenarios. To conclude equivalence (Scenario A), the 90% CI around the observed mean difference should exclude the D L and D U values of À.5 and .5 (indicated by black vertical dashed lines). 2 The traditional two-sided null hypothesis significance test is rejected (Scenario B) when the CI around the mean difference does not include 0 (the vertical gray dotted line). Effects can be statistically different from zero and statistically equivalent (Scenario C) when the 90% CI exclude the equivalence bounds and the 95% CI exclude zero. Finally, an effect can be undetermined, or not statistically different from zero, and not statistically equivalent (Scenario D) when the 90% CI includes one of the equivalence bounds and the 95% CI includes zero.
In this article, the focus lies on the TOST procedure, where two p values are calculated. Readers are free to replace decisions based on p values by decisions based on 90% CIs if they wish. Formally, hypothesis testing and estimation are distinct approaches . For example, while sample size planning based on CIs focusses on the width of CIs, sample size planning for hypothesis testing uses power analysis to estimate the probability of observing a significant result (Maxwell, Kelley, & Rausch, 2008). Since the TOST procedure is based on a Neyman-Pearson hypothesis testing approach to statistics, and I'll explain how to calculate the tests as well as how to perform power analysis, I'll focus on the calculation of p values for conceptual consistency.

Equivalence Tests for Differences Between Two Independent Means
The TOST procedure entails performing two one-sided tests to examine whether the observed data are surprisingly larger than an equivalence boundary lower than zero (D L ) or surprisingly smaller than an equivalence boundary larger than zero (D U ). The equivalence test assuming equal variances is based on: where M 1 and M 2 indicate the means of each sample, n 1 and n 2 are the sample size in each group, and s is the pooled standard deviation (SD): Even though Student's t test is by far the most popular t test in psychology, there is general agreement that whenever the number of observations are unequal across both conditions, Welch's t test (1947), which does not rely on the assumption of equal variances, should be performed by default (Delacre, Lakens, & Leys, 2017;Ruxton, 2006). The equivalence test not assuming equal variances is based on: where the degrees of freedom (df) for Welch's t test are based on the Sattherthwaite (1946) correction: These equations are highly similar to the Student's and Welch's t-statistic for traditional significance tests. The only difference is that the lower equivalence bound D L and the upper equivalence bound D U are subtracted from the mean difference between groups. These bounds can be defined in raw scores or (Berger & Hsu, 1996). The spreadsheet and R package can be used to perform this test, but some commercial software such as Minitab (Minitab 17 Statistical Software, 2010) also include the option to perform equivalence tests for t tests.
As an example, Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those in the control condition (d ¼ .81, 95% CI [0.19, 1.45]). A replication by Moery and Calin-Jageman (2016, study 2) did not observe a significant effect (control: n ¼ 95, M ¼ 5.25, SD ¼ .95, organic food: n ¼ 89, M ¼ 5.22, SD ¼ .83). The authors followed Simonsohn's (2015) recommendation so set the equivalence bound to the effect size the original study had 33% power to detect. With n ¼ 21 in each condition of the original study, this means the equivalence bound is d ¼ .48, which equals a difference of .384 on a 7-point scale given the sample sizes and a pooled SD of .894. We can calculate the TOST equivalence test t-values: prediction seems to be whether the effect is smaller than the upper equivalence bound (a test for inferiority), only the onesided t test against the upper equivalence bound could be performed and reported. Note that the spreadsheet and R package allow you to either directly specify the equivalence bounds in Cohen's d or set the equivalence bound in raw units. An a priori power analysis for equivalence tests can be performed by calculating the required sample sizes to declare equivalence for two one-sided tests based on the lower equivalence bound and upper equivalence bound. When equivalence bounds are symmetric around zero (e.g., D L ¼ À.5 and D U ¼ .5), the required sample sizes (referred to as n L and n U in Equation 5) will be identical. Following Chow, Shao, and Wang (2002), the normal approximation of the power equation for equivalence tests (for each independent group of an independent t test) given a specific a level and desired level of statistical power (1 À b) is: where D L and D U are the standardized mean difference equivalence bounds (in Cohen's d). This equation calculates the required sample sizes based on the assumption that the true effect size is zero (see Table 1). If a nonzero true effect size is expected, an iterative procedure must be used. A highly accessible overview of power analysis for equivalence, superiority, and noninferiority designs with power tables for a wide range of standardized mean differences and expected true mean differences that can be used to decide upon the sample size in your study is available in Julious's (2004) study. The narrower the equivalence bounds, or the smaller the effect sizes one tries to reject, the larger the sample size that is required. Large sample sizes are required to achieve high power when equivalence bounds are close to zero. This is comparable to the large sample sizes that are required to reject a true but small effect when the null hypothesis is a null effect. Equivalence tests require slightly larger sample sizes than traditional null hypothesis tests.

Equivalence Tests for Differences Between Dependent Means
When comparing dependent means, the correlation between the observations has to be taken into account, and the effect size directly related to the statistical significance of the test (and thus used in power analysis) is Cohen's d z (see Lakens, 2013). The t-values for the two one-sided tests statistics are: The bounds D L and D U can be defined in raw scores, or in a standardized bound based on Cohen's d z , where D ¼ d z Â SD diff , or d z ¼ D/SD diff . Equation 3 can be used for a priori power analyses by inserting Cohen's d z instead of Cohen's d. The number of pairs needed to achieve a desired level of power when using Cohen's d z is half the number of observations needed in each between subject condition specified in Table 1.
There are no suggested benchmarks of small, medium, and large effects for Cohen's d z . We can consider two approaches to determining benchmarks. The first is to use the same benchmarks for Cohen's d as for Cohen's d z . This assumes r ¼ .5, when Cohen's d and Cohen's d z are identical. 3 A second approach is to scale the benchmarks for Cohen's d z based on the sample size we need to reliably detect an effect. For example, in an independent t test, 176 participants are required in each condition to achieve 80% power for d ¼ .3 and a ¼ .05. With 176 pairs of observations and a ¼ .05, a study has 80% power for a Cohen's d z of .212. The relationship between d and d z is a factor of ffiffi ffi 2 p , which means we can translate the benchmarks for Cohen's d for small (.2), medium (.5), and large (.8) effects into benchmarks for Cohen's d z of small (.14), medium (.35), and large (.57). There is no objectively correct way to set benchmarks for Cohen's d z . I leave it up to the reader to determine whether either of these approaches is useful.

Equivalence Tests for One-Sample t Tests
The t-values for the two one-sided tests for a one-sample t tests are: where M is the observed mean, SD is the observed standard deviation, N is the sample size, D L and D U are lower and upper equivalence bounds, and m is the value that the mean is tested against.

Equivalence Tests for Correlations
Equivalence tests can also be performed on correlations, where the two one-sided tests aim to reject correlations larger than a lower equivalence bound (r L ) and smaller than an upper equivalence bound (r U ). I follow Goertzen and Cribbie (2010), who use Fisher's z transformation on the correlations, after which critical values are calculated that can be compared against the normal distribution: The two one-sided tests are rejected if Z L ÀZ a and Z U ! Z a . Benchmarks for small, medium, and large effects, which can be used to set equivalence bounds, are r ¼ .1, r ¼ .3, and r ¼ .5. Power analysis for correlations can be performed by converting r to Cohen's d using: after which Equation 5 can be used. This approach is used by, for example, G*Power (Faul, Erdfelder, Lang, & Buchner, 2007).

Equivalence Test for Meta-Analyses
Rejecting small effects in an equivalence test requires large samples. If researchers want to perform an equivalence test with narrow equivalence bounds (e.g., D L ¼ À.1 and D U ¼ .1), in most cases, only a meta-analysis will have sufficient statistical power. Rogers, Howard, and Vessey (1993) explain the straightforward approach to performing equivalence tests for meta-analyses: where D is the meta-analytic effect size (Cohen's d or Hedges' g), and SE is the meta-analytic standard error (or ffiffiffiffiffiffi ffi var p ). These values can be calculated with meta-analysis software such as metafor (Viechtbauer, 2010). The two one-sided tests are rejected if Z L ÀZ a and Z U Z a . Alternatively, the 90% CI can be reported. If the 90% CI falls within the equivalence bounds, the observed meta-analytic effect is statistically equivalent.

Setting Equivalence Bounds
In psychology, most theories do not state which effects are too small to be interpreted as support for the proposed underlying mechanism. Instead, feasibility considerations are often the strongest determinant of the effect sizes a researcher can reliably examine. In daily practice, researchers have a maximum sample size they are willing to collect in a single study (e.g., 100 participants in each between-subject condition). Given a desired level of statistical power (e.g., 80%) and a specific a (e.g., .05), this implies a smallest effect size they find worthwhile to examine or a SESOI (Lakens, 2014) they can reliably examine. Based on a sensitivity analysis in power analysis software (such as G*Power), we can calculate that with 100 participants in each condition, 80% desired power, and an a of .05, the SESOI in a null effect significance test is D ¼ 0.389; and using the power analysis calculation for an equivalence test for independent samples, assuming a true effect size of 0, 80% power is achieved when D L ¼ À0.414 and D U ¼ 0.414. As such, without practical boundaries or theoretical boundaries that indicate which effect size is meaningful, the maximum sample size you are willing to collect implicitly determines your SESOI. Therefore, setting equivalence boundaries to your SESOI in an equivalence test allows you to reject effect sizes larger than you find worthwhile to examine, given available resources. When researchers are not willing (or not able) to collect a decent sample size, the extremely large equivalence bounds will make it clear they can at best reject extremely large effects, but that their data are not informative about the presence or absence of a wide range of plausible and interesting effect sizes. This recommendation differs from practices in drug development, where equivalence bounds are often set by regulations (e.g., differences up to 20% are not considered to be clinically relevant). In psychology, such general regulations about what constitutes a meaningful effect seem unlikely to emerge and perhaps even undesirable. Using equivalence bounds based on effect sizes a researcher finds worthwhile to examine do not allow psychologists to conclude an effect is too small to be meaningless for anyone. When other researchers believe a smaller effect size is plausible and theoretically interesting, they can design a study with a larger sample size to examine the effect. In randomized controlled trials, it is expected that equivalence bounds are prespecified (e.g., see CONSORT guidelines; Piaggio et al., 2006), and this should also be considered best practice in psychology. When in the abstract of an article, authors conclude an effect is "statistically equivalent," the abstract should also include the equivalence bounds that are used to draw this conclusion. Simonsohn (2015) proposes to test for inferiority for replication studies (an equivalence test where the lower bound is set to infinity). He suggests to set the upper equivalence bound in a replication study to the effect size that would have given an original study 33% power. For example, an original study with 60 participants divided equally across two independent groups has 33% power to detect an effect of d ¼ .4, so D U is set to d ¼ .4. This approach limits the sample size required to test for equivalence to 2.5 times the sample size of the original study. The goal is not to show the effect is too small to be feasible to study but too small to have been reliably detected by the original experiment, thus casting doubt on the original observation.
If feasibility constraints are practically absent (e.g., in online studies), another starting point to set equivalence bounds is by setting bounds based on benchmarks for small, medium, and large effects. Although using these benchmarks to interpret effect sizes is typically recommended as a last resort (e.g., Lakens, 2013), their use in setting equivalence bounds seems warranted by the lack of other clear-cut recommendations. By far the best solution would be for researchers to specify their SESOI when they publish an original result or describe a theoretical idea (Morey & Lakens, 2017). The use of equivalence testing will no doubt lead to a discussion about which effect sizes are too small to be worthwhile to examine in specific research lines in psychology, which in itself is progress.

Discussion
Equivalence tests are a simple adaptation of traditional significance tests that allow researchers to design studies that reject effects larger than prespecified equivalence bounds. It allows researchers to reject effects large enough to be considered worthwhile. Adopting equivalence tests will prevent the common misinterpretations of nonsignificant p values as the absence of an effect and nudge researchers toward specifying which effects they find worthwhile. By providing a simple spreadsheet and R package to perform power calculations and equivalence tests for common statistical tests in psychology, researchers should be able to easily improve their research practices.
Rejecting effects more extreme than the equivalence bounds implies that we can conclude equivalence for a specific operationalization of a hypothesis. It is possible that a meaningful effect would be observed with a different manipulation or measure. Confounds can underlie observed equivalent effects. An additional nonstatistical challenge in interpreting equivalence concerns the issue of whether an experiment was performed competently (Senn, 2007). Complete transparency (sharing all materials) is a partial solution since it allows peers to evaluate whether the experiment was well designed (Morey et al., 2016), but this issue is not easily resolved when the actions of an experimenter might influence the data. In such experiments, even blinding the experimenter to conditions is no solution since an experimenter can interfere with the data quality of all conditions. This is an inherent asymmetry between demonstrating an effect and demonstrating the absence of a worthwhile effect. The only solution for anyone skeptical about studies demonstrating equivalence is to perform an independent replication.
Equivalence testing is based on a Neyman-Pearson hypothesis testing approach that allows researchers to control error rates in the long run and design studies based on a desired level of statistical power. Error rates in equivalence tests are controlled at the a level when the true effect equals the equivalence bound. When the true effect is more extreme than the equivalence bounds, error rates are smaller than the a level. It is important to take statistical power into account when determining the equivalence bounds because, in small samples (where CIs are wide), a study might have no statistical power (i.e., the CI will always be so wide that it is necessarily wider than the equivalence bounds).
There are alternative approaches to the TOST procedure. Updated versions of equivalence tests exist, but their added complexity does not seem to be justified by the small gain in power (for a discussion, see Meyners, 2012). There are also alternative approaches to providing statistical support for a small or null effect, such as estimation (calculating effect sizes and CIs), specifying a region of practical equivalence (Kruschke, 2010), or calculating Bayes factors (Dienes, 2014;Rouder, Speckman, Sun, Morey, & Iverson, 2009). Researchers should report effect size estimates in addition to hypothesis tests. Since Bayesian and frequentist tests answer complementary questions, with Bayesian statistics quantifying posterior beliefs, and Frequentist statistics controlling Type 1 and Type 2 error rates, these tests can be reported side by side.
Other fields are able to use raw measures due to the widespread use of identical measurements (e.g., the number of deaths, the amount of money spent), but in some subfields in psychology the variability in the measures that are collected require standardized effect sizes to make comparisons across studies (Cumming & Fidler, 2009). A consideration of using standardized effect sizes as equivalence bounds is that in two studies with the same mean difference and CIs in raw scale units (e.g., a difference of 0.2 on a 7-point scale with 90% CI [À0.13;0.17]), the same standardized equivalence bounds can lead to different significance levels in a equivalence test. The reason for this is that the pooled SD can differ across the studies, and as a consequence, the same equivalence bounds in standardized scores imply different equivalence bounds in raw scores. If this is undesirable, researchers should specify equivalence bounds in raw scores instead.
Ideally, psychologists could specify equivalence bounds in raw mean differences based on theoretical predictions or cost-benefit analyses, instead of setting equivalence bounds based on standardized benchmarks. My hope is that as equivalence tests become more common in psychology, researchers will start to discuss which effect sizes are theoretically expected while setting equivalence bounds. When theories do not specify which effect sizes are too small to be meaningful, theories can't be falsified. Whenever a study yields no statistically significant effect, one can always argue that there is a true effect that is smaller than the study could reliably detect (Morey & Lakens, 2017). Maxwell, Lau, and Howard (2015) suggest that replication studies demonstrate the absence of an effect by using equivalence bounds of D L ¼ À.1 and D U ¼ .1 or even D L ¼ À.05 and D U ¼ .05. I believe this creates an imbalance where we condone original studies that fail to make specific predictions, while replication studies are expected to test extremely specific predictions that can only be confirmed by collecting huge numbers of observations.
Extending your statistical tool kit with equivalence tests is an easy way to improve your statistical and theoretical inferences. The TOST procedure provides a straightforward approach to reject effect sizes that one considers large enough to be worthwhile to examine.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article. Notes 1. As Wellek (2010, p. 30) notes, for all practical purposes, one can simply specify a very large value for the infinite equivalence bound. 2. A 90% confidence interval (CI; 1 À 2a) is used instead of a 95% CI (1 À a) because two one-sided tests (each with an a of 5%) are performed. 3. The author would like to thank Jake Westfall for this suggestion.