Researchers’ Intuitions About Power in Psychological Research

Many psychology studies are statistically underpowered. In part, this may be because many researchers rely on intuition, rules of thumb, and prior practice (along with practical considerations) to determine the number of subjects to test. In Study 1, we surveyed 291 published research psychologists and found large discrepancies between their reports of their preferred amount of power and the actual power of their studies (calculated from their reported typical cell size, typical effect size, and acceptable alpha). Furthermore, in Study 2, 89% of the 214 respondents overestimated the power of specific research designs with a small expected effect size, and 95% underestimated the sample size needed to obtain .80 power for detecting a small effect. Neither researchers’ experience nor their knowledge predicted the bias in their self-reported power intuitions. Because many respondents reported that they based their sample sizes on rules of thumb or common practice in the field, we recommend that researchers conduct and report formal power analyses for their studies.

reports the trimmed means (20%) and the medians of the different variables presented in Table  1 for the participants in the researcher and reviewer condition, separately. The distributions of the variables are given in Figure S1. The unjust removal of outliers can increase the Type I error rate, but keeping real outliers in your data and non-normal distributions can decrease the power of parametric tests. Furthermore, deciding on outliers is often very subjective and these choices might suffer from phacking or biases. Therefore, we choose to use robust statistics, which have high power when assumptions are violated (see for an extensive review of this issue Bakker & Wicherts, 2014). We report trimmed means (20%) and use the Yuen Welch test to compare two independent groups. Table S1 Trimmed means (medians) of typical (for researchers) and desired (for reviewers) Alpha, Effect Size, N, and power given by the respondents, and the power estimates and bias.
Researchers Notes: The typical ES and N as given by the participants contained some extreme outliers that disturbed the plots. Therefore, we show only the histograms of responses within the range 0 and 3 (for d) and 0 and 501 (for N). Not included in the plot are therefore: one participant (researcher condition) with d = 99999, one participant (reviewer condition) with d = 5, and one participant (researcher condition) with N = 10000.

Differences between researcher and reviewers
We did not find significant differences in typical α, ES, N and reported power between the researcher and reviewer conditions. The typical α had no variance (after trimming) and therefore the Yuen-test could not be applied; ES: t y (159.

Statistical knowledge
To see whether respondent's self-assessed statistical knowledge was related to better power intuitions, we correlated (Spearman's Rank Order correlation was used because of non-normality) the calculated power and bias with respondent's self-reported statistical knowledge. In both conditions, we failed to find significant correlations (Power, Researcher condition: r s = -0.01, p = .865; Power, Reviewer condition: r s = .03, p = .763; Bias, Researcher condition: r s = -0.11, p = .144; Bias, Reviewer condition: r s = -0.10, p = .265).

Number of publications
We also investigated whether the number of publications of respondents was related to their power intuitions. In Table S2 the trimmed means of the calculated power and bias are presented for the different publication categories. A robust regression, by using the rlm() function of the MASS package in R, with condition and number of publication as predictors failed to show a significant main effect of number of publications or an interaction between the research output and condition. The number of publications or the interaction between number of publications and condition failed to significantly DOI: 10.1177/0956797616647519 DS6 predict α, ES, and power. However, the number of publications positively predicted N (b = 3.58, p = .002).

Research Field
We investigated possible differences in power intuitions between the different research fields. In Table  S3 the trimmed mean of the calculated power and bias are presented for each research field separately. Because only two participants indicated Forensic psychology as their main field of research, we could not include them in a two way ANOVA of trimmed means. We used the function t2way() of the WRS package, which does not give the df or ES. Furthermore, we find slightly different results for the main effects of condition compared with the results of the robust t test that we used before, because the means are trimmed in every cell (9*2) and for the t test in only two cells. We did not find a main effect of research field (F t = 7.32, p = .622) or an interaction between field and condition in estimated power (F t = 15.26, p = .155). Similarly, bias showed neither a main effect for research field (F t = 3.01, p = .951), nor an interaction between field and condition (F t = 7.69, p = .580).
We found no differences between sub-fields or interactions with condition for the reported α and reported power (both no variance after trimming). For ES we did not find a main effect of sub-field (F t = 16.95, p = .113), but did find a significant interaction effect with condition (F t = 23.18, p = .032). The estimated ES differed between the conditions for participants whose main field of research was Health Psychology, Personality Psychology, and Social Psychology with an estimated ES for participants in the researcher condition of M t = 0.29, M t = 0.40, and M t = 0.41, respectively, and for the participants in the reviewer condition of M t = 0.46, M t = 0.27, and M t = 0.30, respectively. We also found a main effect of sub-field on N (F t = 21.44, p = .032), but no interaction effect with condition (F t = 17.31, p = .081).

Additional questions
In Study 1 we included a question about whether respondents would prefer to conduct (or see in manuscript as reviewer) multiple small studies or rather one large study. We found that differences between the conditions in whether respondents would prefer 5 studies (N = 20), 4 studies (N = 25), 2 studies (N = 50) or 1 study (N =100; see Table S4). A 2 (researcher v. researcher) by 4 (number of studies) χ² test was significant (χ²(3) = 23.3, p < .001, φ = .28). A majority of the participants who answered the question from a researcher's perspective preferred one large study, whereas most participants who answered the question from a reviewer's perspective preferred two smaller studies. In Study 1 we also included a question about outliers. In each condition we had 2 versions. In one version removing the outliers would change the results from nonsignificant to significant and in the other version, both with and without outliers the results were significant and not substantially different. We asked the respondents whether they would report the results with the outlier, the results without the outlier, both the results, or other. The results are presented in Table S5. We see a strong preference for reporting both results.

Invitation letter
Dear fellow psychological scientist, We contact you because you published an article in a high-impact journal in 2014. We would be very grateful if you would complete a short survey that pertains to statistical intuitions in research. This survey consists of 10 short questions and will take no more than 3 minutes of your time.
To participate in the survey, please click on the following link: ${l://SurveyLink?d=Take the Survey} Or copy and paste the URL below into your internet browser: ${l://SurveyURL} All results will be analyzed in aggregate only and answers will never be associated with individual participants. We disabled any form of IP-address logs on Qualtrics.
We would greatly appreciate your participation.

Reminder prompt
Dear fellow psychological scientist, This is a quick reminder for our request to take part in our survey on statistical intuitions. This survey consists of 10 short questions and will take approximately 3 minutes of your time.
To participate in the survey, please click on the following link: ${l://SurveyLink?d=Take the Survey} Or copy and paste the URL below into your internet browser: ${l://SurveyURL} All results will be analyzed in aggregate only and answers will never be associated with individual participants. We disabled any form of IP-address logging on Qualtrics.  Table S6 and S7 contain the average power and sample size estimates based on all available data. We see the same patterns as presented in the paper.

Other factors
To investigate the influence of other factors we summarized the three questions (actual knowledge of power, how often the respondent conducted power analyses, and a self-assessment of statistical knowledge) by means of a PCA. The first component explained 50% of the variance and we used the component scores (CS) to investigate whether these scores predicted estimates of power and sample sizes. We used hierarchical regression analyses. In the first model, we included only CS, in the second model we added condition by means of two dummy coded variables where sample size condition small serves as the reference category (D1 = 1 when sample size condition is medium; D2 = 1 when sample size condition is large), and in the third model we added the interaction between CS and the dummy coded condition. Table S8 presents the results of the hierarchical regression analysis with power estimates as the dependent variable for small, medium, and large underlying ES, separately. Table S9 reports these results with sample size estimations as the dependent variable. We selected the best fitting model (bold text) based on the R 2 change. With power estimates as dependent variable differences between the conditions are expected, we therefore focus on the effect of CS. For all three underlying ES, model 2 fitted best, and only when the ES were medium or large did respondents with a DOI: 10.1177/0956797616647519 DS11 high CS have higher (and hence more accurate) power estimates (b = 0.021 and b = 0.041, respectively). With sample size as dependent variable, we did not expect any differences between the conditions. Nevertheless, we observed an effect of condition when the underlying ES are medium or large. These are probably carry-over effects. Furthermore, when the underlying ES was large, respondents provided smaller sample size estimates (b = -12.89), again resulting in estimates being closer to the true value.
To investigate the two questions with a continuous scale separately we ran the same hierarchical regression analyses with how often the respondent conducted power analyses (PA) as predictor (Table S10 and S11) and self-assessment of statistical knowledge (SK) as predictor (Table S12 and S13). With PA as a predictor and power estimates as dependent variable, Model 2 was selected for all three underlying ESs. The predictor PA was only significantly higher (and hence more accurate) when the underlying ES was medium (b = 0.016) or large (b = 0.020). With sample size as dependent variable, we expected no difference between the conditions. Nevertheless, we did observe the same pattern as with CS, with an effect of condition when the underlying ES was medium or large. Again, these results are probably due to carry-over effects. Furthermore, when the underlying ES were large, respondents had smaller sample size estimates (b = -12.67), again resulting in estimates closer to the true value. When underlying ES is small, Model 3 seems to fit significantly better, than Model 2. However, overall, Model 3 was not statistically significant (F(5,208) = 1.873, p = 0.100).
With SK as a predictor and power estimates as dependent variable, Model 2 was selected for all underlying ESs. The predictor SK was only significantly higher (and hence more accurate) when underlying the ES was large (b = 0.022). With sample size as dependent variable, we again witnessed the same pattern as with CS, with an effect of condition when the underlying ES was medium or large. SK is not a significant predictor for any of the underlying ESs.
To investigate the influence of actual knowledge of power, which is dichotomously scored, on power and sample size estimates, we used a robust 3 (condition) by 2 (correct/incorrect) ANOVA for trimmed means. With a small ES, we found an interaction between condition and answering the question on the power estimate (F t = 9.350, p = .027), which makes it difficult to interpret any differences between participants who answered the question correctly and those who answered the question incorrectly. We also found this question to predict the sample size when ES was small (F t = 4.721, p = .035). Participants who answered the question correctly showed higher estimates of the sample size, which were closer to the true value. When ES was medium we found an interaction between condition and answering the question correctly on estimating sample size (F t = 10.450, p = .012). Participants in the small and medium sample sizes conditions who answered the power question incorrectly had a somewhat higher sample size estimate than participants who answered the power question correctly. For participants in the large sample size condition, this relation was the other way around.