Effect Size Reporting Practices in Applied Linguistics Research: A Study of One Major Journal

Many surveys of effect size (ES) reporting practices have been conducted in social science fields such as psychology and education, but few such studies are available in applied linguistics. To bridge this gap and to echo the recent calls for more robust statistics from scholars in applied linguistics and beyond, this study represents the first attempt, in the field of applied linguistics, to focus upon ES reporting practices. With an innovative “two-standards” approach for coding, which overcomes the limitations with similar studies in other social science fields (e.g., communication), this study assesses the ES reporting practices over a span of 6 years in a major journal. Findings include the following: (a) the ES reporting rate is about 50% and (b) some improvement of ES reporting over time is in evidence. Future research directions (e.g., examining whether and how ES is interpreted after being reported) are suggested.


Introduction
The importance of effect size vis-à-vis the inherent limitations of Null Hypothesis Significance Testing (NHST; including the significance level, viz. the p value) has been underlined since five decades ago (e.g., Hays, 1963) Affairs, 1999) and education (American Educational Research Association, 2006) but also, more recently, in applied linguistics (Larson-Hall, 2010;Norris & Ortega, 2007;Oswald & Plonsky, 2010). The limitations of NHST are not delineated here due to space constraints (see Kline, 2004, for an excellent and detailed account, and Henson, 2006;Oswald & Plonsky, 2010;Sun & Fan, 2010, for summaries). One major limitation is that the p value depends upon the sample size; in other words, an increased sample size will eventually yield a small enough p value (viz. p being .05 or smaller; Biskin, 1998;Wasserstein & Lazar, 2016).
On the contrary, effect size, simply put, is "an objective and (usually) standardized measure of the magnitude of observed effect" (Field, 2009, p. 56). Compared with the p value, effect size is much less influenced by sample size (cf., Fan & Konold, 2010). 1 In this sense, effect size is as important as, if not more important than, the significance level. Fan (2001) uses the two-sides-of-one-coin analogy to argue that the p value and effect size complement but do not substitute for each other, suggesting that researchers report both in their quantitative studies. Larson-Hall (2012) goes a step further by stating that "effect size is much more important than a null significance hypothesis test" (the p value included) (p. 472).
The importance of effect size notwithstanding, only a handful of journals in applied linguistics make such reporting practices mandatory in their editorial policies. While Larson-Hall (2010, p. 114) claims that "currently, the only journal in the second language research field which requires effect sizes is Language Learning," to our knowledge, TESOL Quarterly has required the reporting of effect size measures in quantitative studies since the early 2000s; since about the same time, The Modern Language Journal has released similar editorial policies with respect to effect size reporting (Plonsky & Gass, 2011). To date, another five 850035S GOXXX10.1177/2158244019850035SAGE OpenWei et al.  Learning & Technology, Language Testing, Second Language Research, and Studies in Second Language Acquisition. In contrast to the increasing awareness of the importance of effect size, little is known about the current status of effect size reporting in the field of applied linguistics. Although Plonsky (2013) and Lindstromberg (2016) note that the effect size reporting rates in their sampled papers are not high (25% and 49%, respectively), the focus of these studies is not on effect size reporting practices in the field. In contrast, many studies in such fields as education, psychology, and communication (e.g., Meline & Wang, 2004;Sun & Fan, 2010) have focused upon the effect size reporting practices (see "Literature Review" section).
In view of the undue neglect of effect size reporting in applied linguistics, this article aims to contribute to our understanding by surveying such practices in System, subtitled An International Journal of Educational Technology and Applied Linguistics. This journal is selected because of two considerations. First, it is a major journal in the field, as reflected in the fact that it is indexed in Social Sciences Citation Index (SSCI) and has been regarded as "major" by previous studies (e.g., Benson, Chik, Gao, Huang, & Wang, 2009;Jung, 2004;Wang & Gao, 2008). Second, it does not mandate effect size reporting in its editorial policy, as is the case with most journals in the field, which means that it may better reflect the general situation of effect size reporting in applied linguistics, than the few above-mentioned journals that mandate effect size reporting.
This exploratory study focuses upon the effect size reporting practices concerning five statistical procedures: t test, analysis of variance (ANOVA), 2 correlation, regression, and chi-square (χ 2 ) test. They are selected primarily because they are "the top five" most frequently used methods in four major second language acquisition (SLA) academic journals (Gass, 2009), and thus presumably most frequently used in the wider field of applied linguistics (Larson-Hall, 2012). Furthermore, these five tests are also the focal methods in other fields of social sciences such as communication (Sun & Fan, 2010) and education (Alhija & Levy, 2009); findings from our study may have wider implications beyond the field of applied linguistics per se.
Three research questions are pursued: In the remainder of this article, after providing a more detailed introduction to the definition and use of effect size with an illustrative example in a published study, we review relevant studies from such fields as education and psychology as well as from applied linguistics. We then report upon the data collection and analysis methods of our study. After presenting and discussing major findings, we conclude the article by offering suggestions for effect size reporting practices and for further studies that help contribute to the ongoing methodological reform in applied linguistics.

Definitions of Effect Size
Effect size is defined as an objective and standardized measure of the magnitude of an observed effect, with the wording "(usually)" removed from the above-cited concise definition by Field (2009, p. 56). Although some other definitions (e.g., Meline & Wang, 2004;Sun & Fan, 2010) are so broad that they include nonstandardized forms (e.g., raw mean difference), it is strongly recommended that effect size measures be confined to standardized forms only, so as to maximize the benefits of these forms such as letting "the reader compare effects across groups" and "meta-analysts compare studies even if they use different original measures" (Larson-Hall & Plonsky, 2015, p. 135). Furthermore, the danger of relying on raw mean difference, vis-à-vis the benefits of drawing upon standardized forms of effect size, will be illustrated with one authentic example of effect size reporting below.
Dozens of effect size measures are available, each with relative strengths and weaknesses for particular purposes (Ellis, 2010;Henson, 2006;Kirk, 1996). Two types 3 of effect sizes highly relevant to applied linguistic research are the d family and the r family. The d family is based on standardized measures of mean differences (e.g., Cohen's d), whereas the r family includes standardized measures of strength of relations based on the proportion of variance accounted for (e.g., the r squared in regression) or correlation between two variables. Table 1 provides some frequently used effect size measures for the focal statistical methods, where only Cohen's d and Hedges's g belong to the d family and the others to the r family. Table 1 also lists some benchmarks for interpreting effect sizes recommended by Cohen (1988) and by researchers in applied linguistics. Cohen's (1988) benchmark system had better be reserved "as a last resort" (Ellis, 2010, p. 42), although it has been used by too many researchers as ironclad criteria without reference to the measurements taken, the study design, or the practical importance of the findings. Whenever possible, researchers should try to interpret effect sizes by grounding them in a meaningful context (e.g., comparisons with previous studies vis-à-vis the measurements and study design) or by assessing their contribution to knowledge (e.g., in terms of practical or clinical value). The two benchmark systems from researchers in applied linguistics (see Table 1) provide more nuanced guidance in interpreting the effect size in question than Cohen's system: Plonsky and Oswald's (2014) benchmarks are highly relevant to experiment-based studies in what they called "L2 research," and Wei and Hu's (2018) to survey-based studies examining the effects of sociobiographical variables (e.g., gender and multilingualism) on (socio-)psychological variables (e.g., L2 joy and tolerance of ambiguity). As Zientek, Capraro, and Capraro (2008) point out, "not reporting effect size can be detrimental" (p. 212). Presenting one authentic example helps drive home the consequences of failing to report effect sizes. Table 2 is adapted from Wei and Su's (2015) analysis of the respondents' self-reported data concerning their English spoken proficiency and other variables from the largest language survey in China. The major modification made to Wei and Su's (2015) original table was that we added a column containing Cohen's d values. We suggest that an effect size from either the r or the d family can be used, and in fact, one can be easily converted into the other (see Larson-Hall, 2010, pp. 117-119, for conversion formulas). Take t tests as an example. As indicated in Table 1, both r and Cohen's d can be used as effect size measures. Although many textbooks on statistical procedures rigidly recommend Cohen's d for t tests, Field's (2009) textbook is one interesting exception, in which he writes that "I'm going to stick with the effect size r because it's widely understood, frequently used, and yes, I'll admit it, I actually like it!" (p. 332).

Consequences of Not Reporting Effect Size
The corresponding research question for Table 2 asks, with regard to English spoken proficiency, was there a significant difference between the national average and the city   Wei and Su (2015, p. 182).
Note. In each of the t tests, the degree of freedom (df) equals to the sample size concerned minus one. A five-point Likert-type scale was used for self-rated reading proficiency, with 5 = able to act as interpreters on formal occasions, 4 = able to converse quite fluently, 3 = able to conduct daily conversations, 2 = able to say some greetings, and 1 = able to utter a few words. The national average was 1.928 (SD = 0.922) based on 55,737 valid responses.
average for each of the seven selected cities? These authors answer the question with results (see Table 2) from a series of one-sample t tests. Two important observations can be made regarding Table 2. First, if one relies on the raw mean differences (viz. the city mean minus the national mean) for Beijing (0.269) and Shenzhen (0.256), one might reach a conclusion that Beijing performed better than Shenzhen with the national average being a baseline. But an entirely opposite conclusion that Shenzhen performed better than Beijing is true because the effect size for the former (0.326) was higher than that for the latter (0.295). In this example, effect size, rather than raw mean difference, is the appropriate measure reflecting the magnitude of the real difference. In other words, failure to use effect size and reliance upon unstandardized measures (e.g., raw mean difference) can lead to a completely opposite conclusion. Second, many researchers with traditional training tend to erroneously believe that "the smaller the p value, the larger the effect" (Zhang, 2009, p. 68), and consequently many might conclude that Shanghai, Dalian, and Tianjin performed equally well because of the same level (0.000) of their p values. However, the true scenario revealed by the effect sizes is that Shanghai (0.238) scored higher than the national average, Dalian (0.475) was better than Shanghai, and Tianjin (0.572) was even better than Dalian; put differently, the actual mean difference (reflected by effect size rather than raw mean difference) between Tianjin and the nation was largest, whereas that between Shanghai and the nation was smallest. All in all, failure to report effect size along with the p value masks a lot of (more) important information in the results of an inferential statistical procedure.
It is noteworthy that Wei and Su (2015) explain why they do not attempt to interpret the r values after reporting these effect sizes. Their main justification is that simply drawing upon the frequently used Cohen's (1988) guidelines is not instructive and there have been no similar studies reporting relevant effect sizes for their analysis to compare and contrast. Considering the random sampling approach and the large sample size used in their study, Wei and Su (2015) suggest future studies of similar topics use their effect size values as a baseline for cross-study comparisons. These practices resonate with the earlier suggestion that referring to effect size values from relevant previous studies is a better practice for interpreting effect size, compared with using Cohen's (1988) benchmarks.

Literature Review
Many surveys of effect size reporting (and to a less extent, interpreting) practices have been conducted in such fields as psychology, education, and communication. For example, in gifted education research, with all the 723 papers from six full volumes of three selected journals, Paul and Plucker (2004, p. 69) report that "28.9% of the quantitative research blocks contained effect size estimates"; the so-called "quantitative research blocks" include three subgroups: descriptives, univariate blocks, and multivariate blocks; the effect size reporting rates for the latter two blocks were 17.9% and 52.2%, respectively. To these authors, there is no need for papers utilizing only descriptive statistics to report effects sizes.
More recently, in the fields of education and psychology, Sun, Pan, and Wang's (2010) survey of 1,243 articles published in 14 journals from three full volumes (2005)(2006)(2007) reveals an effect size reporting rate of 49%. In the field of communication, after examining four full volumes (2003)(2004)(2005)(2006) of four influential journals, Sun and Fan (2010) find a relatively high effect size reporting rate (about 75%) in their 224 sampled papers. One major limitation with Sun and Fan's (2010) study is that their coding method tends to overestimate the effect size reporting rate. If, in one particular article, two or more focal statistical procedures (say, t test and ANOVA) are used but only one procedure (say, t test) has an effect size reported, Sun and Fan (2010, p. 333) give "benefit of the doubt" to that article by coding this article as one that reports effect size. Another coding standard, which is more stringent than Sun and Fan's (2010), is that one article using two or more focal statistical procedures has to report effect sizes for all the procedures to qualify as one paper that reports effect size in the coding process. This more stringent standard is likely to yield a lower effect size reporting rate, compared with the situation when Sun and Fan's (2010) standard is adopted.
However, in the field of applied linguistics, no studies focus upon effect size reporting practices. Although Plonsky (2013) finds an effect size reporting rate of 25% by examining 606 articles from Language Learning and Studies in Second Language Acquisition, which mandate effect size reporting in their submission guidelines, his study does not focus on effect size reporting, but rather on a wider range of features reflecting the study quality (e.g., designs, statistical analyses, reporting practices, and outcomes). Understandably, no sufficient coding details are provided regarding papers with two or more statistical procedures, although in Plonsky's (2013, p. 669) sample, up to 60% of the articles use multiple statistical techniques. Another study that sheds light upon effect size reporting is Lindstromberg (2016), who finds 49% of the 96 (quasi) experimental studies reported in 90 articles in 19 volumes (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) of Language Teaching Research report effect sizes. Again, it is unclear how a particular paper/study with two or more statistical procedures is dealt with. Based on the single effect size reporting rate from Plonsky (2013) and Lindstromberg (2016), it seems that only one standard for coding papers with multiple statistical procedures is used, although neither study clarifies whether a relatively loose standard such as Sun and Fan's (2010) or a more stringent one (e.g., the above-proposed "two-standards" approach) is adopted.
To date, no studies concerning effect size reporting in applied linguistic research endeavor to make explicit the coding standard regarding papers with multiple statistical procedures, let alone adopt two standards to arrive at a more comprehensive picture of the reporting practices. Furthermore, no studies have surveyed papers from journals that do not mandate effect size reporting, as all the journals covered in Plonsky (2013) and Lindstromberg (2016) have such a mandate.

Sampling
To contribute to the current understanding of effect size reporting in the field, six full volumes (2011-2016) of System were examined. A span of 6 years was decided upon for three reasons. First, publications over a span of 6 years should be sufficient for studies of an exploratory nature such as the present one to show whether any systematic change in the effect size reporting practices took place across the time (see Research Question 2). Second, similar (and often shorter) time spans were adopted in studies of effect size reporting from other social science areas such as communication (e.g., 4 years, see Sun & Fan, 2010), education (e.g., 2 years, see Alhija & Levy, 2009), and psychology (e.g., 2 years, see Dunleavy, Barr, Glenn, & Miller, 2006). Third, the number of articles used for coding (see below) in this study was comparable with similar research in other fields (e.g., Sun & Fan, 2010), which should be manageable in studies of an exploratory nature.
The sampling frame for this study was based on all of the 414 full-length research articles from the six selected volumes of System. Our first two rounds of initial review, as per the first six questions on a checklist developed for the present study (see the appendix), led to the identification of a total of 217 articles that are supposed to report effect sizes, which formed the core dataset of this study. Specifically speaking, our first initial review identified 17 nonempirical research articles, which were irrelevant to the development of the core dataset. Empirical research articles refer to those that are data-based, characterized by systematic collection and analysis of data (cf., Gao, Li, & Lü, 2001). Our second review identified 96 empirical articles utilizing only qualitative data and another 83 that use purely descriptive statistics (e.g., frequencies and percentages; cf., Dunleavy et al., 2006;Meline & Wang, 2004). After these articles were excluded, the remaining articles totaled 218 in which there is one metaanalysis. Following Keaton and Bodie's (2013, p. 117) rationale that "reporting conventions for meta-analytic reviews are remarkably different from those for individual (primary) studies," we removed this meta-analysis paper in the development of the core dataset.
Finally, the remaining 217 articles formed the core dataset, each of which should have effect size(s) reported. This total number was used as the denominator to generate the overall effect size reporting rates for Research Question 1.

Coding
The unit of analysis was each article. Each article in the core dataset was coded in terms of the research topic, publication year, nature of paper (empirical or not), types of statistical procedures, practices of effect size reporting, types of effect size measures, and authors' awareness of effect size (see the appendix). Two coding standards are used for the situations where two or more of the focal statistical procedures are used in one single paper, so as to achieve a more comprehensive picture of effect size reporting in applied linguistics and facilitate comparisons with findings from other fields: one is Sun and Fan's (2010) standard, which tends to give "benefit of the doubt" and hence is relatively loose, and the other is the more stringent one proposed in "Effect Size" section.
The use of these two standards might introduce an element of subjectivity, although most of the coded variables are dichotomous and involve little subjective judgment (e.g., reported vs. not reported). To ensure consistent application of the checklist, the first and second authors independently coded a common set of articles (20.3% of the core dataset, totaling 44). The intercoder agreement rate was 93.1% (with an acceptable agreement rate ranging between 85% and 90%; cf., Miles, Huberman, & Saldana, 2014), and the points of disagreement were resolved through collegial discussion. Once consistency was established, the second author continued to record information pertaining to the other articles.

Data Analysis
After data were coded, both descriptive and inferential statistics were generated with the statistical package SPSS 21.0. For Research Question 1 regarding the extent of effect size reporting practices, only descriptive statistics in the forms of percentage and frequency were generated. To answer Research Question 2 concerning whether the effect size reporting practices change over time, a series of chi-square tests were performed, using Cramer's V as an effect size. For Research Question 3 concerning the types of effect size typically reported, frequencies and percentages were used, supplemented with qualitatively enumerated examples.

Research Question 1: To What Extent Are Measures of Effect Size Reported?
As Table 3 shows, overall, 73.27% of the sampled papers that should have effect size reported do report effect size(s), when Sun and Fan's (2010) standard is adopted for situations involving papers with two or more statistical procedures. This effect size reporting rate diminishes to 52.07% when a standard more stringent than Sun and Fan's (2010) is adopted. It is unfortunate that effect sizes, the importance of which is no less than that of the p value, get reported in only half of the papers that are supposed to report effect sizes. These findings are indicative of an alarming tip of the iceberg, namely, the troubling situation of a lack of effect size reporting in applied linguistics research.
These remarks may seem overly critical toward the field of applied linguistics. To be fair, we need to situate the discussion in a broader context by reiterating that the underreporting of effect sizes has also been observed in other fields. One comparable study is a survey of 256 papers from the Journal of Counseling & Development over 11 years, where Bangert and Baumberger (2005) find an effect size reporting rate of less than 50% among the papers that conduct statistical significance tests and, hence, need to report effect sizes.

Research Question 2: Do the Effect Size Reporting Practices Vary Across the Years?
A chi-square test, χ 2 (5) = 3.533, p = .618, based on the effect size reporting frequencies according to Sun and Fan's (2010) standard, revealed a small-to-medium level of association (Cramer's V = 0.128) between the reporting rates and the publication year. Another chi-square test, χ 2 (5) = 10.730, p = .057, based on the effect size reporting frequencies according to the more stringent standard, also revealed a small-to-medium level of association (Cramer's V = 0.222) between these two variables. Although the corresponding p values (e.g., .057) were higher than the conventional statistical significance level (p = .05), this does not diminish the importance of the results reflected by effect size. In the words of authorities on statistics, "surely, God loves the 0.06 nearly as much as the 0.05" (Rosnow & Rosenthal, 1989, as cited in Ellis, 2010, p. 49). The p value will become small enough in a future replication study based on a large enough sample.
The effect size values reported above are higher than those reported in previous studies. The counterpart Cramer's V in  was only 0.07, although their corresponding p value was small, p = .06, χ 2 (2) = 5.66, probably because of their large enough sample size. This suggests that the variable of interest (effect size reported or not) was associated with the publication year at a negligibleto-small level according to Cohen's benchmarks, depending on the discipline.
All in all, the answer to Research Question 2 is that effect size reporting practices do vary across time. The strength of association lies between Cohen's (1988) small and medium benchmarks.

Research Question 3: For Each of the Five Focal Statistical Methods, What Is the Effect Size Reporting Rate and What Effect Size Measures Are Typically Reported?
For papers that used correlation analysis, 94.29% of them (see Table 4) reported an effect size measure. This extremely high reporting rate can be attributed to the fact that the test statistic (i.e., correlation coefficient) in itself is effect size . Similarly high effect size reporting rates for correlation analysis can be found in other fields. For instance, in the field of communication, Sun and Fan (2010, p. 334) note that "nearly 100% of studies" that used Pearson correlation reported effect size measures, whereas the corresponding rate in Alhija and Levy (2009) reached 100% in the field of education. In this study, the effect size measures typically used were correlation coefficients such as Pearson's r and Spearman's rho.
For the papers that used regression analysis, about 84.00% (refer to Table 4) reported effect sizes. High effect size reporting rates for regression analysis can be found in other fields such as communication (nearly 100%, see Sun & Fan, 2010) and education (100%, see Alhija & Levy, 2009). The effect size measure mostly used was adjusted R 2 , which is consistent with observations from other fields (e.g., Alhija & Levy, 2009;Sun & Fan, 2010). The underlying reason might be that, as noted by Kirk (1996), adjusted R 2 is readily available in the statistical output for regression analyses generated by popular statistics packages such as SPSS. More than 60% (64.10%, see Table 4) of the papers using ANOVA reported effect size measures. This was highly similar to its counterparts, namely, 56.5% and 57%, respectively, from Sun and Fan (2010) and Alhija and Levy (2009). The reporting rate for ANOVA was lower than that for regression, partly because effect sizes for ANOVA are not as readily available as those for regression in statistics packages. Take SPSS as an example. In SPSS, ANOVA can be realized through three ways. The most common way is to initiate the test by clicking "Compare Means → One-way ANOVA," but an effect size measure for ANOVA, eta-squared, cannot be generated in the output this way, misleading many researchers into believing that SPSS does not provide eta-squared for ANOVA (Zhang, 2009). However, this effect size can be generated in the two less commonly used ways in SPSS (cf., Plonsky & Oswald, 2017). 4 Therefore, when effect sizes were reported, (partial) eta-squared 5 was unsurprisingly most reported, which is consistent with the earlier findings (e.g., Sun & Fan, 2010). In light of the observation that "researchers arbitrarily selected one of these two" (i.e., eta-squared and partial eta-squared) from the field of communication (Sun & Fan, 2010, p. 338) and a most recent discussion of the misuses of (partial) eta-squared in the field of L2 research , future research needs to investigate whether these effect sizes have been correctly used when being reported.
Twenty-three (34.32%) of a total of 67 articles that used t tests reported effect sizes. Similarly, moderate reporting rates are found in other fields. In Alhija and Levy's (2009) sampled papers from five educational journals that do not mandate effect size reporting, the corresponding rate was 31%. In Sun and Fan's (2010) sampled papers from two communication journals that do not require reporting effect size for statistically significant results, the corresponding rate was 25%. These lower reporting rates for t test, compared with those for ANOVA, should be understandable considering that SPSS does not provide effect size measures for various t tests (independent samples, one sample, or paired sample); these measures will have to be calculated by hand or by inputting relevant values (e.g., the values of t and degree of freedom) onto some webpages (cf., Ellis, 2010). Most papers in our sample used Cohen's d, whereas only six reported r as an effect size measure for t tests and another two used etasquared. One of the above-mentioned six papers gives the following justification for choosing r rather than d: "Two commonly used effect sizes of t-tests are Cohen's d and a point-biserial correlation coefficient (i.e., r), and this study adopted the latter as r ranges from 0 (no effect) to 1 (a perfect effect)" (Koga, 2010, p. 176). This practice echoes our suggestion in "Effect Size" section that an effect size index from either the r or the d family can be used, although some textbooks only recommend using Cohen's d for t tests.
Seven (30.43%) of the 23 articles that used chi-square tests reported effect sizes. Similarly, low reporting rates are in evidence elsewhere. In Alhija and Levy's (2009) sampled papers from five educational journals that do not require effect size reporting, the corresponding rate was 17%. In Sun and Fan's (2010) sampled papers from two communication journals without effect size reporting requirements, none of the five papers that used chi-square tests reported effect size; to account for this, the authors speculate that "it is likely that neither Cramer's V nor φ is well known to communication researchers" (Sun & Fan, 2010, p. 338). In our sample, most papers correctly used Cramer's V, with only two using odds ratios.

Conclusion
This study has examined the effect size reporting practices in one major applied linguistics journal. The effect size reporting practices seem to have improved in the past few years, while the identified reporting rate of about 50% is inadequate. Although such improvement is encouraging, evidence from other disciplines suggests that such advances of effect size reporting can be lost without continued vigilance (Loewen et al., 2014). Therefore, journal editors, researchers, and researcher trainers need to (continue to) encourage and/or implement good reporting practices (e.g., reporting effect sizes along with the exact p values).
Although this exploratory study is innovative in terms of the "two-standards" approach for coding and the target journal selection, it has three major limitations. First, it would have benefited from a larger sample size. The above findings and conclusions are tentative, which require verification and/ or falsification in future research. In terms of generalizability, the results may not be representative of the use of effect sizes in applied linguistics in general, as this study has only focused on one journal in the field. Second, the findings here provide limited information about effect size reporting practices for statistical procedures (such as factor analysis and structural equation modeling) other than the five focal ones. Third, the present study has provided evidence of frequency of application of effect sizes in the focal journal, but does not indicate whether these effect sizes have been correctly applied or not (see , for a review of the misuses of [partial] eta-squared in L2 research).
To contribute to the on-going methodological reform in applied linguistics (Larson-Hall & Plonsky, 2015), more studies on effect size reporting are needed. Future studies will stand to gain by expanding the sample size and/or comparing the reporting practices across journal types (journals with vs. without a requirement for effect size reporting). It is also useful to examine whether effect sizes are reported more frequently for statistically significant results than their nonsignificant counterparts, as Plonsky (2013) notices that some authors tend to report effect sizes solely for statistically significant results, although such information was "not coded for throughout the entire sample" in his study. Furthermore, future studies of effect size reporting need to incorporate an element of effect size interpretation in a more systematic way, as the reporting of effect sizes should not be treated "as an end in itself" (Larson-Hall & Plonsky, 2015, p. 135). It is useful to know how effect sizes are interpreted after being reported. Ellis (2010) predicts that "If history is anything to go by, statistical reforms adopted in psychology will eventually spread to other social science disciplines" (p. xiv). Recently, the editors of Basic and Applied Social Psychology (Trafimow & Marks, 2015) have issued a journal-wide ban on NHST. This ban represents a natural progression of the long-standing critiques 6 of NHST and a strong call for the use of more robust statistics (e.g., effect size) in our reporting practices. It is our firm belief that applied linguistics will soon be one of these disciplines in Ellis's prediction.
Yuhang Hu (Sophie) is a master student at the Department of Linguistics with a concentration in Applied Linguistics, Georgetown University. Her areas of research include (socio-)psychological variables in bilingualism and quantitative methodology. She will commence her PhD study in Applied Linguistics at Northern Arizona University this Fall.
Jianhui Xiong, PhD, conducts research concerning educational policy and comparative education at the National Center for Education Development Research, Ministry of Education of the People's Republic of China. His recent research interests include internationalization of education and the use of big data in education.