Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model

Reliability of measurement instruments providing quantitative outcomes is usually assessed by an intraclass correlation coefficient. When participants are repeatedly measured by a single rater or device, or, are each rated by a different group of raters, the intraclass correlation coefficient is based on a one-way analysis of variance model. When planning a reliability study, it is essential to determine the number of participants and measurements per participant (i.e. number of raters or number of repeated measurements). Three different sample size determination approaches under the one-way analysis of variance model were identified in the literature, all based on a confidence interval for the intraclass correlation coefficient. Although eight different confidence interval methods can be identified, Wald confidence interval with Fisher’s large sample variance approximation remains most commonly used despite its well-known poor statistical properties. Therefore, a first objective of this work is comparing the statistical properties of all identified confidence interval methods—including those overlooked in previous studies. A second objective is developing a general procedure to determine the sample size using all approaches since a closed-form formula is not always available. This procedure is implemented in an R Shiny app. Finally, we provide advice for choosing an appropriate sample size determination method when planning a reliability study.


Introduction
2][3] All measurement and evaluation processes are subject to measurement error.These errors can have a serious impact on research undermining the conclusions of the study, as well as in daily practice when measurement and evaluation processes are used to make diagnoses or assess the progression of participants, for example.It is therefore essential for measurement instruments to be reliable (i.e. the device/rater is able to distinguish among participants in a population) and valid (i.e.measurements reflect the underlying true values).The reliability of a device/rater is usually evaluated during a reliability study.Generally, a reliability study consists of participants measured repeatedly under similar conditions by the same device/rater (intrarater reliability) or by different devices/raters (interrater reliability).In interrater reliability studies, the set of raters can be the same, or different for every participant.In this article, we focus (1) on intrarater studies where the same number of repeated measurements is made simultaneously on each participant, and the order of the measurements is interchangeable, and, (2) on specific interrater reliability studies where the set of raters is different for every participant, and the same number of raters rates each participant.In the second case the reliability coefficient additionally reflects the differences between raters, next to the measurement error.
When the outcome measurements are quantitative, reliability can be quantified using an intraclass correlation coefficient (ICC).ICC is defined as the correlation between repeated measurements at multiple occasions made by the same rater/device or by different raters/devices on the same participants.It compares the variability of measurements/ratings within participants to the variability of measurements/ratings between participants.Depending on the design of the study, different forms of ICC should be used. 4,5This article focuses on the ICC defined in the one-way analysis of variance (ANOVA) model, ICC (1). 4 When planning a reliability study, determining the minimum number of raters/repetitions and participants is of prime importance.In fact, too many participants may prove to be time-consuming and may also increase the research budget, while too few may adversely impact the precision of the ICC estimate, preventing the drawing of any conclusion on the study.Several approaches to determine sample sizes can be identified in the literature.The aim of this review is two-fold.First, it is to compare the statistical properties of the sample sizes obtained with the approaches in realistic settings.Second, it is to develop a general procedure for sample size determination, since a closed-form formula is not always available for all the approaches.
][10] The confidence interval approach requires defining, around a planned ICC, a target width of the confidence interval that the researcher aims to achieve.A generalization of the width of the confidence interval approach, the assurance probability approach, 6 is based on testing whether the width of the confidence interval is less than a pre-specified width with a given assurance probability.The testing approach is based on the power of testing the hypothesis that the ICC is lower or equal to (null hypothesis), or, above (alternative hypothesis) a pre-specified value of the ICC.A common feature of these approaches is that the variance of the ICC estimator needs to be defined.In the literature, two closed-form approximations of the large-sample variance of the ICC estimate are mainly used.These are namely, the Swiger variance, 11 which is based on the Taylor-series expansion of the ratio of the ANOVA mean squares, and the Fisher variance, 12 a large-sample approximation obtained by Fisher.We further consider another form of the variance, known as the Zerbe variance, 13 based on the formulation of the ratio of two independent F-statistics.This variance is far less popular and was not included in previous reviews.
Confidence intervals formed around the ICC are mainly based on the Wald method, 6,14 or on the F-statistic, termed the Searle method. 15The Wald and the Searle methods can be further applied using a normalization transformation. 12,16hen comparing the coverage probability of the confidence intervals (confidence intervals based on the Wald method with the Fisher variance and the Searle method), Zou 6 concluded that the normalized Searle method performs better than the Wald method with the Fisher variance.However, when comparing the coverage probabilities and mean interval widths of confidence intervals obtained with the Wald method (with the Swiger variance), the Searle method, and the normalized Searle method, Donner and Wells 17 concluded that no method was superior in all situations.
In the context of sample size determination with the confidence interval approach, a closed-form formula was derived by Bonett 7 for the Wald confidence interval with the Fisher variance and is the most common choice. 18While Shieh more recently defined a numerical procedure for the Searle method, 19 no procedure to determine sample sizes exists for the other methods.Note that, in common statistical software like R, 18 SAS, and PASS, 20 the Wald method with the Fisher variance is the only one that is available (see Appendix E).As for the comparison among the methods, Shieh 19 compared the statistical properties of the Wald method (with the Fisher variance) and the Searle method with respect to the width of the confidence interval approach and the assurance probability approach.To summarize the results, the Searle method and the assurance probability approach, with a 90% assurance probability, showed better coverage than the width of the confidence interval approach. 6Furthermore, this was achieved with a somewhat smaller width of the confidence interval.We aim to complete these comparisons by considering all the identified confidence interval methods.
For the testing approach, sample size determination was derived only for the normalized Searle method 18 and the Searle method (numerically).Only the latter is available in common statistical software. 20As for the comparison, Shieh 21 showed that the approximate sample size formula obtained using the normalized Searle method 8 under-performs, with respect to the observed power of the hypothesis test, when compared to numerical sample sizes obtained via the Searle method.We extend the work of Shieh, 21 by comparing the results that can be obtained using all the methods for sample size determination identified in this article.
Several studies have investigated inference procedures for the ICC in this context but are incomplete as these studies do not consider all the confidence interval methods identified.In summary, our contribution is as follows.First, we compare the statistical properties of all identified confidence interval methods.Second, we analytically derive the sample size formulas using the Swiger and the Zerbe variances.Third, we develop a numerical procedure to obtain sample sizes with all identified confidence interval methods under the three sample size approaches.In this numerical procedure, we derive formulas to approximate the assurance probability function and the power function (except for the Searle method for which these formulas were already derived 21 ).Additionally, we provide guidelines for end users.We further provide an user-friendly and interactive R Shiny application to obtain sample sizes with all the methods discussed in this article on https://github.com/DiproMondal/sample-size-ICCGithuband the https://dipro.shinyapps.io/sample-size-icc/Shinyserver.
The article is organized as follows.Section 2 introduces the methods to estimate ICC, its variance, and confidence interval.Section 3 introduces the simulation setup that is used to evaluate the statistical properties of the confidence interval methods.Section 4 describes different approaches for sample size calculation when the number of raters, k, is fixed.We further propose a general procedure to obtain minimum sample sizes under any approach.Section 5 presents a case study.Finally, Section 6 concludes the article with a summary and a discussion of the results obtained in this article.

Definition
Consider the scenario in which each participant is measured on a quantitative scale by a different set of raters randomly drawn from a population of raters, 4 or is measured repeatedly by a measuring device several times under identical conditions.Further assume that the number of raters/repeated measurements per participant is the same, which is a common assumption when planning a reliability study.Let Y ij represent the measurement of participant i (i = 1, 2, … , n) by rater j (j = 1, 2, … , k).This outcome can be described by a one-way ANOVA model, which can be written as where  is the grand mean, s i is the effect of participant i, and  ij is the measurement error for participant i measured by rater j.The total number of observations is denoted by N (N = kn).The assumptions of this ANOVA model are that the participant effects s i are identically and normally distributed with mean 0 and variance  2 s , the measurement errors  ij are identically and normally distributed with mean 0 and variance  2  , and the errors and participant effects are independent.Table 1 shows the variance components of this one-way ANOVA model.The mean squares in Table 1 Using this variance decomposition, the ICC is defined as Note that the value of  becomes closer to 1 as the measurement error variance becomes smaller ( 2  <<  2 s ) and  becomes closer to 0 as it increases ( 2  >>  2 s ).

Estimation of ICC
ICC is usually estimated using the ANOVA 4 or the maximum-likelihood estimator.The ANOVA estimator is given by Since this estimator is negatively biased, 22 a maximum-likelihood estimator has been suggested 23 : Comparing the bias of the two estimators, Wang et al. 23 showed that the bias of ρML is still quite large and decreases only slightly for large samples.For instance, to achieve a bias of ρML not > 10%, a total of 100 observations (e.g.20 participants and five raters) are required when expecting  = 0.5 (the value of  at which the bias is maximum).When one expects higher values of , as in the context considered in this article, the bias of ρANOVA is small and the two estimators lead to almost identical estimates.For this reason, the maximum-likelihood estimator is generally not used in the literature.Accordingly, we will only consider the ANOVA estimator in this article.Note that this estimator relies on the assumptions of the ANOVA model (equation ( 1)).A brief discussion of what happens when these assumptions are violated is given in Section 6.

Large sample variance of the ICC
Here we focus on the three approximated closed-form expressions of the variance of ρ available in the literature for large n.Swiger et al. 11 provided the large sample variance of ρ as, Given that k(N − n) = nk(k − 1), as N = kn, this leads to the variance obtained by Fisher 12 when N−1 k(n−1) ≈ 1, which is a reasonable assumption for small k and n ≥ 30, 24 Note that in equation (5), n is sometimes replaced by n − 1. 25 Lastly, following Zerbe and Goldgar 13 and Kaart, 26 the variance can also be estimated by the ratio of two independent F-statistics as, These three formulas are related by the following inequality, var( ρ) Ze > var( ρ) S > var( ρ) F (see Appendix A for proof).

Confidence interval for the ICC
In the literature, there are four methods to compute the upper (U) and lower (L) bounds of the confidence interval for , namely the Wald method, 6 the Searle method, 9,15 and their normalized versions.Demetrashvili et al. 27 further suggested two generic methods not considered here because they are not accurate in the balanced one-way random effects model.

Wald confidence interval (Wald S , Wald F , and Wald Ze )
Based on the central limit theorem, the upper (U) and lower (L) bounds of the confidence interval for the ICC can be written as, 6,14 where z 1−∕2 is the (1 − ∕2) × 100 percentile of the standard normal distribution.Plugging equations ( 4) to ( 6) into ( 7) as the variance leads to confidence intervals, which we denote as Wald S , Wald F , and Wald Ze , respectively.The Wald method assumes that the sampling distribution of ρ is normally distributed.However,  is bounded between 0 and 1, implying a skewed sampling distribution of ρ when  is close to the boundaries. 28Since typically ICC values close to one are of interest in a reliability study, Wald confidence intervals may thus have poor statistical properties in this context.

Searle method (F 𝜌 )
Under the assumption of normality of the ANOVA model, the ratio of the between-mean squares and within-mean squares (i.e. the F-statistic) is distributed as 1+(k−1) 1− F  1 , 2 , where F  1 , 2 represents an F-distribution with  1 = n−1 and  2 = n(k −1) degrees of freedom.We represent this ratio as Then, the upper and lower bounds of the confidence interval for  are given by Searle 14 as, where F l and F u are the ∕2 × 100 and the (1 − ∕2) × 100 percentile of an F-distribution with n − 1 and n(k − 1) degrees of freedom, respectively.We denote this method as F  .
Rather than making a normality assumption on ρ, this method makes an assumption of normality on the outcome Y ij .Hence this method has been referred to as being an exact procedure by several authors. 6,173.3Normalized ICC method (Z S , Z F , and Z Ze ) The Fisher transformation can be applied to the ICC so that the transformed ICC approximately follows a normal distribution.Applying this transformation to ρ leads to where E(Z( ρ)) = 1 2 ln 1+ 1− , and the variance, var(Z( ρ)) can be derived applying the Delta method 16 to one of the variances defined in equations ( 4) to ( 6) leading, respectively, to Since Z( ρ) is approximately normally distributed and defined on the real line, we can then compute the Wald confidence interval for this transformation as Finally, the confidence interval for  is obtained by back-transformation leading to We refer to the confidence intervals obtained by these methods as Z S , Z F , and Z Ze , respectively.

Normalized Searle method (ZF 𝜌 )
The F-statistic (F( ρ)) can also be normalized by a log-transformation to obtain confidence limits. 6,9,14Normalizing F( ρ) starting from equation ( 8), we obtain where ).The confidence interval on this log transformed scale Note that the expression for var(Z(F( ρ))) provided in equation ( 3) of Zou 6 is not correct, so we use var(Z(F( ρ))) as specified above.The confidence limits for  can be obtained directly by back-transforming as We denote this method as ZF  .Note that for k = 2, the confidence intervals based on the transformed F-statistic and the normalized ICC with the Swiger variance (following equations (.1) and ( 14)), are the same.

Simulation comparison of the confidence interval methods
We set up a Monte Carlo simulation to evaluate the statistical properties of the eight confidence interval methods described in Section 2.3.Based on the ANOVA model defined in equation ( 1), n participant effects (s i ) are drawn from a standard normal distribution.Then, N (=nk) errors ( ij ) are drawn from a normal distribution, with zero mean and variance determined by the relation in equation ( 2) for a given value of .This process is replicated 25,000 times.For each replication, the confidence interval using the eight methods described in Section 2.3 is obtained.We study the properties of the methods for values of k varying from 2 to 10 (in steps of 1), n from 20 to 100 (in steps of 10), and  from 0.1 to 0.9 (in steps of 0.1).
The methods are compared based on the coverage probability and average confidence interval width in each scenario.The coverage probability is defined as the proportion of times the true value of  is covered by the confidence intervals across the 25,000 replications.We define coverage probability as acceptable if it falls within the range 1−  ± z 0.975 where 1 −  is the nominal coverage and  sim (= 25, 000) is the number of simulations.This is the range of proportions from the simulation, where one expects these proportions to lie in 95% of the cases, if the nominal coverage is the true coverage probability.Specifically, for a nominal coverage of 95%, the coverage probabilities from the simulation are expected to lie between 0.947 and 0.953.The average width of a confidence interval is defined as the average difference between the upper and lower limits of a confidence interval over the 25,000 replications.Since a shorter width of the confidence interval is desirable, methods with a smaller average width of the confidence interval are considered to be better.Table 2 summarizes the results for  ≥ 0.7, while complete results can be found in Supplemental Material 1. Table 2 shows that for k = 2 and  ≥ 0.7, Wald Ze , F, Z S (equivalent to ZF  ), and ZF  provide acceptable coverage for all values of n, while Z F provides acceptable coverage only for n ≥ 40.Wald S , Wald F , and Z Ze do not provide acceptable coverage (based on sample sizes explored in Table 2, i.e., n ≤ 100).
For k > 2, F still provides acceptable coverage under all scenarios while ZF  and Wald Ze only for n ≥ 40.The coverage of Z S and Z F deteriorates first when increasing k from 2 to 3, and then improves on increasing k further.These confidence interval methods provide acceptable coverage when  ≥ 0.7 for n ≥ 90.Z Ze on the other hand provides acceptable coverage when  ≥ 0.7 for n ≥ 80. Wald S provides acceptable coverage when  ≥ 0.7 for n ≥ 80, while Wald F provides acceptable coverage only when  ≥ 0.9 for n ≥ 90.The effect of increasing k is not monotonic for some of the confidence interval methods.However, increasing k above 5 does not seem to improve notably the coverage of the methods (see blueSupplemental Material 1).
The confidence interval methods providing the smallest average width most frequently, under the different scenarios, are marked in bold.The difference in average width between the different confidence interval methods decreases from ∼ 0.1 to < 0.01 as n increases from 20 to ≥ 60.It must be noted here that though Wald Ze provides better coverage compared to Wald F , it has the largest width among the confidence interval methods.
In summary for  ≥ 0.7, Wald Ze , ZF  , and F provide acceptable coverage in almost all scenarios.
Table 2. Summary of the methods which show acceptable coverage for the 95% confidence interval, that is, between 0.947 and 0.953, for the ICC,  ≥ 0.7 and different number of raters, k, and participants, n.In each row, the method providing the average minimum width of the confidence interval is marked in bold.For n ≥ 60, the differences in average width are < 0.01, therefore, none of the methods have been marked bold in those cases.

Sample size determination
Sample size determination when the number of raters, k, is fixed, is reviewed for three approaches, namely, the width of confidence interval approach, the assurance probability approach, and the testing approach.These sample size approaches require a planning value, , and yield valid results when the initial guess for  is accurate.The eight confidence interval methods reviewed in Section 2.3 can be used with each of the three approaches.However, a closed-form formula for sample size determination is not always available, which necessitates numerical evaluation procedures to determine sample sizes.

Width of confidence interval approach
The approach consists in finding the minimum number of participants for a given value of the expected width, , of the confidence interval around a planned value of  and for a given number of raters k.Bonett 7 derived an analytical formula based on the Wald confidence interval and the Fisher variance (Wald F ).We generalize this approach by considering all large sample variance formulas reviewed in Section 2.2.
The expected width of the Wald confidence interval is given by  = 2z 1−∕2 √ var( ρ), where 1 −  is the confidence level and the variance can be estimated using equations ( 4) to (6).Using the Swiger variance (equation ( 4)), under the approximation N ≈ N − 1 and taking the positive root, the minimum number of participants is given by (see Appendix B.1 for the derivation) where Using the Fisher variance (equation ( 5)), the expression for the required minimum number of participants is the same as equation ( 17), but subtracting one participant.Bonett 7 used the Fisher variance with n − 1 in the denominator of equation ( 5) instead of n.As a result, the sample size derived by Bonnet is the same as equation (17).
Using the Zerbe variance (equation ( 6)), the minimum number of participants obtained under the assumption that where 2 for the derivation).Giraudeau and Mary 29 provided an approximate formula for the width of the confidence interval obtained with the Searle method which coincides with the width obtained using the Wald confidence interval with the Fisher variance.Analytical formulas can hardly be obtained for the Searle and the normalization methods.Hence, we propose a general numerical procedure to determine the minimum sample size, n, which can be used with all confidence interval methods.Specifically, this numerical evaluation method consists of finding the expected width of the confidence interval for the specified values of  and k.This is done for every n, starting from n = 4 and increasing n by one unit at a time.The minimum sample size is the smallest value of n for which the expected width of confidence interval is smaller or equal to .Bonett 7 and Shieh 19 used a similar numerical approach to obtain sample sizes for F.
Table 3 shows the minimal sample sizes obtained by using the numerical evaluation for  ∈ {0.1,0.2}, ∈ {0.7,0.8,0.9}, and k ∈ {2,3,6}.The values within parentheses indicate sample sizes obtained using equation (17), equation (17) with a subtraction of one participant and equation (18) for Wald S , Wald F , and Wald Ze , respectively.It can be observed that the sample sizes obtained with the different confidence interval methods are rather close.Sample sizes providing acceptable coverage (the calculation of the acceptable range is given in Section 3) for different combinations of , , and k, are marked in bold.Table 3 indicates that the confidence interval methods Wald Ze , F, and ZF  provide sample sizes with acceptable coverage in most cases.Note that the numerical approach of Bonnett 7 and Shieh 19 leads to sample sizes very close to the values we obtain (data not shown).
Table 3.The minimum number of participants, n, required to achieve an expected width, , of the 95% confidence interval, given  and the number of raters, k, according to the numerical evaluation method.Sample sizes that provide coverage within an acceptable range (based on 25,000 simulations, i.e. between 0.947 and 0.953) are marked in bold.The values in parentheses indicate sample sizes obtained with analytical formulas given in equations (17) for Wald S , (17) with a subtraction of one participant for Wald F , and (18)

Assurance probability approach
The assurance probability approach based on the width of the confidence interval for , 19 consists of finding the minimum number of participants n such that where P(W ≤ ) is the probability that the width W , is less than or equal to a constant, , and 1 −  is the assurance probability.The assurance probability approach based on the width of the confidence interval was introduced by Zou, 6 who pointed out that the width of confidence interval approach seen in the previous subsection is a special case, which corresponds to setting the assurance probability to 0.5.Zou 6 also introduced an assurance probability approach based on the lower limit of a confidence interval, see Section 4.3.Zou 6 derived an analytical formula based on the Wald confidence interval and the Fisher variance (Wald F ). Shieh 19 later extended the approach numerically to the Searle method (F).In this article, we numerically generalize the assurance probability approach by considering all the confidence interval methods mentioned in Section 2.3.
Using the Wald confidence interval and the Fisher variance (equation ( 5)), Zou 6 obtained the minimum number of participants as where 6 used n − 1 in equation ( 5) and derived the formula considering the half-width of the confidence interval.As a result, the formula in Zou has different coefficients than equation (20).Using the Swiger variance (equation ( 4)), we derived the sample size under the approximation that N ≈ N − 1 (taking the positive root).This leads to equation (20) with the addition of one participant.
The analytical forms of the other confidence interval methods (including Wald Ze ) are too complex.Therefore, we propose a generalization of the numerical approach explained in Section 4.1, which uses assurance probability functions to find the minimum n satisfying a pre-defined value of the assurance probability (1 − ).This numerical procedure works with all the confidence interval methods mentioned in Section 2.3.The derivation of the assurance probability functions are given in Appendix C. Table 4. Minimum number of participants, n, required to achieve an expected width, , of the 95% confidence interval, given , the number of raters, k, and the assurance probability 1 −  = 0.9, according to the numerical procedure using assurance probability functions.Sample sizes that provide acceptable empirical assurance probability (i.e.above 0.896 for 25,000 simulations) are marked in bold.The values in parentheses indicate sample sizes obtained from the analytical formulas given in equation (20) for Wald F and equation ( 20) with an addition of 1 participant for Wald S , respectively.Table 4 shows the sample sizes obtained by the numerical procedure using assurance probability functions.The analytical counterparts are shown between parentheses (when available).It can be observed that the minimum sample sizes obtained analytically are close to the values obtained via the numerical procedure.Further, our method gives sample sizes close to the ones obtained by Shieh, who also used a numerical method for F (Tables 8 and 9 of Shieh 19 ).It can be further observed that the sample sizes obtained by the different confidence interval methods are rather close.Sample sizes providing acceptable assurance probability under different combinations of , , and k, for 1 −  = 0.9 are marked in bold.The lower limit of the acceptable range of assurance probabilities is calculated in the same way as in Section 3 where 1 −  is replaced by 1 − .Confidence interval methods F, Z S , Z F , and Z Ze provide sample sizes with acceptable assurance probability in most cases while ZF  for k = 2 only.

Testing approach
The testing approach consists of finding the minimum number of participants when one is interested in achieving a prespecified power (1 − ) when testing the null hypothesis that  is less than or equal to a constant,  0 , that is,  ≤  0 , against the alternative that  is greater than  0 , that is,  >  0 .Denoting  =  A under the alternative hypothesis, the power of this test can be defined as the probability that the null hypothesis is rejected when the alternative hypothesis is true ( =  A ).In our case, this is the probability that the lower limit L of the confidence interval for  is greater than  0 when the alternative hypothesis is true.The mathematical form of the criterion under this approach can be written as, 6

P(L ≥ 𝜌
where P(L ≥  0 | =  A ) is the probability that the lower limit of the confidence interval for , L, is greater than the pre-specified value,  0 , under the alternative hypothesis that  =  A (1 >  A >  0 > 0).Donner and Eliasziw, 10 Walter et al., 9 and Zou 6 derived an analytical formula for the minimum number of participants, n, based on the transformation of the F-statistic (ZF  ) when minimizing the criterion specified in equation (.1).Specifically, Table 5. Minimum number of participants, n, for a given value of  considering the null ( 0 ) and alternative hypothesis ( A ) for a specified number of raters, k, and power of the test 1 −  according to the numerical procedure using power functions.Sample sizes that provide acceptable empirical power (i.e.above 0.896 when 1 −  = 0.9 and above 0.795 when 1 −  = 0.8 for 25,000 simulations) are marked in bold.The values in parentheses indicate the sample sizes obtained from the analytical formula given in equation (22)  Shieh 21 used a numerical evaluation procedure to obtain sample sizes for the Searle method.Zou 6 obtained equation ( 22) by introducing an assurance probability based on a pre-specified lower limit of an asymmetrical interval procedure, which is equivalent to the testing approach.We derived power functions following equation (.1) for all the confidence interval methods (see Appendix D).These power functions were then used to obtain sample sizes for the testing approach using the numerical procedure mentioned in Section 4.2.The numerical procedure uses the power functions to find the minimum n satisfying a pre-defined power (1 − ).
Table 5 shows the sample sizes obtained by the numerical procedure using the power functions and numerical evaluation.The values within parentheses indicate sample sizes obtained by the analytical formulas for the method ZF  which correspond exactly to the ones obtained by our numerical procedure.The values obtained for the method F using our numerical procedure are exactly one unit greater than the values obtained by the numerical method of Shieh. 21Furthermore, unlike the previous approaches, the sample sizes obtained via the Wald confidence interval methods tend to require smaller sample sizes than other confidence interval methods.The actual power of the hypothesis test was also calculated at the obtained sample sizes.Sample sizes providing acceptable power for different combinations of 1 − ,  0 ,  A , and k are marked in bold.The lower limit of the acceptable range of power is calculated in the same way as in Section 3, where 1 −  is replaced by 1 − .The confidence interval methods F and Z Ze provide the sample sizes with acceptable power in most cases while Z S , Z F , and ZF  provide sample sizes with acceptable power for k = 2 only.The Wald methods always have power below acceptable value (i.e.< 0.795 when 1 −  = 0.8, and 0.896 when 1 −  = 0.9).For example, the actual power for the Wald methods can go as low as 0.729 which is the case for 1 −  = 0.8,  0 = 0.7,  A = 0.8, and k = 2.

Software for sample size calculation
Currently, only the method Wald F is available in common software (see Appendix E) for the width of the confidence interval and assurance probability approaches, while only ZF  is available for the testing approach.Therefore, a Shiny app containing all the approaches to determine minimum required sample sizes has been developed 30 and made available on https://github.com/DiproMondal/sample-size-ICCGithuband the https://dipro.shinyapps.io/sample-size-icc/Shinyserver.

Reliability of systolic blood pressure measurements
In this section, we illustrate how the confidence interval methods described in Section 2.3 and the approaches for sample size determination described in Section 4 are used in the context of a reliability study.In the study of Bland and Altman, 31 three repeated systolic blood pressure measurements (k = 3) were made on 85 participants (n = 85) by two experienced observers raters J and R and a semi-automatic blood pressure monitor.For the purpose of our illustration, we use the measurements made by rater J only, which can be modeled by a one-way ANOVA.The ANOVA model assumes that the outcome measurements are normally distributed and the variance across repetitions is homogeneous across participants.Exploratory data analysis revealed that the excess kurtosis for the repetitions was mild while the degree of asymmetry of the repetitions indicated moderate skewness.Furthermore, the data also present mild heteroscedasticity on the repeated measurements.Following equation (3), we obtain ρ = 0.962.The confidence intervals obtained using the eight confidence interval methods are shown in Table 6, which have rather similar bounds.

Planning a reliability study
A researcher may be interested in planning a study to measure blood pressure aiming at a reliability of  = 0.9.The sample size approaches described in previous sections can be used to find the number of participants required for such a study.
Figure 1 shows, for each k in the interval [2, 30] (x-axis), the minimum required n (y-axis) using (top-down) the width of confidence interval approach described in Section 4.1, the assurance probability approach described in Section 4.2, and the testing approach described in Section 4.3 for the confidence interval methods F and ZF  .For example, suppose the study only allows for three repeated measurements per participant.Then using the width of confidence interval approach, the assurance probability approach, and the testing approach the researcher would require, respectively, 43, 67, and 133 (considering the Searle confidence interval method) participants for the criteria given in Figure 1.The effect of increasing the number of measurements per participant to four is a decrease in the number of participants to 37, 59, and 115 participants (considering the Searle confidence interval method), respectively, for the width of confidence interval approach, the assurance probability approach, and the testing approach.The gain in having a smaller number of participants for the study decreases as the number of measurements per participant increases.
If, instead, there is flexibility in choosing the number of repetitions per participant, the researcher can consider a costconstraint approach to find the optimal combination of the number of participants (n) and number of repeated measures per participant (k).Then, the optimal combination of (k, n) is obtained by finding the value of k and n for which the total cost, T, is minimum.A plausible cost function is where T is the total cost, c 1 is the cost of recruiting a participant, and c 2 is the cost of making one observation.Table 7 shows the optimal combinations of (k, n) obtained by minimizing the total cost, T (equation ( 23)), for different combinations of c 1 and c 2 .It can be observed from the table that as c 1 increases relative to c 2 , more repetitions per participant are required with a smaller number of participants to achieve the same criterion value.The criteria for the sample size approaches are mentioned in the sub-figures where, for the width of confidence interval approach,  is the expected width of the confidence interval around a given value ; for the assurance probability approach, the notations are the same as the width of confidence interval approach with the addition of 1 −  denoting the assurance probability; for the testing approach 1 −  is the power of the hypothesis test,  0 and  A are the values under the null and alternative hypotheses.(a) Width of confidence interval approach; (b) assurance probability approach; and (c) testing approach.

Discussion
Sample size determination is a crucial aspect of the planning stage of a reliability study.Usually, the number of raters, k, is fixed due to budget or time constraints in the study, and the sample size of participants, n, needs to be determined.This article gives a complete overview of the different approaches available in that case.Analytical closed-form solutions for sample size determination only exist in a few cases.Therefore, we proposed a general procedure that entails deriving an assurance probability or power function (depending on the approach) and finding optimal n via a simple search procedure.
Before inspecting the different approaches for sample size determination, we looked at the statistical properties of the different confidence interval methods.We have shown that the confidence interval based on the Searle method (F) provides acceptable coverage in almost all scenarios for n ≥ 20, and, Wald Ze and ZF  for n ≥ 40.This can be explained by the fact that F is an exact method and ZF  is based on a normalizing transformation of F. Wald Ze is the Wald method based on the Zerbe variance which was also derived as a ratio of F-statistics.It must be noted however, that Wald Ze does not provide acceptable coverage for small  (when  < 0.5, see blueSupplemental Material 1).The other methods, based on some approximations, only provide acceptable coverage in few scenarios.It is worthwhile to note that the Wald confidence interval using the Fisher variance, Wald F , widely used in the literature shows acceptable coverage only for large sample sizes, n ≥ 80, when  ≥ 0.9 and k > 2. Note that the Zerbe variance provides better statistical properties than the Fisher variance when  ≥ 0.7, but the width of the confidence interval is larger.
Sample sizes were determined using three different approaches which rely on the limits of a confidence interval for .Sample sizes in the case of the width of confidence interval were obtained via a numerical evaluation.We derived the assurance probability and power functions for assurance probability and testing approaches, respectively, to determine sample sizes.These functions, when combined with the numerical evaluation, enabled us to determine sample sizes for all the methods discussed.Sample sizes obtained through this procedure and the corresponding available analytical formulas led to similar sample sizes.Furthermore, sample sizes obtained with different confidence interval methods in the width of confidence interval approach and the assurance probability approach were similar.However, this was not the case in the testing approach where smaller sample sizes were obtained using the Wald confidence interval to achieve a required power level compared to the other confidence interval methods.This is probably because the Wald confidence interval method assumes a symmetric distribution for the estimate of , which is not a realistic assumption when  is large (e.g.0.8, 0.9). 28n all the approaches, the Searle method (F) provided sample sizes with good statistical properties as well as ZF  when k = 2. We, therefore, advise the use of these methods to make statistical inference on the ICC in the one-way ANOVA setting.
We have shown that the choice of the approach to determine sample size or even the choice of the confidence interval method, has an impact on the resulting sample size.We, therefore advise researchers to carefully consider requirements for their studies as a guide to choose the appropriate sample size approach.For the three different approaches discussed in this article, the Searle confidence interval method demonstrated good statistical properties, making it our recommended choice.Furthermore, in order to determine sample sizes, we have developed an R Shiny app which we believe will prove valuable to researchers in need of a simple and efficient interface for obtaining sample sizes.
Our study is not without limitations.First, the confidence interval methods investigated in this paper, except the Searle confidence interval method (which is exact), rely on large sample approximations.Therefore, practitioners should exercise caution when calculations lead to a small minimal sample size because a good statistical behavior is not guaranteed.Note that the minimal sample sizes obtained with the different approaches rarely go below 20 in realistic scenarios (see Tables 3 to  5).Second, the estimator of  and its confidence interval rely on the assumptions of normality and homoscedasticity in line with the one-way ANOVA model (equation ( 1)).Violations of these conditions impact the statistical properties of the confidence intervals.][34][35][36][37] However, simulation studies 38 showed that the effect of heteroscedasticity outweighs the effect of non-normality on the Type-I error rate of the F-statistic, even for a balanced design 39 as considered here.We, therefore, advise researchers to check for violations of the assumptions of the ANOVA model (equation (1)) before using the methods described in this article.Readers interested in non-parametric estimators of ICC, not requiring the normality assumption, are directed to the works of Rothery, 40 Shirahata, 41 Commenges and Jacqmin, 42 and Ukoumunne et al. 43 Note that, however, these papers do not develop a sample size procedure.Third, as previously mentioned, we consider an equal number of ratings per participant constituting a balanced design.Considering unbalanced designs will require specifying the degree of imbalance in advance, which is not an easy task.Furthermore, Donner 14 showed that with an unbalanced design, the F-statistic is not exact and this, in turn, affects the statistical properties of the ICC and its confidence interval.Fourth, we focused on reliability in the context of a one-way ANOVA model.Whether the numerical procedure we developed can be extended to multi-way ANOVA models, will require further investigation, as methods to construct confidence intervals are different in that case. 44,45

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Figure 1 .
Figure 1.Combinations for n and k for the confidence interval methods F and ZF  satisfying the criteria specified for the three different sample size approaches.The criteria for the sample size approaches are mentioned in the sub-figures where, for the width of confidence interval approach,  is the expected width of the confidence interval around a given value ; for the assurance probability approach, the notations are the same as the width of confidence interval approach with the addition of 1 −  denoting the assurance probability; for the testing approach 1 −  is the power of the hypothesis test,  0 and  A are the values under the null and alternative hypotheses.(a) Width of confidence interval approach; (b) assurance probability approach; and (c) testing approach.

Table 1 .
Variance decomposition as for the one-way ANOVA model described by equation(1).
for Wald Ze , respectively.
for ZF  .

Table 6 .
Lower and upper limits of the 95% confidence intervals for  for the systolic blood pressure measurements.

Table 7 .
Optimal combination of the number of repetitions and participants, (k, n) for the sample size approaches with the confidence interval methods, F and ZF  , for different costs of recruiting a participant, c 1 and making an observation, c 2 .