A statistical framework for planning and analysing test–retest studies of repeatability

There is an increasing number of potential quantitative biomarkers that could allow for early assessment of treatment response or disease progression. However, measurements of such biomarkers are subject to random variability. Hence, differences of a biomarker in longitudinal measurements do not necessarily represent real change but might be caused by this random measurement variability. Before utilizing a quantitative biomarker in longitudinal studies, it is therefore essential to assess the measurement repeatability. Measurement repeatability obtained from test–retest studies can be quantified by the repeatability coefficient, which is then used in the subsequent longitudinal study to determine if a measured difference represents real change or is within the range of expected random measurement variability. The quality of the point estimate of the repeatability coefficient, therefore, directly governs the assessment quality of the longitudinal study. Repeatability coefficient estimation accuracy depends on the case number in the test–retest study, but despite its pivotal role, no comprehensive framework for sample size calculation of test–retest studies exists. To address this issue, we have established such a framework, which allows for flexible sample size calculation of test–retest studies, based upon newly introduced criteria concerning assessment quality in the longitudinal study. This also permits retrospective assessment of prior test–retest studies.


Introduction
A biomarker is a characteristic objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or response to a therapeutic intervention. 1Biomarkers used as indicators of response to a therapeutic intervention, or disease progression, are called treatment response biomarkers.One prime, established treatment response biomarker is lesion size change in cross-sectional imaging.For clinical trials concerning solid tumors, the measurement of lesion size is formalized in the so-called response evaluation criteria in solid tumors (RECIST), 2 that categorize treatment response.With the rapid advancement in medical sciences, there is an increasing number of new potential treatment response biomarkers that could possibly allow for early and objective assessment of treatment response or disease progression in clinical trials and clinical practice. 3owever, using a biomarker in practice requires some basic research into the reliability of its measurement.In addition to a fixed systematic measurement error (bias), which can be investigated by comparing measurements with a known target value (e.g.phantom studies), it is important to take into account that measurements of quantitative biomarkers are subject to random variability.Hence, changes in a biomarker in longitudinal measurements made under the same conditions do not necessarily represent real change but might be caused by exactly this random measurement variability.Before testing or even utilizing a quantitative biomarker in longitudinal studies, it is therefore of principal importance to assess the measurement repeatability. 4 The repeatability of measurement is determined by test-retest studies, which then are also referred to as repeatability studies.In such studies, replicate measurements are made on a sample of subjects under conditions that are as constant as possible. 5Measurement repeatability can be quantified by the within-subject standard deviation (w SD ).Using w SD , the repeatability coefficient (RC) can be calculated. 4,6,7RC is then used in the longitudinal study to determine if a difference in the biomarker represents presumed real change or is within the range of random measurement variability.It is defined in such a way that a desired specificity to detect changes-usually 95%-is targeted.
The w SD and the RC, as determined by the test-retest study, are point estimates, and hence suffer from random error.As we will show, the targeted specificity is therefore generally not achieved in practice.Following standard statistical results, the more subjects and the more repeated measurements are included in the test-retest study, the more reliable the estimates of w SD and RC will be.Accordingly, the probability of a relevant deviation of the actually achieved value from the targeted specificity will decrease.The quality of assessments in the longitudinal study and consequently the validity of its results is directly governed by the precision of the estimates of w SD and RC.
Of course, exact knowledge of measurement repeatability is not only crucial for biomarkers.For example, excellent measurement repeatability of scales and other laboratory instruments is mandatory.The reliability of a scale can be checked using weights with a known mass and it is possible to perform many repeated measurements.In contrast, many biomarkers are measured in-vivo, rendering attainment of large sample sizes difficult.Also, it might be necessary from an ethical point of view to keep sample sizes as low as possible, since the measurement in question might be inconvenient, invasive, or even harmful for the patient or the healthy test person.For example, a biomarker might be derived from computed tomography, which involves ionizing radiation.Yet, if the sample size in the test-retest study is small, there is a high chance of obtaining suboptimal estimates of RC with associated detrimental effects on sensitivity and specificity in the longitudinal study.
Related to but different from repeatability is reproducibility.While repeatability represents the measurement precision under constant conditions, that is, same measurement procedure, same operators, same measuring system, etc., reproducibility is, in contrast, measurement precision under differing conditions as various operators, measuring systems, etc. 8 In the following, we will focus solely on repeatability.
Statistical literature concerning requirements for test-retest studies is scarce.One notable study investigating sample size requirements is by Obuchowski and Bullen. 6Their work includes a critical examination of several statistical issues that arise in the use of quantitative imaging biomarkers in clinical applications.The technical performance of these markers is investigated by multiple simulation studies.The repeatability and reproducibility of measurements as well as their bias are analysed.In addition to the standard models, heteroscedasticity and non-linear relationships are also considered.One aspect of their work concerns the relation between the sample size in the test-retest study and the difference between a fixed targeted specificity of 95% and the expected value of the specificity actually achieved in a following longitudinal study.In order to limit this difference to 1 percentage point, the authors give a blanket recommendation for the sample size of test-retest studies with two repeated measurements based on their results from a fixed set of simulation parameters.
Our goal is to expand upon the results of Obuchowski and Bullen 6 for the particular problem of specificity.First, we want to introduce new quality criteria for the planning of test-retest studies.For this purpose, random variables are introduced to distinguish between the targeted specificity and the specificity actually achieved in a subsequent longitudinal assessment that is based upon the results of a test-retest study.Furthermore, we will expand the considerations to include sensitivity, which has not been investigated in the literature so far.Finally, we aim to provide analytical solutions.In contrast to simulation studies, this allows for flexible calculation of sample size requirements without restrictions to fixed parameters and also the retrospective assessment of test-retest studies, as we will show.In doing so, we establish a comprehensive framework in which the notions introduced above are precisely defined.
In what follows, we will introduce the model used for our framework and study the aspects of specificity and sensitivity in separate sections.Afterwards, we demonstrate the application of our concepts in a practical example and discuss our results.

Definitions
One possible approach to distinguish true change from random variation in the longitudinal study is to estimate measurement variability in a test-retest study.For this purpose, n patients are measured m times within a short period of time, in which their true value presumably does not change.For our considerations we assume independent subjects, for example, measurement of one target per patient.In addition, independent replicate measurements are necessary, that is measurements on a subject need to be made independent of the knowledge of its previous value(s). 9Consequently, we establish the following model for the j-th measurement of the i-th patient Y ij of the test-retest study: where  i is the true value for the i-th patient and  ij is the random error.For our following considerations, it is irrelevant whether the values ( i ) i=1,…,n stem from a fixed effects or a random effects model.We assume the random errors to be independent and normally distributed with mean 0 and variance w 2 SD . 6In particular, it follows that Y ij ∼  ( i , w 2 SD ) for any i ∈ {1, … , n} and j ∈ {1, … , m}.This model is appropriate when true replicates are studied and a learning effect can be ruled out.As we are only addressing measurement repeatability, a fixed bias does not need to be considered since it cancels out.We also assume that measurement error is independent from the magnitude of  i .From this data, we can estimate the within-patient standard deviation w SD 7 by where Ȳi⋅ := 1∕m ∑ m j=1 Y ij denotes the mean value of the measurements of patient i.In this formula, the patient-specific means ( i ) i=1,…,n directly cancel each other out.Following Cochran's theorem, 10 the distribution of this entity can be derived from A two-sided confidence interval for w SD at level  is given by where  2 n(m−1), denotes the -quantile of the  2 -distribution with n(m − 1) degrees of freedom.For any fixed m, a central limit theorem can be applied to ŵ2 SD for n → ∞.A subsequent application of the delta method 11 using the square root function yields the convergence for any fixed m, where Z is a standard normally distributed random variable.Furthermore, this convergence also holds for any fixed n when m goes to infinity.This is a consequence of two subsequent applications of the delta method 11 and the commutativity of addition and convergence in distribution for independent random variables.
If the number of repeated measurements differs between subjects, that is, the i-th subject is measured m i times, the value n(m − 1) needs to be replaced by ∑ n i=1 (m i − 1) in all formulas.For the sake of simplicity, we restrict ourselves to the case of an equal number of repetitions m per subject.
In order to assess changes in the measurements of a single patient in the subsequent longitudinal study, the RC is computed. 7It indicates the range in which two repeated measurements are expected to fall with a certain probability.In what follows, we restrict ourselves to the assessment of changes in both directions.We want to keep our decision rules flexible, that is, we establish a target specificity p sp ∈ (0, 1) which shall be reached for patients with no change in their true biomarker value.Hence RC is a function of p sp and is given by In most literature, the RC is only considered for a fixed targeted specificity of 95%, that is, RC(0.95). 4,12n practice, w SD is unknown and hence replaced by its consistent estimator ŵSD to obtain the estimated RC A confidence interval for the RC at level  is given by multiplying the limits of the corresponding confidence interval for The point estimate RC(p sp ) can be applied as cutpoint in a longitudinal study to determine whether there has been change between two consecutive measurements Y pre and Y post .Here, we also assume, that the measured values have independent errors, but the true levels  pre and  post might actually be different, that is, we have Y pre =  pre +  pre and Y post =  post +  post with  pre and  post being independent and normally distributed with mean 0 and variance w 2 SD .In case the true values have not changed, that is,  pre =  post , the difference Y post − Y pre is normally distributed with mean 0 and variance 2w 2 SD .Hence, following the definition of RC from ( 5), we obtain in this case The rule to decide whether there is a change in a patient with the two measured values Y pre and Y post should thus be whether their difference lies outside or inside the interval [−RC(p sp ), RC(p sp )].As the bounds are unknown in practice, this decision rule is replaced by the decision rule based on the estimated interval [− RC(p sp ), RC(p sp )].Consequently, the targeted specificity p sp will never be exactly met.This applies analogously to considerations for the sensitivity of this procedure.

Effective specificity as a criterion for sample size estimation
Our goal is to quantify the uncertainty introduced by the replacement of w SD by its estimator ŵSD .As mentioned, the targeted specificity (p sp ) is not met in practice.To assess this problem, we introduce the effective specificity P esp which is the specificity actually achieved if a realisation of the estimate ŵSD is plugged in.Hence, P esp is a random quantity as it depends on the value of ŵSD .We use a capital letter to emphasize that it is indeed a random variable.It can be implicitly defined via Although this quantity is unknown in practice, we can nevertheless analyse its distribution.Firstly, we can compute the expected value [P esp ] and the bias, that is, the difference [P esp ]−p sp .This is also the quantity targeted by Obuchowski and Bullen. 6Their quality criterion requires |[P esp ] − p sp | to be at most 0.01, that is, they want the mean effective specificity to differ by no more than 1 percentage point from the target specificity, which they set to 95%.But what is even more important, from our point of view, is that we can compute quantiles of the distribution of P esp which will enable us to establish quality guarantees on the effective specificity of the longitudinal studies based on the design parameters n and m of the test-retest study.

Expected value and bias
According to (6), P esp is given by )) The function RC can be inverted as it is a continuous, monotonically increasing function on (0, 1).The expectation of this random quantity can be computed exactly using (2) or approximately using the central limit theorem (4), according to which the distribution of ŵSD ∕w SD can be approximated with a normal distribution with expectation 1 and variance 1∕(2n(m − 1)).Hence, we get (n(m − 1)w 2 ) 2wn(m − 1) dw where denotes the probability density function (PDF) of a  2 -distributed random variable with n(m − 1) degrees of freedom.By numerical evaluation of the terms in ( 7) and ( 8), the bias can be computed.

Quantiles of the distribution of P esp
We need to be aware that even if [P esp ] is close to p sp , that is, the bias is low, the probability for a substantial deviation of the actually realized specificity from the targeted specificity might be large (Figure 1).
Therefore, we want to know with which confidence p conf we can say that the effective specificity is larger than some lower bound p esp,lb .This is expressed by the formula We want to introduce a new quality criterion based on this concept.
The quantity p conf is a function of p esp,lb and of course also depends on p sp , n, and m.For notational convenience, however, we omit those arguments.After some calculations, one obtains where denotes the cumulative distribution function of a  2 -distributed random variable with n(m − 1) degrees of freedom.These formulas can now be used in different ways.In the above form, one can determine the confidence with which the effective specificity exceeds a fixed bound p esp,lb with given design parameters n and m of the test-retest study.Analogous considerations can be made for upper bounds by computing the probability of the complementary event.The Supplemental Material contains a simulation study that confirms this key calculation.
In the planning stage of the test-retest study, it could be beneficial to choose the sample size n in such a way that a desired lower bound p esp,lb is achieved with a prespecified confidence p conf .To this end, the asymptotic formula (10) can be solved explicitly for n: The exact formula (9) cannot be explicitly solved for n.However, one can numerically solve In our application example, we will apply these formulas in the planning stage of a hypothetical test-retest study.
The need for a number of measurements to ensure compliance with the quality criteria proposed here cannot only be formulated based on the number of cases n for an arbitrary fixed m.Although this is common practice, the formulas given here show that the product n(m − 1) is decisive.Hence, if the number of replicated measurements m can also be chosen freely, the total number of measurements N = mn can be reduced by increasing m.In this sense, setting n = 1 and m = N is the most efficient configuration.However, such a design of a test-retest study might be regarded inadvisable for several reasons as laid down in the discussion.
If one wants to identify the worst possible cases for given n and m, one could compute the lower bound of the effective specificity which is reached with confidence p conf : Accordingly, in (1 − p conf ) ⋅ 100% of all cases, the effective specificity will be even lower than the obtained p esp,lb .
From our point of view, the probability of exceeding a lower bound p esp,lb is a valid criterion for evaluating the quality of assessment in a longitudinal study.Different from the expected value of P esp which has been previously proposed as a quality criterion, 6 our criterion considers the tails of the distribution of P esp .This allows to bound the probability of strongly deviating from the desired specificity.

Consideration of effective sensitivity
Concerning the sensitivity, that is, the ability to detect real change between two measurements of one patient in the longitudinal study, we can make similar considerations.Before coming back to the problem of the uncertainty caused from the estimation of w SD , we first assume, that w SD and hence also RC(p sp ) is known.Of course, the sensitivity strongly depends on the difference between  pre and  post .Also, such differences are more difficult to detect if w SD is large and a large target specificity is chosen.To be more precise, the sensitivity p se to detect a difference can be written as a function of  Δ :=  post −  pre , w SD and the chosen specificity p sp .If the intercepts  post and  pre are considered random, all the following probability statements are conditional on  Δ .Then, this relationship is given by )) In this form, the function can also be seen as a function of the effect size  :=  Δ ∕w SD , that is, )) This dependence of the sensitivity on the effect size  is visualized in Figure 2.
As w SD is unknown and needs to be estimated by ŵSD which will then be plugged in to compute RC(p sp ), the sensitivity computed in (15) will not be reached.Analogously to our considerations for the specificity, we introduce the effective sensitivity P ese which is the sensitivity that is actually achieved if a realization of the estimate ŵSD is plugged in.Of course, it is also a random variable and does depend again on  Δ , w SD and p sp .It can be defined by the equation With this expression and the exact distribution of ŵSD given as in (2) resp.the approximation of the distribution of ŵSD w SD by a normal distribution from (4) we can now quantify the bias caused by the replacement of w SD by ŵSD and compute quantiles of the distribution of P ese which will enable us to also give quality guarantees on the effective sensitivity.Unlike our considerations for the specificity, these values will also depend from the actual w SD and the difference  Δ of the longitudinal study and hence will be regarded as functions thereof.

Expected value and bias
To compute the bias in dependence from p sp ,  Δ , and w SD , we can take the expectation of the right hand side of ( 16) and use the exact distribution (2) and the central limit theorem (4) to obtain the result (n(m − 1)w 2 )2wn(m − 1) dw Please note that this can essentially be seen as a function of .Following ( 16), the bias of the effective sensitivity can be considered a function of  for any given p sp , i.e. [P ese (p sp , )] − p se (p sp , ).

Quantiles of the distribution of P ese
For the most accurate examination of the distribution of P ese , we would need to consider both events However, this leads to expressions that are difficult to handle analytically.Actually, the two probabilities sum up to the effective sensitivity.However, in the presence of an effect, one of them will be much larger than the other.
In the case  > 0, the probability in (19) is larger than that from (20) which is bounded from above by 0.025 and quickly converges to 0 as  increases.To enable the derivation of analytical formulas, we will therefore restrict ourselves to the consideration of  > 0 and the event (18).It is nevertheless possible to circumvent this simplification by numerical inversion of the relationship given in (16).But here, we will approximate In analogy to the previous section, we can provide confidence levels p conf which indicate the probability that the effective sensitivity for some effect  exceeds the lower bound p ese,lb : Of course, such considerations only make sense if p se > p ese,lb for the chosen effect size .As above, analogous considerations can be made for upper bounds by computing the probability of the complementary event.
While (21) allows to compute the confidence of reaching a certain lower bound of the sensitivity for an effect , this formula may also be transformed to be used in the planning stage of the test-retest study.If one wants to achieve a fixed confidence with which the effective sensitivity for an effect size  exceeds some lower bound, one can use the exact results from above or the approximations made thereafter to determine the sample size n of the test-retest study in which each patient is measured m times.It shall be chosen such that ≈ min Analogous to the preceding section, we can use these results in the planning stage of a test-retest study, as we will demonstrate in the following application example.Even if a study is planned based on considerations of the specificity, the formulas allow to assess the distribution of the effective sensitivity for any given effect size of interest.Calculations for  < 0 follow analogously to the considerations for  > 0.

Application example
To illustrate our considerations, we will discuss a hypothetical application for early treatment response assessment in recurrent or metastatic nasopharyngeal carcinoma.While some patients with recurrent nasopharyngeal carcinoma show response or stable disease after systemic treatment, many patients will have progressive disease, which is invariably lethal. 13,14evertheless, futile treatments should be avoided due to associated toxicity. 13,15To suspend futile treatment as soon as possible an imaging biomarker is desirable which accurately classifies treatment response earlier than change in morphologic lesion size, the current standard.A promising biomarker in this context is diffusion weighted magnetic resonance imaging (DWI). 16DWI depends on the differences in the movement of water molecules based on Brownian motion, which can be quantified by the apparent diffusion coefficient (ADC).An exemplary measurement of ADC is shown in Figure 3. [18][19]

Prospective planning of test-retest studies
As laid out above, before conducting a longitudinal study in which a biomarker is applied to assess treatment response, a test-retest study should be conducted to assess repeatability.In our example, we will set p sp to 95% and m = 2, as these are the usual values in the literature.We imagine the researcher would want to obtain a specificity of at least 90% (p esp,lb ) with 95% certainty (p conf ) in the longitudinal study.What sample size (n) is necessary in the test-retest study?This question can be answered using the asymptotic formula (11): This can also be concluded from Figure 4(A).Numerical solution of the exact formula (12) yields a sample size of 54.Resulting sample sizes for other values of p esp,lb can be taken from Figure 4(B).The resulting scenario in terms of the distribution of the relative error in the estimation of ŵSD and its effect on P esp is displayed in Figure 5(A).As already noted in Section 3.2, it can be seen in Figure 4(B) that the overall number of measurements N = mn can be reduced by increasing m.Here, for m = 3, a total number of 81 = 3 × 27 measurements is necessary, while for n = 2, 108 = 2 × 54 are required.Taking this to the extreme, 55 measurements of only one patient would yield the same result in terms of the distribution of P esp , although this may not be advisable.
Analogous considerations can be made for the effective sensitivity.We consider the sensitivity for an underlying true effect size of  = 4 in a study with p sp = 0.95 and m = 2.According to formula (15), a sensitivity of 80.74% was achieved if w SD was a known quantity.However, this will not be met in practice.What is the minimum sample size (n) of the test-retest study such that we can be 95% (p conf ) sure to achieve at least a sensitivity of 75% (p ese,lb ) for that effect size?This question can be answered using the approximate formula (24).
⇔n ≈ 138.1 Accordingly, a sample size of 139 patients in the test-retest study would be recommended to achieve the set targets.Using the exact formula (13) or the asymptotic formula ( 14), an effective specificity of at least 92.25% resp.92.27% is reached with a certainty of 95%, in this scenario.This is also depicted by Figure 5(B).

Retrospective assessment of test-retest studies
Conducting a test-retest study prior to a longitudinal study may not always be necessary.The RC used in the longitudinal study might be adopted from already published test-retest studies.If one intends to use the point estimate of the RC obtained in a previous study, it is advisable to retrospectively assess the resulting distribution of the effective specificity and sensitivity.This allows to evaluate the impact of the sample size of the used test-retest study on quality criteria of the longitudinal study, especially the probability of exceeding a given p esp,lb .
1][22][23][24][25] If the point estimator of RC(0.95) resulting from a test-retest study with a sample size of 10 and two repeated measurements is used, the distribution of the effective specificity will have prominent tails as illustrated in Figure 1.According to (13), for sample sizes of 10 and 20, the lower bounds of the effective specificity obtained with 95% confidence are 0.7814 and 0.8512, respectively (Figure 4).This might be insufficient.Note that for the recommendation by Obuchowski and Bullen 6 of a sample size of 35 for test-retest studies with m = 2 the probability of achieving an effective specificity below 94% is 39.74%.
Such considerations are also possible for the effective sensitivity.

Discussion
We have established a comprehensive framework for planning of test-retest studies concerning repeatability.It enables flexible calculation of sample size requirements and retrospective assessment of such studies with regard to different quality criteria.
To better discuss planning of test-retest studies we have introduced the notions of effective specificity (P esp ) and effective sensitivity (P ese ), allowing for clearer differentiation of the targeted specificity p sp and sensitivity p se from the values actually achieved in the longitudinal study.Both P esp and P ese are random quantities and their actual values are unknown in practical application.However, we can determine their distribution and thus can compute different characteristics which properly reflect the uncertainty caused by the estimation process.
Expanding on the work of Obuchowski and Bullen, 6 we have introduced a new quality criterion for sample size calculation of test-retest studies.In their work, Obuchowski and Bullen 6 demand that the mean effective specificity ([P esp ]) Figure Visualization the application example.The x-axis denotes the relative error between ŵSD and w SD .The solid black line represents the asymptotic probability density function (PDF) of the relative error.The dashed line shows the effective specificity.Analogously, the dotted line shows the effective sensitivity for an underlying effect size of  = 4. (A) For n = 53 and m = 2: The area shaded in light gray represents 95% of the area under the normal curve.That is, there is a 95% chance of obtaining a ŵSD from the test-retest study that will result in an effective specificity of >90%.(B) For n = 139 and m = 2: The area shaded in light gray represents 95% of the area under the normal curve.That is, there is a 95% chance of obtaining a ŵSD from the test-retest study that will result in a specificity of >75%.In this case, an effective specificity of 92.27% will be reached with a certainty of 95%.
deviates at most by 0.01 from the fixed targeted specificity (p sp ) of 0.95.However, using the mean effective specificity as sole quality criterion has limitations, since the whole distribution of the effective specificity is not properly taken into account.As illustrated in Figure 1, there is a high probability that the actually achieved effective specificity deviates strongly from its target even if the mean effective specificity may be close to the targeted specificity.Therefore, we propose a quality criterion for sample size calculations based on the probability that the effective specificity exceeds a chosen lower bound, taking into account the tails of the distribution of P esp .
Typically, the number of repeated measurements per patient is set to m = 2. Nevertheless, efficiency gains in the total number of measurements of the test-retest study can be attained by increasing m, as the distribution of P esp only depends from the product n(m − 1).However, caution is advised here as a very small number of subjects with a large number of repeated measurements may come with some disadvantages.From an ethical point of view, it can be unjustified to subject a patient to a high number of potentially harmful examinations.But also from a statistical point of view this can be unfavorable.Assumptions of the underlying model, such as homoskedasticity, cannot be investigated properly in such a trial and habituation effects after a large number of measurements could bias the estimate of w SD .
In contrast to previous works, we expand our consideration also to issues of sensitivity.Here, of course, it must also be taken into account that the sensitivity depends on the underlying effect size.Nevertheless, we can determine the distribution of the effective sensitivity for any effect size and provide analogous sample size formulas as for the specificity.
Finally, our study is the first to provide analytical rather than simulation results.This provides greater flexibility as the targeted specificity p sp and number of repeated measurements m may be chosen freely.Hence, it allows the readers to avoid conducting time-consuming simulation studies themselves.While our formulas enable flexible calculations for all scenarios, for convenience of the reader we also provide a table with sample sizes for some exemplary scenarios in Figure 4(B).Sample sizes resulting from other choices of the parameters p conf , p esp,lb , p sp , and m can be found in Supplemental Tables S1 to S4.
A larger sample size than often found in the literature for test-retest studies is required if a high quality of the estimate ŵ2 SD is desired according to the choices of p conf , p esp,lb , and p ese,lb .From a purely statistical perspective, multiple estimates ŵSD,1 , … , ŵSD,k of the within-patient standard deviation can be combined (e.g. from different studies) in a meta-analytic way to meet this increased sample size requirement by defining The properties of this aggregated estimate can be investigated using all the tools introduced in our work as However, for this combination to be sensible, it must be ensured that all of these estimators estimate the same quantity.
Our study has some limitations.The field of application is restricted to test-retest studies in which true replicates of measurements are possible, for example, in quantitative imaging markers.Our considerations are not valid if the measurement process itself results in a change of the measurand (learning/practice effect) as has been described for some psychological assessments. 26,27ur standard model (1) assumes independent and identically normally distributed errors.It is, therefore, advisable to examine whether there is a relationship between the within-subject variation and the level of the measured value before applying our approach. 28If the variability of the measurement error increases with the magnitude of the measured value, a log transformation might resolve the issue. 29,30,28Beyond that, non-normally distributed error terms are not covered so far.We also have not specifically considered the scenario of clustered data, for example, measuring multiple lesions per subject.However, a hierarchical model structure with independent errors can be assumed, this does not pose a restriction to application of our approach.
It should be noted that exact solutions based on the  2 -distribution for all our considerations are available.In some cases, when an analytic solution is not possible, these exact solutions require the application of numerical methods.In order to give completely analytic solutions, some of our formulas rely on asymptotic results and approximations.The differences between exact and approximate results are most severe for small sample sizes and small effect sizes.Applying both exact and approximate formulas in our application example, it can be seen that these differences are negligible in practically relevant scenarios.However, we recommend using the exact calculations for a final planning of a test-retest study.Implementations of exact and approximate solutions can be found in our Supplemental R Code. 31o far, our considerations are limited to repeatability, that is, assuming same measurement conditions for the repeated measurements.However, for real world application of biomarkers, consideration of reproducibility is also important since longitudinal measurements are often performed under different measuring conditions, for example, varying readers or scanners.Therefore, our model should be enhanced to include aspects of reproducibility such as a fixed bias as, for example, in some models considered by Obuchowski and Bullen 6 or multiple variance components that take into account different sources of variability in such settings.Nevertheless, since repeatability limits reproducibility, a good knowledge of the former is useful in order to interpret reproducibility studies properly. 9est-retest studies of repeatability should be well planned to guarantee for a sufficient quality of dependent longitudinal studies.Our framework allows the derivation of analytical solutions for quality criteria that can be used to assess implications of the test-retest study design on subsequent longitudinal studies.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figure 2 .
Figure2.Sensitivity as a function of effect size  :=  Δ ∕w SD for p sp =0.95.Note that we assume w SD to be known here.

Figure 3 .
Figure 3. Example case of a tumor in the right nose showing restricted diffusion (A) with a mean apparent diffusion coefficient (ADC) of 610 × 10 −6 mm 2 /s.(B) Region of interest outlined in yellow.

Figure 4 .
Figure 4. (A) Lower bound of effective specificity reached with confidence of 95% as a function of sample size (n) and number of repeated measurements (m) of the test-retest study.(B) Sample size resulting from (12) for different values of the desired lower bound p esp,lb that shall be exceeded with a fixed confidence of 95%.