Reference range: Which statistical intervals to use?

Reference ranges, which are data-based intervals aiming to contain a pre-specified large proportion of the population values, are powerful tools to analyse observations in clinical laboratories. Their main point is to classify any future observations from the population which fall outside them as atypical and thus may warrant further investigation. As a reference range is constructed from a random sample from the population, the event ‘a reference range contains (100 P)% of the population’ is also random. Hence, all we can hope for is that such event has a large occurrence probability. In this paper we argue that some intervals, including the P prediction interval, are not suitable as reference ranges since there is a substantial probability that these intervals contain less than (100 P)% of the population, especially when the sample size is large. In contrast, a (P,γ) tolerance interval is designed to contain (100 P)% of the population with a pre-specified large confidence γ so it is eminently adequate as a reference range. An example based on real data illustrates the paper’s key points.


Introduction
The 'Choose Wisely' campaign was developed in the United States in 2012 by the American Board of Internal Medicine Foundation and was launched in the United Kingdom in 2016 by the Academy of Medical Royal Colleges. It aims to encourage a dialogue between clinicians and patients regarding the risk and benefits of interventions, and the practice of evidence-based treatment regimens. 1 As described recently, 2 this conversation often refers to the patient's observed values of relevant clinical markers. Since the clinical laboratory provides comparator intervals to assist the clinician in determining a context for an individual value, a natural question from the patient is 'Are my test results typical with respect to a healthy population?'. Although such assessment values are often referred to as the test's normal range, this terminology should be discouraged as it implies that such a result has a binary 'normal or abnormal' quality which may lead to an arbitrary dichotomous interpretation of the patient's health status. 2 Instead, the terms 'reference limits' or 'reference range' should be used in this context.
Reference ranges are powerful tools in laboratory medicine to aid decision making 3 and their use has become increasingly prevalent in clinical practice. Searching in the Web of Science engine at the time of writing for articles published between 1999 and 2019 with 'reference range' as a topic, we found 5431 articles of which 469 appeared in 2019, in contrast to 268 articles that appeared in 2009. These articles have collectively been cited by 91,034 publications of which 11,270 appeared in 2019, 2.4 times more than the number of citing publications 10 years earlier.
Apart from the important individual overtones for patients, incorrectly estimating the reference range of a sensitive clinical marker of physiological function has enormous public health implications. For example, underestimating the upper limit of a reference range would mean classifying a large number of people as diseased, thus affecting the doses of medication prescribed. 4 Construction of appropriate reference ranges is therefore crucial in laboratory medicine practice. Well-known general references 3,5-9 and a case for teaching tolerance intervals in introductory statistics courses 10 are available.
It is common practice to assume that clinical markers related to a disease follow a normal distribution among healthy subjects. If there is evidence against this assumption we could fit models to specify optimal transformations to normality, e.g. logarithmic or square root though this might still result in biased estimates of the upper or lower limits of the reference range depending on whether the distribution is right or left skewed. 9 Alternatively we could construct reference ranges under specific parametric assumptions different to normality, or follow a nonparametric procedure. The focus of this paper is on the construction of parametric and nonparametric reference ranges for a selected reference population based on a random sample from the population. The problems related to selecting a reference population have been discussed elsewhere. 6 A P (commonly set to 95%) reference range is a data-based interval that purports to include 100 P ð Þ% of the values in the population of interest. Their main point is to classify any future observations from the population which fall outside these intervals as atypical and thus may warrant further investigation.
Let FðÁÞ denote the continuous cumulative distribution function (cdf) of the population, and F À1 ðcÞ denote the ð100 cÞ-th percentile of the population for a given c 2 ð0; 1Þ. The interval F À1 ð1 À PÞ=2 ð Þ ; F À1 ð1 þ PÞ=2 ð Þ Â Ã contains exactly 100 P ð Þ% of the population and would be used as the P reference range had F been known. Since FðÁÞ is usually not known completely in real problems, the reference range has to be estimated from a random sample X 1 ; . . . ; X n from the population, i.e. X 1 . . . ; X n are independent random variables identically distributed FðÁÞ. Note that we follow the notation in Krishnamoorthy and Mathew 11 thus denoting the interval's content level by P instead of the commonly used 1 À a, and its confidence level by c.
When FðÁÞ is assumed to have a normal distribution Nðl; r 2 Þ with unknown mean l and unknown variance r 2 , we have F À1 ðcÞ ¼ l þ z c r where z c denotes the ð100 cÞ-th percentile of the standard normal distribution N(0, 1). When FðÁÞ is not assumed to have a parametric form, nonparametric (or distribution free) methods can be used. In this paper, both normal-based and nonparametric methods are considered.
As a reference range depends on the random sample, the proportion of the population contained in it is also random. Thus the question is 'which statistical intervals should be used as reference ranges?' In this article we argue that a P prediction interval, which continues to be used as a reference range in the literature, 6,12,13 is not fit for the purpose of interest since there is a substantial probability (due to the randomness in the sample) that the prediction interval contains less than 100 P ð Þ% of the population. We then argue that a ðP; cÞ tolerance interval, with confidence c 2 ð0; 1Þ set at a pre-specified large value, c ¼ 0:95 say, is valid as a reference range since it guarantees, with large confidence c due to the randomness in the sample, to contain 100 P ð Þ% of the population values. Several authors have proposed to use tolerance intervals as reference ranges. 5,14,15 With almost 80 years of research on tolerance intervals or regions, various parametric and nonparametric procedures are readily available for use as reference ranges.
The next two sections discuss reference ranges based on the normal distribution, and nonparametric reference ranges. They are followed by a section considering a numerical example, and a final one with concluding remarks.
2 Reference ranges based on the normal distribution

Reference ranges currently in use
Based on the sample, one reference range that has been widely used is the P prediction interval for a future observation Y from a population with Nðl; r 2 Þ distribution 6,12,13 is the ð100 dÞ-th percentile of the t distribution with degrees of freedom (df), ¼ n À 1, and c 1 ¼ t ð1þPÞ=2; ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 þ 1=n p . A relevant guide on prediction intervals for reference regions is available, 7 and we note that the prediction interval RR 1 has also been called the P expectation tolerance interval. 16,17 Other reference ranges are based on estimators of the percentiles l AE z ð1þPÞ=2 r and include X þ c 3 S has the minimum variance among unbiased estimators of l þ z c r, and X þ c 4 S has minimum mean squared error among estimators of the form X þ c S where c is a constant. 12 One immediate question is whether these reference ranges RR i contain 100 P ð Þ% of the values in the population, which is the objective of a reference range. Note that the proportion of the population within the reference range RR i ¼ X AE c i S is given by where Y$Nðl; r 2 Þ and is independent of the sample X 1 ; . . . ; X n , Pr YjX 1 ;...;X n fÁg is the conditional probability of Y conditioning on the sample X 1 ; . . . ; X n , and UðÁÞ is the cdf of a N(0, 1) random variable. Hence the objective of a reference range is to have K i ! P. It is clear from equation (1) that K i is a random variable depending on the random sample via X and S so whether 'K i ! P' is also random. As a result, all we can hope is that the event ghas a large probability of occurrence. We note from equation (1) that K i increases as c i increases. Hence, among the RR i (1 i 4) given above, the one that has the largest c i contains the largest proportion of the population. Figure 1 compares the c i for given sample sizes n ¼ 2 : 150 and P ¼ 0.95. Clearly, c 1 is the largest among the c i (1 i 4), and so RR 1 contains the largest proportion of the population among the four reference ranges. We therefore investigate whether or not 'K 1 ! P' has a large probability to occur in order for RR 1 to be used as a reference range. First, note that where the equality in equation (2) results directly from the well-known conditional expectation formula, 18 and the equality in equation (3) follows from the fact that Y À of S=r which has the distribution ffiffiffiffiffiffiffiffiffi ffi v 2 = p , with v 2 denoting a chi-squared random variable with ¼ n À 1 df. That the probability in equation (3) is equal to P qualifies RR 1 as a P prediction interval for a future observation Y from the same population that the sample X 1 ; . . . ; X n is drawn.
Second, the distribution of K 1 can be studied by simulating a large number, R sim ¼ 1; 000; 000 say, of independent realisations of K 1 . Note from equation (1) that are statistically independent. From equation (4), K 1 can easily be simulated. For given P and n, R sim ¼ 1; 000; 000 replicas of K 1 are simulated, based on which the probability density function (pdf) of K 1 can be accurately approximated. In Figure 2, the kernel density estimate 19 of the pdf of K 1 based on the simulated K 1 values is plotted (by using the R package KernSmooth) 20 for n ¼ 20, 50, 100 and 150. Based on the simulated K 1 values, we approximated PrfK 1 < Pg by the proportion of the K 1 values that are less than P ¼ 0:95, which are given by 0.385, 0.429, 0.450 and 0.459 for n ¼ 20, 50, 100 and 150, respectively. Note that PrfK 1 < Pg is given in Figure 2 by the area under the pdf to the left of the vertical line at P ¼ 0:95.
Given equation (3), it can be shown by the delta method that ffiffi ffi n p K 1 À P ð Þtends when n ! 1 to a normal distribution with zero mean and finite variance. This is supported by Figure 2 which shows that the pdf of K 1 is getting closer to be symmetric and centered with decreasing variance at P as n increases. Note that n ¼ 150 is not large enough yet for the pdf of K 1 to converge to a normal pdf. From a brief simulation study we found that in order to achieve this satisfactorily the sample size must be very large indeed. Even for n ¼ 10, 000 the skewness and kurtosis values suggest a significant lack of normality. The coefficient of variation of K 1 for n ¼ 150 is 0.014, and becomes smaller than 0.01 for n ! 300, and is around 0.002 for n ¼ 10,000. This asymptotic normal distribution implies that PrfK 1 < Pg ! 0:5 as n ! 1, that is, the probability of the reference range RR 1 containing less than 100 P ð Þ% of the population is about 1/2 when the sample size is large. The argument above means that, due to the sample's randomness, using RR 1 as the reference range implies that there is a substantive probability, close to 50% when n is sufficiently large, that the reference range does not fulfill its objective of containing 100 P ð Þ% of the population. Its property E ðK 1 Þ ¼ P in equation (3) has the following interpretation. A large number of individuals, I say, collect independent samples, one each, and compute the corresponding reference ranges RR 1 based on their own samples. Then the proportions of the population contained in these I reference ranges, K 1;1 ; . . . ; K 1;I , are random values from the interval (0, 1) and form a random sample from the distribution of K 1 although some values could be very close to 0 and some values could be very close to 1. The property E ðK 1 Þ ¼ P merely says that ðK 1;1 þ Á Á Á þ K 1;I Þ=I is close to P when I is large. Hence, the proportion of the population that one particular reference range contains could be very small but this is compensated by some very large proportions of the population that some other individuals' reference ranges might contain in the sense that ðK 1;1 þ Á Á Á þ K 1;I Þ=I is close to P. This potential for compensation from other reference ranges is unlikely to offer any comfort for knowing that one's reference range has a substantial probability of containing less than 100 P ð Þ% of the population. It is clearly desirable to have a high confidence that our own reference range contains 100 P ð Þ% of the population. Hence RR 1 falls short on this ground and should not be used as a reference range.
The justification for using prediction intervals as reference ranges 5,13 is that exactly 100 P ð Þ% of the future observations from the population should fall within the prediction intervals. It is clear from the line of reasoning stated in the previous paragraphs that this argument is not valid. The inappropriateness of prediction regions when used as reference regions has also been noted in Sections 2.2 and 3.3 of Dong and Mathew. 15 In the next section we discuss tolerance intervals since several authors 5,14,15 have proposed to use them as reference ranges. For example it has been stated that 'it would seem that the statistical tolerance interval is what clinical chemists have in mind when they speak of a reference range derived from a sample of individuals representing some defined population' 5 (p. 55).

Tolerance intervals
A tolerance interval with content level P is a data-based random interval constructed to contain 100 P ð Þ% of the population with a pre-specified (large) confidence level c about the randomness in the sample. 11,16,17,[21][22][23] Specifically, a ðP; cÞ tolerance interval is given by 11 where the critical constant c 5 ¼ c 5 ðP; c; nÞ is chosen such that where the random variables Z and v 2 in equation (5) are the same as those in equation (4). The R package tolerance 24,25 can be used to compute c 5 . Figure 3 compares c 1 and c 5 for given sample sizes n ¼ 2 : 150 with P ¼ 0.95 and c ¼ 0:90; 0:95 f g . It is clear from Figure 3 that c 5 is considerably larger than c 1 in order that RR 5 contains 100 P ð Þ% of the population with a pre-specified large confidence c about the randomness in the sample. Also, as expected, c 5 increases with c as seen in Figure 3.

Equal-tailed tolerance intervals
The tolerance interval RR 5 contains 100 P ð Þ% of the population with a pre-specified (large) confidence c about the randomness in the sample. But the proportion P of the population contained in RR 5 may not be the central ð Þ% interval of the population. If we insist that a reference range should contain that central proportion of the population, i.e. ½l À z ð1þPÞ=2 r; l þ z ð1þPÞ=2 r with pre-specified confidence c about the randomness in the sample, then we should use the following interval as the reference range where the critical constant c 6 ¼ c 6 ðP; c; nÞ is chosen such that Pr X À c 6 S < l À z ð1þPÞ=2 r and l þ z ð1þPÞ=2 r < X þ c 6 S È É ¼ c This interval is called the equal-tailed or central ðP; cÞ tolerance interval. 15 A formula for values of c 6 is available 11 and can be computed using the function K.factor of the R package tolerance. 24,25 This interval can be viewed as a c confidence simultaneous lower confidence bound on quantile l À z ð1þPÞ=2 r and upper confidence bound on quantile l þ z ð1þPÞ=2 r. 26 It is clear that comprising the central 100 P ð Þ% of the population ½l À z ð1þPÞ=2 r; l þ z ð1þPÞ=2 r implies containing 100 P ð Þ% of the population. Hence the equal-tailed RR 6 satisfies a more stringent requirement than RR 5 and, as a result, c 6 is larger than c 5 . Figure 4 compares c 5 and c 6 for given sample sizes n ¼ 2 : 150 with P ¼ 0.95 and confidence c ¼ 0:90; 0:95; 0:99 f g . It is clear from Figure 4 that c 6 > c 5 , as expected. Our view is that the ðP; cÞ tolerance interval should be used as the reference range since its form X AE c 5 S is centered at X, mimicking the form of the equal-tailed tolerance interval l AE c 6 r, and with a large confidence c it does contain 100 P ð Þ% of the population. Only if we specifically require the reference range to contain the central 100 P ð Þ% of the population, l AE z ð1þPÞ=2 r, then the equal-tailed ðP; cÞ tolerance interval should be used; otherwise it is unnecessarily wider and flags as atypical fewer individuals than the ðP; cÞ tolerance interval.

Nonparametric prediction intervals
When FðÁÞ is not assumed to have a specific form, nonparametric reference ranges can be considered and are based on the order statistics X ½1 < . . . < X ½n of the sample X 1 ; . . . ; X n , and the sample quantiles have been used to estimate the population quantiles F À1 1 À P ð Þ=2 À Á and F À1 1 þ PÞ=2 ð Þ . 6 In what follows, j ðpÞ and j ðtÞ are indices used for prediction and tolerance intervals, respectively. Let j ðpÞ , with 1 j ðpÞ n=2, be the largest natural number such that where Y is a future observation from the population FðÁÞ independent of the random sample X 1 ; . . . ; X n as before.
Using the well-known facts that U 1 ¼ FðX 1 Þ; . . . ; U n ¼ FðX n Þ are independent, each having a uniform distribution on the interval (0, 1), and that U ½k ¼ FðX ½k Þ is the k-th order statistic of U 1 ; . . . ; U n and has a beta distribution with parameters k and n À k þ 1, the probability in (6) is equal 16 to n þ 1 À 2 j ðpÞ À Á =ðn þ 1Þ. Hence the constraint on j ðpÞ required in equation (6) gives where hai denotes the integer part of a. This leads to use the nonparametric prediction interval RR 7 ¼ X ½j ðpÞ ; X ½nÀj ðpÞ þ1 À Á as a reference range. An interesting remark is that X ½j ðpÞ and X ½nÀj ðpÞ þ1 are consistent point estimators of the population quantiles F À1 ð1 À PÞ=2 ð Þand F À1 ð1 þ PÞ=2 ð Þ , respectively. The proportion of the population contained in RR 7 is given by which is a random variable. The important question is whether the probability that this proportion is at least P, given by Pr U ½nÀj ðpÞ þ1 À U ½j ðpÞ ! P È É is sufficiently large to qualify the P prediction interval RR 7 ¼ X ½j ðpÞ ; X ½nÀj ðpÞ þ1 À Á as a reference range. By noting that U ½nÀj ðpÞ þ1 À U ½j ðpÞ and U ½nÀ2j ðpÞ þ1 follow the same beta distribution B nÀ2j ðpÞ þ1;2j ðpÞ , Tukey's equivalence blocks result 27 directly implies that Pr U ½nÀj ðpÞ þ1 À U ½j ðpÞ ! P È É ¼ 1 À B nÀ2j ðpÞ þ1;2j ðpÞ ðPÞ (10) where B nÀ2j ðpÞ þ1;2j ðpÞ ðÁÞ denotes the cdf of the beta distribution with parameters n À 2 j ðpÞ þ 1 and 2 j ðpÞ . This probability can be easily calculated using the function pbeta in R. Note that, as n ! 1, the beta distribution B nÀ2j ðpÞ þ1;2j ðpÞ converges to a normal distribution with mean P thus the probability in equation (10) approaches 0.5 as n ! 1. Figure 5 plots this probability against n for P ¼ 0:90; 0:95; 0:99 f g . The plots are saw-tooth shaped due to the discreetness of n and j ðpÞ . It is clear from the figure that this probability can be substantially smaller than P, and approaches 0.5 as n is large as expected from the asymptotic normal distribution pointed out above. This shows that the nonparametric prediction interval has a substantial probability, close to 0.5 when n is large, of containing less than 100 P ð Þ% of the population values. Hence, this nonparametric prediction interval should not be used as a reference range for the same reason as the prediction interval based on the normal distribution.

Nonparametric tolerance intervals
A nonparametric tolerance interval is constructed to contain 100 P ð Þ% of the population with a pre-specified (large) confidence c about the randomness in the sample. Consider the following nonparametric tolerance interval 21 RR 8 ¼ ðX ½j ðtÞ ; X ½nÀj ðtÞ þ1 Þ where j ðtÞ satisfies that 1 j ðtÞ n=2 should be the largest natural number such that the proportion of the population contained in RR 8 , given by ...;X n Y 2 ðX ½j ðtÞ ; X ½nÀj ðtÞ þ1 Þ È É ¼ U ½nÀj ðtÞ þ1 À U ½j ðtÞ following similar lines as K 7 in equation (8), is at least P with probability c about the randomness in the sample X 1 ; . . . ; X n . It follows therefore from equation (10) that 1 j ðtÞ n=2 should be the largest natural number that satisfies Pr U ½nÀj ðtÞ þ1 À U ½j ðtÞ ! P È É ¼ 1 À B nÀ2j ðtÞ þ1;2j ðtÞ ðPÞ ! c For given n, P and c, j ðtÞ can be easily computed by a direct search over the natural numbers in the range from 1 to n=2. Note that if the sample size n is too small, then the existence of j ðtÞ is not guaranteed unless n satisfies 11 1 À n P nÀ1 À ðn À 1Þ P n À Á ! c The equal-tailed or central nonparametric tolerance intervals can be constructed in a similar way. Our view is that a ðP; cÞ nonparametric tolerance interval is pertinent as a reference range similar to the normal distribution case. Hence we do not go into the details about the equal-tailed nonparametric tolerance intervals to save space. Figure 6 compares j ðtÞ and j ðpÞ for given sample sizes n with P ¼ 0.95 and c ¼ 0:90; 0:95; 0:99 f g . It is clear from Figure 6 that j ðtÞ is considerably smaller than j ðpÞ , and so RR 8 is wider than RR 7 , in order that RR 8 contains 100 P ð Þ% of the population with a pre-specified large confidence c about the randomness in the sample. Also, as expected, j ðtÞ decreases as c increases.

Example
A random sample of n ¼ 210 observations on fasting plasma glucose is taken from the population of interest. The data and the R code for all the computations in this paper are available at http://www.personal.soton.ac.uk/ wl/RefRange/.
Suppose that the usual normality tests 28 show that it is reasonable to assume the population has a normal distribution. The sample mean and standard deviation are computed to be X ¼ 5:31 and S ¼ 0.41 (in unit mmol/ L). If we use the prediction interval as the reference range, then it is given by Note, however, as pointed out above, that the probability of the prediction interval containing less than 100 P ð Þ% of the population can be substantial and is computed to be 47%. So there is a 47% probability that the interval does not do what it purports to do: containing 100 P ð Þ% of the population. If we use the ðP; cÞ tolerance interval as the reference range, with c ¼ 0:95, then it is given by This interval is wider than the prediction interval. But, as we pointed out, the tolerance interval does contain 100 P ð Þ% of the population with probability c ¼ 0:95. Therefore, any future observations falling outside this interval can be regarded as atypical and should be considered for further investigation.
While the tolerance interval above has a confidence c ¼ 95% of containing 100 P ð Þ% of the population, it has a less than c ¼ 95% probability of containing the central 100 P ð Þ% of the population, l AE z ð1þPÞ=2 r. This probability is computed to be 86%.
In order to have a c ¼ 95% probability of containing the central 100 P ð Þ% of the population, l AE z ð1þPÞ=2 r, we can use the equal-tailed ðP; cÞ tolerance interval, which is given by The confidence that this equal-tailed tolerance interval contains 100 P ð Þ% of the population is computed to be 99%, which is much larger than c ¼ 95%. Hence, with a 99% probability, the equal-tailed tolerance interval contains 100 P ð Þ% of the population. Furthermore, we estimated that the equal-tailed tolerance interval X AE 2:21 S is the ð0:957; cÞ tolerance interval, that is, the interval contains 95.7% of the population with confidence c ¼ 95%.
Now suppose that the distribution of the population cannot be assumed to be normal. Then nonparametric reference ranges should be used. If we use the prediction interval as the reference range, then it is given by with j ðpÞ ¼ 5. Note, however, as we have pointed out, that the probability of the prediction interval containing less than 100 P ð Þ% of the population can be substantial and is computed to be 39%. So there is a 39% probability that the interval does not do what it purports to do: containing 100 P ð Þ% of the population. If we use the ðP; cÞ nonparametric tolerance interval as the reference range, with c ¼ 0:95, then it is given by RR 8 ¼ ½X ½3 ; X ½nÀ3þ1 ¼ ½X ½3 ; X ½208 ¼ ½4:38; 6:27 with j ðtÞ ¼ 3. This tolerance interval is wider than the nonparametric prediction interval but, as we pointed out, it does contain 100 P ð Þ% of the population with 95% confidence. Therefore, any future observations falling outside this interval can be regarded as atypical and should be considered for further investigation.
Finally, we note that nonparametric intervals are usually wider than the corresponding parametric ones since they require fewer assumptions than the parametric model.

Conclusions
The objective of a reference range is to contain a pre-specified large content level 100 P ð Þ% of the population with c confidence level, so that a future observation falling outside the reference range is regarded as atypical and considered for further investigation. This procedure should be useful as part of screening programmes, whose aim is to identify subjects at sufficient risk of a specific disorder who may benefit from further investigation or direct preventive action to avoid death or disability and to improve their quality of life. 29 Since a reference range depends on the random sample, the event 'a reference range contains 100 P ð Þ% of the population' is also random and so we can never be certain that a reference range contains 100 P ð Þ% of the population. All we can hope for is that the event 'a reference range contains 100 P ð Þ% of the population' occurs with a large probability, c.
Based on this premise, we have argued that the prediction interval is not suitable as a reference range since there is a substantial probability, close to 50% when n is large, that the prediction interval contains less than 100 P ð Þ% of the population. In contrast, a ðP; cÞ tolerance interval is designed to contain 100 P ð Þ% of the population with a pre-specified large confidence c so it is eminently adequate as a reference range.
Tolerance intervals or regions have been studied by many statisticians since the 1940s. Various parametric and nonparametric procedures are readily available for use as reference ranges or reference regions. 11,16,17,24 Finally, we note that there is some work on constructing reference ranges specifically assuming that the clinical marker follows a log-normal distribution, 30 and on sample size calculation for reference ranges, 31,32 and tolerance intervals. [33][34][35] These aspects, however interesting, fall beyond the scope of our paper.