Predicting the Finite Population Distribution Function under a Multilevel Model

Chambers and Dunstan proposed a model-based predictor of the population distribution function that makes use of auxiliary population information under a general sampling design. Subsequently, Rao, Kovar, and Mantel proposed design-based ratio and difference predictors of the population distribution function that also use this auxiliary information. Both predictors (CD and RKM) assume a single level model for the target population. In this article we develop predictors of the finite population distribution function for a population that follows a multilevel model. These new predictors use the same smearing approach underpinning the CD predictor. We compare our new predictors with the CD and RKM predictors via design-based simulation, and show that they perform better than these single level predictors when there is significant intra-cluster correlation. The performances of these new two level predictors are also examined via an empirical study based on data from a large-scale UK business survey aimed at estimating the distribution of hourly pay rates. AMS Subject Classification: Primary 62G30, Secondary 62G32

predictor, which assumes that the target population follows a single level model with heteroskedastic errors, is hereafter referred to as the CD predictor.It is based on combining the smearing approach of Duan [6] with a model for the finite population distribution function of Y , and is model-consistent for a finite population distribution function of this variable.These authors also show that the proposed model-based predictor provides significant gains over competing design-based estimators when there is auxiliary population information that is linearly related to the target variable of interest.Following on from this, Rao et al. [9] define design-based predictors of the finite population distribution function (hereafter referred to as RKM predictors) based on ratio and difference approaches under a general sampling design and show that the RKM predictor is preferable to the CD predictor under model misspecification.However, both the CD and RKM predictors assume an uncorrelated data structure, in the sense that they do not take into account the presence of intra-cluster correlation in the population.As a consequence it is expected that they will be inefficient when there is intra-cluster correlation.Chambers and Clark [3] argue that without the simultaneous use of auxiliary information, there is little to be gained from just assuming a two level structure (more specifically a clustered structure) when predicting the value of a finite population distribution function.We therefore propose an alternative framework for predicting the finite population distribution function in the presence of clustering based on assuming that the population of interest is a realization of a super-population that can be modelled via a two level linear regression structure, and show that applying the Chambers and Dunstan [4] approach in this situation leads to more efficient and robust predictors of the finite population distribution function of Y .
We note that our framework also allows us to calculate small-area estimates for quantities that can be defined in terms of functionals of alternative smearing-type estimators of the small-area finite population distribution function, for example using the outlier-resistant finite population distribution function prediction approach developed in Tzavidis et al. [10] In what follows we use simulation and an empirical study to compare the performance of our proposal with various potential predictors, including CD and RKM, under the two level super-population model.
Let U be a finite population of size N consisting of D domains each with N i sub-population units and let y i j (i = 1, . . ., D; j = 1, . . ., N i ) denote the value of the response variable Y for j th unit of i th domain.The finite population distribution function of Y is then defined as where �(e) is a step function such that �(e) = 1 if e ≥ 0 otherwise �(e) = 0. Let X denotes a set of p + 1 auxiliary variables (including the intercept) related to Y where the corresponding realizations of X are known for all units of the population.Furthermore, assume that the response variable Y follows a two level super-population model where β is a vector of ( p + 1) unknown regression parameters, σ 2 u and σ 2 � are unknown domain-specific and unit-specific random error variances respectively, typically referred to as variance components, and υ(x i j ) is a known and strictly positive function of x i j .Note that, υ(x i j ) = 1 if the unit-specific random error is homoskedastic.

Model-based Approach
The distribution function F N (t) can be decomposed into sampled (s) and non-sampled (r ) parts as where s i and r i indicate the set of sampled and non-sampled units of domain i, respectively.If we assume that the variance components σ 2 u and σ 2 � are known, a predictor of F N (t) can be defined following the smearing approach described in Chambers and Dunstan [4] , from now on referred to as a global smearing approach.This leads to the GSA predictor where where β is the vector of the estimates of the regression coefficients and u = {ũ 1 , .. . . ., ũ D } is the best linear unbiased predictor (BLUP) of u = {u 1 , .. . . ., u D }.An alternative to global smearing is local smearing.Here domain-specific residuals are used to define FG r (t, β, u).We refer to this as local smearing below, and it leads to the LSA predictor

.3)
One needs to know the variance components σ 2 u and σ 2 � before one can calculate the BLUP u used in (2.2) and (2.3).In practice these parameters will not be known.However, they can be estimated from the sample data using maximum likelihood (ML) or restricted maximum likelihood (REML), and the resulting estimates then plugged into (2.2) and (2.3), leading to the empirical versions of the GSA and LSA predictors respectively: and where û is empirical version of ũ, typically referred to as the empirical BLUP, or EBLUP, of u.Note that under homoskedasticity, ûl can be written as ûl = 3) are defined by conditioning on the vector u of domain-specific random errors.This allows us to treat the conditional residuals y i j − x T i j β − u i as independent and identically distributed.An unconditional version of these predictors can also be developed following the specification of the usual CD predictor.This is based on the estimates of the regression coefficient β and the marginal unit level error variances of the unit level errors e i j = u i + � i j that are obtained after fitting a two level regression model to the sample data.In this case the global smearing version of the predictor FG r (.) can be defined as follows: where υ * (x i j ) = (σ 2 u + σ 2 � υ(x i j ))/σ 2 u .Similarly, the local smearing version of this predictor is then: Clearly empirical versions of these unconditional CD-type predictors are easily written down once we have estimates of the variance components σ 2 u and σ 2 � .We denote these empirical versions by MGSA (2.6) and MLSA (2.7) below.

Monte Carlo Approximation to Smearing
Predictors like (2.4) and (2.5) can be computationally extensive for realistic large-scale application like poverty mapping because all the sample residuals are used in the smearing method.In order to speed up calculation in this situation, Marchetti et al. [8] propose an alternative approach to implementing smearing based on Monte Carlo (MC) simulation.Following this approach, a Monte-Carlo approximation to the value of the non-sample part of F * N (t) in the EGSA predictor is where the γ (b) lk are random draws with replacement from the sample residuals γij = υ(x i j ) −1 (y i j − x T i j β − ûi ), i = 1, . . ., D and j = 1, . . ., s i .Similarly, the local smearing predictor ELSA can be approximated by drawing the γ (b) lk from the corresponding domain-specific sample residuals.

Mean Squared Error Estimation via Bootstrap
We propose two different bootstrap methods for estimating the mean squared error (MSE) of the EGSA/ELSA predictors.They are based on the non-parametric bootstrap developed by Marchetti et al. [8] and the two level block-bootstrap procedure developed by Chambers and Chandra. [1]We refer to the non-parametric bootstrap procedure of Marchetti et al. [8] as MTP, while the block bootstrap procedure of Chambers and Chandra [1] is referred to as CC.The steps of the MTP bootstrap procedure are as follows: Step 1 Generate bootstrap population values using the estimated errors as i and ê * i j are randomly sampled with replacement from the corresponding rescaled vectors of estimated random errors ûi and êij , i = 1, . . ., D and j = 1, . . ., s i , and then calculate the bootstrap domain-specific target parameters F MT P N (t) from these bootstrap population values.
Step 2 Extract a sample s * of size n from the bootstrap population using the same sample design as that used to obtain the original sample.(t) be its corresponding predicted value.
Step 5 The MTP bootstrap estimator of the MSE of F N (t) is then The CC-based MSE estimation method follows the same steps except for generation of the bootstrap population.Under this approach, the bootstrap population is generated as follows: Step 1 Calculate sample residuals as r i j = y i j − x t i j β − ûi , i = 1 . . .D, j = 1 . . .n i and then calculate estimated level two residuals as group averages ri = n i j=1 r i j of the r i j for the D groups and estimated level one residuals as r 1 i j = r i j − ri .

Step 2 Obtain û *
i by sampling independently with replacement from the ri , i = 1 . . .D.
Step 3 Obtain êij for all the individuals belong to the i th group by sampling independently with replacement from the set of r 1 i j s defined by the i th group.The CC bootstrap procedure then replicates steps 2-5 of the MTP procedure, leading to the CC bootstrap estimator of the MSE of F N (t):

Numerical Evaluations
This section uses design-based simulation to illustrate the performances of the EGSA and ELSA smearing predictors of the finite population distribution function (2.1).These predictors are compared with their marginal versions MGSA (2.6) and MLSA (2.7) as well as with the single level predictors proposed in Chambers and Dunstan [4] and Rao et al., [9] and with the unweighted (Dir) and the sample weighted (WDir) direct estimators given in Chambers and Clark [3] .In the two level CD-type estimators EGSA and ELSA ûi is assumed to be an asymptotically consistent predictor of the group-specific random effect.However, as is well known, these predictors are subject to shrinkage.As an alternative we therefore also consider implementations of EGSA and ELSA based on unshrunken versions of the predictors of the group-specific random effects Chambers et al. [2] These are referred to as UGSA and ULSA respectively below.The aim here is to reduce bias due to the shrinkage effect in EGSA and ELSA Chambers et al. [2] The simulation is based on a population of 338 sugar cane farms corresponding to a sample of these farms obtained in a 1982 survey of the Queensland sugar cane industry, for detail see in Chambers and Dunstan [4] .The unplanned domains in this case are the four cane-growing regions of Queensland, with three response variables used for the simulation: (a) total cane harvested; (b) gross value of cane; (c) total farm expenditure.The auxiliary variable x is the measure of size, the assigned area for cane planting.It is assumed that υ(x i j ) in (1.1) is x 1/2 i j .We tested the null hypothesis of zero between region variation by computing the conditional-AIC (cAIC) value proposed in Vaida and Blanchard [11] and compared this to the AIC value for a linear regression model without random effects.The cAIC and AIC values for the three response variables are shown in Table 1.The cAIC value for the linear mixed model is always smaller than the AIC value for the linear regression model.This indicates that the linear mixed model represents a better fit than the linear regression model that does not include random effects.A linear regression model that included region as a fixed effect was also fitted and the traditional CD estimator based on this model was then used to predict the finite population distribution functions of the three response variables.These are denoted by CD1 below.We expect CD1 to perform better than the CD predictor that does not account for between region variability.The AIC values for the regression models with region as a fixed effect are close to the cAIC values for the corresponding random effects fits.
For each response variable, S = 500 samples of size n = 30 were drawn from the population under three different sampling designs: simple random sampling, stratified random sampling with 2 strata and proportional allocation, and stratified random sampling with 2 strata and optimal Neyman allocation based on x following the approach described in Chambers and Dunstan [4] .The stratum boundaries were the same for both methods of stratified sampling and was chosen to make the total measure of size (x) for each stratum as nearly equal as possible.The target parameter was the value of F N (t) for t equal to the 10, 25, 50, 75 and 90 percentiles of the finite population distribution of each of the three study variables.The performances of the various predictors in the simulation study were evaluated by computing their Mean Integrated Bias (MIB), Mean Integrated Absolute Error (MIAE) and Mean Integrated Mean Squared Error (MIMSE).The first two of these indicators measure bias related performance while the third one measures variability related performance.In order to assess the performance of the MTP and CC estimators of MSE, the simulation procedure was repeated using the same S = 500 samples, each with B = 500 bootstraps.The performances of the MSE estimators were also assessed using the MIB, MIAE and MIMSE indicators.Formally the MIB, MIAE and MIMSE indicators measure performance by averaging over samples and target percentiles as follows: where t denotes the 10, 25, 50, 75 and 90 percentile values, with T = 5.

Simulation Results
The simulation performances of the different distribution function predictors for the response variables total cane harvested, gross value of cane and total farm expenditure are set out in Table 2 for simple random sampling and stratified sampling with proportional allocation and optimal allocation respectively.These results show a mixed picture for the MIB measure, with design-based methods (WDir and RKM) performing best in 5 scenarios and model-based methods (MLSA, ULSA, ELSA and CD) performing best in the remaining 4 scenarios.However, this changes when we consider performances with respect to the MIAE and MIMSE measures.Here it is clear that the EGSA predictor performs best overall.In particular, it is outright best in 6 scenarios, second best after CD1 in two scenarios and equal best with CD in the remaining scenario.Although this seems a surprising result at first, given that CD1 is based on fixed region effects while EGSA is based on random region effects, it can be explained by noting that CD1 assumes homoskedasticity in model errors while EGSA allows for heteroskedasticity in level one errors.The importance of allowing for heteroskedasticity in model specification when predicting the value of a finite population distribution function also explains why CD and CD1 perform similarly, even though the the latter allows for average regional differences.It also emphasizes the fact that although the cAIC values generated by a two level model with heteroskedasticity and AIC values generated by a homoskedastic single level model with region as factor seem close (see Table 1), the CD1 predictor based on the latter model can perform worse than conditionally specified predictors like EGSA and ELSA based on the former model that also allow for heteroskedasticity.The simulation performances of the MTP and CC based MSE estimators of the EGSA and ELSA predictors are set out in Table 3.Interestingly, for both simple random sampling and for stratified sampling with proportional allocation, the variations in performance shown here depend on the variable of interest, rather than the performance measure used, or the predictor whose MSE is being estimated.In particular, for the variables total cane harvested and total farm expenditure the MTP method outperforms the CC method for both EGSA and ELSA, while the reverse holds for gross value of cane, where the CC method outperforms the MTP method for EGSA and ELSA.For stratified random sampling with optimal allocation, there does not appear to be any particular trend in these results.The MTP method works well for EGSA with total farm expenditure, while the CC works well with ELSA for gross value of cane.There appears to be little to choose between the two methods of MSE estimation for the variable total cane harvested.

An Empirical Study: UK Business Survey Data
In this section we use data from the 2002 UK New Earnings Survey (NES2002) in an empirical study of the distribution of hourly pay rates (in pence units).Among N = 142,999 employees surveyed in NES2002, n = 71,481 were able to provide exact hourly pay rates (Y ) while the remaining 71,518 surveyed employees could not provide their exact hourly pay rate.However, an implicit hourly pay rate (X 1 ) can be calculated for all surveyed employees based on their total wage and total working hours.Here we assume that the n = 71,481 employees for whom we have both an exact hourly rate and implicit hourly rate constitute our sample, while the remaining N − n = 71,518 employees in the NES2002 sample constitute the non-sampled part of our target population of N = 142,999 employees.Our domains of interest for this study are the 76 occupation groups covered by NES2002, with employee implicit hourly pay rate (X 1 ), sex (X 2 ), and age-group (X 3 ) used as explanatory variables in fitting a regression model to Y .Preliminary analysis shows that the quantiles of X 1 are higher in the non-sampled employees (X 1r ) compared to the sampled employees (X 1s ) and all employees (X 1U ).Since Y and X 1 are highly correlated, the sample quantiles of Y are close to those of X 1s as can be seen in Table 4.In order to calculate sampling weights, the sample was split into 220 strata based on the deciles of X 1 , X 2 , and X 3 (11 age-groups).These sampling weights were then used to estimate the pay rate distribution function using the weighted direct (WDir) estimator following Chambers and Clark [3] .Finally, we see from Figure 1 that the stratum-specific and occupation-specific means of Y and X 1 (denoted by X ) vary significantly in the survey data.
Linear models, with and without domain random effects, were fitted to Y using X 1 , X 2 , X 3 , and the interactions between X 2 and X 3 as fixed effects.The intra-cluster (occupation group is the cluster here) correlation was calculated to be 24%.Assuming homoskedastic level one and level two errors, the CD, RKM, EGSA, and ELSA predictors were used to obtain the estimated proportion of employees with hourly pay rates below t = 500, 600, 700, 800, 900 and 1,000 pence.The MTP based MSE estimation method was used to estimate the MSEs of these estimated proportions, including those based on the CD and RKM approaches.In these cases, the bootstrap procedure was implemented by generating a bootstrap population based on marginal residuals, as y * i j = x t i j β + ê * i j where ê * i j was randomly selected with replacement from the sample residuals êij = y i j − ŷij .Table 5 shows that the estimates obtained using the unweighted (Dir) and weighted (WDir) direct estimators are higher than those obtained using the other methods.This was expected given that the quantiles of X 1s are lower than those of X 1r (see Table 4).All the other estimators appropriately adjust for this issue, with CD making the smallest adjustments.The RKM, EGSA and ELSA predictors lead to very similar estimates in all cases.As far as MSE estimation is concerned, the CD and RKM predictors show very poor performance while the EGSA and ELSA predictors lead to significantly lower MSE estimates (see Table 5).It is of some concern that the MSE estimates for CD and RKM were considerably worse than those for the direct estimators.

Concluding Remarks
In this article we extend the Chambers and Dunstan [4] approach to prediction of a finite population distribution function given two level clustered population data.Our simulation results and our empirical study provide evidence that two level smearing-based extensions of this approach (EGSA and ELSA) perform well when the population data exhibit significant intra-cluster correlation.We also develop two bootstrap methods for estimating the MSE of the predictors that we propose.These clearly show that ignoring intra-cluster correlation (a feature of single level predictors like CD and RKM) can lead to serious error when predicting the value of the population distribution function.The conditional two level smearing predictors EGSA and ELSA that we develop outperform their unconditional alternatives MGSA and MLSA as well as conditional versions based on unshrunken random effects (UGSA and ULSA).Some important advantages of the two level prediction methods that we propose in this article are that they can be easily extended to prediction of small area distribution functions and associated functionals (e.g., small area poverty indicators) and can also be easily extended to outlier robust inference following the approach described in Welsh and Ronchetti. [12]In further research, we aim to develop analytical MSE estimators of the the EGSA and ELSA predictors in order to further evaluate the performances of the bootstrap-based MSE estimators introduced in this article.We also note that non-parametric versions of the EGSA and ELSA predictors can be developed following Chambers et al., [5] and also Kuk and Welsh. [7]

Step 3
Calculate the bootstrap values of the EGSA/ELSA predictors F * MT P N (t) based on the bootstrap sample data.Step 4 Repeat steps 1 − 3 B times.In the b th bootstrap replication, let F MT P(b) N (t) be the quantity of interest and let F * MT P(b) N

Figure 1 .
Figure 1.Stratum and Occupation (group) Specific Mean Hourly Pay Rate ( Ȳ ) and Implicit Hourly Pay Rate ( X) for NES2002 Data.

Table 1 .
cAIC and AIC Values Computed for the three Response Variables Available in the Queensland Sugar Cane Industry Survey Data.

Table 2 .
Simulation Values of MIB, MIAE and MIMSE for the Direct (Dir) and Sample Weighted Direct (WDir) Estimators, the Single Level Model-based Predictors CD and CD1, the Single Level Model-assisted Estimator RKM, and the Two Level Model-based Predictors MGSA & MLSA (Marginal CD Using Shrunken Random Effects), UGSA & ULSA (Conditional CD Using Unshrunken Random Effects) and EGSA & ELSA (Conditional CD Using Shrunken Random Effects) Under (i) Simple Random Sampling, (ii) Stratified Random Sampling with 2 Strata and Proportional Allocation, and (iii) Stratified Random Sampling with 2 Strata and Optimal Allocation.The Best Performances for Each Method of Sampling, Variable of Interest, and Performance Measure are Highlighted in Bold.

Table 3 .
Simulation Values of MIB, MIAE and MIMSE for the MTP and CC Based Estimators of the Mean Squared Error (MSE) of the Global (EGSA.MTP & EGSA.CC) and Local (ELSA.MTP & ELSA.CC) Smearing Based Two Level CD Predictors Under (i) Simple Random Sampling, (ii) Stratified Random Sampling with 2 Strata and Proportional Allocation, and (iii) Stratified Random Sampling with 2 Strata and Optimal Allocation.The Best Performing Predictor/MSE Estimator Combination is Highlighted in Bold.

Table 4 .
Distributions of Employee Hourly Pay Rate (Y ) and Implicit Hourly Pay Rate (X 1 ) by Sample (s), Non-sample (r ) and Population (U ) for NES2002 Data.

Table 5 .
Estimated values for the Distribution Function of Hourly Wage Rates (in Pence) Along with the Estimated Mean Squared Errors (MSE, ×10 4 ) Using Direct (Dir), Weighted Direct (WDir), CD, RKM, Mixed Model CD with Global (EGSA) and Local (ELSA) Smearing Approaches Applied to NES2002 Data.