Researchers studying income inequality, economic segregation, and other subjects must often rely on grouped data—that is, data in which thousands or millions of observations have been reduced to counts of units by specified income brackets. The distribution of households within the brackets is unknown, and highest incomes are often included in an open-ended top bracket, such as “$200,000 and above.” Common approaches to this estimation problem include calculating midpoint estimators with an assumed Pareto distribution in the top bracket and fitting a flexible multiple-parameter distribution to the data. The authors describe a new method, mean-constrained integration over brackets (MCIB), that is far more accurate than those methods using only the bracket counts and the overall mean of the data. On the basis of an analysis of 297 metropolitan areas, MCIB produces estimates of the standard deviation, Gini coefficient, and Theil index that are correlated at 0.997, 0.998, and 0.991, respectively, with the parameters calculated from the underlying individual record data. Similar levels of accuracy are obtained for percentiles of the distribution and the shares of income by quintiles of the distribution. The technique can easily be extended to other distributional parameters and inequality statistics.

Researchers studying inequality and economic segregation must often deal with a paucity of data in the upper tail of the income distribution, which is unfortunate given that the correct characterization of the highest incomes is central to many important measures. Although individual-level data sets are best, even when they are available, they may not correspond to the geographic areas of interest or they may have limited geographic information to protect confidentiality. In the context of the United States, Public Use Microdata Sample (PUMS) data are released for geographic areas that do not always correspond to other important geographies, such as counties or metropolitan areas. When the PUMS data do match well in a given year, changes in metropolitan area boundaries may not allow the construction of a panel of geographic areas with consistent boundaries, inhibiting longitudinal research.

For these reasons, it is often necessary to rely on grouped data, in which individual, family, or household incomes are presented in terms of a series of brackets of varying width. Typically, the mean and median are presented as well. Surprisingly, the variance and/or standard deviation of income, let alone more advanced inequality measures, are rarely presented. Thus, there is a need for a reliable methodology for estimating the variance of income and other inequality statistics, such as the Gini coefficient and the Theil index, from grouped data.

In grouped data, the information provided is the number of observations between lower and upper income limits, such as “$20,000 to $25,000” or “$50,000 to $75,000.” Grouped data pose three main challenges. First, the exact arrangement of the individual observations within the brackets is unknown and could have large effects on summary statistics and inequality measures. The fewer in number and the wider the brackets are, the worse this problem becomes. Second, the brackets tend to be of different widths and are often quite wide in the upper tail of the distribution, providing less precise information about exactly those observations that contribute disproportionately to the variance and many measures of inequality. Third, the top and bottom income brackets are usually open ended, such as “$10,000 and below” and “$200,000 and above,” increasing the difficulty of making inferences about the underlying income values in those brackets. The open-ended top bracket, in particular, presents a daunting challenge given the high and growing inequality in the United States and other nations (Atkinson, Piketty, and Saez 2011; Piketty 2014; Piketty and Saez 2003). Nor is the problem limited to the study of income distributions. For example, educational achievement scores are presented in grouped form to protect confidentiality (Ho and Reardon 2012), and historical data on housing values may be presented in tables by categories, with no access to the original underlying data (Soltow 1981).

In this article, we propose mean-constrained integration over brackets (MCIB), a new method for estimating such parameters from grouped data, which is both less biased and more precise than previously available methods. We illustrate the method using data on household income for 297 U.S. metropolitan areas. The performance of MCIB is compared with that of the traditional midpoint estimator (Henson 1967), the robust Pareto midpoint estimator (RPME), and the multimodel generalized β estimator (MGBE) (von Hippel, Scarpino, and Holas 2016). The MCIB method improves on prior methods by making more complete use of the available information relative to alternative methods. First, given that the bracket amounts are post hoc inventions of a statistical agency, it is reasonable to assume that the densities change in a roughly continuous way within brackets and across the bracket boundaries. Second, incomes in the upper tail of the distribution are assumed to follow a Pareto distribution, as in prior research, but the crucial shape parameter of that distribution is estimated more accurately than in the midpoint approach. Third, the parameters of the income distribution are estimated by computing integrals of desired quantities over the income range spanned by each bracket, rather than by assuming that all observations within a bracket are equal to the midpoint. MCIB also improves upon distribution-fitting methods, by analyzing brackets separately rather than constraining the distribution as a whole to follow a continuous unimodal distribution.

In Section 2 we describe characteristics of metropolitan income distributions using household-level data from the 2011 PUMS of the American Community Survey (ACS) (Ruggles et al. 2015), a 5 percent sample spanning the years 2007 to 2011. In Section 3 we review the literature on the existing methods to recover the original parameters from grouped data. In Section 4 we discuss the proposed methodology. In Section 5 we evaluate and compare the performance of MCIB relative to the RPME and the MGBE (von Hippel et al. 2016) in estimating the variances, Gini coefficients, and Theil indexes of inequality. We find that MCIB produces estimates that are both less biased and substantially more precise. MCIB estimates also have smaller root mean square errors (RMSEs) than the RPME estimates and are nearly perfectly correlated with the parameters from the original data. In Section 6 we conclude with a discussion of the implications of the findings as well as areas for future research.

To provide a context for the long-standing problem of recovering the parameters of the household income distribution from aggregated data reported in income brackets, we begin by examining actual household income distributions1 from 297 metropolitan areas2 using individual PUMS data from the ACS for 2011.

Of 87 million metropolitan households, slightly more than 1 million (1.2 percent) reported zero income.3 Another 32,000 (0.04 percent) reported negative incomes that averaged –$6,270. Negative and zero incomes are often discarded because the statistical distributions and inequality measures that researchers use to study income distribution are undefined for negative or zero values (e.g., those based on the log of household income). This approach is problematic. Although they are a small part of the entire sample, the negative and zero incomes make up 16.5 percent of the “below $10,000” income bracket used in the grouped data. Moreover, these observations are well below the metropolitan mean and therefore could contribute disproportionately to deviation-based measures of inequality. Systematically discarding the poorest households for computational convenience seems like a questionable approach to studying inequality.

Table 1 reports on several approaches to handling these observations using data for the 297 metropolitan areas combined. Although dropping the negatives has little effect on the mean or standard deviation of income, dropping the households with zero income decreases the weighted count by 1 million households and increases the mean income by nearly $1,000. Instead, recoding the negative and zero incomes to $1 preserves the sample size (which could be an issue with smaller metropolitan areas) and produces nearly identical summary statistics to the data as originally reported. We follow this practice here, allowing a consistent sample to be used for all the estimation techniques being compared.

Table

Table 1. Household Income in U.S. Metropolitan Areas: Alternative Definitions

Table 1. Household Income in U.S. Metropolitan Areas: Alternative Definitions

Summary data published by the Census Bureau currently use a set of 16 income categories of varying widths. The counts of households for the Los Angeles–Long Beach, California, metropolitan area as well as the richest (Stamford, CT), median (Vineland-Camden-Millville, NJ), and poorest (Flint, MI) metropolitan areas are shown in Table 2. Nearly 6 percent of households in the Los Angeles–Long Beach metropolitan area are in the open-ended top bracket. In contrast, nearly 25 percent of the households are in the top bracket in the richest metropolitan area, Stamford, Connecticut. Vineland-Millville-Bridgeton, New Jersey, and Flint, Michigan, have only 2.3 percent and 0.6 percent of households in that category, respectively. The bracket containing each metropolitan area’s median income is shown in boldface type and is different for each metropolitan area. Clearly any estimation technique must be able to cope with the fact that the fixed bracket amounts cut the different metropolitan income distributions in different relative positions.

Table

Table 2. Household Income by Income Brackets

Table 2. Household Income by Income Brackets

Taking advantage of the fact that in this case we have the underlying household-level data, Figure 1a shows the histogram of household income for the Los Angeles metropolitan area using consistent $10,000 brackets.4 The boundaries for the income brackets used for grouped data are indicated, the highest of which is the open-ended top bracket beginning at $200,000. The profound rightward skew compresses the histogram of household income, obscuring all detail in the lower income brackets. To overcome this compression, we break the distribution into lower, middle, and upper sections. Figure 1b shows the lower brackets of the income distribution, using $1,000 brackets to better show the details of the distribution. Incomes below $10,000 are bimodal, with many households at both extremes of the first bracket. The distributions in the eight $5,000-wide brackets are highly irregular.5 For households in the $50,000 to $200,000 brackets, however, there is a pattern of declining density, as shown in Figure 1c. This exact within-bracket distribution is not known when we have access only to the grouped data, but the general tendency of the within-bracket distributions should be taken into account to estimate more accurately the variance of the distribution and other statistics from the bracket counts.


                        figure

Figure 1. Distribution of income, Los Angeles–Long Beach metropolitan area, 2011.

The open-ended top bracket is the most troublesome. Figure 1d shows the distribution of households with incomes above $200,000. A Pareto distribution is superimposed on the histogram with a shape parameter α, estimated from the data, and a scale parameter β equal to the lower bound of $200,000.6 Although the Pareto fits the declining pattern in a general way, there are clumps of households at different values, possibly due to top coding.7 Estimating the shape parameter is easier using the underlying data than the counts of households in brackets, but it is important to remember that the Pareto is not an exact fit to the individual-level data. Moreover, because the right tail of the distribution is so extended, even small errors in estimating α will produce large errors in the estimation of the variance and other parameters of interest.

Although there are variations across the 297 metro areas in the sample, the general features of the household income distributions described above are nearly universal. Keeping these particulars of the income distribution in mind, we turn to the question of estimating the household income variance and related quantities when the household-level data are not available. First, we briefly review the literature on this problem, with an emphasis on the recent valuable contribution of von Hippel et al. (2016). We then lay out an improved method, MCIB, which takes into account the empirical regularities discussed above and makes more complete use of the limited information provided in grouped data. Following that, we compare the accuracy of different methods of recovering the parameters of the underlying data.

Introductory statistics textbooks present a method for calculating the variance of data presented in a series of brackets:

s2=1N1b=1Bfb(mbG)2,(1)

where b indexes the brackets, fb is the count of observations in the bracket, G is the grand mean, and mb is the midpoint of the bracket. The formula would be exactly correct only if all observations were situated at the bracket midpoints. Typically, however, observations will be clustered within the brackets so that the mean of the observations in a bracket is different than its midpoint. In the brackets above $50,000, however, the households were more frequent toward the lower end and declined toward the upper limit. Even if the mean of a given bracket is equal to its midpoint, this estimator of the variance is still incorrect, because

i=1nb(xiG)2nb(xmidG)2.(2)

In other words, the bracket’s sum of squared deviations will differ from the number of households times the squared deviation of the midpoint even when the mean is equal to the midpoint.

Despite these problems, the midpoint estimator has been widely used. In an early study of the U.S. income distribution, U.S. Census Bureau demographer Mary Henson (1967) described her method as follows:

Since $500 or $1,000 levels were used below $10,000 for 1947 to 1950 and below $15,000 thereafter, the midpoint of each interval below the open-end was assumed to be the average. A value of $19,000 was used for the $15,000 to $24,999 interval. In general, the average for the open-end interval . . . was obtained by fitting a Pareto Curve to the data. (p. 33)

Note that Henson did not use the exact midpoint for the closed bracket of $15,000 to $24,999, but she adjusted it from $20,000 to $19,000 without offering any explanation. We can surmise she made this ad hoc adjustment to account for the likelihood that households were clustered toward the lower end of this bracket.

No midpoint is available for the open-ended top interval. However, if the households in the top bracket follow a Pareto distribution, the average income of the top bracket is a simple function of the Pareto shape parameter α, given by

μB=βαα1,(3)

in which β is the lower limit of the open-ended bracket. The idea is to estimate α and then use that to estimate the average income of the top bin; that figure is then used in the same way that the midpoint is used to represent the lower bins. In other words, the entire population of the top bracket is assumed to have the mean income of the bracket, just as the households in the lower brackets are assumed to be at one of the other midpoints.

The accuracy of the estimated top bracket mean depends on the estimate of α. The Pareto distribution can be written as N=AYα, where N is the number of households with incomes of Y or higher; in other words, in this formulation it is cumulative from the top. Taking the log of both sides yields a linear version: lnN=lnAαlnY. On the basis of this identity, one common method to estimate the shape parameter is to estimate a linear regression of ln N on ln Y for all the brackets above the median income (Bronfenbrenner 1971:44–45). However, diagnostic analysis of the residuals from such regressions suggests a lack of parameter stability, and large errors result when extrapolating these regression coefficients to households in the topmost bracket (Cloutier 1988).

An alternative is to calculate the slope directly from the second highest bracket. For example, using the Los Angeles–Long Beach data presented in Table 2, the estimate of the shape parameter of the Pareto distribution is computed as follows:

α=ln(N16)ln(N15)ln(Y16)ln(Y15)=ln(161,829+189414)ln(189,414)ln(200,000)ln(150,000)=2.15.(4)

This is the commonly used two-point estimator (Cloutier 1988; Henson 1967; Jargowsky 1996; Miller 1966; von Hippel et al. 2016). This approach assumes that the incomes in the top two brackets are Pareto distributed and share a common shape parameter. Given that this estimate of α is based on only two points at the thinner end of the distribution, “it is not robust or even usable in some small samples” (von Hippel et al. 2016:219).

Dissatisfied with the performance of this “naive” midpoint estimator, von Hippel et al. (2016) systematically evaluated a number of more robust midpoint estimators (RPME). They noted that much of the instability in the estimates of income distribution parameters using the two-point estimator is due to the high degree of sensitivity of the top bracket mean to the value of α. To make the midpoint estimator robust, they suggested replacing the arithmetic mean with the median, harmonic mean, or geometric mean of the top bracket. Although these alternatives are also functions of α, they are more stable. They are also smaller in magnitude than the arithmetic mean. von Hippel et al. (2016) acknowledged that the substitution “could introduce some negative bias, but the bias will be compensated by some reduction in variance” (pp. 220–21). They also recommended setting a lower bound on the value of α of 2 in the case of the arithmetic mean and 1 for the alternative measures.

A completely different approach is to fit a single, flexible probability distribution to the entire set of brackets and household counts. McDonald (1984) described a number of two-, three-, and four-parameter distributions that can be used to model income distributions, and they gave formulas to calculate the means, variances, and inequality measures as functions of the estimated parameters. For example, Corcoran and Evans (2010) estimated inequality in school districts by fitting a three-parameter Dagum distribution to income bracket data from a panel of school districts. Minoiu and Reddy (2014) tested a nonparametric kernel density estimator but found that it performed worse than parametric approaches in many situations.

von Hippel et al. (2016) also developed an estimator (MGBE) that calculates 10 related distributions from the generalized beta family and either selects a best model on the basis of fit statistics or provides a weighted average of several models.8 They found that the MGBE outperformed single distribution approaches but was not uniformly more accurate than the RPME and was less accurate in more recent data. The quality of the MGBE estimator depends on how well the actual income distribution fits to one of the unimodal, smoothly changing theoretical distributions, a cause for concern given the lumpiness and inconsistency of the actual metropolitan income distributions described above (see also Minoiu and Reddy 2008). In contrast, the RPME “with enough bins, can fit the nooks and crannies of any income distribution, regardless of its shape” (von Hippel et al. 2016:216). A further consideration is that “the RPME is about 1,000 times faster than MGBE” (p. 216). In the “Results” section, the performance of the naive midpoint estimator, the RPME, and the MGBE are compared with the MCIB method proposed here.

Midpoint estimators perform worse as the bracket width increases (Heitjan 1989). In the case of U.S. census and ACS data, there are 16 income brackets that vary enormously in width; the widest closed-end bracket spans a range of $50,000. The method proposed here estimates the variance, Gini coefficient, Theil index, and other inequality statistics by the computation of integrals of density functions over the brackets that explicitly take into account the variation of values within the brackets. If the changing relative frequency of households is well captured by the density functions used, the width of any given bracket will matter far less.

The MCIB estimation proceeds in three steps. First, a combination of uniform and linear density functions is used to estimate the aggregate household income and mean income within the closed-end brackets. Second, that information in combination with the grand mean is used to directly compute the mean in the open-ended top bracket, which implies a specific value for the shape parameter of the Pareto distribution for the open-ended top bracket. Third, a combination of uniform, linear, and Pareto density functions is used to compute the contribution of each bracket to the parameter of interest. These are then combined to get the overall estimate. To illustrate the process, the steps to estimate the variance of household income from the grouped data are spelled out as follows.

  • Step 1: Estimate the density functions of the closed-end brackets.

Assume there are B brackets, indexed by b = {1, 2, . . ., B} in ascending order of income level. Each bracket has an upper limit, Ub, and a lower limit, Lb, but the top and bottom brackets are technically open ended. However, given the small number and negligible impact of negative values in the PUMS data discussed above, the bottom bracket can be considered closed at 0 (or $1 if the data are recoded to eliminate zeros). Each bracket contains nb households adding up to N, the total number of households. The unknown mean income of each bracket is denoted µb; in general, these means are not equal to the midpoint of the bracket, though in narrower brackets they may be approximately the same.

For those brackets in which the density is consistently rising or falling, however, the density of households within the bracket may be described by a linear function:

fb(y)=mby+cbN,(5)

where mb is the slope and cb is the constant of the line that describes the relative frequency of households in bracket b. To determine the slopes and intercepts for the brackets, we take the number of the households in each bracket and divide by the width of the bracket, giving a frequency per dollar of income for each bracket. The slopes, mb, are calculated as the average of the slopes from bracket b– 1 to b and from b to b+ 1, if both are available, reflecting the assumption that the trend within a bracket roughly reflects the trend across the neighboring brackets. The constants, cb, are then calculated to force the line of slope m through the relative frequency point, thus preserving the correct overall frequency for the bracket (Liebenberg and Kaitz 1951). The slopes are constrained to produce only non-negative densities. Finally, the function is divided by the total households N, so the density sums to nb/N over the bracket, that is, the given bracket’s contribution to the total probability function.

In some brackets, a uniform within-bracket distribution may be preferable to a linear function. For example, if a bracket is either higher or lower than both of its neighboring brackets (effectively an island or a trench), there is no basis for concluding the density is either rising or falling within the bracket, so the slope is set to zero. The large spike of households frequently observed at $0 of income argues against an upward-sloping function in the first bracket, so a uniform density is applied there as well. For the sake of illustration, the relative frequencies and resulting slopes for the closed-end brackets for the Los Angeles–Long Beach metropolitan area are shown in Figure 2. In the “Results” section, we empirically evaluate several options for specifying uniform distributions: (1) no brackets, (2) first bracket only, (3) all brackets below the mean, and (4) for the sake of comparison only, all closed-end brackets. There are discontinuities in the linear density functions at the bracket boundaries, but this reflects the actual messiness of the income distribution shown in the data. Because each bracket is evaluated separately, the discontinuities have no effect on the subsequent calculations. The discontinuities may appear odd, but if a single continuous function fits the data well, we would not need to engage in this exercise. The validity of the procedure will be evaluated on the basis of how well the resulting estimates match the parameter values of the underlying income distributions.


                        figure

Figure 2. Linear functions for relative frequency within brackets, with uniform density in first bracket.

  • Step 2: Estimate the mean and Pareto parameter for the open-ended bracket.

By definition the grand mean, G, is related to the bracket means through the following identity: G=1Nb=1Bnbμb. The mean income of a given area is nearly always published along with the counts of households by income brackets. It makes sense to use this information to the extent that it is possible to calculate the mean of the incomes in the open-ended top bracket, by pulling out the top bracket from the summation and rearranging terms:

μB=(NGb=1B1nbμb)/nB(6)

In other words, we aggregate all the income in the brackets below the top bracket, subtract that from total aggregate income, and divide by the number of households in the top bracket. Despite the unlimited possible income values in the top bracket, the mean of the top bracket is nevertheless constrained by the grand mean.

In brackets assumed to have a linear trend in the household distribution, the following integral estimates the mean household income in the brackets below the top bracket:

μbNnbLbUb(y)fb(y)dy=NnbLbUb(y)(mby+cbN)dy=1nb(mb3y3+cb2y2)|LbUb.(7)

(The term N/nb weights the density to sum to 1 over the bracket.) These estimated mean incomes, rather than the bracket midpoints, are then plugged into equation (6) to get the top bracket mean, µB.

For the open-ended bracket, we use the Pareto density function (Mandelbrot 1960):

π(y)=αβαyα+1,(8)

in which β is the minimum income, in this case the lower limit of the top bracket. Rather than estimating the shape parameter α from the counts of households in the top two brackets, it is calculated directly from our estimate of the top bracket mean, µB, using the following relationship (Quandt 1966):

α=μBμBβ.(9)

Estimating α from the top bracket mean is the reverse of the procedure usually used in prior research, in which α is first estimated and implies a value for the top bracket mean (Henson 1967; Jargowsky 1996; von Hippel et al. 2016).

  • Step 3: Estimate the variance (and other parameters).

Accurate estimates of the parameters of the income distribution are computed using the density functions developed in the first two steps. The procedure is illustrated using the variance. The variance has deficiencies as a measure of inequality, but it is clearly an important characteristic of the distribution and it is useful in calculating other measures.9 The formula for the variance of the income distribution on the basis of individual households is additively separable by income brackets:

σ2=1Ni=1N(yiG)2=1Nj=1Bi=1nb(yiG)2.(10)

We replace the inner summation with the appropriate integral:

σ2=1Nb=1Bi=1nb(yiG)2b=1BLbUb(yiG)2fb(y)dy.(11)

In other words, MCIB estimates the contributions to the variance separately in each bracket and sums them up to obtain the estimate of the variance (Cowell 1977).

The variance component for the inferior brackets depends on the type of density assumed for each bracket. If the relative frequency of the households is assumed to change in a linear fashion within the bracket, the linear density function described above is used; the resulting integral for the variance component in brackets from 1 to B– 1 is therefore

LbUb(yG)2fb(y)dy=LbUb(y22GyG2)(mby+cbN)dy=1NLbUb(mby3+cby22Gmby22Gcby+G2cb)dy=1NLbUb(mby3+(cb2Gmb)y2+(G2mb2Gcb)y+G2cb)dy=1N(mb4y4+(cb2Gmb)3y3+(G2mb2Gcb)2y2+G2cby)|LbUb.(12)

Note that for brackets assumed to follow a uniform density, the same integral is used but with mb set to zero and cb adjusted accordingly. An alternative integral based on the uniform density per se produces identical results.

Finally, the variance component from the households in the top bracket, which is assumed to follow a Pareto distribution, can be estimated though the following integral:

LB(yG)2fB(y)dy=LB(y22Gy+G2)(nBN)(αβαyα+1)dy=nBαβαNLB(y1α2Gyα+G2yα1)dy=nBαβαN(y2α2α2Gy1α1α+G2yαα)|LB.(13)

The Pareto density π(y) that was used above summed to 1 over the open-ended bracket; here it is adjusted by nB/N so that the contribution to the variance from this bracket is properly weighted.

There are two difficulties in completing the calculation of the last integral on the basis of the Pareto distribution. First, the upper bound is technically infinity. Although the expected number of households falls dramatically as the income level rises, the small probability of extreme values continues to inflate the estimate, particularly given that the quantity being estimated grows as the square of the distance from G. Indeed, if α is less than 2, the variance of the Pareto distribution, on the basis of the squared deviations from the upper bracket mean, is infinite (Mandelbrot 1960). It thus stands to reason that the sum of squared deviations from the grand mean would be infinite as well. One possibility is to use an arbitrary cutoff for the upper limit of the integral, such as $2 million. The disadvantage of a fixed cutoff approach is that more of the Pareto density is included in the calculation for less unequal places and vice versa, possibly introducing a systematic bias.

Instead, we use a value for the upper limit of the integral that includes a consistent 99.5 percent of the Pareto density function across all metropolitan areas. The exact dollar amount to set as the upper limit of integration is obtained by setting the cumulative probability function for the Pareto distribution (Quandt 1966) equal to 0.995 and solving for the income level:

F(UB)=1βyα=0.995UB=e(lnβln(10.995)α).(14)

Note that β is the minimum level of income; in the ACS data, it is the lower bound of the top bracket ($200,000), and it is the same for all metropolitan areas. Thus, the cutoff is a function of α alone for any given cumulative probability. A second potential computational problem is that the integral is undefined when α is exactly 2 or exactly 1. However, this is extremely unlikely to happen in practice. First, the Pareto distribution is not defined for α of 1 or less. Second, the integral’s value is stable on both sides of the discontinuity. Thus, should an α value of exactly 1 or 2 occur, it is only necessary to add or subtract a small amount, such as .001, to the value of α to avoid a missing value for the top bracket interval.

Finally, the variance components from all the brackets are added together to get the estimate of the variance of the entire household income distribution. We take the square root to get the standard deviation and divide that by the grand mean to get the coefficient of variation.

4.1. Extensions to Additional Measures

The same procedure can be used to estimate any quantity that is additively separable across the income brackets in the sense described above. In general, if there is an income distribution statistic Z, with a calculation formula z(y), the general formula to estimate it from the grouped data is

Z=b=1BLbUbz(y)fb(y)dy=b=1B1LbUbz(y)(mby+cbN)dy+LBz(y)(nBN)(αβαyα+1)dy.(15)

We illustrate this using the Theil index in the “Results” section.

Another advantage of MCIB is that the multipart density function can produce percentiles of the distribution and inequality measures based on them. To calculate the median, for example, it is first necessary to identify the bracket that contains it, which will vary from one metropolitan area to another. For example, if the first five brackets hold 47 percent of the distribution in a particular area and the sixth bracket holds an additional 10 percent, then the value must be found between the upper and lower bounds of the sixth bracket that corresponds to an additional 3 percent of the overall distribution. In other words, the integral of the density function for that bracket must be solved for the upper value θ that corresponds to the required additional probability, Δp:

Lbθ(mby+cbN)dy=(mby22N+cbN)|Lbθ=Δp.(16)

Expanding the definite integral and rearranging terms results in a quadratic equation that can easily be solved for the critical value, θ:

(mb2N)θ2+(cbN)θ(mbLb22N+cbLbN+p)=0,(17)

with the quadratic coefficients shown in parentheses. Only one of the roots will fall between the upper and lower limits of the bracket; the other is discarded. For the open-ended top bracket, equation (14) can be used again, substituting the desired p in place of 0.995. The percentile points can then be used to calculate familiar inequality measures such as the ratio of the 90th to the 10th percentile or the shares of income accruing to different quintiles of the distribution. In contrast, the midpoint method does not produce percentiles beyond identifying the midpoint of the bin in which the percentile is located.

Not all inequality measures can be estimated directly from the bracket integrals. For example, computational formulas for the Gini coefficient are not additively separable on the income axis. However, the Gini coefficient can be estimated geometrically from the Lorenz curve, which plots the cumulative proportion of aggregate income against the cumulative proportion of households. Points on the Lorenz curve at the bracket endpoints may be calculated from the bracket totals, and to improve the accuracy of the estimate additional points within the brackets can be computed by successive applications of the appropriate density functions for households and income (Henson 1967). This procedure is illustrated in the “Results” section.

MCIB as proposed here makes better use of the information in the bracket data in several ways. The method makes reasonable inferences about the distribution of income within the brackets, on the basis of general experience with income distributions and by taking advantage of the trend in frequency across neighboring brackets. MCIB constrains the estimates of α, solving a problem that has led to instability in other estimation methods, by requiring that the estimate is consistent with the overall mean income. By allowing the household density to vary from one end of the bracket to the other, MCIB estimates are less sensitive to the specific cutoff points used and the width of the brackets. All of these factors contribute to a high degree of accuracy in the estimates of income statistics using MCIB, as the following section demonstrates.

Household income data from 297 metropolitan areas in the 2011 PUMS data are used to evaluate and compare methods of estimating the variance. The PUMS household-level data described above serve as the true population, so that we know the parameters of interest and can assess how well each of the methods recovers these parameters. These metropolitan areas vary in population size from 38,000 to 3.5 million.10 The mean household income ranges from $40,409 in Flint, Michigan, to $159,351 in Stamford, Connecticut. In fact, Stamford is quite an outlier; the next wealthiest metropolitan area is Danbury, Connecticut, with a mean household income of $112,314. The standard deviations of the household income distributions vary widely as well, from $37,718 in Flint to $191,634 in Stamford. The wide dispersion of population size, income levels, and income dispersion provides a good test of the estimation methods.

5.1. Closed Bracket Means

Table 3 shows the closed-end bracket limits, midpoints, the true bracket means calculated from the household-level data, and the bracket means estimated by integration for the Los Angeles–Long Beach metropolitan area. In none of the brackets is the actual mean equal to the midpoint. In the first bracket, however, the midpoint is closer to the true value than the mean estimated by integration. That is because the upward-sloping density from the first bracket to the second does not account for the observations stacked at zero, reducing the mean in the first bracket. In the brackets above the first and up to the median bracket ($50,000 to $60,000 in Los Angeles–Long Beach), the midpoint and estimated means are nearly identical and slightly above the true means. In the brackets above the median, however, the means estimated by integration are closer to the actual means than the midpoint in every case.

Table

Table 3. Bracket Midpoints, Actual Means, and Estimated Means for Closed-end Brackets in Los Angeles–Long Beach, California

Table 3. Bracket Midpoints, Actual Means, and Estimated Means for Closed-end Brackets in Los Angeles–Long Beach, California

This is the general pattern across the 297 metropolitan areas, although there are variations, especially in smaller metropolitan areas. On average, weighting by households, the midpoints are $24 too low in the first bracket, whereas the means calculated using the linear function are $475 too high. For brackets below the metropolitan median other than the first, the two perform about equally well: The midpoint average is $188 too high, whereas the linear integration means are $183 too high. In median brackets, the integration method performs better than the midpoint; the former is $261 off on average compared with $384 for the midpoint method. But the biggest differences come in the wider brackets above the median, where the downward-sloping linear functions perform substantially better: the average error is only $104 for the integration method, compared with $1,377 for the midpoints. Henson (1967) was right to adjust the estimated mean for the upper closed bracket down, but we can do better than guessing which value to use.

Of course, in any real application of these methods, the true bracket means would not be known. However, these results suggest that it makes sense to use a uniform density function for the first bracket, and it should make little difference whether a uniform or linear density is used in the remaining brackets below the median. The results also suggest that a linear density should be used for all closed-end brackets at or above the median to capture the downward-sloping density that is common in the upper part of the distribution. We evaluate the performance of these choices below.

5.2. Alpha and the Top Bracket Mean

The shape parameter of the Pareto distribution, although essential to evaluating the open-ended bracket, is challenging to estimate from grouped data. We compare several methods of estimating α with the maximum likelihood estimates obtained from the ungrouped household data. Regression using the linear form of the Pareto distribution on the cumulative totals from the brackets above the mean, as described by Bronfenbrenner (1971), does not work well, as can be seen in Figure 3a. Cloutier (1988) was right to criticize this practice. The frequently used two-point estimate does better on average, but it is imprecise and produces many estimates of α less than 2, as shown in Figure 3b. As von Hippel et al. (2016) noted, “as α approaches 1 from above, the [top bracket] mean grows arbitrarily large very quickly” (pp. 219–20). They recommended constraining the value of α to be greater than or equal to 2 when estimating the arithmetic mean of the top bracket, as shown in Figure 3c. In the MCIB method, the top bracket mean is estimated first, using the grand mean and the lower bracket incomes to calculate a constrained value of α. The estimates thus produced are highly correlated with the values fit by maximum likelihood from the household data (about 0.90, vs. 0.37 for the constrained two-point estimates), as shown in Figure 3d. The extreme outlier on the upper right is Flint, Michigan, the poorest metropolitan area in these data, which has very few households in the topmost bracket (0.6 percent).


                        figure

Figure 3. Alternative estimates of the Pareto shape parameter.

Note: PUMS = Public Use Microdata Sample.

Not surprisingly, given the use of the grand mean, the estimated mean income of the households in the top bracket is much more closely estimated by the MCIB method than by inference from the two-point α, as shown in Figure 4. But both the Pareto shape parameters and the top bracket means are just intermediary variables. The real test of the method is how it performs estimating the variance and inequality measures of interest to researchers, which we address below.


                        figure

Figure 4. Estimates of top bracket mean.

Note: PUMS = Public Use Microdata Sample.

5.3. Variance, Standard Deviation, and Coefficient of Variation

Table 4 presents several numeric criteria to compare the various methods: four versions of the RPME, two versions of the MGBE, and MCIB with different choices of brackets restricted to the uniform distribution. For comparison, the naive midpoint estimator is also shown, that is, a midpoint estimator using the arithmetic mean and no constraint on the value of α. The results of the naive estimator are quite poor by any standard. Merely by constraining the value of α to be greater than 2, the first RPME method performs substantially better. The bias drops from $7,000 to less than $900, the RMSE falls by more than 90 percent, and the correlation of the estimator and the true value rises from 0.63 to 0.93. The improvement is remarkable when you consider that it is entirely due to the 31 metropolitan estimators in which the two-point estimate of α is less than 2; for the remaining metropolitan areas, the estimated standard deviations are identical. The remaining RPME methods, which replace the arithmetic mean with the harmonic mean, geometric mean, or median, have better correlations with the true values but a large negative bias. The first MGBE method selects the best of ten statistical distributions on the basis of the Akaike information criterion. The second provides the average predictions of the models that converge weighted by the Akaike information criterion. As von Hippel et al. (2016) found in their evaluation, the MGBE performs well but not demonstrably better than the RPME.

Table

Table 4. Evaluation of Alternative Methods

Table 4. Evaluation of Alternative Methods

The MCIB method, however, performs far better than any of the midpoint or multimodel estimators. All four versions are correlated at greater than 0.99 with the true values. The substitution of a uniform distribution in the first bracket reduces the bias, but the difference is small. It makes little difference whether uniform or linear densities are used in the remaining brackets below the median. In contrast, using uniform densities in all closed brackets degrades the performance significantly, leading to an underestimate of the standard deviations by $2,700 and a doubling of the RMSE. Thus, taking into account the declining density within the higher brackets contributes to more accurate estimates of the standard deviations of household income.

A related measure is the coefficient of variation, defined as the standard deviation divided by the mean, which measures variability of a variable independent of the scale of measurement (Abdi 2010). The MCIB estimates of the coefficient of variation are correlated with the true values calculated from the PUMS at greater than 0.97, whereas the correlations for RPME and MGBE range from 0.80 to 0.83.

5.4. Theil Index

The Theil index is an inequality measure based on principles from information theory (Shannon 1948; Theil 1967). Although the exact form of the measure is not particularly intuitive, the Theil index is based on differences between individual units in the share of total income each possesses (Conceição and Ferreira 2000). Moreover, it is popular with researchers because, unlike the Gini coefficient, its mathematical form allows decomposition of total inequality into the shares that are within and between groups (Shorrocks 1980). Similar to the variance, the Theil index is additively separable across the income brackets. Thus, we can estimate this index in the same manner as the variance after substituting its formula in place of the squared deviations from the mean and computing the appropriate integrals. The Appendix shows the resulting integrals for the linear and Pareto density functions.

The RPME and MGBE estimates are quite good, with correlations between .90 and .93 to PUMS values, but the MCIB estimates are better. Figure 5 shows the estimates from the RPME-arithmetic estimator and MCIB with a uniform density in the first bracket only compared with the actual values calculated from the household-level data. The RPME is far too low for one metropolitan area: Stamford, Connecticut, the extreme outlier in terms of mean income, variance, and percentage of households in the open-ended bracket. Given that the Theil index is highly sensitive to incomes in the upper tail (Champernowne 1974), placing all households in upper bracket at that bracket’s mean (however calculated) underestimates the true value. In contrast, the correlation of the MCIB estimates with the true values is 0.9908 (with uniform density in the first bracket), compared with 0.9081 for the RPME with an arithmetic mean, as shown in Table 4. The RMSE for the MCIB estimate is 0.0082, compared with 0.0226 for the RPME estimate, and the bias of MCIB is about one fourth that of RPME.


                        figure

Figure 5. Robust Pareto midpoint estimator (RPME) and mean-constrained integration over brackets (MCIB) Theil index estimates.

Note: PUMS = Public Use Microdata Sample.

5.5. Percentiles and Quintile Shares

A number of percentiles and statistics based on them are useful in assessing and comparing income distributions. Key among them are the median, the first and third quartiles, and the interquartile range. Studies of income distribution often track the ratio of the 90th to the 10th percentile and the shares of total income received by different deciles or quintiles of the distribution (Piketty 2015). Table 5 compares the MCIB estimates of a number of percentiles for the 297 metropolitan areas with those calculated from the original PUMS data. (The RPME does not produce these estimates.) Between the 10th and 90th percentiles, the figures match to a high degree of accuracy. Not surprisingly, the 5th and 95th percentiles are not quite as accurate, but they are still highly correlated with the true values. The shares of household income for each quintile are also estimated with great accuracy. For example, the MCIB estimates indicate that the top quintile possesses 48.5 of total income on average; the actual figure calculated from the PUMS is 48.7 percent. As a group the metropolitan area values are correlated at 0.9991. On the basis of these results, the MCIB estimates for quantiles and income shares should be very reliable except for points at the extremes, such as below the 5th or above the 95th percentile.

Table

Table 5. Percentiles and Income Shares by Quintile

Table 5. Percentiles and Income Shares by Quintile

Several interesting inequality measures are derived from quantiles of the data. The ratio of the 90th to the 10th percentile averages 10.90 in the PUMS data. The MCIB estimates average 10.85 and are correlated with the true values at 0.973. The average of the estimates of the interquartile range is $59,269, only $34 higher than the figure calculated from the household level data, and the correlation of the estimates with the true values is 0.978.

5.6. Gini Coefficient

The Gini coefficient is based on the mean difference between all pairs of observations (Gini 1912, 1921). There are many computational formulas, but unlike the variance/standard deviation and Theil index, these formulas are not additively separable over the income values that define the brackets. However, the Gini coefficient can also be computed geometrically from a graph of the Lorenz curve, which plots the cumulative proportion of total income against the cumulative proportion of households (Lorenz 1905). A 45-degree line on this graph represents perfect equality; the area between the Lorenz curve and the line of equality divided by the total area under the line of equality is the Gini coefficient. Figure 6 shows the Lorenz curve for the household-level PUMS data for the Los Angeles–Long Beach metropolitan area, on which plotted points have been superimposed, representing the grouped data as calculated by MCIB. The area under the Lorenz curve is computed by adding up the areas of the trapezoids formed by adjacent bracket endpoints and the x axis (Bronfenbrenner 1971:47–50). The Gini calculated from the PUMS data is 0.488, whereas the Gini computed from the Lorenz curve by “connecting the dots” is 0.480.


                        figure

Figure 6. Lorenz curves for Los Angeles–Long Beach metropolitan area.

Note: PUMS = Public Use Microdata Sample.

The implicit assumption of this approach is that there is no inequality within the brackets; therefore, this point-to-point estimation represents a lower bound estimate (Cowell 1977:127). The straight lines cut slightly above the Lorenz curve, especially in the open-ended top bracket as the figure illustrates. The estimate can be made more precise by increasing the number of points that are plotted. For example, in Figure 6b, each closed-end bracket is divided into five equal intervals by width. The open-ended bracket has no upper bound and therefore no width to divide; however, using equation (14), we can calculate lower and upper bounds for five regions of equal probability. The revised Gini estimate of 0.486 comes closer to the value estimated from the household-level data for Los Angeles–Long Beach.

To assess how the procedure works in general, we estimate the Gini coefficient for all 297 metropolitan areas using five segments per bracket and compare the results with the RPME. As shown in Figure 7, the MCIB approach is far more accurate in these data. The numerical results are also included in Table 4. The fit of the RPME is excellent, with a correlation in the 297 metropolitan areas of 0.9500, but the correlation using the MCIB method is a remarkable 0.9981. This degree of predictive power is all the more impressive when you consider that aside from the dollar amounts of the bracket bounds, the MCIB estimates are based on just 17 numbers per metropolitan area: the counts of the households in the 16 brackets and the grand mean.


                        figure

Figure 7. Robust Pareto midpoint estimator (RPME) and mean-constrained integration over brackets (MCIB) Gini coefficient estimates.

It is interesting to note that there is little gained by dividing the brackets into smaller and smaller parts. Dividing the original brackets into five segments, as illustrated above, reduces the RMSE from 0.00669 to 0.00233; increasing to 10 segments reduces it to 0.00223; for any number of segments from 50 to 1,000, the RMSE rounds off to 0.00221. Some degree of error always remains, because the distribution within the brackets does not exactly follow the density functions used.

We propose the MCIB as a method of estimating income distribution parameters and inequality measures from grouped data. We show that the MCIB is less biased and more accurate than previously available methods. The improvements in performance are driven by making more complete use of the information typically provided by statistical agencies. First, the relative frequencies of adjoining brackets are used to infer the general trend within closed-end brackets, typically rising on the left side and falling on the right side of the median household income. Second, the within-bracket distributions are used to estimate quantities more accurately within the closed brackets, such as the bracket mean and aggregate income within the bracket. This procedure reduces the sensitivity of the estimates to the exact placement and width of the brackets, a significant advantage in studying trends over time, given that bracket boundaries are often changed explicitly or implicitly through inflation. Third, the overall mean of the distribution, which is normally available, is used in conjunction with the closed bracket means to calculate the top bracket mean, constraining the value of the shape parameter of the Pareto distribution used to model that bracket. Fourth, integrals over each bracket are used to estimate distributional parameters and inequality measures rather than weighted calculations using a single value (the midpoint) to represent each bracket. Fifth, the MCIB method does not impose a particular statistical distribution on the data as a whole. Our inspection of actual income distributions revealed many irregularities. In our method, these irregularities inform the analysis of each bracket, but each bracket is ultimately evaluated separately, allowing a flexible fit to the data.

The resulting estimates of the standard deviation, Thiel index, and Gini coefficient are correlated at 0.997, 0.991, and 0.998, respectively, with the values calculated from the PUMS data. Percentiles and income shares for different quantiles and the measures based on them are also estimated with great precision. The method can be extended to include other parameters of the income distribution and measures of inequality. Further research is needed to test the performance of the method in smaller samples and in data from different time periods in which income distributions may have different characteristics. However, the results presented here suggest that this technique could be used to allow more systematic analysis of income distribution data that are presented in grouped form. Researchers trying to use such data have often been hampered by inconsistencies in the number of brackets and the cutoff points between them. By providing reliable estimates of underlying distributional parameters, the MCIB can help overcome these differences in the way data are published and can therefore be a useful tool for investigating longer term trends in inequality and for comparing geographic areas for which individual-level data are not available. By specifying an appropriate density function for the open-ended bracket in place of the Pareto, the method could be extended to other types of data presented in grouped summaries, such as student test scores, housing values, or any size distribution for which individual-level data are not available.

Integrals for the Theil Index

The general pattern for estimating a parameter of the income distribution using MCIB, assuming it is additively decomposable over brackets, is to compute the integral of the equation for the statistic weighted by the linear and Pareto density functions. The Theil index, T, is additively separable over the income brackets:

T=n=1NyiGln(yiG)b=1BLbUbt(y)fb(y)dy

in which t(y)=yiGln(yiG) and G is the grand mean. For the uniform (m = 0) and linear (m≠ 0) brackets, from b = 1 to B– 1, we calculate

LbUbt(y)fb(y)dy=LbUbyGln(yG)(mby+cbN)dy=1GN(ln(yG)(mby33+cby22)mby399cby24)|LbUb.

This quantity is solved bracket by bracket. In the first bracket, the lower bound must be set to $1 instead of zero to avoid a missing value for the natural log of Y over G. Recall that incomes less than or equal to zero were recoded to $1 at the start of the analysis.

For the open-ended bracket, B, we substitute the Pareto density function scaled by the proportion in the upper bracket:

LBt(y)fB(y)dy=LByGln(yG)((nBN)αβαyα+1)dy=αβαnBGNyα1(α1)2((α1)ln(yG)+1)|LB=α(LB)αnBGN(LB)α1(α1)2((α1)ln(LBG)+1).

The last step follows because as the upper bound of integration approaches infinity, the first term in the definite integral goes to zero as long as α is greater than 1, which it must be in all cases by the definition of the Pareto distribution. Also note that the Pareto parameter β is the minimum income; hence, it is equal to LB, the lower bound of the interval.

Finally, the integrals are summed over all the brackets to generate the estimated Theil index. This pattern may be followed for any statistic of the income distribution whose computation is additively separable across the income brackets, provided the definite integrals exist (or can be estimated by other means).

We are grateful for helpful comments and suggestions from Marie Chevrier, Chris Goodman, Michael Hayes, Ross Matsueda, Austin Nichols, Adam Okulicz-Kozaryn, Sean Reardon, and three anonymous reviewers.

Funding
Support for this research was provided by a grant from the National Science Foundation (award 1636520) and by a 2016–2017 fellowship from the Center for Advanced Study in the Behavioral Sciences, Stanford University, where the article was developed and written. All opinions and errors are the responsibility of the authors.

Abdi, Hervé . 2010. “Coefficient of Variation.”Encyclopedia of Research Design 1:169–71.
Google Scholar
Allison, Paul D. 1978. “Measures of Inequality.”American Sociological Review 43(6):865–80.
Google Scholar
Atkinson, Anthony B. 1970. “On the Measurement of Inequality.”Journal of Economic Theory 2(3):244–63.
Google Scholar
Atkinson, Anthony B., Piketty, Thomas, Saez, Emmanuel. 2011. “Top Incomes in the Long Run of History.”Journal of Economic Literature 49(1):371.
Google Scholar | ISI
Bronfenbrenner, Martin . 1971. Income Distribution Theory. New York: Aldine.
Google Scholar
Champernowne, D. G. 1974. “A Comparison of Measures of Inequality of Income Distribution.”Economic Journal 84(336):787816.
Google Scholar
Cloutier, Norman R. 1988. “Pareto Extrapolation Using Grouped Income Data.”Journal of Regional Science 28(3):415–19.
Google Scholar
Conceição, Pedro, Ferreira, Pedro. 2000. “The Young Person’s Guide to the Theil Index: Suggesting Intuitive Interpretations and Exploring Analytical Applications.” Retrieved March6, 2017 (https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 228703).
Google Scholar
Corcoran, Sean, Evans, William N. 2010. “Income Inequality, the Median Voter, and the Support for Public Education.”Cambridge, MA: National Bureau of Economic Research. Retrieved February18, 2017 (http://www.nber.org/papers/w16097).
Google Scholar
Cowell, F. A. 1977. Measuring Inequality. Oxford, UK: Philip Allan.
Google Scholar
Crimi, Nicole, Eddy, William. 2014. “Top-coding and Public Use Microdata Samples from the U.S. Census Bureau.”Journal of Privacy and Confidentiality 6(2). Retrieved (http://repository.cmu.edu/jpc/vol6/iss2/2).
Google Scholar
Gini, Corrado . 1912. “Variabilità e Mutabilità.” Reprinted in Memorie Di Metodologia Statistica, edited by Pizetti, E., Salvemini, T. Rome, Italy: Libreria Eredi Virgilio Veschi.
Google Scholar
Gini, Corrado . 1921. “Measurement of Inequality of Incomes.”Economic Journal 31(121):124–26.
Google Scholar
Heitjan, Daniel F. 1989. “Inference from Grouped Continuous Data: A Review.”Statistical Science 4(2):164–79.
Google Scholar
Henson, Mary F. 1967. “Trends in the Income of Families and Persons in the United States, 1947–1964.” Washington, DC: U.S. Department of Commerce, Bureau of the Census.
Google Scholar
Ho, Andrew D., Reardon, Sean F. 2012. “Estimating Achievement Gaps from Test Scores Reported in Ordinal ‘Proficiency’ Categories.”Journal of Educational and Behavioral Statistics 37(4):489517.
Google Scholar | SAGE Journals | ISI
Jargowsky, Paul A. 1996. “Take the Money and Run: Economic Segregation in US Metropolitan Areas.”American Sociological Review 61(6):984–98.
Google Scholar
Kinney, Satkartar K., Karr, Alan. 2017. “Public-use vs. Restricted-use: An Analysis Using the American Community Survey.”Rochester, NY: Social Science Research Network. Retrieved February21, 2017 (https://papers.ssrn.com/abstract=2909935).
Google Scholar
Liebenberg, Maurice, Kaitz, Hyman. 1951. “An Income Size Distribution from Income Tax and Survey Data, 1944.” Pp. 378462 in Studies in Income and Wealth, edited by Conference on Research in Income and Wealth. Cambridge, MA: National Bureau of Economic Research. Retrieved February18, 2017 (http://www.nber.org/chapters/c5728.pdf).
Google Scholar
Lorenz, M. O. 1905. “Methods of Measuring the Concentration of Wealth.”Publications of the American Statistical Association 9(70):209–19.
Google Scholar
Mandelbrot, Benoit . 1960. “The Pareto-Lévy Law and the Distribution of Income.”International Economic Review 1(2):79106.
Google Scholar | ISI
McDonald, James B. 1984. “Some Generalized Functions for the Size Distribution of Income.”Econometrica 52(3):647–63.
Google Scholar | Medline
Miller, Herman Phillip . 1966. “Income Distribution in the United States.” Washington, DC: U.S. Government Printing Office.
Google Scholar
Minoiu, Camelia, Reddy, Sanjay G. 2008. “Estimating Poverty and Inequality from Grouped Data: How Well Do Parametric Methods Perform?Rochester, NY: Social Science Research Network. Retrieved March8, 2017 (https://papers.ssrn.com/abstract=925969).
Google Scholar
Minoiu, Camelia, Reddy, Sanjay G. 2014. “Kernel Density Estimation on Grouped Data: The Case of Poverty Assessment.”Journal of Economic Inequality 12(2):163–89.
Google Scholar | Medline
Piketty, Thomas . 2014. Capital in the Twenty-first Century. Cambridge, MA: Belknap.
Google Scholar
Piketty, Thomas . 2015. The Economics of Inequality. 3rd ed.Cambridge, MA: Belknap.
Google Scholar
Piketty, Thomas, Saez, Emmanuel. 2003. “Income Inequality in the United States, 1913–1998.”Quarterly Journal of Economics 118(1):141.
Google Scholar | ISI
Quandt, Richard E. 1966. “Old and New Methods of Estimation and the Pareto Distribution.”Metrika 10(1):5582.
Google Scholar
Ruggles, Steven, Genadek, Katie, Goeken, Ronald, Grover, Josiah, Sobek, Matthew. 2015. “Integrated Public Use Microdata Series” [Data set]. Minneapolis: University of Minnesota.
Google Scholar
Shannon, C. E. 1948. “A Mathematical Theory of Communication.”Bell System Technical Journal 27(3):379–423, 623–56.
Google Scholar
Shorrocks, A. F. 1980. “The Class of Additively Decomposable Inequality Measures.”Econometrica 48(3):613–25.
Google Scholar | Medline
Soltow, Lee . 1981. “The Distribution of Property Values in England and Wales in 1798.”Economic History Review 34(1):6070.
Google Scholar
Theil, Henri . 1967. Economics and Information Theory. Amsterdam, the Netherlands: North-Holland.
Google Scholar
U.S. Census Bureau . n.d. “American Community Survey and Puerto Rico Community Survey: 2011 Subject Definitions.” Retrieved (https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2011_ACSSubjectDefinitions.pdf).
Google Scholar
von Hippel, Paul T., Scarpino, Samuel V., Holas, Igor. 2016. “Robust Estimation of Inequality from Binned Incomes.” Pp. 212–51 in Sociological Methodology, Vol. 46, edited by Alwin, Duane F. Thousand Oaks, CA: Sage.
Google Scholar

Author Biographies

Paul A. Jargowsky is a professor of public policy and director of the Center for Urban Research and Education at Rutgers University–Camden. He is a fellow at the Century Foundation, an affiliated scholar at the Urban Institute, and a member of the Poverty and Geography Thematic Research Network of the Institute for Research on Poverty. His principal research interests are inequality, the geographic concentration of poverty, and residential segregation by race and class.

Christopher A. Wheeler is the chief data officer of the New Jersey Department of Community Affairs, responsible for the department’s program evaluation and research activities. A former public-sector consulting professional, he has almost six years of experience working with school districts, state, and municipal governments on management and budget issues, including extended engagements with the City of New Orleans and the School District of Philadelphia. His research interests include tax policy, poverty dynamics, economic development, community development, and housing affordability.

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top