segregsmall: A command to estimate segregation in the presence of small units

Suppose that a population, composed of a minority and a majority group, is allocated into units, which can be neighborhoods, firms, classrooms, etc. Qualitatively, there is some segregation whenever allocation leads to the concentration of minority individuals in some units more than in others. Quantitative measures of segregation have struggled with the small-unit bias. When units contain few individuals, indices based on the minority shares in units are upward biased. For instance, they would point to a positive amount of segregation even when allocation is strictly random. The command segregsmall implements three recent methods correcting for such bias: the nonparametric, partial identification approach of D’Haultfœuille and Rathelot (2017, Quantitative Economics 8: 39–73); the parametric model of Rathelot (2012, Journal of Business & Economic Statistics 30: 546–553); and the linear correction of Carrington and Troske (1997, Journal of Business & Economic Statistics 15: 402–409). The package also allows for conditional analyses, namely, measures of segregation accounting for characteristics of the individuals or the units.


Introduction
We consider a population made of two groups (minority and majority) whose individuals are spread across units. Units can be geographical areas, residential neighborhoods, firms, classrooms, or other clusters provided that every individual belongs to exactly one unit. We seek to measure the extent to which individuals from the minority group are concentrated in some units more than in others. Throughout the article, we follow the literature and use the word "segregation" as a neutral term to refer to such concentration. Measuring the magnitude of segregation is a necessary step to understand the underlying mechanisms and to design adequate policies.
A natural way to measure segregation is to start from the minority shares X i /K i , where X i is the number of individuals from the minority group and K i the number of individuals (or unit's size) in unit i ∈ {1, . . . , n}, and then compute an inequality index based on the distribution of the proportions X i /K i across the n units.
There are two possible benchmarks to assess the magnitude of these indices. Evenness relates to the case where all minority shares X i /K i are equal across units. Ranc 2021 StataCorp LLC domness relates to the case where the underlying allocation assigns minority individuals at random across units. If p i is the probability that an arbitrary individual in unit i belongs to the minority, randomness means that probabilities p i are equal across units i. Past research has stressed the difference between both benchmarks, especially when the units are small (Cortese, Falk, and Cohen 1976). The minority share X i /K i is only an estimate of p i , and even if p 1 , . . . , p n are all equal, there will be some variation in the X i /K i , especially if the units' sizes K i are small. If one is interested in the deviations from the randomness case, indices based on minority shares, which measure the deviation from evenness, will overestimate the level of segregation. This issue is known as small-unit bias.
The problem is pervasive in applied research. For workplace and school segregation, a large share of firms have fewer than 10 employees and classrooms have usually between 20 and 40 students. The bias also arises when the units are not small per se but only surveys of individuals are available. This is the case when one attempts to measure residential segregation using the local strata of households surveys.
Two main approaches have been proposed in the literature to deal with the small-unit bias. One strand proposes to correct the so-called naive inequality indices based on the minority shares X i /K i . The idea was initially proposed by Cortese, Falk, and Cohen (1976) and Winship (1977) for the Duncan index. Carrington and Troske (1997, CT hereafter) extend the correction to other indices.Åslund and Skans (2009) adapt it to measure segregation conditional on covariates. Allen et al. (2015) develop another adjustment based on bootstrap. These corrections all aim at switching the benchmark from evenness to randomness by subtracting an estimate of the bias from the initial, naive index.
Another approach, adopted by Rathelot (2012, R hereafter) and D'Haultfoeuille and Rathelot (2017, HR hereafter), defines segregation using an inequality index based on the unobserved probabilities p i as a functional of the distribution F p of p i . In line with the rest of the literature, they assume that the X i are independent and follow a Bin(K i , p i ) distribution. Conditional on K i and p i , R assumes a mixture of beta distributions for F p and derives the segregation index as a function of the parameters of the distribution. HR follow a nonparametric method leaving F p unspecified; they show that the first moments of F p are identified under the previous binomial assumption and obtain partial identification results on the segregation measure. Both R and HR construct confidence intervals (CIs) for the segregation indices. HR also extend the methodology to study conditional segregation indices, namely, measures of "net" or "residual" segregation accounting for other covariates (either of units or individuals) that may influence allocation.
The command segregsmall allows social researchers to measure segregation in the context of small units. The command implements the methods proposed by R, HR, and CT. Conditional indices are available for all three methods. With R and HR, the command computes CIs obtained by bootstrap. Finally, the command also implements a test of the binomial assumption. This article describes the command and presents the three methods it implements. Section 2 defines the setup and the parameters of interest and synthesizes the estimation and inference methods of R, HR, and CT. Section 3 details the syntax, options, and stored results of the segregsmall command and discusses its execution time. Section 4 presents an application of the command on French firm data to measure workplace segregation between foreigners and natives across workplaces. Section 5 concludes.

The setting and the parameters of interest
The population studied is assumed to be split into two groups: a group of interest, henceforth the minority group, and the rest of the population. Individuals are distributed across units. For each unit, we assume that there exists a random variable p that represents the probability for any individual belonging to this unit to be a member of the minority. The total number of individuals in a unit is denoted by K.
We now introduce the segregation indices we focus on hereafter. We consider first unconditional indices; conditional indices are introduced in section 2.6. Let us first assume that K is fixed. A segregation index θ is then a functional of the cumulative distribution function (c.d.f.) F p of p and of m 01 = E(p); that is θ = g(F p , m 01 ). 1 Roughly speaking, one expects such an index to be minimal when F p is degenerate and maximal when p ∈ {0, 1}. In the former case, the probability of belonging to the minority is the same in all units, whereas in the latter case, the minority group is concentrated in a subset of units only.
The command segregsmall estimates five classical segregation indices satisfying this property; namely, When K is random and takes values in K, θ is defined as a weighted average of indices conditional on K = k denoted θ k = g(F k p , m k 01 ) with F k p the c.d.f. of p conditional on K = k and m k 01 = E(p|K = k). Whether we study segregation at the unit level or at the individual level matters for the weights used. The unit-level index θ u satisfies whereas the individual-level segregation index θ i is defined by To estimate θ, we assume hereafter that the researcher has at his or her disposal K; however, the probability p remains unobserved. Instead, the researcher observes only X, the number of individuals belonging to the minority in the unit. By definition of p, we have E(X|K, p) = Kp, which implies that the proportion of individuals from the minority, X/K, is an unbiased estimator of p. However, because it varies conditional on p, X/K is more dispersed than p. Thus, we have for usual segregation indices including the five above, In other words, even in the absence of statistical uncertainty on the distribution of X/K, we would still overestimate the segregation index by using X/K in place of p. Moreover, this bias increases as K decreases. We refer to this issue as the small-unit bias hereafter.
The binomial assumption. We assume henceforth that individuals are allocated into units independently from each other. Namely, X is assumed to follow, conditional on p and K, a binomial distribution Bin(K, p). This assumption may be restrictive when allocation is in some way sequential and influenced by the composition of units. But more importantly, this assumption is testable (see section 2.5).

Nonparametric approach
Identification. This approach, followed by HR, leaves the distribution F p of p unrestricted. Combined with the binomial assumption, it entails a nonparametric binomial mixture model for X. Let us first suppose that K is constant; if not, we can simply retrieve aggregated indices θ u and θ i using (1) and (2). We also assume that K > 1; if K = 1, the distribution of X is not informative on θ, and we get only trivial bounds on it, namely, 0 and 1 for the five indices above.
First, some algebra yields a one-to-one mapping between the distribution of X, defined by the K probabilities P 0 = (P 01 , . . . , P 0K ) with P 0j = Pr(X = j) and the first K moments of F p , denoted m 0 = (m 01 , . . . , m 0k ) , It follows that m 0 is identified from the distribution of X; hence, any parameter depending only on m 0 is point identified. It is the case of CW as soon as K ≥ 2. Second, there may be a single distribution F * corresponding to m 0 . This happens if (and only if) m 0 belongs to the boundary ∂M of the moment space M. 2 Then, F * is a discrete distribution with at most L + 1 support points, where L is the integer part of (K + 1)/2. For instance, when K = 2, M = {(m 01 , m 02 ) ∈ [0, 1] 2 : m 2 01 ≤ m 02 ≤ m 01 }, because V(p) ≥ 0 and p 2 ≤ p. Then, ∂M corresponds to Dirac and Bernoulli distributions, for which we have, respectively, V(p) = 0 and p 2 = p.
When m 0 belongs to the interior • M of the moment space, there are infinitely many distributions F p corresponding to m 0 . Then, unless we consider CW, θ is not identified in general. Nevertheless, HR show that the sharp-identified set on θ can be computed relatively easily under the following restriction. Assumption 1 fails for the Gini but is satisfied by the Duncan, the Theil, the Atkinson, and the Coworker indices. Under this condition, the bounds on h(x, m 01 ) dF (x), and thus on θ, are attained by distributions with no more than K + 1 support points. Specifically, let D K+1 m0 denote the set of distributions on [0, 1] with at most K +1 support points for which the vector of first K moments equals m 0 . Then, the sharp-identified set on θ is [θ, θ], with The following theorem, which reproduces theorem 2.1 of HR, summarizes the previous discussion. Hereafter, we let θ and θ denote the sharp lower and upper bounds on θ, whether or not θ is point identified.
where F * is the unique c.d.f. for which the first K moments are equal to m 0 . Moreover, F * has at most L + 1 support points.
In the interior case, computing the bounds still requires a nonlinear optimization under constraints that are also nonlinear in the support points. Yet the problem can be further simplified under an additional assumption using the theory of Chebyshev systems. In particular, it requires that the function h in assumption 1 does not depend on m 01 , a condition satisfied by the Theil and Atkinson indices. Basically, for those two indices, no numerical optimization is needed to compute the bounds θ and θ. The idea behind it is that the bounds are attained by two special discrete distributions, called principal representations. The interest is that finding the principal representations boils down to obtaining the roots of specific polynomials, which is much simpler and faster than solving (3). We refer to HR for more details on that matter.
Estimation. Let us assume to have a sample (X i ) i=1,...,n of independent and identically distributed variables with constant sizes equal to K > 1. Theorem 1 shows that θ is either point or partially identified, depending on whether m 0 ∈ ∂M or m 0 ∈ • M. We follow this result to estimate (θ, θ). First, we estimate P 0 , and thus m 0 = Q −1 P 0 , by constrained maximum likelihood. The constraints come from the binomial mixture model: P 0 ∈ P = {Qm : m ∈ M}. To compute the constrained maximum-likelihood estimator (MLE), HR show lemma 1 below. Let us define Lemma 1. The constrained MLE P = ( P 1 , . . . , P K ) satisfies where x = ( x 1 , . . . , x L+1 ) and y = ( y 1 , . . . , y L+1 ) are given by Second, we estimate (θ, θ). We first check whether m ∈ ∂M. A simple possibility to do so is by checking whether the unconstrained MLE P = ( P 1 , . . . , P K ) satisfies P = P (in which case m ∈ • M with probability approaching one). Note that the unconstrained MLE simply satisfies P k = N k /n for all k.
When P = P, we simply let θ = θ = g( F , m 1 ), where F is the distribution corresponding to ( x, y). We refer to this situation as the constrained case. If P = P, there are infinitely many distributions corresponding to m, and we estimate bounds for θ. We refer to this situation as the unconstrained case. For the Theil and Atkinson indices, the estimated bounds are obtained from the principal representations computed from m. For the Duncan index, optimization is required to obtain the estimated bounds. We obtain estimators of θ and θ by solving (3), replacing m 0 by its estimator m. Finally, the Coworker index depends only on (m 01 , m 02 ). Thus, whether or not P = P, this index can be estimated directly by replacing (m 01 , m 02 ) by ( m 1 , m 2 ).
Inference. When assumption 1 holds, HR show that the estimators of the bounds are consistent: ( θ, θ) P −→ (θ, θ) as the number of units n tends to infinity. Under additional assumptions, HR characterize their asymptotic distributions. This enables one to build valid asymptotic CIs for the index θ using a modified bootstrap procedure. The construction needs to account for the fact that the lower bound and upper bound collapse when m 0 ∈ ∂M (point identification), whereas they differ when m 0 ∈ • M (partial identification). The underlying idea relates to the construction of CIs for partial identification (see Imbens and Manski [2004], Stoye [2009]). HR define a CI for the interior case, where only one of the two ends of the interval matters in the asymptotic coverage, and another for the boundary case. To obtain the nominal asymptotic coverage in all situations, HR define the final CI by selecting one of them according to the length of the estimated identification interval ( θ − θ) relative to sampling error. 3 Random unit size. The previous identification and estimation results can be adapted to cases where K is random and takes values in K. Using the definitions of θ u and θ i in (1) and (2), the idea is to reason conditional on the unit size to get each θ k , k ∈ K, and replace the theoretical weights by plug-in estimators. More precisely, let θ k and θ k denote the estimators of the bounds of θ k based on the subsample of units of size k. Let Then, the estimators of the bounds on θ u and θ i satisfy Remark that as soon as for one size k the index θ k is not point identified, the resulting aggregated index will be partially identified too. To obtain point identification of θ u or θ i , one must be in the constrained case for each k ∈ K. This is unlikely to happen when the support of K contains very small sizes k, typically lower than 10.
As with the constant unit case, CIs for the aggregated indices θ u and θ i are constructed by the modified bootstrap procedure detailed in HR. The randomness of K just involves an additional step that consists in drawing K in its empirical distribution.
Assuming independence between K and p. The previous estimation and inference procedures are fully agnostic as regards possible dependence between K and p, which is a safe option when unit size may be a potential determinant of segregation. However, if one is ready to impose independence between these two variables, the identified bounds on θ u = θ i get closer to each other. This is because the F k p coincide with the unconditional distribution of p. Thus, we can gather all units and identify the first K moments of F p , with K = max(K). Estimation and inference are performed as in the case of constant unit size, with K replaced by K. Thus, assuming independence between K and p leads to an improvement in identification because we identify more moments. It also leads to more accurate estimators because one estimates a single vector P on the whole sample, instead of doing so on each subsample {i : K i = k}, for all k ∈ K. An important particular case occurs when only some individuals are observed in the unit (for example, survey data). Imagine units are of size (K i ) i=1,...,n but that, for each unit i, only n K,i individuals are sampled and observed. We let X i denote the number of individuals belonging to the reference group in this subgroup of n K,i people. As previously, X i follows a binomial distribution Bin(n K,i , p i ) conditional on p i and n K,i . The previous results apply by simply replacing the unit size K by the number n K of individuals observed in each unit. Moreover, in such settings, it is usually plausible to assume that the random variable n K is independent of p conditional on the unit size because n K depends on the survey process that, a priori, is orthogonal to the segregation phenomenon.

Parametric approach
This approach, followed by R, is like that of HR, except that it imposes a parametric restriction on F p . Specifically, it is supposed to be a mixture of beta distributions. Combined with the binomial assumption for the conditional distribution of X, the model becomes fully parametric and thus can be estimated by maximum likelihood. The indices are therefore point identified, contrary to the nonparametric approach of

HR.
A concern might be that the parametric restriction leads to invalid results when the model is misspecified. However, R shows through simulations that segregation indices associated with various distributions, both continuous and discrete, are accurately proxied by his parametric approach.
Estimation and inference. As in HR, we first assume that K is constant. Let B(., .) denote the beta function, c the number of components of the beta mixture, v = (α j , β j , λ j ) j=1,...,c the vector of parameters with (α j , β j ) ∈ R * + × R * + as the two shape parameters of the jth beta distribution and λ j ∈ [0, 1] its weight ( c j=1 λ j = 1). The probability density function of p distributed as a c -component mixture of beta distributions with parameters v is In this model, the probability that k individuals belong to the minority group can be written, after some algebra, as Thus, the log likelihood satisfies, up to terms independent of the parameter v, When we use the parametric assumption on F p , v translates into an estimator F p of the distribution of p, which in turn yields an estimator θ of θ. The explicit expressions of the five indices above, as functions of the parameter v, are given in appendix A.1. Inference can be achieved by the delta method or by the bootstrap, performed at the unit level.
Random unit size. The adaptation to this case is exactly similar to HR method. For each k ∈ K, the MLE of θ k is obtained using the subsample of units of size k. The weights are estimated by their empirical counterparts. The estimated aggregated indices are then obtained by plug-in, using (1) and (2). When K and p are assumed independent, all units can be pooled, independently of their size, to compute the MLE of v for the whole sample. As above, the resulting estimator v allows us to estimate the distribution of p, and then θ.

Correction of the naive index
The approaches of HR and R are immune to the small-unit bias as they directly estimate g(F p , m 01 ). Other, previous approaches rather start from the naive index θ N = g(F X/K , m 01 ) and attempt to modify it, so that the parameter becomes less sensitive to changes in K. We present here the correction proposed by CT, which is the most popular in applied work.
CT's correction relies on the distinction between the randomness and evenness benchmarks, introduced notably by Cortese, Falk, and Cohen (1976) and Winship (1977). Evenness corresponds to X/K being constant, whereas randomness refers to the case where p is constant. Under the binomial model, however, evenness cannot occur. The central idea of CT is then to convert θ N , which measures departure from evenness, into a distance to randomness. Let θ ra N denote g(F X ra /K , m 01 ), where X ra |K ∼ Bin{K, E(X/K)}. X ra /K is the proportion we would observe if p was constant and equal to E(p) = E(X/K). Then, assuming that θ ∈ [0, 1], a constraint satisfied by the five indices above, CT's correction θ CT is defined by θ CT = (θ N − θ ra N )/(1 − θ ra N ). CT suggest the following simulation-based estimator of θ CT . Let E(p) denote the sample average of X/K. For s = 1, . . . , S, draw X ra i,s ∼ Bin{K i , E(p)} independently for each unit i. Then, letting F ra s and m 1,s denote respectively the empirical distribution and mean of (X ra i,s /K i ) i=1,...,n , compute θ ra N,s = g( F ra s , m 1,s ). The estimator of θ ra N is then the mean over the S replications, θ ra , with θ N the plug-in estimator of θ N . The quantiles of ( θ ra N,s ) s=1,...,S can be used to test that the data are consistent with random allocation using randomization tests (see Boisso et al. [1994] and CT).
Links with HR and R. In general, θ CT = θ. They do coincide, however, in the extreme cases of no segregation, where p is constant, and of "full" segregation, where p follows a Bernoulli distribution. We refer to section 2.3 of R and section 2.4 of HR for further discussion on the relationship between θ CT and θ.

Test of the binomial assumption
We have relied so far on the binomial assumption X|K, p ∼ Bin(K, p). This assumption implies that P 0 ∈ P = {Qm : m ∈ M}. A vector (m 1 , . . . , m K ) in M has to satisfy some restrictions, such as m 2 ≥ m 2 1 (that is, nonnegative variance). Hence, we could have Q −1 P 0 / ∈ M if the distribution of X conditional on K and p is not binomial. In other words, the binomial assumption is testable.
HR propose a likelihood-ratio (LR) test of P 0 ∈ P, where the constrained estimator under the null hypothesis is P, whereas the unconstrained MLE is P. Note that these estimators are already computed to estimate (θ, θ). For a unit size equal to k, the test statistic satisfies With a random unit size, the test statistic is then LR n = k∈K Pr( The critical values of the test are obtained by approximating the distribution of LR under the null by bootstrap. The bootstrap is performed as follows. First, we draw n units of sizes ..,n , which is drawn under the null hypothesis. For a level 1 − α ∈ (0, 1), the critical region of the test is defined by The results of HR imply that the test has an asymptotic level equal to α and is consistent. Remark, however, that it tests P 0 ∈ P, which is an implication of the binomial assumption, rather than this assumption itself. This means that the binomial assumption may fail, but P 0 ∈ P: X|K, p could fail to be binomial, yet the distribution of X given K could be rationalized by a binomial mixture.

Conditional segregation indices
Conditional indices aim at accounting for the fact that part of the segregation along the dimension at stake may be driven by sorting according to other dimensions. In this sense, they measure the net or residual level of segregation, when the contribution of covariates to segregation is removed (seeÅslund and Skans [2009]). To illustrate this point, let us consider workplace segregation between foreigners and natives. Foreigners may be hired more in some sectors of the economy on the basis of sector-specific skills. Imagine an extreme case where, within each sector, all firms hire foreigners with the same probability. As long as these probabilities differ from one sector to another, an unconditional segregation index would be positive. On the contrary, the conditional index defined in (4) below would indicate no segregation because it controls for the influence of the sector, a characteristic of units, in the allocation process. Similarly, foreigners may be hired with the same probability for all low-skilled jobs (respectively, all high-skilled jobs), but the probabilities for these two types of job may differ. In this case again, failing to account for this characteristic would lead to a positive unconditional index, while the conditional index defined in (5) below would indicate no segregation.
The previous discussion underscores that covariates can be defined either at the unit level or at the level of an individual (or of a position). We separate the two cases below because they lead to different treatments.
Unit-level covariates. Let Z ∈ {1, . . . , Z} denote a characteristic of a unit, which is assumed to be discrete. To account for Z in the allocation process, we measure segregation conditional on Z. For each z ∈ {1, . . . , Z}, let θ 0z denote the segregation index we consider conditional on Z = z. The subscript 0 indicates that we consider a generic index of interest, which could correspond to either θ or θ CT . Whatever the index, the estimation of θ 0z is done exactly as in the unconditional case, focusing on the subsample {i : Z i = z}.
The index θ 0z can be of interest by itself. We can also consider an aggregate conditional index defined as follows: 4 The estimation of θ cond 0,u is obtained by plug-in, with n −1 n i=1 1{Z i = z} the empirical counterpart of Pr(Z = z). For HR and R methods, a similar bootstrap procedure as in the random size case provides asymptotic CIs for θ cond 0,u . 5 Individual-or position-level covariates. Let W ∈ {1, . . . , W } denote a characteristic of an individual or of a position. To resume the example of workplace segregation, we note that a characteristic attached to individuals can be education, whereas a characteristic linked to positions can refer to the type of occupation (for example, high skilled versus low skilled). While these two forms of covariates may lead to different interpretations, they are similar as regards estimation and inference.
For each unit and each type w ∈ {1, . . . , W }, we suppose to observe X w and K w , which are respectively the number of individuals with characteristic W = w (or in positions satisfying W = w) who belong to the minority group and the overall number of individuals (or positions) of type W = w in the unit. As above, we define θ 0w as the segregation index of interest conditional on W = w. With individual-or position-level covariates, the idea is to consider the subsample of individuals (or positions) such that W = w, instead of a subsample of units. Hence, θ 0w can be estimated exactly as in the unconditional case simply using (X w , K w ) instead of (X, K). 6 Again, θ 0w might be a relevant parameter of interest on its own. Researchers can also be interested in an aggregated conditional index: The estimation of θ cond 0,i is obtained by plug-in, with ( n i=1 K wi )/( n i=1 K i ) the empirical counterpart of Pr(W = w). For HR and R methods, as previously, a modified bootstrap procedure provides asymptotic CIs for θ cond 0,i . 7 5. The initial step of the bootstrap procedure becomes drawing units in the joint empirical distribution of (K, Z). 6. Remark that, in the general random sizes case without assuming K ⊥ ⊥ p, it makes more sense to consider the index θ i that uses individual-level weights (compared with unit-level ones) because the types are defined at this individual or position level. 7. The initial step of the bootstrap procedure becomes drawing units in the empirical distribution of units, hence keeping fixed the composition of the units with respect to W .

The segregsmall command
The segregsmall command is compatible with Stata 14.2 and later versions.

Description and main options
The command segregsmall estimates the five classical segregation indices mentioned above (Duncan, Theil, Atkinson, Coworker, and Gini) using the HR, R, or CT method. It provides CIs obtained by bootstrap in the approaches of HR and R and allows for conditional analysis for all three methods.
format(format) indicates the format of the dataset used and needs to be either unit (datasets where an observation is a unit) or indiv (datasets where an observation is an individual). format() is required. The option determines the variables to be put in varlist. For unconditional analyses (the default without the option conditional()), these are the following: • K X for unit-level datasets; K and X correspond to the variables K and X introduced in section 2: the number of individuals and the number of minority individuals, respectively. K has to be strictly positive integers, and X positive or null integers. X should be lower or equal to K for each unit.
• id unit I minority for individual-level datasets; id unit is the identifier of the unit the individual belongs to. I minority is a dummy variable equal to 1 when the individual belongs to the minority group, and 0 otherwise. method(method) specifies the method used to compute the segregation indices. method must be one of np, beta, or ct. method() is required. Argument np, standing for nonparametric, implements HR method. The command does not report the Gini index in this case, because it does not verify assumption 1. The choice beta implements R's method assuming a beta distribution for F p . 8 Both methods provide 8. R assumes a mixture of beta distributions. However, simulations reveal that the differences between the indices obtained with a two or higher component mixture versus a simple beta are marginal in most cases, segregsmall uses a beta assumption for simplicity. Also, the command allows one to assess the reliability of this restriction because the indices obtained with the beta restriction can be compared with the nonparametric estimates that leave Fp unrestricted.
estimates of the same parameters of interest, namely, θ if K is fixed and, unless independencekp is specified, (θ u , θ i ) if K is random. By default, they report asymptotic CIs obtained by bootstrap. With the argument ct, the command estimates the naive and CT-corrected indices θ N and θ CT . Confidence intervals are not computed for these parameters.
conditional(conditional) estimates conditional segregation indices. conditional must be either unit or indiv, and it specifies the level at which the covariates included in the analysis are defined. For conditional analysis, varlist has to be • K X Z for unit-level datasets or id unit I minority Z for individual-level datasets, with covariates defined at unit level (unit). The variables K, X, id unit, and I minority are the same as in unconditional analyses. Z corresponds to the variable Z, the characteristics of units defined in section 2.6. Z needs to take values in {1, 2, . . . , Z} with Z ≥ 2.
• id unit I minority W for individual-level datasets with covariates defined at the level of individuals or any subunit level (indiv). W corresponds to the variable W , the individual (or position) characteristics introduced in section 2.6. W has to take values in {1, 2, . . . , W } with W ≥ 2.

Additional options
withsingle includes single units (with only one individual) in the analysis. As explained in section 2.2, single units are in general uninformative about the level of segregation. By default, they are not included in the data used. The option is available both for unconditional or conditional analyses.
excludingsinglepertype excludes single cells (unit × type) from the analysis. The option is relevant and available only in conditional analyses with covariates defined at the individual or subunit level. In this setting, the role of a unit in unconditional analyses is played by a cell defined as the intersection of a unit and an individual type (see section 2.6). As just described, units with only one individual are dropped by default. Yet this does not prevent the existence of single cells coming from units with more than one individual but that have only one individual of a given type W = w. Without option excludingsinglepertype, those single cells are included in the analysis, which can lead to wide estimated identified intervals in the HR method, especially when the number W of types is large. With the option, they are dropped. For consistency, the options withsingle and excludingsinglepertype are mutually exclusive.
independencekp assumes independence between K and p. The option is available only with method(np) and method(beta).
repbootstrap(#) specifies the number of bootstrap iterations used to construct CIs in method(np) and method(beta). The default is repbootstrap (200). It is also the number of bootstrap repetitions used to test the binomial assumption.
level(#) sets the confidence level, which has to be a scalar in (0, 1). With method(np) and method(beta), by default, the traditional 90%, 95%, and 99% confidence levels are stored (see section 3.4), and the 95% CI is displayed in Stata output. The option permits to save and display a personalized level besides (the other three are still stored). With method(ct), by default, the empirical quantiles of the index under random allocation are stored for the orders 0.01, 0.05, 0.10, 0.90, 0.95, and 0.99. The option additionally saves the empirical quantiles at order τ and 1 − τ with τ the argument of the option.
noci restricts the command to estimation: CIs are not computed. The option is applicable only to method(np) and method(beta).
testbinomial implements the test of the binomial assumption. More precisely, with method(np) and without options independencekp or noci, the test is made by default and stored: the option displays only the result in Stata output. In any other situations (method(beta) or method(ct), no CIs, or assuming K ⊥ ⊥ p), the option performs the test in addition to estimation and potential inference. In both cases, the number of bootstrap repetitions used for the test is the same as the one specified by option repbootstrap(). When the user wants to test the binomial assumption, we recommend always to do so combined with inference using the HR method in the general case (namely, without assuming independence between K and p): together with the test, it will give estimation and CIs from method(np) virtually for free. The option is available only in unconditional analyses. 9 repct(#) sets the number S of draws used to estimate θ ra N in CT's correction. Its argument needs to be a positive integer. The default is repct(50). atkinson(#) allows the user to specify the parameter b of the Atkinson index. Its argument has to be a real number in (0, 1). The default is atkinson(0.5); it is the only one that ensures the symmetry property for the Atkinson index (that is, the index does not change when swapping the minority and majority labels).

Stored results
The objects stored by segregsmall depend on the options, in particular, whether the analysis is unconditional or conditional. They can be gathered into three types of information about i) the data included in the analysis, ii) the method and assumptions used, and iii) the estimation and inference results.
In this section, we list the objects stored in e() by the command and detail their contents when they relate to estimation and inference results. The remaining objects have self-explanatory names and are described in the help page of the segregsmall command.
9. The test of the assumption type by type can be done manually by restricting the sample used through the qualifiers if and in. The matrices whose name includes estimates ci store the results of estimation and possible inference. The content of e(estimates ci) varies with the method used but its structure remains similar. Each row corresponds to an index.
With method(beta), 10 rows represent the 2 possible aggregated indices θ u (unitlevel weights) and θ i (individual-level weights), when K is considered as random, for each of the 5 indices (Duncan, Theil, Atkinson, Coworker, and Gini). For each possible index weights × mapping, the columns store the estimated index using the R method with a beta distribution restriction on F p , and asymptotic CIs at the traditional 90%, 95%, and 99% levels (plus the one specified by level() if any).
With method(np), the rows are identical, but there are only eight parameters because the Gini indices are absent. For each possible index weights × mapping, the columns of e(estimates ci) save the estimated bounds θ u and θ u for unit-level weights (or θ i and θ i for individual-level weights); a dummy variable equal to 1 if the CI used is the boundary-case interval and 0 for the interior case; the resulting asymptotic CI at the classical 90%, 95%, and 99% levels (plus the one specified by level() if any). 10 In conditional analyses, either with unit-or with individual-or position-level covariates, the matrices e(estimates ci aggregated) and e(estimates ci type #) store exactly the same information as e(estimates ci): the former for the aggregated conditional index θ cond 0,u or θ cond 0,i , the latter for the index conditional on a given type #, that is θ 0z with unit-level characteristics or θ 0w with individual-or position-level characteristics (# ranges from z = 1 to Z or w = 1 to W ).
With method(ct), five rows correspond respectively to the Duncan, the Theil, the Atkinson, the Coworker, and the Gini indices. In columns: the naive index θ N ; the index under random allocation θ ra N ; the CT-corrected index θ CT ; the empirical standard deviation of the draws ( θ ra N,s ) s=1,...,S under random allocation; the "standardized score" originally proposed by Cortese, Falk, and Cohen (1976), namely, (θ N − θ ra N ) divided by that standard deviation; the empirical quantiles of ( θ ra N,s ) s at the orders 0.01, 0.05, 0.10, 0.90, 0.95, 0.99 (τ and 1 − τ , with τ the argument of the option level() if this option is used).
e(I constrained case) is a dummy equal to 1 in the constrained case and 0 otherwise. As discussed in section 2.2, with random unit size, it is equal to 1 if and only if we are in the constrained case (D m restricted to a singleton) for each size k ∈ K. In this case, method(np) yields point estimates for all indices. e(I constrained case) is identical in conditional analyses. The dummy is equal to 1 provided we are in the constrained case for each type. Otherwise, θ cond 0,u and θ cond 0,i are only partially identified with method(np). e(test binomial results) is stored when the test of the binomial assumption is performed (see the option testbinomial). It is a row vector whose first element saves the value of the test statistic LR n and the second the p-value of the test where the null hypothesis is the binomial assumption. method(np) and method(beta) save e(info distribution of p) in unconditional analyses. 11 This matrix contains the information learned about the distribution of p in the estimation. In the general case, without assuming K ⊥ ⊥ p, it means the information as regards the conditional distributions F k p , for each k ∈ K. With the option independencekp, it is about the unconditional distribution F p .
With the method(beta) option, all the (F k p ) k∈K (or F p when assuming K ⊥ ⊥ p) are supposed to follow a beta distribution. In the general case, e(info distribution of p) is a matrix with |K| rows. Each row is associated with a size k, and the columns report the size k; the number of units of size k in the data used, that is, n i=1 1{K i = k}; the latter quantity expressed as a proportion over the n units studied; the number of components of the beta mixture considered (that is 1); and the MLEs α 1 and β 1 of the two shape parameters characterizing the beta distribution assumed for F k p . In the case where K ⊥ ⊥ p is supposed, the matrix e(info distribution of p) is similar but consists of a single row because only one estimation is done, pooling all units together. It contains the maximal size K, the number of units n used for the estimation, and the estimates of the parameters that characterize the beta distribution assumed for F p .
With the method(np) option, the structure of e(info distribution of p) is more involved because the approach is nonparametric. Without the restriction K ⊥ ⊥ p, it contains 3 × K rows and should be read by blocks of three rows. The kth block concerns F k p . The first line shows some general information, namely, the size k, the number of units of size k, and the proportion of such units within the data used (as in method(beta)). The most important element is displayed in the fourth column and consists of a dummy variable equal to 1 if we are in the constrained case for F k p , that is, m ∈ ∂M conditional on K = k. In this case, despite the nonparametric approach, the constrained maximum-likelihood estimation yields an estimate F of F k p which turns out to be a discrete distribution with at most (k + 1)/2 + 1 support points (see section 2.2 §Estimation). In this situation, the fifth column of the first row, within the three-row block, indicates the number of support points of F , and the two following rows characterize F by reporting its support points and the corresponding probabilities. In the unconstrained case, the dummy is 0, and the two last rows, within the three-row block, are empty because there is no estimate of F k p then. When assuming K ⊥ ⊥ p, the matrix e(info distribution of p) is analogous but is made of a single three-row block because it deals only with the unconditional distribution F p . In this case (see section 2.2 §Assuming independence between K and p), the estimation uses the first K moments of F p . It is likely to fall in the constrained case because K will exceed 10 in most applications, a size above which simulations reveal that the probability to be in the constrained case is close to 1 even with large sample sizes n. e(info distribution of p) is interesting because virtually any segregation index is a functional of the distribution F p [of the conditional distributions (F k p ) k in general when accounting for the randomness of K]. Consequently, an estimate of F p [respectively of the (F k p ) k ] enables one to recover any other personalized segregation index.

Execution time
The times reported below are averaged over 50 repetitions on a desktop computer run under Windows 10 Enterprise with an Intel R Core TM i5-6600 CPU 3.30 GHz processor (RAM 16 Go). The operations of segregsmall can be decomposed into a preparation stage and a stage devoted to estimation and inference.
Preparation stage. The preparation stage is common to the three methods and reshapes the dataset. Its execution time is quick compared with the whole command and increases in the number n of units. For instance, with unit-level datasets, for K taking values in K = [5,15], it lasts around 0.06 second with n = 1000 and 0.99 second with n = 300000. In conditional analyses, the execution time is approximately multiplied by the number of types: for example, 6.03 seconds for 5 types and 9.17 seconds for 10 types, with K = [5, 15] and n = 300000. With individual-level datasets, the preparation stage is longer because it is necessary first to form the units. With K = [5,15], it takes 0.24 seconds with 1,000 units and 9.99 seconds with 300,000 units.
Estimation and inference stage. The subsequent operations depend on the method used. The central brick of method(np) and method(beta) is the estimation of the indices for a given dataset (original or bootstrapped). The construction of CIs repeats the operation for each bootstrapped dataset. The execution time is thus more or less linear in the number of bootstrap repetitions (fixed by the option repbootstrap()). method(ct) requires one to reshuffle the data under the randomness benchmark, hence an execution time broadly linear in the number of draws (controlled by the option repct()). Table 1 illustrates this dependence for method(np) and method(beta) as well as the effect of the option conditional(). Regarding the latter, for all three methods, the same operations as in unconditional analyses are done for each type (see section 2.6). Thus, the execution time of segregsmall is roughly linear in the number of types included in the analysis. The primary determinant of the computation time is the unit sizes: both the number of distinct values of the support K and the magnitude of K, as shown by table 3. With method(ct), the execution time quickly increases with the magnitude of K, while the increase is moderate for method(np) and even lighter for method(beta). 12 Table 3. Execution time in seconds. Setting: unit-level datasets, for each K, n = 10000 (except 9,000 for the first row)-1,000 units per distinct size, option noci for method(np) and method(beta), 50 draws (default) for method(ct). Support K of K method(beta) method(np) method(ct) [1,9] 0

Example
We use the command to measure workplace segregation between natives and foreigners in France (see D'Haultfoeuille and Rathelot [2017] for details about the context). A large share of workers is employed in small establishments. This section shows the importance of correcting for the small-unit bias, which may lead to erroneous economic conclusions.
The data used are the 2007 Déclarations Annuelles des Données Sociales, French data linking workers to their employer. Data are exhaustive in the private sector (1.77 million establishments). In the application, we use the 1.04 million establishments that have between 2 and 25 employees. The minority group consists of individuals born outside France and with the nationality of a country outside Europe. The overall proportion of minority individuals is 4.1% in the sample studied. Figure 1 shows the estimates of workplace segregation by firm size, for the Duncan, the Theil, the Atkinson (with parameter b = 0.5), and the Coworker indices. The Gini index does not satisfy the conditions required by the nonparametric method of HR and is thus not displayed (but see figure 2 in appendix A.2 for the graph on the Gini without the nonparametric estimator). The distinct methods of the package are used: the estimated bounds θ and θ by method(np) on θ ("np bounds"); the 95%-level CI for this parameter using the modified bootstrap procedure of method(np), with the default 200 bootstrap iterations ("np CI"); the point estimate θ by method(beta) ("beta"); the naive index θ N ("naive"); the CT-corrected index θ CT using method(ct), with the default 50 draws under random allocation ("ct"). Figure 1 shows that the naive indices overestimate the actual level of segregation: they are almost always above the CI obtained by method(np) (except for the Atkinson index with K ∈ {7, 8}). This bias decreases with the size of the units. For the Duncan, the Theil, and the Atkinson indices, the estimated identification interval for θ quickly becomes informative for K ≥ 5 and reduces to a singleton for K ≥ 9 (see discussion in section 2.2). When the unit size is larger than 1, the estimated bounds of method(np) boil down to a point estimate for the Coworker index.
The point estimate θ using method(beta) is within the identification bounds of HR for the Duncan, the Theil, and the Atkinson indices, but is below HR's CIs for the Coworker index. The CT-corrected measure θ CT underestimates the Duncan and Theil indices, being always below the method(np)'s CI. θ CT lies within the CI and is quite close to the estimated identification set of θ for the Atkinson and Coworker indices.
Interestingly, the naive indices exhibit a stronger negative relationship between segregation levels and unit size than corrected ones. Neglecting the small-unit bias would produce a statistical artifact as the magnitude of the bias decreases with K and therefore would support a negative correlation while it may not be so. On the contrary, the indices that account for the small-unit bias can address this question (see section 5 of HR for further details).
Finally, we report below the Stata output obtained with the segregsmall command for method(np) and with option testbinomial. Appendix A.2 displays the output associated with method(beta) and method(ct). Compared with the analyses of figure 1 (K by K), the estimation is performed over the entire sample of units (K = [2, 25]) in this output without assuming K ⊥ ⊥ p. As detailed in section 3.3, the test of the binomial assumption is automatically performed and stored in this configuration; the option displays only the result in the Stata output. In this application, we cannot reject the binomial assumption at any standard level.

Conclusion
This article presented the segregsmall command which implements three methods (D'Haultfoeuille and Rathelot 2017;Rathelot 2012;Carrington and Troske 1997) to measure segregation indices in settings when units (neighborhoods, firms, classrooms, etc.) contain few individuals. In such situations, naive indices overestimate the actual level of segregation and produce measures that are not comparable across settings or over time, because the small-unit bias might vary. segregsmall enables social scientists to compute segregation indices in those cases and makes the HR nonparametric approach easy to use. It provides asymptotic CIs for HR and R parameters. For all three methods, conditional indices can be estimated: they account for other covariates (either at unit level or at individual or position level) that may influence the allocation process of individuals into units and therefore measure "net" or "residual" segregation. HR and R methods can be used whatever the unit size to measure segregation as a departure from the relevant benchmark of randomness. Even with large units with above 100 individuals, the parametric approach of the R method remains quite affordable as regards computational requirements, even including inference by bootstrap.

Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type . net sj 21-1 . net install st0631 (to install program files, if available) . net get st0631 (to install ancillary files, if available) Gini index. Contrary to the previous indices, there is no closed-form expression for the Gini index under a mixture of beta distributions for F p because of the term {F p (u)} 2 du. This quantity has to be approximated by numerical methods. The Gini index can be written only as A.2 Supplementary material for the example Figure 2 is equivalent to the analysis displayed in figure 1 for the Gini index. Because the Gini index does not satisfy assumption 1, only the output of method(beta) and method(ct) is reported: the point estimate θ ("beta") and the 95%-level asymptotic CIs obtained by bootstrap ("beta CI") using method(beta) (with the default 200 bootstrap iterations); the naive or direct index θ N ("naive"); the CT-corrected index θ CT using method(ct) with the default 50 draws under random allocation ("ct").
As with the Duncan, the Theil, the Atkinson, and the Coworker indices, figure 2 illustrates some points discussed in section 2 in the particular case of the Gini index. Regarding CT correction, there is no reason why θ CT should be close to θ. For the Gini, the CT-corrected index happens to fall far below the CI for the index obtained by method(beta). We report below the Stata output obtained with method(beta) and method(ct). These estimations are done over the whole sample of units (K = [2, 25]). As an illustration, the option independencekp is used for method(beta). .89388 . segregsmall K X, format(unit) method(ct) repct(100) *** Construction of relevant databases for the analysis *** *** Estimation and correction *** CT-correction -current random allocation iteration x10 (out of 100): .........+ Estimates for segregation indices using CT-correction (ct) method: Unconditional analysis Number of units studied in the analysis: 1036840 (0 unit with a single individual are excluded from the analysis) Number of individuals studied: 6178564 Proportion of minority (or reference) group: 4.1e-02 No inference for naive and CT-corrected indices CT-correction is made using 100 draws under random allocation (