arhomme: An implementation of the Arellano and Bonhomme (2017) estimator for quantile regression with selection correction

Despite constituting a major theoretical breakthrough, the quantile selection model of Arellano and Bonhomme (2017, Econometrica 85: 1–28) based on copulas has not found its way into many empirical applications. We introduce the command arhomme, which implements different variants of the estimator along with standard errors based on bootstrapping and subsampling. We illustrate the command by replicating parts of the empirical application in the original article and a related application in Arellano and Bonhomme (2018, Handbook of Quantile Regression, chap. 13).


Introduction
Ever since the contributions by Gronau (1974) and Heckman (1974), economists and researchers from other disciplines have been aware of the possibility that measured relationships may suffer from selection bias. The classic example is the determinants of pay (that is, wages) and the selectivity through participation in employment. If one is interested in measuring how individuals with certain characteristics are paid, one has to deal with the possibility that some of them may actually not take up employment, especially if their potential pay is too low (in comparison with their alternative options). If these individuals differ in terms of unobservables from the general population, omitting them from wage regressions will yield biased estimates of regression coefficients.
Following Heckman (1979), a large literature has studied generalized models of sample selection for regression models, for example, Ahn and Powell (1993); Andrews and Schafgans (1998); Chen and Khan (2003); and Das, Newey, and Vella (2003). This literature initially focused on correcting regressions for the mean outcome (for example, the mean wage). An even more challenging case is to correct entire outcome distributions for selection bias. In an influential contribution, Buchinsky (1998Buchinsky ( , 2001 proposed a control function approach to correcting quantile regressions for selection bias. However, it was later shown by Huber and Melly (2015) that the proposed correction was based on restrictive assumptions that are unlikely to hold in general (conditional independence and additivity). It was not until the contribution by Arellano and Bonhomme (2017) that the selection problem for entire distributions was solved in some generality. In particular, Arellano and Bonhomme (2017) showed that, in the general case, sample selection corrections may not be additive but nonlinearly "rotate" observed distributional ranks.
Despite representing a theoretical breakthrough, Arellano and Bonhomme's (2017) method has not yet found its way into many empirical applications (recent exceptions include Maasoumi and Wang [2019] and Bollinger et al. [2019]). The purpose of this article is to provide an implementation of their method that is easy to use by practitioners. We also provide some replications of original analyses in Arellano andBonhomme (2018, 2017). More generally, Arellano and Bonhomme's (2017) contribution is part of an active recent literature that addresses the problem of correcting entire distributions for selection with potential applications in many fields (for example, Albrecht, van Vuuren, and Vroman [2009]; Picchio and Mussida [2011]; Fernández-Val, van Vuuren, and Vella [2018]; D'Haultfoeuille et al. [2020]; and Biewen, Fitzenberger, and Seckler [2020]).
The rest of this article is organized as follows. Section 2 describes the Arellano and Bonhomme (2017) selection model and estimation method. Section 3 introduces and describes the command arhomme, which implements this estimation method along with several options. Section 4 presents three empirical examples, two of them being successful replications of original applications in Arellano andBonhomme (2018, 2017). Section 5 concludes.

Model
Although they consider more general versions in theoretical parts of their analysis, the practical version of the Arellano and Bonhomme (2017) quantile regression model with selection correction takes the form (1) where Y * is the potential outcome, D the selection indicator, and Y the observed outcome (available only for individuals with D = 1). The vectors X and Z are covariate vectors, where X is assumed to be a strict subset of Z (exclusion restriction). The uniformly distributed variable U denotes the rank of the individual in the conditional distribution Y * |X, while the uniformly distributed V represents a normalized error term describing the resistance toward selection. The propensity score p(Z) = P (D = 1|Z) describes the selection probability of individuals with characteristics Z. The propensity score is assumed to follow a probit model; that is, p(Z) = Φ(Z γ). The main substantive assumption of the model is that (U, V ) is jointly statistically independent of Z given X.
Equation (1) is a linear quantile regression model for the potential outcome Y * defining the value of Y * that an individual with rank U would get if he or she was selected (for example, the U th quantile in a distribution of wage offers for individuals with characteristics X). Equation (2) specifies that, among individuals with characteristics Z, a percentage of p(Z) gets selected, but only those whose resistance toward selection V is low enough. Equation (3) states that outcomes are observed only for selected individuals (for example, an individual would earn some wage Y * if he or she decided to work, but this wage is observed only if he or she actually decides to do so).
The interest lies in uncovering the coefficients β(U ) characterizing the conditional distribution Y * |X, which includes all individuals with characteristics X, although not all of these individuals actually produce observable outcomes Y . For example, the β(U ) describe the pay structure for women with characteristics X, although not all of these women actually take part in the labor market. To establish a link between the quantiles of the distribution of observable outcomes Y and those of the distribution of potential outcomes Y * , Arellano and Bonhomme (2017) is the conditional copula of U and V , that is, the probability for an individual with characteristics Z = z to have at most ranks (U, V ) conditional on having at most rank p(z) in the propensity to get selected. Because the left-hand side of (4) describes outcome quantiles in the selected population, this means that the coefficients β(τ ) belonging to the τ th quantile in the overall population can be recovered by looking at the G x {τ, p(z)}-quantile observations of the selected population. This establishes the validity of the following "rotated" quantile regression, which uses the observed outcomes Y but applies to them individual specific ranks G x {τ, p(z)} (instead of the target rank τ ).

Estimation
Based on an independent and identically distributed sample Arellano and Bonhomme's (2017) estimation method proceeds as follows. For practical implementation, one assumes that the true copula C U,V |X=x (u, v) belongs to a parametric family with parameter ρ (such as Gaussian or Frank; see below) and that it does not depend on X. The latter restriction can be relaxed by carrying out estimations by subgroup (see below). The resulting conditional copula function is denoted by G(u, v; ρ). For the following, define a + = max(a, 0) and a − = max(−a, 0).
Propensity score estimation (step 1) Estimation of the copula parameter (step 2) Rotated quantile regression (step 3) Step 1 estimates the probit parameters of the propensity score.
Step 2 is a generalized method of moments estimating equation for the copula parameter ρ, which is identified by the conditional moment condition [following from (4)]. As an instrumental variable for estimating ρ in (6), one can use a suitable function ϕ(Z i ) of Z i , for example, ϕ(Z i ) = Φ(Z i γ). Minimization in (6) is carried out over a grid of candidate values for the copula parameter r ∈ R. For each candidate value r, the estimated β(τ l , r) in (6) are obtained by rotated quantile regression over a grid of auxiliary quantiles τ 1 , . . . , τ L [see (7)].
Step 3 estimates, for any desired τ , selection-corrected quantile regressions based on the preestimated copula parameter ρ. For this, individual-specific rotated ranks G τ,i = G{τ, Φ(Z i γ); ρ} are used. This can be seen by comparing (8) with the (infeasible) quantile regression, which would be carried out if potential outcomes Y * were observed for the whole population (that is, if there was no selection problem), Note that, if one is interested only in β(τ 1 ), . . . , β(τ L ), step 3 is not necessary, because these are already estimated in step 2. However, for computational reasons and for reasons of flexibility, it may be useful to separate steps 2 and 3. For example, one may already have obtained individual specific copula estimates G τ,i (for example, by subgroup estimation) and then carried out the desired rotated quantile regressions of step 3 conditional on these preestimated quantities (also see below). Arellano and Bonhomme (2017) showed that the estimators defined in (6), (7), and (8) are asymptotically normal. However, the resulting form of the asymptotic variance matrix is very complex. This makes the use of resampling techniques attractive. In their empirical application, Arellano and Bonhomme (2017) used subsampling (Politis, Romano, and Wolf 1999). An alternative is the bootstrap (for example, Shao and Tu [1995]). The bootstrap draws independent and identically distributed resamples of size N from the original sample and repeats the estimation for several bootstrap replications. The empirical distribution of the bootstrap replications then serves as an estimate of the asymptotic distribution. Subsampling draws subsamples of size m < N without replacement from the original sample and repeats the estimation on the subsamples to obtain an estimate of the asymptotic distribution (after rescaling by m/N ). A related method is the m-out-of-n bootstrap, which also draws subsamples of size m < N from the original sample but with replacement. Subsampling and the m-out-of-n bootstrap require that N → ∞ and m/N → 0; that is, the subsamples are required to be small in relation to the sample size N . 1

Inference
Subsampling and the m-out-of-n bootstrap work under more general conditions than the bootstrap. In particular, they do not require that the asymptotic distribution be normal. It suffices that a suitably normalized version of the estimator has a limit distribution (Politis, Romano, and Wolf 1999). The bootstrap is guaranteed to work if the limit distribution is normal (Shao and Tu 1995). In the given case, both methods will work because the limit distribution is known to be normal. Subsampling and the m-out-of-n bootstrap are attractive for computational reasons if the sample size is very large because estimations have to be repeated on smaller portions of the data only. However, subsampling and the m-out-of-n bootstrap have to deal with the difficult issue of determining the subsample size (for example, Politis, Romano, and Wolf [1999]; Chernozhukov and Fernández-Val [2005]; Bickel and Sakov [2008]). Based on Chernozhukov and Fernández-Val (2005), Arellano and Bonhomme (2017) used a subsample size of a constant plus the square root of the sample size, where the constant is chosen such that the subsamples are large enough to ensure a reasonable finite sample performance of the estimator (Arellano and Bonhomme 2017, footnote 19).
Our version of Arellano and Bonhomme's (2017) estimator implements the m-outof-n bootstrap as well as the conventional bootstrap.

Algorithms
It is well known that quantile regression problems such as (9) can be solved using linear programming techniques. However, the rotated versions (7) and (8) cannot be handled with standard implementations of quantile regression such as qreg, because these do not allow for individual specific ranks G τ,i . An exception are the codes of Morillo, Koenker, 1. As evident from their MATLAB codes, Arellano and Bonhomme (2017) actually use the m-out-of-n bootstrap but call it subsampling. In view of the requirement m/N → 0, the numerical difference between subsampling and m-out-of-n bootstrap is typically small. and Eilers (available at http://www.econ.uiuc.edu/∼roger/research/rq/rq.m), also used by Arellano and Bonhomme (2017). For our implementation of Arellano and Bonhomme (2017), we translated these codes from MATLAB to Mata. The codes are based on an interior point algorithm (Koenker 2005) as opposed to the exterior point algorithm used in the current version of qreg. Our experience was that the interior point algorithm converged considerably faster than that used in qreg for most of the datasets analyzed by us. Our implementation also includes safeguards against problems related to using copula values too close to the boundary cases of the counter and the comonotonicity copula (in this case, G τ,i → 0 or G τ,i → 1; see the ado-file).
Our implementation allows sampling weights (as used in the empirical application of Arellano and Bonhomme [2018], which we replicate in section 4.2). If sampling weights are specified, we premultiply observations Y i , X i , and ϕ(Z i ) with the sampling weight of observation i before carrying out all calculations. This ensures that the sums over i = 1, . . . , N in (6) to (8) are weighted sums. In addition, we include weights in (5).

Copula functions
Our implementation allows the user to choose among four copula functions, as shown in table 1 (for an overview of copula functions and their properties, see Joe [2015]). An important feature of all of these copulas is that they contain as limit cases the extreme forms of positive (or negative) dependence described by the comonotonicity (countermonotonicity) copula. Not restricting the strength of dependence between U and V (and therefore the strength of selection) appears to be important to avoid imposing restrictions that are not compatible with the data. Out of the four copulas listed in table 1, the Frank, Gaussian, and Plackett copulas can represent only symmetrical patterns, while the Joe and Ma (2000) copula can also accommodate asymmetrical patterns (Joe 2015).
Joe and Ma (2000) copula (2015). F Γ (·, a) is the cumulative Gamma distribution with shape parameter a.
Because the value of the copula parameter ρ typically has no direct interpretation, our estimation command reports standard measures of bivariate concordance as listed in table 2. These represent generalized measures of correlation between U and V , which are a function of the copula and the copula parameter (Joe 2015). The concordance measures describe the association between the rank in the latent outcome distribution U and that in the distribution of resistance toward selection V . For example, if high values of U are associated with low values of V , then individuals who get selected tend to have higher outcomes than those who do not (positive selection). Note that positive (or negative) selection will be represented by negative (or positive) concordance measures because of the definition of V as the resistance toward selection.
The interpretation of the different concordance measures is as follows. Spearman's rank correlation measures the (ordinary) correlation between the ranks U and V . Kendall's tau is positive if it is more likely that ranks go into the same rather than into opposite directions, and it is negative otherwise. Blomqvist's beta is positive if it is more likely that both ranks U and V lie on the same side of the median rank (which is one half) than on opposite sides.

Selection
select( depvar s = varlist s ) specifies the variables and options for the selection equation. It is an integral part of specifying the Arellano and Bonhomme (2017) model and is required. The selection equation must contain at least one variable that is not in the outcome equation.
If depvar s is specified, it should be coded as 0 or 1, with 0 indicating an observation not selected and 1 indicating a selected observation. If depvar s is not specified, observations for which depvar is not missing are assumed selected and those for which depvar is missing are assumed not selected.

Grid tuning
rhopoints(#) determines the number of candidate points for the copula parameter grid search. The default is rhopoints (19). When the option frank is chosen, the copula candidate values are constructed as follows. First, the unit interval is divided into (# + 1) equidistant intervals. Then, the ith candidate is defined as the ith quantile of a Cauchy distribution with scale meshsize() and shift centergrid(). With the option gaussian, the quantiles of a sinus density with emphasis centergrid() and range meshsize() × (1 − |centergrid()|) are built. The grid for the copula options plackett and joema is designed as the square root of the ith unit interval point divided by 1 minus this point. This method ensures that the resulting grid is denser around centergrid(). The user can shift the focus of the grid search by specifying the desired centergrid(), by reducing (or increasing) meshsize(), or by increasing (or reducing) rhopoints(). Note that the default rhopoints (19) is likely to be too small for many applications.
taupoints(#) specifies the number of quantiles for which the moment restriction is supposed to hold (step 2 in Arellano and Bonhomme [2017]). We recommend using this option with graph. The resulting scatterplot should suggest a smooth objective function (at least around the gravity center of search; the objective function may look erratic toward outer values no matter how many taupoints() are used). Increase taupoints() to further smooth the objective function. The default taupoints (3) is a good start in many applications, but many taupoints() are recommended for more reliable estimates.
meshsize(#) scales the grid search interval up (or down). For large #, the resulting grid becomes less dense but searches a wider range. # is restricted to strictly positive real values for the options frank, plackett, and joema and is restricted to (0, 1] when using gaussian. The default meshsize(1) tends to be a good start.
centergrid(#) sets the gravity center of the grid. If you already suspect the optimal copula parameter to be a specific value, this option helps shift the emphasis of your search. # is restricted to (−1, 1) with gaussian and to (0, ∞) for plackett and joema, and it is unrestricted with frank. By default, the grid will always be symmetric about the independence copula, that is, centergrid(0) for frank and gaussian, and centergrid(1) for plackett and joema.
frank specifies the Frank copula to model individually rotated quantiles. The copula parameter is ρ ∈ R, with ρ → −∞ corresponding to the lower Fréchet-Hoeffding bound, ρ = 0 to the independence copula, and ρ → ∞ to the upper Fréchet-Hoeffding bound.
plackett specifies the Plackett copula used to model individually rotated quantiles. The copula parameter is ρ ∈ (0, ∞), with ρ → 0 corresponding to the lower Fréchet-Hoeffding bound, ρ = 1 to the independence copula, and ρ → ∞ to the upper Fréchet-Hoeffding bound. If standard errors are computed, the copula parameter is tested for ρ = 1 instead of ρ = 0. The p-value is reported accordingly.
joema specifies the Joe and Ma (2000) copula to model individually rotated quantiles. The copula parameter is ρ ∈ (0, ∞), with ρ → 0 corresponding to the lower Fréchet-Hoeffding bound, ρ = 1 to the independence copula, and ρ → ∞ to the upper Fréchet-Hoeffding bound. If standard errors are computed, the copula parameter is tested for ρ = 1 instead of ρ = 0. The p-value is reported accordingly.

Standard errors/subsampling
nostderrors disables the computation of standard errors. This option precludes the use of subsample(#) and repetitions(#).
subsample(#) draws samples of size # with replacement from the marked dataset. Standard errors are computed by the m-out-of-n bootstrap method. If # is greater than or equal to the effective size of the entire dataset, the conventional bootstrap is executed.
repetitions(#) specifies the number of bootstrap replications to be used to obtain an estimate of the variance-covariance matrix of the estimators. The default is repetitions(100), which is likely to be too small in many applications.
fillfraction(#) determines up to which fraction of overall bootstrap repetitions the program replaces subsamples in case of failed convergence. If this limit is reached, further failed subsamples are dropped without being replaced. The default is fillfraction(.3).

Instrument/copula parameter
instrument(varname) lets the user define a variable in the dataset that serves as the instrument to estimate the copula parameter [see (6)]. The instrument has to be a function of varlist s . The default is the propensity score.
copulaparameter(varname) indicates that the copula parameter has already been estimated by the user (for example, separately by sample subgroups in a first stage) and stored per observation in the variable varname. In this case, only step 3 of Arellano and Bonhomme (2017) is performed (estimation of the selection-corrected quantile coefficients). The values in varname are restricted to (−1, 1) with the option gaussian and to the positive real line for plackett and joema. They are unrestricted for frank. This option must be used in connection with nostderrors and precludes the use of rhopoints(), taupoints(), meshsize(), centergrid(), subsample(), repetitions(), instrument(), and graph. The reason is that the user will have to code his or her own bootstrap procedure, including all the different stages of his or her estimations (for example, using bootstrap). It is only in this way that the sampling variability of the preestimated copula parameters is accounted for.
graph specifies that a scatterplot of all objective function values be automatically generated after estimation.
output( normal bootstrap ) defines whether the output table generated is based on the asymptotic, that is, normal, or the bootstrap distribution. If both are specified, two separate output tables are produced. The first stage (probit) standard errors in the output are always asymptotic (coming from the default probit command). repetitions() should always be set to at least 500 when choosing output(bootstrap). If both normal and bootstrap are specified, then results based on the normal distribution are reported first.

Comparison with Heckman selection model
Our first empirical illustration uses the example in the Stata manual for the command heckman, which fits the Heckman (1979)  We then use arhomme to fit a selectivity-corrected regression model for the median. For doing this, we also illustrate a useful stepwise procedure to arrive at a reasonable choice for the grid used to estimate the copula parameter.
Second step (gaussian copula parameter estimation) successfully completed. Found objective function minimum 1.705e-05 for rho = -0.5903 Third step (minimization of rotated check function) successfully completed. (2017)  We then use a refined grid:

Arellano & Bonhomme
. arhomme wage $X, select($X $B) nostderrors gaussian quantiles(.5) > taupoints (7) rhopoints (25)  So far, we have used the option nostd to save computing time. In the next step, we also include the computation of standard errors: . set seed 1337 . arhomme wage $X, select($X $B) gaussian quantiles(.5) taupoints(7) > centergrid(`c') repetitions(250) option subsample left unspecified: subsample automatically set to 2000 (bootstrap) use option nostderrors to disable estimation of covariance matrix First step estimation (probit model) successfully completed. Second step (gaussian copula parameter estimation) successfully completed.  Both heckman and arhomme find substantial positive selection (recall that a negative copula parameter in arhomme represents positive selection). The coefficients for education and age in arhomme for the median wage are quite similar to those for the mean wage in heckman. This will be the likely outcome if the conditional distributions are symmetric (implying that the mean is equal to the median).

Replication of Arellano and Bonhomme (2018)
This section replicates the empirical example in Arellano and Bonhomme (2018). The data are from Huber and Melly (2015) and can be downloaded from the Journal of Applied Econometrics data archive (http://qed.econ.queensu.ca/jae/2015-v30.7/hubermelly / ). The application refers to the returns to education and experience for women in the United States using data from the 2011 Current Population Survey. The sample covers white non-Hispanic women aged between 25 and 54 years. Individuals who are self-employed or work for the military, public, or agricultural sector are excluded. Working is defined as having worked for more than 35 hours in the week preceding the survey. The application uses the Current Population Survey sampling weights.
We first load the data and modify some of the variable names to make them conform to those used in Arellano and Bonhomme (2018): . use application, clear . rename hsg educ_7 . rename some_college educ_8 . rename associate educ_9 . rename college educ_11 . rename advanced educ_13 . rename mw midwest . rename So south . rename We west We now apply arhomme, following as closely as possible the specification in Arellano and Bonhomme (2018): . global X educ_7 educ_8 educ_9 educ_11 educ_13 exp exp2 exp_edu exp2_edu > midwest south west married . global B child02 child35 child613 child02_m child35_m child613_m . set seed 1337 . arhomme lwage $X [pw=wgt], select(ft = $X $B) taupoints(4) rhopoints(39) > gaussian subsample (1000) repetitions (500)  The results for the selection-corrected quantile regression coefficients are almost identical to those reported by Arellano and Bonhomme (2018, table 1). Standard-error estimates are also very similar. The point estimate for the copula parameter and its standard error also come close to those reported by Arellano and Bonhomme (2018) ( ρ = −0.10 with standard error 0.054). Small differences between our results and those of Arellano and Bonhomme (2018) are to be expected because standard errors are the result of a random process (subsampling), because of numerical software differences, and because we do not know the exact choices of Arellano and Bonhomme (2018) for grid search, number of supporting quantiles, etc.

Partial replication of Arellano and Bonhomme (2017)
Our last empirical example replicates selected results in Arellano and Bonhomme (2017). wagedata.dta can be downloaded from Stéphane Bonhomme's webpage (https://sites. google.com/site/stephanebonhommeresearch/) and refers to selectivity-corrected wage distributions for the UK for the period 1978-2000. This section illustrates the use of the graph option, which displays the grid search optimization for the copula parameter.

Conclusion
In this article, we described a command, arhomme, implementing the Arellano and Bonhomme (2017) method of sample selection correction for quantile regressions along with standard errors based on bootstrapping and subsampling. arhomme is fast and potentially applicable in many fields in which there is a need to correct estimates of conditional distributions for sample selection. If one is interested in obtaining unconditional distributions corrected for sample selection, the resulting conditional distributions may be aggregated up as described in Albrecht, van Vuuren, andVroman (2009) or Chernozhukov, Fernández-Val, andMelly (2013).

Acknowledgments
We thank Michael Wolf (University of Zürich) for discussions and advice on subsampling.
We also thank Ben Jann, Blaise Melly, Philippe Van Kerm, and the participants of the Swiss Stata Conference 2020 for many helpful comments and suggestions. Financial support through the DFG Priority Program 1764 and DFG project BI 767/3-1 is gratefully acknowledged.

Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type . net sj 21-3 . net install st0648 (to install program files, if available) . net get st0648 (to install ancillary files, if available)