On Lasso and adaptive Lasso for non-random sample in credit scoring

Prediction models in credit scoring are often formulated using available data on accepted applicants at the loan application stage. The use of this data to estimate probability of default (PD) may lead to bias due to non-random selection from the population of applicants. That is, the PD in the general population of applicants may not be the same with the PD in the subpopulation of the accepted applicants. A prominent model for the reduction of bias in this framework is the sample selection model, but there is no consensus on its utility yet. It is unclear if the bias-variance trade- off of regularization techniques can improve the predictions of PD in non-random sample selection setting. To address this, we propose the use of Lasso and adaptive Lasso for variable selection and optimal predictive accuracy. By appealing to the least square approximation of the likelihood function of sample selection model, we optimize the resulting function subject to L1 and adaptively weighted L1 penalties using an efficient algorithm. We evaluate the performance of the proposed approach and competing alternatives in a simulation study and applied it to the well-known American Express credit card dataset.


Introduction
Credit scoring models are used to evaluate the likelihood of credit applicants defaulting in order to decide whether to grant them credit.The scoring systems are based on the past performance of consumers who are similar to those who will be assessed under the system.In other words, several loan applicant attributes are used to assign a score.These scores are used to determine credit worthiness of the applicant.In practice, the credit scores are transformed into the probability of default (PD).PD is the expected probability that a borrower will default on the debt before its maturity.
A key concern in the use of these models is that they are typically designed and calibrated using data from applicants who were previously considered adequately 116 Emmanuel O. Ogundimu creditworthy to have been granted credit (Banasik et al., 2003).Consider, as an example, where a loan application is made to a bank.The bank uses the loan applicant attributes to grant or reject the loan request.If the request is accepted, then the bank will observe the loan performance over time.Marshall et al. (2010) classified these procedures into two: the credit granting process (accept or reject) and loan performance process (default or non-default).A model developed using the accept-only applicants from the credit granting process may be a non-random sample from the target population and can lead to selection bias.
A strategy for addressing the problem of sample selection bias in credit scoring is the reject inference techniques (Hand & Henley, 1993;Crook & Banasik, 2004).Reject inference is the process of inferring how rejected loan applicants would have behaved had they been granted loan.The techniques for reject inference can be classified under two different assumptions (Kim & Sohn, 2007).The first assumption is that the distribution pattern of accepted applicants can be extended to that of rejected ones.That is, P(default|X, rejected) = P(default|X, accepted), where X is the vector of applicants' attributes.This implies that PD in the population of accepted applicants can be applied to the rejected ones.Examples of statistical methods in this category include re-weighting and extrapolation methods.The second assumption implies that P(default|X, rejected) � = P(default|X, accepted).In this case, the PD in the population at large, P(default|X) cannot be approximated by the conditional model based on P(default|X, accepted) for an applicant selected at random from the full population.A widely used method under this assumption is the bivariate probit model with sample selection (Dubin & Rivers, 1989).
There are two discordant viewpoints on the utility of sample selection models for reject inference in the literature.Greene (1998) and Greene (2008), for example, analysed the risk of a loan default for credit cardholders using sample selection model, and concluded that the model with adjustment for sample selection bias exhibits better discrimination than the model based on accept-only data.By taking variable selection into account, Marshall et al. (2010) showed that the model without considering sample selection bias can underestimate PD.Other studies that reported higher model performance can be found in Banasik et al. (2003), Banasik and Crook (2007) and Kim & Sohn (2007).On the other hand, Little (1985) and Crook & Banasik (2004) showed that adjusting for selectivity bias may not yield improved predictions when the proportion of rejected applicants is low.In a simulation study, Wu and Hand (2007) also reported the importance of the proportion of accepted or rejected applicants in reject inference.It was shown that even with the normality assumption in place for sample selection model, correction for selection bias may not improve predictions when the proportion of accepted applicants is large.Further examples can be found in Puhani (2000) and Chen and Åstebro (2012).
There are various reasons for the discordant viewpoints mentioned above.These include the proportion of rejected applicants, the inclusion of 'noise' variables (variables that are not predictive of PD) in both the loan granting and loan performance processes, which may lead to overfitting, and the degree of correlation between the error terms in loan granting and loan performance processes.Indeed, some of these Penalized non-random sample 117 issues have been dealt with to some extent in the literature.Marshall et al. (2010) used bootstrap variable selection to control for the effect of noise variables, but their method was not optimized for predictions like regularization methods.Data mining techniques have been used in reject inference to improve the quality of credit scorecards, but these are yet to be applied to the Heckman-type selection models (see Li et al., 2017 and references therein).The need to harmonize these issues within the Heckman-type selection models for reject inference is the motivation for this article.
The contribution of this article is therefore twofold.First, we introduce Lasso (Tibshirani, 1996) and adaptive Lasso (Zou, 2006) penalized Heckman-type bivariate probit model and assess its performance in identifying predictive features of PD in credit scoring.Since the model is made up of two components, each of which may have different variables, features selection is somewhat complex.Thus, our framework appeals to the unified treatment of L 1 -constrained model selection of Wang and Leng (2007), which is based on least squares approximation (LSA) of the likelihood function.The resulting LSA is then solved subject to L 1 and adaptively weighted L 1 penalties using the coordinate descent algorithm.Unlike the bootstrap variable selection approach of Marshall et al. (2010), regularization methods have the advantage of simultaneous estimation of parameters and selection of variables.Second, since variable selection provides sparse solution for the true model with true zero coefficients, the predictive performance of the model can be enhanced.We therefore propose a bootstrap internal validation method (Harrell et al., 1996;Ogundimu, 2019) for both the regularized and unregularized sample selection models.Unlike in previous work, where model validation is done using hold-out sample, the bootstrap approach can be used to quantify the degree of optimism in the model.
The remainder of the article is organized as follows.In Section 2, we describe the dataset used in the study.The bivariate probit model with sample selection (BPSSM) and its extension using copula-based sample selection model (CBSSM) are described in Section 3. We develop Lasso and adaptive Lasso estimators for the models and provide the computational algorithm for its maximization in Section 4. In Section 5, we describe five metrics for predictive performance that are not threshold dependent and the procedure for internal validation.Simulation study and data example are presented in Section 6.Finally, in Section 7, we provide concluding remarks and further results are presented in Supplementary Materials.We also provide a package in R (HeckmanSelect) for the implementation of the methods.

Dataset
We used the American Express credit card dataset (Greene, 1998(Greene, , 2008) in this study.The dataset consisted of 13,444 observations on credit card applications received in a single month in 1988.Of the full sample, 10,499 applications were approved, and the next 12 months of spending and default behaviour were observed.Important variables in the data include demographic and socioeconomic factors of the applicants (e.g., Age, Income, whether the applicant owns his or her home, whether the applicant is self-employed or not and the number of dependents living with the applicant).An  important factor for granting the credit facility is grouped under 'Derogatories and Other Credit Data'.These influential variables are number of major and minor derogatory reports (60-day and 30-day delinquencies).Details of the variables used in this study can be found in Table A1 (Supplementary Materials).
The dataset consisted of two outcome variables-Cardholder status (C), which takes 1 if the application for a credit card was accepted and 0 if not, and Default status (D) which takes 1 if defaulted and 0 if not.Default is defined as having skipped payment for six months, and the corresponding status is observed only when C = 1, that is for 10,499 observations.Table 1a shows the distribution of the cardholder status.Out of the 13,444 applicants, 2,945 (21.9%) are censored, 996 (7.41%) of those that are selected to receive the card defaulted and 9,503 (70.7%) applicants paid back their loans.Table 1b shows the default status distribution of the selected sample.As it is common in credit scoring, the event rate is less than 10%.

Sample selection model with binary outcome
The use of sample selection model in reject inference assumes that P(default|X, rejected) � = P(default|X, accepted).This implies reject inference can be construed as a missing data problem under the assumption of missing not at random (MNAR) and Heckman selection model can be adapted for parameter estimation and inference.Henceforth, we treat the loan granting process (accept/reject) as the selection equation (S i ) and loan performance (default/non-default) as the outcome submodel of interest (Y i ).
Let Y � and S � be two latent (unobservable) variables characterizing the outcome and selection equations respectively.That is, where β T = (β 0 , β 1 , . . ., β p ) and γ T = (γ 0 , γ 1 , . . ., γ q ) are unknown parameters with corresponding covariates x T i = (1, x i1 , . . ., x i p ) and w T i = (1, w i1 , . . ., w iq ); and ε T i = (ε 1i , ε 2i ) are random errors with means zero, variances one and correlation ρ.Define Penalized non-random sample 119 further Y i and S i as two observable versions of equation (3.1) such that The probability mass function (PMF) of Y i and S i is Bernoulli, where the probability of success depends on the parameters β and γ respectively.

Bivariate probit model with sample selection (BPSSM)
Suppose that the error terms in equation (3.1) follow a bivariate normal distribution There is no selection bias when ρ = 0.In this case, the missing data mechanism is said to be ignorable.Now, we have three levels of observability: S i = 0 (rejected loans), S i = 1, Y i = 0 (accepted loans and non-default) and S i = 1, Y i = 1 (accepted loans and default).Thus, where �(•) and � 2 (•, •; ρ) denote the univariate and bivariate standard normal cumulative distribution functions (CDF) respectively.The appropriate log-likelihood function is easily derived from equation (3.2) as where θ = (β, γ , ρ).The probability of interest is It is straightforward to show that equation (3.4) reduces to �(β T x i ) when ρ = 0.This is indeed the model for the accept-only applicants.That is, P( a probit regression model.The performance of the rejected loans can be imputed via The evaluation of the performance of our method is based on equation (3.4).This is the PD given that a loan is accepted.

Copula-based sample selection model
Although there is no reason to discountenance the symmetric dependence and the underlying normal assumption used in Section 3.1 for prediction purposes, the model can be generalized using copulas.Let us define the marginal CDFs of Y � i and ) respectively, then their joint CDF can be written as where C(•, •) is a two-dimensional copula function, β and γ are as defined in equation (3.1) and ρ is an association copula parameter representing the dependence between the two marginal distributions.Note that ρ is defined as an association measure in this case and therefore can have values outside the usual correlation range of [−1, 1].Since the realized 'outcomes' Y and S are both binary, we can define the probability of event (Y i = 1, S i = 1) as where In addition, Penalized non-random sample 121 Thus, the likelihood function in equation (3.3) can be generalized as If we assume a Gaussian copula with normal marginals, that is ; ρ , then equations (3.5) and (3.3) are essentially the same.The probability of interest given in equation (3.4) is then generalized as Details of various copulas that can be used in non-Gaussian sample selection models, including the marginal distributions can be found in Marra et al. (2017b) and Gomes et al. (2019).We illustrate the proposed method using Ali-Mikhail-Haq (AMH) copula function with Gaussian marginal distribution for both the outcome and the selection equations.We note that AMH copula can only allow for relatively modest dependence (see Section A.3. in Supplementary Materials), and as such, we only consider comparable dependence between the outcome and the selection process of BPSSM and CBSSM in our simulation settings.
4 Lasso and adaptive Lasso 4.1 Lasso and adaptive Lasso for BPSSM and CBSSM Ogundimu (2021) introduced a regularization method for sample selection model for continuous outcomes.We generalize the method to binary outcomes by using the unified Lasso approach of Wang and Leng (2007).Since the model is a twocomponent model, similar to mixture models, we adapt the method implemented in Zeng et al. (2014).
Consider the log-likelihood function l(θ ) given in equation (3.3) (equivalently equation (3.5)).Suppose that the last three elements of θ are β 0 , γ 0 and ρ, and that the first p + q elements of θ are β j , j = 1, . . ., p and γ k , k = 1, . . ., q.This construction is to ensure that β 0 , γ 0 and ρ are not penalized.The Lasso estimator (Tibshirani, 1996) for the sample selection model is given by θlasso where the second term in the RHS of equation (4.1) is the L 1 -penalty which shrinks small coefficients to zero to obtain sparse representation of the solution and λ is a tuning parameter controlling the amount of shrinkage, often chosen via crossvalidation.The optimization problem reduces to the familiar maximum likelihood estimation when λ = 0. Equation (4.1) does not have a closed form solution and various algorithms for its computation have been studied.These include the shooting algorithm (Fu, 1998), the least angle regression (LARS; Efron et al., 2004) and the coordinate descent algorithm (Friedman et al., 2007).Since the Lasso penalizes all the regression coefficients equally, it over-penalizes the important variables thereby resulting in biased estimators.The lack of the oracle property (Fan & Li, 2001) of Lasso prompted the development of the adaptive Lasso (Zou, 2006) with this property.The oracle property implies the method is consistent in variable selection, unbiased and asymptotically normal.The estimator is defined as where w = (w 1 , . . ., w p+q ) is data-driven adaptive weight, which is often given as w = 1/| θ | and θ is any consistent estimator of θ .We take this to be the maximum likelihood estimator, θml .

Least squares approximation
It is not straightforward to optimize the penalized log-likelihood function in equation (4.2).To simplify the optimization problem, we approximate l(θ ) by the least squares approximation (LSA) method.Consider the second-order Taylor expansion of l(θ ) at θml , where l � (•) and l �� (•) are the first-and second-order derivatives of the log-likelihood function.Since l( θml ) is a constant, l � ( θml ) = 0, and l �� ( θml ) = �−1 , where � is the estimated variance-covariance matrix of θml , we have We fitted the models and obtain θml and � by using the GJRM package in R (Marra & Radice, 2020).A pseudo data is created as where X * is a square matrix containing all the ( p + q + 3) predictors along with the correlation parameter ρ and the intercepts, and Y * is the corresponding pseudo response.Thus, equation (4.2) can be re-written as which we optimized using the coordinate descent algorithm (see Friedman et al., 2010;Simon et al., 2011;Ogundimu, 2021).

Selection of tuning parameter
The optimal tuning parameter, λ can be estimated by using AIC (Akaike information criterion), BIC (Bayesian information criterion) and GCV (generalized crossvalidation).It has been shown that the combination of the adaptive Lasso penalty and BIC-type tuning parameter selector results in LSA estimator that can be as efficient as the oracle estimator (Wang et al., 2007).Thus, we focus on the BIC criterion although the method is implemented for both AIC and GCV as well.The expression is given as where 0 ≤ d f λ ≤ ( p + q) is the degree of freedom corresponding to the number of nonzero coefficients of θ.The optimal value of λ is computed over a grid of candidate values of λ between λ = 0 and λ = λ max , with step size of 0.1, where λ max is the value of λ for which the entire vector of θ is zero.We allowed optimal λ = 0 for the unregularized solution.

Variance estimation
The variance of the nonzero component of θ (base on the optimal tuning parameter λ) can be derived using the local quadratic approximation (LQA) sandwich formula given in Fan & Li (2001).We suggest alternative formulation using block decomposition of the Hessian matrix l �� (θ ) (see Section A.2. in Supplementary Materials) and by generalization the Hessian matrix corresponding to equation (3.5).Let θ1 (with r elements, r ≤ s) be non-vanishing component of θ.Define where M 11 corresponds to the first r × r submatrix of M. Further, let A 11 be the first We have presented the variance estimation formula for the sake of completeness as the focus of the current work is on predictions.Further details on variance estimation for regularized sample selection model can be found in Ogundimu (2021).

Performance metrics and bootstrap validation
We describe the metrics for predictive accuracy and the bootstrap approach for model validation.

Metrics for predictive performance
In credit risk assessment, the misclassification of loan defaulters into non-defaulters will result in a loss for banks/creditors.Therefore, it is more important that the true defaulters are correctly classified.Here, we focus on model evaluation criteria for predictions in the context of regression analysis rather than classification.This is to ensure that the metrics for predictive performance are not threshold dependent and the users of the model can determine the appropriate threshold for classification.Unlike in previous studies, where area under the curve is the most common metric of prediction accuracy, we examined the performance of four other metrics based on model discrimination and calibration.The following performance metrics are used: (1) Area under the receiver operating curve (AUROC): The c-index (Harrell et al., 1982) is the generalization of the AUROC, which is a measure of model performance that separates subjects with the event of interest from subjects without the event (discrimination).It calculates the proportion of pairs in which the predicted event probability is higher for the subject with the event of interest than that for the subject without the event.A model with no discriminatory ability has a value around 0.5 whereas a value close to 1 suggests excellent discrimination.
Statistical Modelling 2024; 24(2): 115-138 Penalized non-random sample 125 (2) Area under the precision-recall curve (AUPRC): Suppose True positive (TP) is defined as actual defaulters who are correctly predicted, False negative (FN) as actual defaulters who are predicted as non-defaulters, and False positive (FP) as actual non-defaulters predicted as defaulters.Then, Recall = T P/(T P + F N) and Precision = T P/(T P + F P). Thus, the precision-recall curve shows the relationship between precision and recall for every possible threshold value.The area under the curve is a single number summary of the information in the precision-recall (PR) curve.A key advantage of AUPRC is that it takes into consideration the prior probability of the outcome of interest, thereby reflecting the ability of the model to identify defaulters (often the minority class in credit scoring).Unlike the AUROC, its values range from 0 to 1.Its value approaches 0 as the prior probability of the outcome decreases (Davis & Goadrich, 2006).
We computed this metric by using the PRROC package in R (Grau et al., 2015).
(3) Brier score (BS): It is a measure of agreement between the observed binary outcome (i.e., default versus non-default) and the predicted PD as shown below where Y i is the outcome and Ŷi is the predicted PD.It is a proper scoring rule in that it is maximized when correct probabilities are used.A key advantage is that it captures both discrimination and calibration, and a low value of the metric is preferred.(4) Calibration Metrics: We consider two metrics of calibration-Expected Calibration Error (ECE) and Maximum Calibration Error (MCE).The metrics are computed by sorting predicted probabilities and partitioning it into K fixed number of equal-frequency bins.The ECE calculates the average calibration error over the bins as where y i is the true fraction of positive instances in bin i, ŷi is the average of the probabilities for the instances in bin i, and P(i) is the empirical probability (fraction) of all instances that fall into bin i, and the MCE calculates the maximum calibration error for the bins as The choices between K = 10 and K = 100 have been reported in the literature (Naeini et al., 2015;Wang et al., 2019).We chose K = 10 in this study.Like

Emmanuel O. Ogundimu
Brier score, the lower the values of ECE and MCE, the better is the calibration of a model.

Bootstrap internal validation
Although we implemented penalized sample selection models to alleviate overfitting, some degree of optimism may persist, nonetheless.Harrell et al. (1996) presented a procedure for estimating optimism in predictive models.We extend this method to incorporate variable selection.Without loss of generality, consider a dataset D = {x i , Y i , S i }, where x i , Y i and S i are as defined in Section 3.1, and performance metric P. A generic algorithm for the procedure is given in Algorithm 1.The optimism corrected metric P is the metric that has been corrected for overfitting.It is noteworthy that optimal lambda value is selected for each of the b bootstrap sample.

Numerical studies
In this section, we use simulation and a real data to evaluate the utility of the proposed estimators in reject inference.
Penalized non-random sample 127
The intercepts of the outcome equation, β 0 = −2.78 and the selection equation, γ 0 = 1.90 are chosen such that the required event rate and missing data is about 10% and 22% respectively.The 10% event rate is typical of datasets for modelling PD (Ogundimu, 2019).The simulation design ensures that there is one predictor in the selection equation that is not in the outcome equation (exclusion restrictionalthough this is not essential as demonstrated in Ogundimu, 2021).The covariates x i and w i are independent of the error terms ε T i = (ε 1i , ε 2i ).We generated the underlying error distribution in two ways: i BPSSM: the errors are generated with mean zero and correlation matrix � = 1 ρ ρ 1 , where ρ = {0, 0.2, 0.5} (ρ = 0 corresponds to ignorable selection process) ii CBSSM: the errors are generated from AMH copula with association measure θ AMH = {0, 0.498, 1}.Note that these values are equivalent to the values of ρ in the distribution of error terms under BPSSM.Specifically, where τ is the Kendall's tau.
The covariates x 1 , . . ., x 8 are generated such that their distribution are marginally standard normal with pairwise correlations corr(x j , x k ) = � | j −k| .We take � = 0.5 to allow for moderate correlation between the covariates.The binary versions of Y � i and S � i , denoted as Y i and S i respectively, are generated as in Section 3.1.Y i is observed if S i = 1 and missing otherwise.A sample of n = 1000 is used with 200 replications.We have combined weak (β = 0.2) and moderate (β = 0.7) covariates effects in this design.
We evaluated the performance of the methods using sensitivity (mean of proportion of nonzero coefficients that were correctly identified) and specificity (mean of proportion of zero coefficients that were correctly identified).The predictive accuracy of the model is evaluated using bootstrap method as described in Section 5.For each bootstrap sample, optimal λ is computed over a grid of candidate values of λ as described in Section 4.3 to provide a model having predictors and coefficients based on that penalty.
In Table 2, we present the results of the sensitivity and specificity of the methods for variable selection.Lasso methods have higher sensitivity but lower specificity than the other methods.This observation is not surprising since Lasso estimator lacks oracle property (Zou, 2006) and it is generally known to include true covari-  ates, but also irrelevant covariates (Meinshausen & B ühlmann, 2006).Although the data was generated based on a bivariate normal distribution, the performance of CBSSM methods is superior to the corresponding BPSSM methods in terms of specificity.There is no clear distinction among the methods in terms of sensitivity.The adaptive Lasso methods have slightly better overall performance on the combined effects of sensitivity and specificity.To see the impact of weak and moderate covariate effects on the methods, we present the frequency with which the variables are selected in 200 replications in Table 3.The unregularized methods (BPSSM Pvalue and Probit P-value) selected true covariates less often when the effect is weak, but it is slightly better in selecting fewest covariates with true zero coefficients.CB-SSM Lasso is slightly better than BPSSM Lasso across the correlation values.There is no clear advantage of sample selection models over complete case analyses in Tables 2  and 3.  is meant to alleviate the problem of overfitting.The results indicate that the use of sample selection models in reject inference problem, whether regularized or not, does not improve the accuracy of complete case analysis.Lasso-based methods are slightly better than the other methods in terms of the metrics for discrimination (AUROC and AUPRC).This is counterbalanced by its performance on the two metrics for calibration (ECE and MCE), where calibration results are consistently poorer than the other methods.This may be due to the inclusion of unimportant variables in Lasso methods.Table A2 in the Supplementary Materials is equivalent to the results in Table 2 but with the data generated from AMH copula.Overall, CBSSM performance is slightly better than BPSSM (due to its performance on specificity).In general, complete case analyses are slightly better in terms of specificity for the outcome model whereas the regularized CBSSM sample selection models are better in terms of sensitivity.CBSSM ALasso is better than BPSSM ALasso (see Table A3 in Supplementary Materials).The results also show that there is slight benefit of the sample selection models Statistical Modelling 2024; 24(2): 115-138

Algorithm 1 :
Bootstrap validation with variable selection Input: D = {x i , Y i , S i }: for b = 1 to B do Take a bootstrap sample D b from D fit model to D b using regularized sample selection model (grid search for optimal λ is done on each D b ) predict on the same D b sample and compute predictive accuracy metric of interest, say P boot boot(b) use the model to predict on D and compute predictive accuracy metric, model to D using regularized sample selection model use the model to predict on D and compute apparent performance: P orig orig Output: Optimism corrected metric P is P orig orig − Optimism.

Table 2
Results on the covariate selection based on selection model and corresponding complete case analyses-Bivariate normal data generation

Table 3
Simulation results for the number of times each covariate is selected (out of 200) with both weak and moderate covariate effects (Outcome equation)-Bivariate normal data generation Table4shows the result of quantifying optimistic predictions in the models.Regularized methods are expected to exhibit smaller optimism as the regularization Statistical Modelling 2024; 24(2): 115-138 130 Emmanuel O. Ogundimu 130 Emmanuel O. Ogundimu

Table 4
Results of the optimism corrected model performance-Bivariate normal data generation