pystacked: Stacking generalization and machine learning in Stata

The pystacked command implements stacked generalization (Wolpert, 1992, Neural Networks 5: 241–259) for regression and binary classification via Python’s scikit-learn. Stacking combines multiple supervised machine learners—the “base” or “level-0” learners—into one learner. The currently supported base learners include regularized regression, random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multilayer perceptron). pystacked can also be used as a “regular” machine learning program to fit one base learner and thus provides an easy-to-use application programming interface for scikit-learn‘s machine learning algorithms.


Introduction
When faced with a new prediction or classification task, it is a priori rarely obvious which machine learning algorithm is best suited.A common approach is to evaluate the performance of a set of machine learners on a hold-out partition of the data or via cross-validation and then select the machine learner that minimizes a chosen loss metric.However, this approach is incomplete as combining multiple learners into one final prediction might lead to superior performance compared to each individual learner.This possibility motivates stacked generalization, or simply stacking, due to Wolpert (1992) and Breiman (1996).Stacking is a form of model averaging.Theoretical results in van der Laan et al. (2007) support the use of stacking as it performs asymptotically at least as well as the best-performing individual learner as long as the number of base learners is not too large.
In this article, we introduce pystacked for stacking regression and binary classification in Stata.pystacked allows users to fit multiple machine learning algorithms via Python's scikit-learn (Pedregosa et al. 2011;Buitinck et al. 2013) and combine these into one final prediction as a weighted average of individual predictions.pystacked adds to the growing number of programs for machine learning in Stata, including, among others, lassopack for regularized regression (Ahrens et al. 2020), rforest for random forests (Schonlau and Zou 2020) and svm for support vector machines (Guenther 2016;Guenther and Schonlau 2018).Similarly to pystacked, Cerulli (2022) and Droste (2020) provide an interface to scikit-learn in Stata.mlrtime allows Stata users to make use of R's parsnip machine learning library (Huntington-Klein 2021).pystacked differs from these in that it is, to our knowledge, the first to make stacking available to Stata users.Furthermore, pystacked can also be used to fit a single machine learner and thus provides an easy-to-use and versatile API to scikit-learn's machine learning algorithms.
Stacking is widely used in applied predictive modeling in many disciplines, e.g., for predicting mortality (Hwangbo et al. 2022), bankruptcy filings (Liang et al. 2020;Fedorova et al. 2022), or temperatures (Hooker et al. 2018).The use of stacking as a method and pystacked as a program is, however, not only restricted to pure prediction or classification tasks.A growing literature exploits machine learning to facilitate causal inference (see for an overview Athey and Imbens 2019).Indeed, a motivation for writing pystacked is that it can be used in combination with ddml (Ahrens et al. 2023a,b), which implements the Double-Debiased Machine Learning (DDML) methodology of Chernozhukov et al. (2018).DDML utilizes cross-fitting, a form of iterative sample splitting, which allows leveraging a wide class of supervised machine learners, including stacking, for the estimation of causal parameters.For instance, in the context of DDML, stacking can be used to estimate the conditional expectation of an outcome with respect to confounders or propensity scores.
We stress that pystacked relies on Python's scikit-learn (version 0.24 or higher) and the ongoing work of the scikit-learn contributors.Thus, pystacked relies on Stata's Python integration which was introduced in Stata 16.0.We kindly ask users to cite scikit-learn along with this article when using pystacked.Throughout we refer to version 0.7 of pystacked, which is the latest version at the time of writing.Please check for updates to pystacked on a regular basis and consult the help file to be informed about new features.The pystacked help file includes information on how to install a recent Python version and set up Stata's Python integration.Section 2 introduces stacking.Section 3 presents the main features of the pystacked program.Section 4 demonstrates the use of the program using examples.

Methodology
In this section, we briefly summarize the stacking approach for regression and binary classification tasks.Van der Laan et al. (2011) provides a book-length treatment; for a concise textbook treatment, see Hastie et al. (2009).
We first focus on stacking for regression problems where the aim is to predict the continuous outcome y i using predictors x i .The idea of stacking is to combine a set of "base" (or "level-0") learners using a "final" (or "level-1") estimator.It is advisable to include a relatively large and diverse set of base learners to capture different types of patterns in the data.The same algorithm can also be included more than once using different tuning or hyper-tuning parameters.Typical choices for base learners are regularized regression or ensemble methods, such as random forests or gradient boosting.
In the first step of stacking, we obtain cross-validated predicted values ŷ−k(i) (j),i for each base learner j and observation i.The super-script "−k(i)" indicates that we form the cross-fitted predicted value for observation i by fitting the learner to all folds except fold k(i), which is the fold that includes observation i.The use of cross-validation is necessary as stacking would otherwise give more weight to base learners that suffer from over-fitting.The second step is to fit a final learner using the observed y i as the outcome and the cross-validated predicted values ŷ−k(i) (1),i , . . ., ŷ−k(i) (J),i as predictors.A typical choice for the final learner is constrained least squares, which enforces the stacking weights to be non-negative and sum to one.This restriction facilitates the interpretation of stacking as a weighted average of base learners and may lead to better performance (Breiman 1996;Hastie et al. 2009).Algorithm 1 summarizes the stacking algorithm for regression problems as it is implemented in pystacked.
The stacking predicted values are defined as ŷ i = j ŵj ŷ(j),i where ŵj is the estimated stacking weight corresponding to learner j and ŷ(j),i are the predicted values from re-fitting learner j on the full sample I.

It is instructive to compare
Step 2 with the classical 'winner-takes-all' approach that selects one base learner as the one which exhibits the lowest cross-validated loss.Stacking, in contrast, may assign non-zero weights to multiple base learners, thus combining their strengths to produce a better overall predictor than any of the individual base learners.Other choices for the final learner are possible.In addition to the default final learner, pystacked supports, among others, non-negative least squares without the constraint w j = 1, ridge regression, the aforementioned 'winner-takes-all' approach that selects the base learner with the smallest cross-validated mean-squared error, and unconstrained least squares.

Stacking classification
Stacking can be applied in a similar way to classification problems.pystacked supports stacking for binary classification problems where the outcome y i takes the values 0 or 1.The main difference to stacking regression is that ŷ−k (j),i represent cross-validated predicted probabilities.

Base learners
In the following paragraphs, we briefly describe the base learners supported by pystacked and highlight central tuning parameters.We repeat that each of the machine learners discussed below can be fit using pystacked as a regular stand-alone machine learner without the stacking layer.
Note that it goes beyond the scope of the article to describe each learner in detail.Familiarity with linear regression, logistic regression, and classification and regression trees is assumed.We recommend consulting machine learning textbooks, e.g.Hastie et al. (2009), for more detailed discussion.
Regularized regression imposes a penalty on the size of coefficients to control overfitting.The lasso penalizes the absolute size of coefficients, whereas the ridge penalizes the sum of squared coefficients.Both methods shrink coefficients toward zero, but only the lasso yields sparse solutions where some coefficient estimates are set to exactly zero.The elastic net combines lasso and ridge-type penalties.For classification tasks with a binary outcome, logistic versions of lasso, ridge and elastic net are available.
The severity of the penalty is most commonly chosen by cross-validation.For lasso only, pystacked also supports selecting the penalty by AIC or BIC.The use of AIC or BIC has the advantage that it is computationally less intensive than cross-validation.Ahrens et al. (2020) compare the two approaches.
Random forests rely on fitting a large number of regression or decision trees on bootstrap samples of the data.The random forest prediction is obtained as the average across individual trees.A crucial aspect of random forests is that, at each split when growing a tree, one may consider only a random subset of predictors.This restriction aims at de-correlating the individual trees.Central tuning parameters are the number of trees (n_estimators()), the maximum depth of individual trees (max_depth()), the minimum number of observations per leaf (min_samples_leaf()), the number of features to be considered at each split (max_features()) and the size of the bootstrap samples (max_samples()).
Gradient boosted trees also rely on fitting a large number of trees.In contrast to random forests, these trees are fit sequentially to the residuals from the current model.The learning rate determines how much the latest tree contributes to the overall model.Individual trees are usually fit to the whole sample, although sub-sampling is possible.In addition to tuning parameters relating to the trees, the learning rate (learnings_rate()) and the number of trees (n_estimators()) are the most important tuning parameters.
Support vector machines (SVM).Support vector classifiers span a hyperplane that separates observations by their outcome class.The hyperplane is chosen to maximize the distance (margin) to correctly classified observations while allowing for some classification errors.The tuning parameter C (C()) controls the frequency and degree of classification mistakes.The hyperplane can be either linear or fitted using kernels.The SVM algorithm can also be adapted for regression tasks.To this end, the hyperplane is constructed to include as many observations as possible in a tube of size 2 around the hyperplane.Central tuning parameters for regression are (epsilon()) and C (C()), which determines the cost of observations outside of the tube.
Feed-forward neural networks consist of hidden layers that link the predictors (referred to as input layers) to the outcome.Each hidden layer is composed of multiple units (nodes) which pass signals to the next layer using an activation function.Central tuning parameters are the choice of the activation function (activation()), and the number and size of hidden layers (hidden_layer_sizes()).Further tuning choices relate to stochastic gradient descent algorithms which are typically used to fit neural networks.The default solver is Adam (Kingma and Ba 2014).The option early_stopping can be used to set aside a random fraction of the data for validation.The optimization algorithm stops if there is no improvement in performance over a pre-specified number of iterations (see related options n_iter_no_change() and tol()).

Program
This section introduces the program pystacked and its main features.pystacked offers two alternative syntaxes between which the user can choose (see Section 3.1 and 3.2).The two syntaxes offer the same functionality and are included to accommodate different user preferences.Section 3.3 and 3.4 list post-estimation commands and general options, respectively.Section 3.5 discusses supported base learners.Section 3.6 is a note on learner-specific predictors.Section 3.7 explains the pipeline feature.

Syntax 1
The first syntax uses method(string) to select base learners, where string is a list of base learners.Options are passed on to base learners via cmdopt1(string), cmdopt2(string), etc.That is, base learners can be specified and options are passed on in the order in which they appear in method(string) (see Section 3.5).Likewise, the pipe*(string) option can be used for pre-processing predictors within Python on the fly, where '*' is a placeholder for '1', '2', etc. (see Section 3.7).Finally, xvars*(predictors) allows specifying a learner-specific variable lists of predictors.

Syntax 2
In the second syntax, base learners are added before the comma using method(string) Predicted values (in-and out-of-sample) are calculated when pystacked is run and stored in Python memory.predict pulls the predicted values from Python memory and saves them in Stata memory.This storage structure means that no changes on the data in Stata memory should be made between the pystacked call and the predict call.If changes to the data set are made, predict will return an error.
The option basexb returns predicted values for each base learner.By default, the predicted values from re-fitting base learners on the full estimation sample are returned.If combined with cvalid, the cross-fitted predicted values are returned for each base learner.
Tables.After estimation, pystacked can report a table of in-sample (both cross-validated and full-sample) and, optionally, out-of-sample (or holdout sample) performance for both the stacking regression and the base learners.For regression problems, the table reports the root mean-squared prediction error (RMSPE).For classification problems, a confusion matrix is reported.The default holdout sample used for out-of-sample performance with the holdout option is all observations not included in the estimation.Alternatively, the user can specify the holdout sample explicitly using the syntax holdout(varname).The table can be requested after estimation as a replay command or as part of the pystacked estimation.

pystacked , table holdout[(varname)]
Graphs.pystacked can also create graphs of in-sample and, optionally, out-of-sample performance for both the stacking regression and the base learners.For regression problems, the graphs compare predicted and actual values of depvar.For classification problems, the default is to generate receiver operator characteristic (ROC) curves.Optionally, histograms of predicted probabilities are reported.As with the table option, the default holdout sample used for out-of-sample performance is all observations not included in the estimation, but the user can instead specify the holdout sample explicitly.The table can be requested after estimation or as part of the pystacked estimation command.The graph option on its own reports the graphs using pystacked's default settings.Because graphs are produced using Stata's twoway, roctab and histogram commands, the user can control either the combined graph (graph(options)) or the individual learner graphs (lgraph(options)) by passing options to these commands.

General options
A full list of general options is provided in the pystacked help file.We list only the most important general options here: type(string) allows reg(ress) for regression problems or class(ify) for classification problems.The default is regression.finalest(string) selects the final estimator used to combine base learners.The default is non-negative least squares without an intercept and the additional constraint that weights sum to 1 (nnls1).Alternatives are nnls0 (non-negative least squares without an intercept and without the sum-to-one constraint), singlebest (use the base learner with the minimum MSE), ls1 (least squares without an intercept and with the sum-to-one constraint), ols (ordinary least squares) or ridge for (logistic) ridge, which is the scikit-learn default.folds(integer ) specifies the number of folds used for cross-validation.The default is 5. foldvar(varname) is the integer fold variable for cross-validation.bfolds(integer ) sets the number of folds used for base learners that use cross-validation (e.g.cross-validated lasso); the default is 5.
pyseed(integer ) sets the Python seed.Note that since pystacked uses Python, we also need to set the Python seed to ensure replicability.There are three options: 1. pyseed(-1) draws a number between 0 and 10 8 in Stata which is then used as a Python seed; this is pystacked's default behavior.This way, one only needs to deal with the Stata seed.For example, set seed 42 is sufficient, as the Python seed is generated automatically.
2. Setting pyseed(x) with any positive integer x allows to control the Python seed directly.
njobs(integer ) sets the number of jobs for parallel computing.The default is 0 (no parallelization), -1 uses all available CPUs, -2 uses all CPUs minus 1. backend(string) backend used for parallelization.The default is 'threading'.voting selects voting regression or classification which uses pre-specified weights.By default, voting regression uses equal weights; voting classification uses a majority rule.voteweights(numlist) defines positive weights used for voting regression or classification.The length of numlist should be the number of base learners -1.The last weight is calculated to ensure that the sum of weights equals 1. sparse converts predictor matrix to a sparse matrix.This conversion will only lead to speed improvements if the predictor matrix is sufficiently sparse.Not all learners support sparse matrices and not all learners will benefit from sparse matrices in the same way.One can also use the sparse pipeline to use sparse matrices for some learners but not for others.printopt prints the default options for specified learners.Only one learner can be specified.This is for information only; no estimation is done.See Section 3.5 for examples.showopt prints the options passed on to Python.showpy prints Python messages.

Base learners
The base learners are chosen using the option method(string) in combination with type(string).The latter can take the value reg(ress) for regression and class for classification problems.Table 1 provides an overview of supported base learners and their underlying scikit-learn routines.
cmdopt*(string) (Syntax 1) and opt(string) (Syntax 2) are used to pass options to the base learners.Due to the large number of options, we do not list all options here.We instead provide a tool that lists options for each base learner.For example, to get the default options for lasso with cross-validated penalty, type The naming of the options follows scikit-learn.Allowed settings for each option can be inferred from the scikit-learn documentation.We strongly recommend that the user reads the scikit-learn documentation carefully.1

Learner-specific predictors
By default, pystacked uses the same set of predictors for each base learner.Using the same predictors for each method is often not desirable as the optimal set of predictors may vary across base learners.For example, when using linear machine learners such as the lasso, adding polynomials, interactions and other transformations of the base set of predictors might greatly improve out-of-sample prediction performance.The inclusion of transformations of base predictors is especially worth considering if the base set of observed predictors is small (relative to the sample size) and the relationship between outcome and predictors is likely non-linear.Tree-based methods (e.g., random forests  and boosted trees), on the other hand, can detect certain types of non-linear patterns automatically.While adding transformations of the base predictors may still lead to performance gains, the added benefit is less striking relative to linear learners and might not justify the additional costs in terms of computational complexity.
There are two approaches to implement learner-specific sets of predictors: Pipelines, discussed in the next section, can be used to create some transformations on the fly for specific base learners.A more flexible approach is the xvars*(predictors) option, which allows specifying predictors for a particular learner.xvars*() supports standard Stata factor variable notation.

Pipelines
scikit-learn uses pipelines to pre-preprocess input data on the fly.In pystacked, pipelines can be used to impute missing values, create polynomials and interactions, and to standardize predictors.Table 2 lists the pipelines currently supported by pystacked.
Remarks.First, regularized regressors (i.e., the methods lassoic, lassocv, ridgecv and elasticcv) use the stdscaler pipeline by default.pipe*(nostdscaler) disables this behavior.Second, the stdscaler0 pipeline is useful in combination with sparse, which transforms the predictor matrix into a sparse matrix.stdscaler0 does not center predictors so that the predictor matrix retains its sparsity property.

Applications
This section demonstrates how to apply pystacked for regression and classification tasks.Before we discuss stacking, we first show how to use pystacked as a 'regular' machine learning program for fitting a single supervised machine learner (see next subsection).We then illustrate stacking regression and stacking classification in Section 4.2 and 4.3, respectively.

Single base learner
We import the Pace and Barry (1997) California house price data, and split the sample randomly into training and validation partitions using a 75/25 split.The aim of the prediction task is to predict median house prices (medhousevalue) using a set of house price characteristics. 2We prepare the data for analysis as follows.
. clear all .use https://statalasso.github.io/dta/cal_housing.dta, clear .set seed 42 .gen train=runiform() .replace train=train<.75(20,640 real changes made) .replace medh = medh/10e3 variable medhousevalue was long now double (20,640 real changes made) Gradient-boosted trees.As a first example, we use pystacked to fit gradient-boosted regression trees and save the out-of-sample predicted values.Since we consider a regression task rather than a classification task, we specify type(reg) (which is also the default).The option method(gradboost) selects gradient boosting.We will later see that we can specify more than one learner in methods(), and that we can also fit gradient-boosted classification trees.The output shows the stacking weights associated with each base learner.Since we only consider one method, the output is not particularly informative and simply shows a weight of one for gradient boosting.Yet, pystacked has fitted 100 boosted trees (the default) in the background using scikit-learn's ensemble.GradientBoostedRegressor.
Before we tune our gradient boosting learner, we retrieve a list of available options.The default options for gradient boosting can be listed in the console, and the scikitlearn documentation provides more detail on the allowed parameters of each option.
. We can then compare the performance across the three models using the out-ofsample mean-squared prediction error (MSPE): . gen double res_gb1_sq=(medh-yhat_gb1)^2 if !train (15,448  The initial gradient booster achieves an out-of-sample MSPE of 29.87.The second gradient booster uses a reduced learning rate of 0.01 and performs much worse, with an MSPE of 71.10.The third gradient booster performs only slightly worse than the first, illustrating the trade-off between the learning rate and the number of trees.
Pipelines.We can make use of pipelines to pre-process our predictors.This is especially useful in the context of stacking when we want to, for example, use second-order polynomials of predictors as inputs for one method, but only use elementary predictors for another method.Here, we compare lasso with and without the poly2 pipeline: . pystacked medh longi-medi if train, type(reg) methods(lassocv) We could replace pipe1(poly2) with xvars1(c.(medhlongi-medi)##c.(medhlongi -medi)).In fact, the latter is more flexible and allows, for example, to create interactions for some predictors and not for others.
We again calculate the out-of-sample MSPE: . gen double res_lasso1_sq=(medh-yhat_lasso1)^2 if !train (15,448  The poly2 pipeline improves the performance of the lasso, indicating that squared and interaction terms constitute important predictors.However, the lasso does not perform as well as gradient boosting in this application.

Stacking regression
We now consider a stacking regression application with five base learners: (1) linear regression, (2) lasso with penalty chosen by cross-validation, (3) lasso with secondorder polynomials and interactions, (4) random forest with default settings, and (5) gradient boosting with a learning rate of 0.01 and 1000 trees.That is, we use the lasso twice-once with and once without the poly2 pipeline.Indeed, nothing keeps us from using the same algorithm multiple times.This way, we can combine the same algorithm with different settings.
Note the numbering of the pipe*() and cmdopt*() options below.We apply the poly2 pipe to the first and third methods (ols and lassoic).We also change the default learning rate and the number of estimators for gradient boosting (the 5th estimator).
. The above syntax becomes a bit difficult to read with many methods and many options.We offer an alternative syntax for easier use with many base learners.The stacking weights shown in the output determine how much each method contributes to the final stacking predictor.In this example, OLS and lasso based on both sets of predictors all receive a weight of zero.Random forest receives a weight of 83.8%, and gradient boosting contributes the remaining 16.2% of the weight to the final predictor.
Predicted values.In addition to the stacking predicted values, we can also get the predicted values of each base learner using the basexb option: Plotting.pystacked also comes with plotting features.The graph option creates a scatter plot of predicted values on the vertical and observed values on the horizontal axis for stacking and each base learner, see Figure 1.The black line is a 45-degree line and shown for reference.Since pystacked with graph can be used as a post-estimation command, there is no need to re-run the stacking estimation.
. pystacked, graph(scheme(sj)) lgraph(scheme(sj)) holdout Number of holdout observations: 5192 Figure 1 shows the out-of-sample predicted values.To see the in-sample predicted values, simply omit the holdout option.Note that the holdout option will not work if the estimation was run on the whole sample.The example below is more complicated.We go through it step-by-step: • We use five base learners: logistic regression, two gradient boosters and two neural nets.• We apply the poly2 pipeline to logistic regression, which creates squares and interaction terms of the predictors, but not to other methods.• We employ gradient boosting with 600 and 1000 classification trees.
• We consider two specifications for the neural nets: one neural net with two hidden layers of 5 nodes each, and another neural net with a single hidden layer of 5 nodes.• Finally, we use type(class) to specify that we consider a classification task and njobs(8) switches parallelization on utilizing 8 cores.
. As in the previous regression example, gradient boosting receives the largest stacking weight and thus contributes most to the final stacking prediction.
Confusion matrix.Confusion matrices allow comparing actual and predicted outcomes in a 2 × 2 matrix.pystacked provides a compact table format that combines confusion matrices for each base learner and for the final stacking classifier, both for the training and validation partition.
Plotting.pystacked supports ROC curves which allow assessing the classification performance for varying discrimination thresholds.The y-axis in a ROC plot corresponds to sensitivity (true positive rate) and the x-axis corresponds to 1-specificity (false positive rate).The Area Under the Curve (AUC) displayed below each ROC plot is a common evaluation metric for classification problems. .

Conclusion
In this article, we introduce the Stata program pystacked.The program not only makes a range of popular supervised machine learners available, such as regularized regression, random forests, and gradient-boosted trees, but also facilitates combining multiple learners for stacking regression or classification.pystacked comes with a number of practically relevant features: for example, sparse matrix support, learner-specific predictors, pipelines for predictor transformations, and two alternative syntaxes.Nevertheless, we also see the potential for improvement in at least two directions.First, pystacked offers few diagnostics for individual learners.Features such as variable importance plots and coefficient estimates for parametric learners could be added in later versions.Secondly, the deep learning features of pystacked are currently relatively limited and could be improved by adding support for other deep learning algorithms.

Algorithm 1 :
Stacking regression. 1. Cross-validation: a. Split the sample I = {1, . . ., n} randomly into K partitions of approximately equal size.These partitions are referred to as folds.Denote the set of observations in fold k = 1, . . ., K as I k , and its complement as I c k such that I c k = I \ I k .I k constitutes the step-k validation set and I c k the step-k training sample.b.For each fold k = 1, . . ., K and each base learner j = 1, . . ., J, fit the supervised machine learner j to the training data I c k and obtain out-of-sample predicted values ŷ−k (j),i for i ∈ I k .2. Final learner: Fit the final learner to the full sample.The default choice is nonnegative least squares (NNLS) with the additional constraint that coefficients sum to one: min w1,...,w J pystacked depvar predictors if in , methods(string) cmdopt1(string) cmdopt2(string) ... pipe1(string) pipe2(string) ... xvars1(predictors) xvars2(predictors) ... general options where general options are discussed in Section 3.4 below.

Figure 1 :
Figure 1: Out-of-sample predicted values and observed values created using the graph option after stacking regression.

Table 1 :
Overview of machine learners available in pystacked.
Note:The first two columns list all allowed combinations of method(string) and type(string), which are used to select base learners.Column 3 provides a description of each machine learner.The last column lists the underlying scikit-learn learn routine.'CV penalty' indicates that the penalty level is chosen to minimize the cross-validated MSPE.'AIC/BIC penalty' indicates that the penalty level minimizes either either the Akaike or Bayesian information criterion.SVC refers to support vector classification.
RMSPE table.The table option allows comparing stacking weights with in-sample, cross-validated and out-of-sample RMSPEs.As with the graph option, we can use table as a post-estimation command: