Skip to main content

[]

Intended for healthcare professionals
Skip to main content
Free access
Research article
First published online August 18, 2023

Linear or smooth? Enhanced model choice in boosting via deselection of base-learners

Abstract

The specification of a particular type of effect (e.g., linear or non-linear) of a covariate in a regression model can be either based on graphical assessment, subject matter knowledge or also on data-driven model choice procedures. For the latter variant, we present a boosting approach that is available for a huge number of different model classes. Boosting is an indirect regularization technique that leads to variable selection and can easily incorporate also non-linear or smooth effects. Furthermore, the algorithm can be adapted in a way to automatically select whether to model a continuous variable with a smooth or a linear effect. We enhance this model choice procedure by trying to compensate the inherent bias towards the more complex effect by incorporating a pragmatic and simple deselection technique that was originally implemented for enhanced variable selection. We illustrate our approach in the analysis of T3 thyroid hormone levels from a larger Galician cohort and investigate its performance in a simulation study.

1 Introduction

With the introduction of newer and more complex model classes (e.g., distributional regression, Rigby and Stasinopoulos, 2005; Kneib, 2013) and also more flexible types of effects, in the last decades the toolbox of data analysts has become richer in potential statistical modelling options. However, this increased flexibility also leads to the burden of having to decide on selecting the best possible option from a vast number of suitable solutions. This is also reflected by the introduction of P-splines (Eilers and Marx, 1996), based also on the work by Brian Marx—who we remember with this special issue. P-splines allow for a very flexible form of effects for continuous variables, but at the same time their flexibility is regulated by penalization in order to ensure smooth effects. This trade-off between flexibility and simplicity is the core of many discussions in statistical modelling between the involved researchers trying to find the best solution for the research question at hand. Generally, the problem of selecting a model as complex as necessary but also as simple as possible can be either tackled from a subject-matter perspective or in a data-driven automated manner. With the emergence of larger data sets, new and modern sources (e.g., proteomics, genomics), there is also an increased need for automated procedures to answer the questions of dimensionality, complexity reduction and model choice.
A natural solution for complexity reduction in the context of statistical modelling are penalized regression techniques. The most popular regularized regression techniques as the lasso (Tibshirani, 1996), ridge regression (Hoerl and Kennard, 1970) or elastic net (Zou and Hastie, 2005) typically aim for linear effects. Boosting, on the other hand, is an indirect approach to regularized regression that can easily incorporate non-linear and smooth effects (as introduced by Schmid and Hothorn, 2008). For the smooth component, the boosting framework relies on the iterative application of P-splines (Eilers and Marx, 1996) as base-learners, that were also extended to monotone or cyclical effects (Hofner et al., 2016). Boosting algorithms originally emerged from machine learning, but were later adapted to statistical modelling (Bühlmann and Hothorn, 2007) and are nowadays a versatile option to fit various model classes, including very complex ones (for an overview see Mayr et al., 2014a, b). For the remainder of this article, we will focus on statistical boosting algorithms in the sense of component-wise gradient boosting with regression models as base-learners (Bühlmann and Hothorn, 2007). It can be shown, that with linear base-learners, boosting with early stopping leads to very similar solutions as the lasso (Hepp et al., 2016).
In the context of dimensionality or complexity reduction, statistical boosting algorithms allow simultaneously: (a) to select the most influential variables from a potentially high-dimensional set of candidate variables; (b) to perform model choice by selecting the most appropriate type of effect via a decomposition into linear and non-linear parts; and (c) to estimate the final prediction model.
The notion of model choice in the context of statistical boosting was introduced by Kneib et al. (2009). The basic idea is to decompose the effect of a continuous variable in a linear component, represented by a linear base-learner, and a smooth component, represented by a non-linear base-learner. To avoid bias towards more complex base-learners, either when choosing between linear and smooth effects or in the presence of categorical variables, Hofner et al. (2011) developed a framework for unbiased model choice by incorporating penalized least squares base-learners.
Over the last years, however, practical experience suggests that there is still a tendency of statistical boosting to select too complex models—particularly in settings with a relatively large number of observations n in comparison to the number of potential predictor variables p (Staerk and Mayr, 2021, Strömer et al., 2022). As overfitting is less problematic in rather low-dimensional settings, the algorithm stops relatively late and has the tendency to include many base-learners. The effect of some of these base-learners in the final model might be only very small, as they were likely only updated once or twice, however this behaviour limits the interpretability of the prediction model without having a relevant impact on prediction performance. This carries over to the question of model choice.
Further regularization might be indicated if variable selection or model choice is key. An approach which can be applied to many regularization techniques is stability selection (Meinshausen and Bühlmann, 2010, Shah and Samworth, 2013), which was later successfully transferred to boosting (Hofner et al., 2015; Mayr et al., 2016; Thomas et al., 2018). This approach is computationally expensive and it remains unclear how to fit the final model.
Here, we propose to tackle this issue with a similar approach as described by Strömer et al. (2022) in the context of variable selection: We directly enforce model choice by deselecting more complex base-learners that do not contribute enough to the final model. In other words, if the smooth component of a continuous predictor was selected, but does not have a larger impact on the predictive risk than the linear one, we can also deselect it and re-fit the model only with the simpler linear component. We investigate the performance of this pragmatic and simple approach in a small simulation study and illustrate its merits by modelling thyroid hormone levels from a large Galician cohort based on different continuous and categorical predictor variables.

2 Methods

Our approach to model choice is based on the application of boosting which emerged from machine learning (Freund and Schapire, 1996) performing classification based on simple decision trees, but was later adapted to estimate regression models via gradient descent in function space (Friedman et al., 2000; Friedman, 2001). For given observations (yi,xiT), i=1,…,n, and loss function ρ, for example, the common L2 loss ρ(y,f(x))=(yf(x))2, the component-wise boosting algorithm aims to minimize an empirical risk
r=n1i=1nρ(yi,f(xiT))
by estimating a statistical model f. While the loss function defines the type of regression setting (e.g., L2 leading to least-squares mean regression, L1 to median regression), the algorithm allows for additional (pre-)specification of base-learners hj(xl),j=1,...,J for each dependent variable xl,l=1,...,p according to their potential effect on the outcome y, for example, linear, non-linear, smooth or spatial base-learners. The negative gradient vector u of the loss function ρ(y,f) is evaluated in iteration m with the fitted model at the previous iteration f^[m1](x)
u[m]=fρ(y,f)|f=f^[m1](x)
(2.1)
and afterwards every base-learner is fitted to the negative gradient vector u[m]. Starting with an offset value f^[0], for example, the mean of y for mean regression, the algorithm fits all specified base-learners to the negative gradient vector of the loss function u[m] (which corresponds to the model fit f^[m1] from the previous iteration—see equation (2.1)) and chooses only the best performing base-learner hj*. The model is then subsequently updated by f^[m]=f^[m1]+νhj* in component j* with a fixed step length 0<ν1 until the algorithm reaches a final stopping iteration mstop. The step-length is typically chosen to be small, for example, ν=0.1. The main tuning parameter is the stopping iteration mstop, which controls the complexity of the final model. Higher numbers of mstop hence lead to larger and more complex models, while stopping the algorithm early leads to sparser and simpler models. Usually, the optimal stopping iteration is determined via cross-validation, subsampling or bootstrapping.
Note that in most cases, the number of base-learners J equals the number of candidate variables p (one base-learner for each variable) but it is possible to specify multiple base-learners per variable in the case of a model including both main effects and interactions, or for the decomposition of linear and non-linear effects as we will discuss in this work. However, one could also specify a single base-learner based on multiple variables, for example, in the case of spatial effects. As in each boosting iteration only the best-fitting base-learner is used, that is, either the effect of one base-learner hj is updated or a new one is incorporated in the model, one can perform automated variable selection when the algorithm is stopped before convergence. Base-learners and their corresponding predictor variables that have never been selected up to that point are effectively excluded from the final model.

2.1 Boosting P-splines

P-splines are typically used as base-learners to model a potential non-linear effect of a dependent variable on the outcome (Schmid and Hothorn, 2008). P-splines base-learners can be expressed via a simple penalized least squares regression model, regardless of the distributional assumption of the overall model (Eilers and Marx, 2021; Fahrmeir et al., 2022, Ch. 8). The effect estimate for a P-spline base-learner hj(xl) in any given iteration m then boils down to the following penalized regression model (dropping indices for simplicity)
u^[m]=X(XX+λK)1Xu[m]=Sλu[m],
where X in this case is the design matrix for base-learner hj(xl), K is the corresponding penalty matrix, and λ is the smoothness parameter. For P-splines, the design matrix is a B-spline matrix B and the penalty matrix is a difference matrix. The smoother matrix, which has a similar function to the standard hat matrix in unpenalized regression settings, is denoted by Sλ. In the most common implementation, P-splines base-learners are incorporated with cubic B-splines, 20 equidistant knots, and second order difference penalization (Schmid and Hothorn, 2008; Hofner et al., 2014). In standard spline approaches for additive models, one typically needs to choose the smoothing parameter λ0 for each smooth effect in an adequate way, for example, via cross-validation techniques. This parameter controls the smoothness of the spline fit via the penalization of the regression estimate. It is easy to see that in settings with many smooth effects tuning might become quickly burdensome. When using P-splines in boosting models the tuning of each smoothing parameter is not necessary but λ is chosen for each smooth function such that it results in a fixed (and equal) number of degrees of freedom (Schmid and Hothorn, 2008; Hofner et al. 2011). The boosting algorithm updates only the fit of the best-fitting base-learner in each iteration, and only by a comparable small step length ν. As each base-learner can be selected and updated multiple times, the flexibility of the final spline fit depends on the number of boosting iterations and adapts itself to the data.
In Figure 1 this aspect is illustrated, displaying in the upper row the spline fit after different numbers of boosting iterations and in the bottom row a smoothing spline through the current residuals at this iteration (see also Mayr and Hofner, 2018). One can nicely observe how the P-spline fitted to the residuals (corresponding here to u[m]) adapts to the overall data and the true fit throughout the iterations, and only eventually starts to overfit. The base-learner is hence learning in every iteration more from u[m], therefore reducing the structure in the residuals (lower plot) and moving more closely to representing the observations (upper plot). Similarly, multiple P-splines can be fitted and adapt themselves to the required flexibility without actively selecting a smoothing parameter λ.
Figure 1 Illustration of the iterative fitting process when boosting a single P-spline. The figure visualizes the influence of the stopping iteration m on the complexity of the spline from oversmoothing (small m, left) to overfitting (larger m, right). The dashed line in the upper plot displays the true underlying function. The red solid line (top) is the boosting fit f^[m] for iteration m. For the same iteration m the blue line (bottom) is a classical cubic smoothing spline displaying the remaining structure of the residuals (which correspond in this L2 case to the gradient of the loss).

2.2 Model choice in boosting

As discussed above, boosting with early stopping allows for variable selection as only one base-learner is updated per boosting iteration. Hence, not all variables will be incorporated in the final model. If we define for each variable additionally separate base-learners per effect type (linear, smooth, interaction), one can see that the selection of base-learners naturally leads to the selection of different effect types, that is, model choice. It is obvious that a linear effect β0+xlβl is less flexible than a smooth effect fj(xl), which therefore, has a tendency to get selected more frequently even with small degrees of freedom as applied in the boosting context. It should be noted that smooth effects incorporate the linear effect (in case of second order differences) even with λ (the so called null-space) and hence the degrees of freedom cannot become arbitrarily small. To allow a fair choice between effect types Kneib et al. (2009) suggested to decompose the smooth effect
fj(xl)=β0+xlβl+fj,dev(xl)
(2.2)
with intercept β0, a linear effect xlβl and the smooth deviation from the linear effect fj,dev(xl). Following this approach, one now defines separate base-learners for (a) the intercept; (b) the linear effect; and (c) the smooth deviation from linearity; and chooses λ for the smooth effect (c) such that the degrees of freedom equal 1. The degrees of freedom are naturally equal to 1 for the intercept (a) and the linear effect (b). Hofner et al. (2011) showed that this leads to an (almost) unbiased selection, when the degrees of freedom are defined as df(λ):=tr(2SλSλSλ).
As before, the crucial parameter for the complexity of the model is the stopping iteration. If the boosting algorithm chooses only the linear effect of xl this simplifies the model and shows that no smooth effect is needed. If on the other hand only the smooth deviation is chosen this leads to a smooth effect centred around zero. When both the linear and smooth effect are chosen this leads to a smooth effect with ‘trend’. It is noteworthy that the smooth base-learner with one degree of freedom for the deviation from linearity indeed allows the model to much more flexible overall effects with arbitrarily high degrees of freedom for later stopping iterations (comparable to Figure 1).

2.3 Enhanced model choice with deselection

As discussed in the Introduction, boosting models eventually tend to select too many different base-learners. If variable selection or model choice is of key interest, additional sparsity is warranted. Our approach builds on the classical decomposition of linear and smooth components of continuous variables as described in Section 2.2 but additionally incorporates a recent deselection procedure (Strömer et al., 2022) that was developed for enhanced variable selection. The core idea is to identify selected base-learners with minor importance for an initial boosting model and remove them from the set of potential predictors before boosting the model again on the subset of base-learners that were initially selected and have not been deselected. In the second boosting run, the same number of iterations mstop is used as in the initial boosting model. To measure the relevance of base-learner hj(xl), Strömer et al. (2022) propose to rely on the attributable risk reduction Rj:
Rj=m=1mstopIj=j*[m]r[m1]r[m],j=1,,J
Where I() is an indicator function, j*[m] identifies the base-learner that was selected to be updated in boosting iteration m, and r[m] is the empirical risk (loss function evaluated on the underlying training data) at iteration m. In other words, Rj represents the risk reduction that can be attributed to base-learner hj, for j=1,...,J over the course of the boosting iterations. A base-learner is deselected, if its contribution Rj to the overall risk reduction of the model from iteration 0 to mstop is smaller than a pre-specified threshold τ, that is
Rjr[0]r[mstop]<τ.
Strömer et al. (2022) propose to use a pragmatic threshold of τ=0.01. In our context with decomposition of smooth effects, simply applying this procedure could lead to biased results: Let Rlin(xl) denote the risk reduction attributed to the linear base-learner hlin(xl)=xlβl of variable xl and Rsm(xl) the one of smooth deviation from the linear effect hsm(xl)=fj,dev(xl) for the same variable (see equation (2.2)). Due to the decomposition, the corresponding single base-learners run a higher risk of not passing the threshold. In the most extreme case, both base-learners (and therefore the corresponding variable xl) are deselected because they individually do not pass the threshold. Meanwhile in the same scenario, another for example, categorical variable xk remains in the model, although its attributed risk reduction Rk=Rk(xk) is actually smaller than the ones of variable xl:
Rlin(xl)r[0]r[mstop]<τ
Rsm(xl)r[0]r[mstop]<τ
Rk(xk)r[0]r[mstop]>τ, but
Rk=  Rk(xk)<Rlin(xl)+Rsm(xl)  =Rl
To overcome this, we adapted the original deselection procedure and deselect all base-learners referring to the variable xl only if
Rlin(xl)+Rsm(xl)r[0]r[mstop]<τ
to ensure that the decomposition does not lead to a tendency to deselect these variables. To enforce simpler and more interpretable models, we additionally propose to further adapt the procedure by Strömer et al. (2022) to enhance model choice. The base-learner hsm(xl) should be deselected if:
Rsm(xl)<Rlin(xl)
In other words, we remove the spline base-learner representing the deviation from the linear trend, when its contribution to the risk reduction is smaller than the one of the linear base-learner. The justification for this procedure is pragmatic: In case of doubt, go for the simpler solution. This does not mean that the deviation from the linear trend is considered non-existing; however, our approach aims at ensuring that only variables with an effect clearly deviating from linearity are actually modelled with a spline.

3 Empirical results

3.1 Simulation

Simulations conducted by Hofner et al. (2011) showed a biased selection of categorical base-learners with more factor levels and smooth base-learners compared to linear base-learners. To overcome this bias, the authors proposed to use a different, more appropriate definition of the degrees of freedom and to assign equal degrees of freedom (i.e., usually df = 1, as described in Hofner et al. (2011) in more detail) to all competing base-learners. Simulations indicate that despite these adjustments boosting still has a tendency to select the more complex base-learners even if not needed; for example, the smooth base-learner gets selected even though the true underlying effect was strictly linear. To illustrate this, we use first a setup with six informative variables of which five have a linear effect and one has a non-linear one as well as 20 completely non-informative variables (similar to power case 3 in Hofner et al., 2011):
yi=2xi1xi2+xi3+2xi4+3xi5+sin(4zi120.6(2zi1)3)+εi
As an additional scenario, we consider a model of four informative linear predictors, three informative non-linear predictors and only six non-informative predictors with reduced noise term εi (sd = 1 compared to sd = 1.95 in the first scenario). To perform model choice via the decomposition as in Section 2.2, for each variable one linear and one smooth base-learner with df = 1 are specified:
yi=2xi1xi2+0.5xi3+3xi4+sin(4zi120.6(2zi1)3)+2|zi2|+10zi32+εi
Results show that in the first scenario 9.8% of variables with true linear effect were falsely incorporated also or only with a smooth base-learner (averaged over all true linear variables and all simulation steps). This proportion increased to 47% in the second scenario. Applying the proposed deselection procedure for enhanced variable selection and model choice from Section 2.3 the proportion of variables incorrectly identified as representing a smooth effect was successfully decreased to 2.2% in the case of the setup by Hofner et al. (2011) and to 5.8% in the second scenario. On the other hand, the proportion of variables with true smooth effect that were incorrectly included only with a linear term increased from 20% to 23% in the first and from 1.3% to 5.3% in the second scenario. The changes in prediction accuracy due to deselection was only minimal. The Root Mean Square Error (RMSE) on independent test data decreased slightly from 1.887(95% CI: [1.782;1.992]) for the standard boosting algorithm to 1.885[1.782;1.992] after performing deselection both for variable selection and model choice in the first scenario and from 1.098[1.039;1.166] to 1.097[1.038;1.182] in the second scenario. These results were similar also for seven additional scenarios, more detailed information on all scenarios (including separate results for variable selection and model choice) can be found in the Supplementary Material. The code to reproduce the results via the R add-on package mboost (Hothorn et al., 2022) is available on GitHub (https://github.com/wistubaT/ModelChoice_SmoothOrLinear).

3.2 Modelling thyroid hormone levels from the AEGIS study

The A Estrada Glycation and Inflammation Study (AEGIS) is a cross-sectional observational trial performed in Galicia, Northwestern Spain, including n=1516 participants of the adult population (Gude et al., 2017). Thyroid hormones are crucial for normal development and the proper functioning of physiological systems. Thyroid hormone synthesis is regulated by feedback mechanisms: decreased thyroid hormone levels lead to increased synthesis of hypothalamic thyrotropin-releasing hormone (TRH) increasing the secretion of thyroid-stimulating hormone (TSH) (Babić Leko et al., 2021). TSH stimulates the production of thyroid hormones, thyroxine (T4) and triiodothyronine (T3). Variation in TSH and thyroid hormone levels may indicate that normal thyroid function has been altered. Genetic factors account for up to 65% of inter-individual variations in TSH and thyroid hormone levels, but also demographic factors (age and sex), intrinsic factors (stress), and environmental factors as diet, smoking, alcohol and exercise influence thyroid function. Anemia and other blood abnormalities are common in thyroid function abnormalities (cf. Babić Leko et al., 2021). Red cell mass is frequently reduced in hypothyroid patients, and it is typically increased in hyperthyroidism. While neutropenia has long been associated with hyperthyroidism as recently reviewed in Scappaticcio et al. (2021), only few studies describe the relationship between thyroid function and blood cell components in euthyroid subjects (e.g., Bremner et al., 2012). The aim of this study was to examine the association between thyroid function and hematological parameters.
We consider overall p=33 potential predictor variables, of which 29 are continuous and four categorical. The final sample size for a complete case analysis reduced to n=1 352. A first graphical assessment of the data showed potentially non-linear associations between some of the continuous predictor variables and thyroid hormones, particularly for T3 which we will focus on in the following. The results of the same analyses with T4 and TSH can be found in the Supplementary Material.
For the classical boosting fit with the L2 loss we incorporated the decomposition with df = 1 from Section 2.2 for all continuous variables. After optimal stopping via 25-fold bootstrapping, 32 of the total 2*29+4=62 base-learners were selected, representing 24 variables. Performing the proposed enhanced variable selection and model choice via deselection from Section 2.3 resulted in a final model with 16 base-learners (14 variables), of which only three entered the model with a smooth effect. The procedure excluded ten variables completely and removed the non-linear spline base-learner for four continuous variables.
The variables in the final model in descending order w.r.t. overall risk reduction were hemoglobin, age, monocytes percentage, red blood cells count, height, erythrocyte sedimentation rate, transferrin, ferritin, transferrin saturation, platelet mean volume, red cell distribution width, smoking, platelet distribution width and physical activity. The three variables with non-linear effects were transferrin saturation, ferritin and red cell distribution width. The two remaining categorical variables were smoking and physical activity (both with three categories). The deselection process adjusted the originally smoothly variables monocytes percentage, transferrin saturation, age and height to be included in the final model as linear predictors.
In order to further evaluate the stability of our proposed procedure we repeated the fitting, deselection and re-fitting on 1 000 bootstrap samples of the original data set. The estimated effects from standard boosting and from boosting with the previously described deselection procedure with 95%-bootstrap confidence intervals (Hofner et al. 2016) are displayed for selected variables in Figure 2. Similar graphics for all other continuous variables with non-linear effects identified by the classical approach as well as a table with all selected base-learners can be found in the Supplementary Material.
Figure 2 Results from the AEGIS study: the dashed lines refer to the estimated effects from standard boosting with model choice (left) and the enhanced approach with deselection (right). We show effects of two variables (red cell distribution width and ferritin) that remained with their smooth component in the model and two variables (transferrin saturation and age) that were switched to a linear effect after deselection. The shaded areas refer to empirical 95%-bootstrap confidence intervals.
These results are in accordance with other studies reporting that environmental factors (smoking, diet, physical exercise, body mass index) can affect thyroid function. Iron metabolism (among others hemoglobin, ferritin and transferrin) is also very intricately connected to thyroid hormone metabolism. Thyroid hormone insufficiency may lead to deficiency of iron and vice versa. Other factors like oxidative stress may also play a role (erythrocyte sedimentation rate). Finally, we should keep in mind that the thyroid gland is the organ most commonly affected by autoimmune disease.
To evaluate model fit and predictive performance we computed the RMSE both on the training as well as on the test data. For the standard boosting algorithm this resulted in RMSE=0.878(95% CI: [0.829;0.907]) on the training and 0.959[0.875;0.992] on the test data. Incorporating also the proposed deselection procedure led to slightly higher values; RMSE=0.884[0.839;0.915] on training and 0.961[0.880;1.006] on test data (see also Figure 3).
Figure 3 Root Mean Square Errors (RMSE) on training and test data (generated via 1 000 bootstrap samples) for standard boosting with decomposition and the deselection approach for enhanced model choice and variable selection for T3 thyroid hormone levels from the AEGIS study.
To summarize the results, in the example of modelling T3 thyroid levels from a larger Galician cohort with np, the original boosting approach with model choice led to a rather large and overly complex model with 24 variables, of which nine were incorporated with a smooth effect. The additional deselection approach for enhanced model choice and variable selection decreased the number of variables with non-linear effect to only three with a total of 14 variables in the model. This much simpler model leads basically to the same data fit as well as prediction accuracy. For the other hormone levels the impact of the enhanced model choice via deselection was less pronounced, because also standard boosting led to smaller models with less variables with smooth effects (see Supplementary Material).

4 Discussion

We have proposed a simple and pragmatic approach to enhance model choice regarding the decision whether to include a continuous variable with a linear or smooth effect in a boosted statistical model. We have illustrated both in a simulation study as well as on thyroid hormone data that our approach can help researchers to decide on the type of effect and at least in the considered settings led to very promising results: The resulting models were sparser and incorporated only non-linear effects, which were really necessary, but led basically to the same fit and prediction accuracy.
There are, however, several points and limitations to consider: First, our approach actively changes the optimal model selected by the boosting procedure. As the boosting algorithm is typically tuned for predictive risk, it hence should be expected that deselecting components that were initially selected will yield some kind of loss in prediction accuracy. The initial deselection step is controlled by the τ parameter, which is set to 0.01 (as proposed by Strömer et al., 2022). Tuning of this parameter is non-trivial, as on average the best prediction accuracy should be achieved without any deselection. Second, our model choice approach is designed to favor the simple linear component. The smooth component is only chosen if it outperforms the linear part w.r.t. risk reduction; in any other case we stay with the linear effect. The argument for this is the general notion that in doubt one should always opt for the simplest model possible—but this obviously might not be favourable for all research questions. Also this choice (setting a threshold of 0.5) is not tuned, and could lead to a loss of prediction accuracy. Third, the deselection procedure leads to a longer runtime as the model effectively needs to be fitted twice. Fourth, our approach is focusing on data-driven, automated variable selection and model choice. In many modelling situations, it might be more reasonable to decide based on subject matter knowledge or practical implications (what should the model be used for) on the particular type of effect for a variable, not based on an automated procedure. Fifth, our approach is simple and pragmatic—but not based on theoretical insights. Also inherited from the iterative updating of the boosting algorithm, there is no closed form solution to determine standard errors for effect estimates. The construction of confidence intervals or even significance testing can hence only be performed via resampling techniques (Hofner et al., 2016; Mayr et al., 2017; Hepp et al., 2019).
With all that in mind, we still have reason to believe that this simple procedure might be a valuable option for practical data modelling. Automated model choice is an often cited and highlighted feature of statistical boosting, but the practical relevance over the last years seemed to be limited. Combined with the proposed pragmatic deselection procedure, the decomposition of effects in linear and non-linear components could become a true asset—not only for the simple Gaussian models presented here. The development of more complex and flexible model classes (Kneib et al., 2021) calls also for methods to reduce the complexity again in order to keep the model manageable for data analysts and interpretable for subject matter researchers.
Further research is warranted on how this procedure performs in multi-dimensional optimization problems where the same continuous variable could enter in different model components (like for location and scale, Mayr et al., 2012). Another field of future research might be to extend the procedure not only for selecting the type of effect, but also to allocate variables to different model components like in joint models for longitudinal and time-to-event data (Waldmann et al., 2017; Rappl et al., 2022).

Acknowledgements

This article would not have been possible without two fellow researchers and friends that are no longer with us and are deeply missed. The work of Professor Brian D. Marx ( 25 November 2021) on P-splines laid the foundation for how we model smooth effects today. With his enthusiasm and support Brian also triggered our commitment to statistical modelling in general. Professor Carmen Cadarso Suárez ( 3 June 2022) was still with us when we started to work on this project and participated in the first meetings. Only Carmen’s collaborative way to work and her talent to bring people together made our collaboration and the work on the Galician cohort possible.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

References

Babić Leko M, Gunjača I, Pleić N, and Zemunik T (2021) Environmental factors affecting thyroid-stimulating hormone and thyroid hormone levels. International Journal of Molecular Sciences, 22, 6521.
Bremner AP, Feddema P, Joske DJ, Leedman PJ, O’Leary PC, Olynyk JK, and Walsh JP (2012) Significant association between thyroid hormones and erythrocyte indices in euthyroid subjects. Clinical Endocrinology, 76, 304–311.
Bühlmann P, and Hothorn T (2007) Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science, 22, 477–522.
Eilers PHC, and Marx BD (1996) Flexible smoothing with B-splines and penalties (with discussion). Statistical Science, 11, 89–121.
Eilers PHC, and Marx BD (2021) Practical smoothing: The joys of P-splines. Cambridge University Press.
Fahrmeir L, Kneib T, Lang S, and Marx BD (2022) Regression: Models, methods and applications. Springer Nature, New York.
Freund Y, and Schapire R (1996) Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning Theory, pages 148–156. Morgan Kaufmann Publishers Inc.
Friedman JH (2001) Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.
Friedman JH, Hastie T, and Tibshirani R (2000) Additive logistic regression: A statistical view of boosting (with discussion). The Annals of Statistics, 28, 337–407.
Gude F, Díaz-Vidal P, Rúa-Pérez C, Alonso-Sampedro M, Fernández-Merino C, Rey-Garcí J, Cadarso-Suárez C, Pazos-Couselo M, García-López JM, and Gonzalez-Quintela A (2017) Glycemic variability and its association with demographics and lifestyles in a general adult population. Journal of Diabetes Science and Technology, 11, 780–790.
Hepp T, Schmid M, Gefeller O, Waldmann E, and Mayr A (2016) Approaches to regularized regression: A comparison between gradient boosting and the lasso. Methods of Information in Medicine, 55, 422–430.
Hepp T, Schmid M, and Mayr A (2019) Significance tests for boosted location and scale models with linear base-learners. The International Journal of Biostatistics, 15.
Hoerl AE, and Kennard RW (1970) Ridge regression: Applications to nonorthogonal problems. Technometrics, 12, 69–82.
Hofner B, Hothorn T, Kneib T, and Schmid M (2011) A framework for unbiased model selection based on boosting. Journal of Computational and Graphical Statistics, 20, 956–971.
Hofner B, Mayr A, Robinzonov N, and Schmid M (2014) Model-based boosting in R: A hands-on tutorial using the R package mboost. Computational Statistics, 29, 3–35. doi: 10.1007/s00180-012-0382-5.
Hofner B, Boccuto L, and Göker M (2015) Controlling false discoveries in high dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16, 144.
Hofner B, Kneib T, and Hothorn T (2016) A unified framework of constrained regression. Statistics and Computing, 26, 1–14. doi: 10.1007/s11222-014-9520-y.
Hothorn T, Buehlmann P, Kneib T, Schmid M, and Hofner B (2022) mboost: Model-based boosting. URL https://CRAN.R-project.org/package=mboost. R package version 2.9-7.
Kneib T (2013) Beyond mean regression. Statistical Modelling, 13, 275–303.
Kneib T, Hothorn T, and Tutz G (2009) Variable selection and model choice in geoadditive regression models. Biometrics, 65, 626–634.
Kneib T, Silbersdorff A, and Säfken B (2021) Rage against the mean—a review of distributional regression approaches. Econometrics and Statistics. doi: 10.1016/j.ecosta.2021.07.006.
Mayr A, Fenske N, Hofner B, Kneib T, and Schmid M (2012) Generalized additive models for location, scale and shape for high-dimensional data: A flexible approach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61, 403–427.
Mayr A, Binder H, Gefeller O, and Schmid M (2014a) The evolution of boosting algorithms. Methods of Information in Medicine, 53, 419–427.
Mayr A, Binder H, Gefeller O, and Schmid M (2014b) Extending statistical boosting. Methods of Information in Medicine, 53, 428–435.
Mayr A, and Hofner B (2018) Boosting for statistical modelling: A non-technical introduction. Statistical Modelling, 18, 365–384.
Mayr A, Hofner B, and Schmid M (2016) Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinformatics, 17, 288.
Mayr A, Schmid M, Pfahlberg A, Uter W, and Gefeller O (2017) A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models. Statistical Methods in Medical Research, 26, 1443–1460.
Meinshausen N, and Bühlmann P (2010) Stability selection (with discussion). Journal of the Royal Statistical Society. Series B, 72, 417–473.
Rappl A, Mayr A, and Waldmann E (2022) More than one way: Exploring the capabilities of different estimation approaches to joint models for longitudinal and time-to-event outcomes. The International Journal of Biostatistics, 18, 127–149.
Rigby RA, and Stasinopoulos D (2005) Generalized additive models for location, scale and shape (with discussion). Applied Statistics, 54, 507–554.
Scappaticcio L, Maiorino MI, Maio A, Esposito K, and Bellastella G (2021) Neutropenia in patients with hyperthyroidism: Systematic review and meta-analysis. Clinical Endocrinology, 94, 473–483.
Schmid M, and Hothorn T (2008) Boosting additive models using component-wise P-splines. Computational Statistics & Data Analysis, 53, 298–311.
Shah RD, and Samworth RJ (2013) Variable selection with error control: Another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
Staerk C, and Mayr A (2021) Randomized boosting with multivariable base learners for high-dimensional variable selection and prediction. BMC Bioinformatics, 22, 1–28.
Strömer A, Staerk C, Klein N, Weinhold L, Titze S, and Mayr A (2022) Deselection of base-learners for statistical boosting — with an application to distributional regression. Statistical Methods in Medical Research, 31, 207–224.
Thomas J, Mayr A, Bischl B, Schmid M, Smith A, and Hofner B (2018) Gradient boosting for distributional regression: Faster tuning and improved variable selection via noncyclical updates. Statistics and Computing, 28, 673–687.
Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society-Series B, 58, 267–288.
Waldmann E, Taylor-Robinson D, Klein N, Kneib T, Pressler T, Schmid M, and Mayr A (2017) Boosting joint models for longitudinal and time-to-event data. Biometrical Journal, 59, 1104–1121.
Zou H, and Hastie T (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

Cite article

Cite article

Cite article

OR

Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

Share options

Share

Share this article

Share with email
Email Article Link
Share on social media

Share access to this article

Sharing links are not relevant where the article is open access and not available if you do not have a subscription.

For more information view the Sage Journals article sharing page.

Information, rights and permissions

Information

Published In

Article first published online: August 18, 2023
Issue published: October 2023

Keywords

  1. boosting
  2. model choice
  3. prediction modelling
  4. sparsity
  5. splines

Rights and permissions

© 2023 The Author(s).
Request permissions for this article.

Authors

Affiliations

Andreas Mayr
Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Germany
Tobias Wistuba
Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Germany
Jan Speller
Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Germany
Francisco Gude
Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, Galicia, Spain
Benjamin Hofner
Paul-Ehrlich-Institut, Langen, Germany
Department of Medical Informatics, Biometry, and Epidemiology, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany

Notes

Address for correspondence: Andreas Mayr, Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Venusberg-Campus 1, 53127 Bonn Germany. E-mail: [email protected]

Metrics and citations

Metrics

Journals metrics

This article was published in Statistical Modelling.

View All Journal Metrics

Article usage*

Total views and downloads: 264

*Article usage tracking started in December 2016


Articles citing this one

Receive email alerts when this article is cited

Web of Science: 0

Crossref: 0

There are no citing articles to show.

Figures and tables

Figures & Media

Tables

View Options

View options

PDF/EPUB

View PDF/EPUB

Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:


Alternatively, view purchase options below:

Purchase 24 hour online access to view and download content.

Access journal content via a DeepDyve subscription or find out more about this option.