3.1 Simulation
Simulations conducted by
Hofner et al. (2011) showed a biased selection of categorical base-learners with more factor levels and smooth base-learners compared to linear base-learners. To overcome this bias, the authors proposed to use a different, more appropriate definition of the degrees of freedom and to assign equal degrees of freedom (i.e., usually df = 1, as described in
Hofner et al. (2011) in more detail) to all competing base-learners. Simulations indicate that despite these adjustments boosting still has a tendency to select the more complex base-learners even if not needed; for example, the smooth base-learner gets selected even though the true underlying effect was strictly linear. To illustrate this, we use first a setup with six informative variables of which five have a linear effect and one has a non-linear one as well as
completely non-informative variables (similar to
power case 3 in
Hofner et al., 2011):
As an additional scenario, we consider a model of four informative linear predictors, three informative non-linear predictors and only six non-informative predictors with reduced noise term
(sd = 1 compared to sd = 1.95 in the first scenario). To perform model choice via the decomposition as in Section 2.2, for each variable one linear and one smooth base-learner with df = 1 are specified:
Results show that in the first scenario 9.8% of variables with true linear effect were falsely incorporated also or only with a smooth base-learner (averaged over all true linear variables and all simulation steps). This proportion increased to 47% in the second scenario. Applying the proposed deselection procedure for enhanced variable selection and model choice from Section 2.3 the proportion of variables incorrectly identified as representing a smooth effect was successfully decreased to
in the case of the setup by
Hofner et al. (2011) and to 5.8% in the second scenario. On the other hand, the proportion of variables with true smooth effect that were incorrectly included only with a linear term increased from 20% to 23% in the first and from 1.3% to 5.3% in the second scenario. The changes in prediction accuracy due to deselection was only minimal. The Root Mean Square Error (RMSE) on independent test data decreased slightly from
for the standard boosting algorithm to
after performing deselection both for variable selection and model choice in the first scenario and from
to
in the second scenario. These results were similar also for seven additional scenarios, more detailed information on all scenarios (including separate results for variable selection and model choice) can be found in the Supplementary Material. The code to reproduce the results via the R add-on package
mboost (
Hothorn et al., 2022) is available on GitHub (
https://github.com/wistubaT/ModelChoice_SmoothOrLinear).
3.2 Modelling thyroid hormone levels from the AEGIS study
The A Estrada Glycation and Inflammation Study (AEGIS) is a cross-sectional observational trial performed in Galicia, Northwestern Spain, including
participants of the adult population (
Gude et al., 2017). Thyroid hormones are crucial for normal development and the proper functioning of physiological systems. Thyroid hormone synthesis is regulated by feedback mechanisms: decreased thyroid hormone levels lead to increased synthesis of hypothalamic thyrotropin-releasing hormone (TRH) increasing the secretion of thyroid-stimulating hormone (TSH) (
Babić Leko et al., 2021). TSH stimulates the production of thyroid hormones, thyroxine (T4) and triiodothyronine (T3). Variation in TSH and thyroid hormone levels may indicate that normal thyroid function has been altered. Genetic factors account for up to 65% of inter-individual variations in TSH and thyroid hormone levels, but also demographic factors (age and sex), intrinsic factors (stress), and environmental factors as diet, smoking, alcohol and exercise influence thyroid function. Anemia and other blood abnormalities are common in thyroid function abnormalities (cf.
Babić Leko et al., 2021). Red cell mass is frequently reduced in hypothyroid patients, and it is typically increased in hyperthyroidism. While neutropenia has long been associated with hyperthyroidism as recently reviewed in
Scappaticcio et al. (2021), only few studies describe the relationship between thyroid function and blood cell components in euthyroid subjects (e.g.,
Bremner et al., 2012). The aim of this study was to examine the association between thyroid function and hematological parameters.
We consider overall p=33 potential predictor variables, of which 29 are continuous and four categorical. The final sample size for a complete case analysis reduced to n=1 352. A first graphical assessment of the data showed potentially non-linear associations between some of the continuous predictor variables and thyroid hormones, particularly for T3 which we will focus on in the following. The results of the same analyses with T4 and TSH can be found in the Supplementary Material.
For the classical boosting fit with the loss we incorporated the decomposition with df = 1 from Section 2.2 for all continuous variables. After optimal stopping via 25-fold bootstrapping, of the total base-learners were selected, representing 24 variables. Performing the proposed enhanced variable selection and model choice via deselection from Section 2.3 resulted in a final model with 16 base-learners ( variables), of which only three entered the model with a smooth effect. The procedure excluded ten variables completely and removed the non-linear spline base-learner for four continuous variables.
The variables in the final model in descending order w.r.t. overall risk reduction were hemoglobin, age, monocytes percentage, red blood cells count, height, erythrocyte sedimentation rate, transferrin, ferritin, transferrin saturation, platelet mean volume, red cell distribution width, smoking, platelet distribution width and physical activity. The three variables with non-linear effects were transferrin saturation, ferritin and red cell distribution width. The two remaining categorical variables were smoking and physical activity (both with three categories). The deselection process adjusted the originally smoothly variables monocytes percentage, transferrin saturation, age and height to be included in the final model as linear predictors.
In order to further evaluate the stability of our proposed procedure we repeated the fitting, deselection and re-fitting on 1 000 bootstrap samples of the original data set. The estimated effects from standard boosting and from boosting with the previously described deselection procedure with
-bootstrap confidence intervals (
Hofner et al. 2016) are displayed for selected variables in
Figure 2. Similar graphics for all other continuous variables with non-linear effects identified by the classical approach as well as a table with all selected base-learners can be found in the Supplementary Material.
These results are in accordance with other studies reporting that environmental factors (smoking, diet, physical exercise, body mass index) can affect thyroid function. Iron metabolism (among others hemoglobin, ferritin and transferrin) is also very intricately connected to thyroid hormone metabolism. Thyroid hormone insufficiency may lead to deficiency of iron and vice versa. Other factors like oxidative stress may also play a role (erythrocyte sedimentation rate). Finally, we should keep in mind that the thyroid gland is the organ most commonly affected by autoimmune disease.
To evaluate model fit and predictive performance we computed the RMSE both on the training as well as on the test data. For the standard boosting algorithm this resulted in
on the training and
on the test data. Incorporating also the proposed deselection procedure led to slightly higher values;
on training and
on test data (see also
Figure 3).
To summarize the results, in the example of modelling T3 thyroid levels from a larger Galician cohort with , the original boosting approach with model choice led to a rather large and overly complex model with 24 variables, of which nine were incorporated with a smooth effect. The additional deselection approach for enhanced model choice and variable selection decreased the number of variables with non-linear effect to only three with a total of 14 variables in the model. This much simpler model leads basically to the same data fit as well as prediction accuracy. For the other hormone levels the impact of the enhanced model choice via deselection was less pronounced, because also standard boosting led to smaller models with less variables with smooth effects (see Supplementary Material).