Heterogeneous treatment effect estimation for observational data using model-based forests

The estimation of heterogeneous treatment effects has attracted considerable interest in many disciplines, most prominently in medicine and economics. Contemporary research has so far primarily focused on continuous and binary responses where heterogeneous treatment effects are traditionally estimated by a linear model, which allows the estimation of constant or heterogeneous effects even under certain model misspecifications. More complex models for survival, count, or ordinal outcomes require stricter assumptions to reliably estimate the treatment effect. Most importantly, the noncollapsibility issue necessitates the joint estimation of treatment and prognostic effects. Model-based forests allow simultaneous estimation of covariate-dependent treatment and prognostic effects, but only for randomized trials. In this paper, we propose modifications to model-based forests to address the confounding issue in observational data. In particular, we evaluate an orthogonalization strategy originally proposed by Robinson (1988, Econometrica) in the context of model-based forests targeting heterogeneous treatment effect estimation in generalized linear models and transformation models. We found that this strategy reduces confounding effects in a simulated study with various outcome distributions. We demonstrate the practical aspects of heterogeneous treatment effect estimation for survival and ordinal outcomes by an assessment of the potentially heterogeneous effect of Riluzole on the progress of Amyotrophic Lateral Sclerosis.


Introduction
Over the past years, there has been emerging interest in methods to estimate heterogeneous treatment effects (HTEs) in various application fields.In healthcare, HTE estimation can be understood as a core principle driving personalized medicine.As opposed to average treatment effects, which assume a constant effect of a treatment on an outcome for the whole population, HTEs account for the heterogeneity in the effect for subgroups or individuals based on their characteristics.Most research on HTE estimation has mainly focused on continuous and binary response variables.These methods have typically built upon Rubin's potential outcomes framework, a statistical approach to formulating and inferring causal effects in various designs. 1,2raditionally, statistical models were used to estimate the treatment effect, but machine learning methods have been more and more adapted for these tasks over the past decade.Machine learning models rely on weaker assumptions and can automatically learn complex relationships such as higher order interaction effects, resulting in greater predictive performance in a variety of applications.2][13] Kuenzel et al. 14 proposed general frameworks-T-learners, S-learners, U-learners, and X-learners-that base treatment effect estimates on arbitrary machine learning models.Chernozhukov et al. 15 coined the term double/debiased machine learning models, which uses machine learning models for nuisance parameter estimation.The approach still relies on parametric models for estimating treatment effects, but Nie and Wager 16 derived so-called R-learners that allow for arbitrary (nonparametric or semiparametric) models.
Beyond continuous or binary responses, research on machine learning methods for HTE estimation have primarily focused on (right-censored) survival data.Methods have been proposed based on BARTs, 17 random forest-type methods, 18,19 or deep learning approaches. 12,13Theoretically, any machine learning model for survival analysis-such as random survival forests 20 or a Cox regression-based deep neural network (deepSurv) 21 -can estimate HTEs. 22These models can estimate survival or hazard functions in both treatment groups separately; HTEs are then defined as the difference in derived properties of the two functions, e.g. as differences in the median survival time.However, Hu et al. 22 found that methods specifically designed for HTE estimation, like the adapted BART, 17 produce more reliable estimates.
In general, for a continuous or binary outcome Y conditional on treatment w and covariates x, the conditional average treatment effect (CATE) (x) can be estimated from the model (Y | X = x, W = w) = (x) + (x)w even if the model is misspecified, e.g. when the prognostic effect (x) cannot be fully estimated due to missing covariate information.Beyond mean regression, stricter assumptions are necessary both for randomized and for observational studies to estimate HTEs.For example, under a true Cox model with survivor function exp(− exp(h(t) + (x) + w)) with log-cumulative baseline hazard h(t) at time t and log-hazard ratio , the prognostic effect (x) must be specified correctly, even in a randomized trial.Estimated marginal log-hazard ratios τ -i.e. when the model is fitted under the constraint (x) ≡ 0 -are shrunken towards zero if this constraint is unrealistic. 23Naturally, this problem carries over to heterogeneous log-hazard ratios (x).
Consequently, HTE estimation in more complex models requires the simultaneous estimation of both the prognostic part (x) and the predictive HTE (x).1][32][33] In a nutshell, model-based forests combine the parametric modeling framework with random forests to estimate individual treatment effects. 25By using generalized linear models and transformation models, model-based forests can be adapted for survival data, [24][25][26] ordinal data, 27 or clustered data. 28n observational studies, the treatment group assignment is not under control of the researcher and confounding effects could bias the estimation of HTEs.Robinson's 34 orthogonalization strategy has shown to be instrumental for achieving robustness to confounding effects for HTE estimation in observational studies. 5,35,36In this work, we propose and evaluate novel model-based forest variants based on the orthogonalization strategy that are suitable for HTE estimation in the observational setting.Compared to previous work, our work focuses on the utilization of model-based forests for HTE estimation and covers a richer class of models-(generalized) linear and transformation models.Simulation and application studies demonstrate the efficacy of our method for estimating HTEs under confounding biases yielding a versatile model framework applicable to a wide range of application cases.We contribute the first holistic approach to HTE estimation for observational studies which is applicable to many clinically relevant outcomes, including survival times, ordinal scores, binary events, or counts.
We review key components of model-based forests for HTE estimation in randomized trials in Section 2. In Section 3, we start introducing the orthogonalization approach by Robinson. 34We motivate previous developments using linear models 36 and leverage adaptations to more complex models discussed by Gao and Hastie 35 to define novel model-based forest variants suitable for HTE in the observational setting.These variants' performances are empirically assessed in a simulation study with a range of outcome distributions in Section 4. Finally, in Section 5 presenting a re-analysis of the patient-specific effect of Riluzole in patients with amyotrophic lateral sclerosis (ALS), practical aspects of model estimation and interpretation are discussed.

Review of model-based forests for randomized trials
We are interested in estimating HTEs based on i.i.d.observations (y i , x i , w i ) with i = 1, … , n, where y i , x i and w i are realizations of the outcome Y , covariates X ∈ , and control vs. treatment indicator W ∈ {0, 1}.For the remainder of this manuscript (with the exception of equation ( 4)), we will omit the subscript i. Y (0) and Y (1) denote the potential outcomes under the two treatment conditions W ∈ {0, 1}.Throughout this paper, we assume that X includes all relevant variables to explain heterogeneity both in the treatment effect and the outcome Y , and that the base model underlying model-based forests is correctly specified.
We review model-based forests for HTE estimation based on randomized trials as introduced by Seibold et al. 25 and Korepanova et al. 26 Within this section, we only consider settings where the treatment assignment is randomized and, therefore, follows a binomial model W | X = x ∼ B(1, (x)) with constant propensities (x) ≡ .We omit discussion of the abstract framework underlying model-based forests and instead discuss the important linear, generalized linear, 25 and transformation models 26 in detail.

Linear model
For a continuous outcome Y ∈ ℝ with symmetric error distribution, a model-based forest might be defined based on the model where the residuals are given by the error term Z with (Z|X, W ) = 0 and standard deviation  > 0. 36 We are mainly interested in estimating (x), the treatment effect that depends on predictive variables in x.With model-based forests, however, we also obtain an estimated value for the prognostic effect (x), which depends on prognostic variables in x.A variable might be predictive and prognostic at the same time.We refer to these situations as "overlays."Because we assume in this section that (x) ≡  applies, W ⟂ ⟂ X holds.Consequently, (x) can be interpreted as a CATE on the absolute scale.To estimate ((x), (x)) ⊤ the L 2 loss is minimized w.r.t. and  using an ensemble of trees.Inspired by recursive partitioning techniques, 37,38 split variable and split point selection are separated.Specifically in every node,  and  in the constant model (Y | W = w) =  + w are estimated using the observations in the node.The split variable is then the variable that has the lowest p-value for the bivariate permutation tests for the H 0 -hypothesis that  and  are constant and independent of any split variable.The cut-point is the point of the chosen split variable at which the score functions 0][41] First, for the ith training sample, the frequency  i (x) with which it falls in the same leaf as x over all B trees is measured.The obtained weighting vector ( 1 (x), … ,  n (x)) is used as an input for minimizing where  i denotes the loss for the ith sample.Model-based forests easily allow adaptions if HTEs for an outcome variable Y that is not well represented by equation ( 1) should be estimated.In this case, model-based forests can build on generalized linear models or transformation models in the recursive partitioning framework. 38As detailed in the following sections, the loss function  in equation (3) changes from the squared error to the negative (partial) log-likelihood of some appropriate model.

Generalized linear models
When the conditional outcome distribution is better described through a generalized linear model with parameter  depending on the additive function (x) + (x)w, the conditional mean is linear on the scale of a link function g.Thus, the interpretation of (x) as CATE (2) generally no longer holds.Instead, the predictive effect is understood as the difference in natural parameters (DINA) 35 In contrast to the linear model case, HTEs (x) are now defined on relative scales, such as odds ratios in binary logistic regression models or multiplicative mean effects in a Poisson or Gaussian model with a log-link.The negative log-likelihood contribution of some observation (Y , x, w) is with f as the conditional density of an exponential family distribution Model-based trees and forests 38,24,25 After the tree fitting phase, an HTE is estimated with equation (4) with (, , ) of equation (7) as the corresponding loss function.Thus, model-based forests can be directly applied to estimate HTEs on relative scales for binary outcomes (binary logistic or probit regression, for example), counts (Poisson or quasi-Poisson regression), or continuous outcomes where a multiplicative effect is of interest (normal model with log-link).

Transformation models
More complex responses like ordered categorical or time-to-event outcomes are not covered by generalized linear models but can be analyzed using transformation models; corresponding model-based forests for survival analysis have been introduced by Korepanova et al. 26 For some at least ordered outcome Y , we write the conditional distribution function as The transformation function h is monotone non-decreasing and the inverse link function F governs the interpretability of  as log-odds ratios (F = logit −1 ), log-hazard ratios (F = cloglog −1 ), log-reverse time hazard ratios (F = loglog −1 ), or shift effects (F = Φ, the cumulative distribution function of the standard normal).The shift term  w (x) differs between the two treatment groups w ∈ {0, 1}.The distribution functions of the potential outcomes are F(h(y) − (x)) for Y (0) and . The negative log-likelihood of a discrete or interval-censored observation ( ȳ, ȳ] (where ȳ is the lower interval bound, ȳ is the upper) is For a continuous datum y ∈ ℝ, we obtain details are given in Hothorn et al. 42 Transformation forests apply the model-based recursive partitioning principle and estimate  in each node along with the transformation function h (a "nuisance" parameter) by minimizing  Trafo (h,  ≡ 0, ). 29Because h contains an intercept term, the parameter  is not identified.We thus estimate the model under the constraint  ≡ 0. Variable and cut-points are selected using the bivariate gradient This model family includes proportional odds logistic regression (for ordered categorical, count or continuous outcomes), Box-Cox type models, Cox proportional hazards model, Weibull proportional hazards models for discrete and continuous outcomes, reverse time proportional hazards models relying on Lehmann alternatives, and many more. 42Forests for ordinal outcomes were evaluated by Buri and Hothorn, 27 and a general approach to "transformation forests" is described in Hothorn and Zeileis. 29pplication of the ideas underlying model-based forests allows HTEs to be estimated for such outcomes under all types of random censoring and truncation. 26For example, for Weibull distributed outcomes under right censoring, h(y) =  1 +  2 log(y) is chosen for the conditional distribution function in equation (8). 42n this case, we define Y as the event time, C as the censoring time and T = min(Y , C) as the observed time.For identification of (x) under potential censoring, the following assumption must hold 18 :

Assumption 1 ((Ignorable censoring)). Censoring time C is independent of survival time Y conditional on treatment indicator W and covariates X
An important special case represents the Cox proportional hazards model, where the profile likelihood over the baseline hazard function defines the partial log-likelihood  PL (, ) with  ≡ 0. The scores with respect to the constant  ≡ 0 are known as martingale residuals.Model-based forests for such models, and extensions to time-varying prognostic and predictive effects, are discussed in Korepanova et al. 26

Noncollapsibility
As mentioned in the introduction, one problem with the Cox model is that misspecifications of prognostic effects (x) lead to biased estimates such that the estimated hazard ratios cannot be interpreted causally.This issue arises from the noncollapsiblity of the Cox model, the notion of which is characterized by the fact that in these models, the mean of the conditional effect estimates defined over covariates X does not coincide with the marginal effect over X.Because the noncollapsiblity of the Cox model arises from its nonadditivity of the hazard function, models such as a Weibull model (parameterized in form of an accelerated failure time model) do not suffer from this issue because they satisfy the additivity condition.Consequently, misspecifications of prognostic effects do not affect treatment effect estimates. 23he noncollapsibility issue is not limited to the Cox model but also affects members of the exponential family without identity or linear link functions.Without adjustments, effect estimates can only be interpreted causally if there is no treatment effect ( ≡ 0) or there are no prognostic covariates. 43f this is not the case, specific methods are needed; ignoring the estimation of (x) at all and only focusing on (x) does not solve the problem.Conditioning on available prognostic variables is a common solution and is already applied by model-based forests because they estimate both the prognostic effect (x) and (x).The ensemble of trees used to estimate these effects provides a high degree of flexibility and might therefore retain some of the potential complexity in the underlying (x) to mitigate misspecification.Whether conditioning resolves the noncollapsibility issue depends heavily on the assumption that all prognostic variables are known which is often not the case in the real world. 23or members of the exponential family and the Cox model, Gao and Hastie 35 derived a method to account for noncollapsibility in the context of observational data with confounding effects.While we consider the noncollapsibility issue beyond the scope of this work, we briefly review the work of Gao and Hastie and discuss its applicability to model-based forests in Section A of the Supplemental Material.

Model-based forests for observational studies
In the previous section, we described model-based forests in the randomized setting under the assumption that (x) ≡ .In observational studies, in which the treatment group assignment is not under the control of the researcher, the propensity score (and therefore, the probability of being in the treatment group) often depends on covariates x In this case, confounding effects could bias the estimation of treatment effects (x), and stricter assumptions are necessary in order to interpret (x) causally. 44sumption 2 ((Ignorability/Unconfoundedness)).The treatment assignment is independent of the potential outcomes conditional on covariates x The propensity score (x) must be bounded away from 0 and 1 Assumption 2 could be violated by an unmeasured confounder, while Assumption 3 could be violated if all observations in a certain group (defined via x) are in the treatment group.][47][48] Dandl et al. 36 showed for mean regression models that model-based forests are not robust to confounding effects and need further adaptions to estimate causal effects in case of observational data.One strategy for dealing with confounding effects is the orthogonalization strategy originally introduced by Robinson, 34 which has received considerable attention in recent years. 15,5,16The reformulation of the linear model given the conditional mean function motivates this approach. 5,36verall, the orthogonalization strategy consists of two steps: First, nuisance parameters m 34 used kernel estimators, but any machine learning method could be employed. 15,16Regressing Y − m(x) on W − π(x) then yields unbiased estimates for (x).Subtracting m(x) and π(x) from Y and W , respectively, partially eliminates the association between X and Y and between X and W .The orthogonalization strategy has the distinct advantage over other methods against confounding-such as inverse propensity weighting and matching-that it is stable for extreme propensity scores and forgoes stratification. 35obinson 34 and Chernozhukov et al. 15 use parametric models to estimate treatment effects based on residualized W and Y , but these models could be replaced by non-parametric or local parametric models 16,49 -such as model-based forests.For mean regression, Dandl et al. 36 adapted the orthogonalization strategy to model-based forests.Their approach closely follows causal forests, which were the first to combine the orthogonalization strategy with tree-based estimators for (x).
Gao and Hastie 35 proposed extensions of Robinson's strategy to members of the exponential family and the Cox model, where DINA ( 6) is of interest.Gao and Hastie 35 assume (x) = x ⊤  and use parametric models to estimate (x), but they conclude that non-parametric or local parametric models could be applied instead.We review model-based forests in combination with linear models for observational data in the next section and summarize the idea by Gao and Hastie 35 in Section 3.2.On this basis, we assess how the orthogonalization strategy could be employed in model-based forests beyond mean regression with generalized linear models and transformation models as base models.
3.1 Review of Dandl et al. 36 As noted above, Athey et al. 5 were the first to combine the orthogonalization strategy of Robinson with tree-based estimators to estimate (x).First, the marginal model m(x) = (Y | X = x) and propensity score (x) = (W | X = x) are estimated by regression forests.Afterwards, causal forests estimate individual treatment effects (x) in the model using the "locally centered" outcomes Y − m(x) and treatment indicators W − π(x).Equation (13) shows that causal forests and model-based forests share common foundations for mean regression.The main difference is that the splitting scheme of model-based forests allows splitting according to heterogeneity in both treatment and prognostic effects, whereas causal forests only split with respect to heterogeneity in treatment effects (in equation (11), (x) cancels out).
Dandl et al. 36 identified which elements of both approaches lead to improved performance in randomized trials and observational studies.They defined and evaluated the following blended versions of model-based forests and causal forests: (1) mob( Ŵ , Ŷ ), which uses model-based forests to estimate the model i.e. after centering the treatment indicator w and the outcome Y .Both parameters μ and  are estimated simultaneously.
(2) mob( Ŵ ), which uses model-based forests to estimate the model i.e. after only centering the treatment indicator w but not outcome Y .Both  and  are estimated.
(3) cfmob, a method which uses model-based forests to estimate the model i.e. after only centering the treatment indicator w and splitting only according to τ.That is, only the parameters  are estimated in this variant.
Their blended approaches competed with the original implementations of (uncentered) model-based forests and causal forests in an extensive simulation study.In case of confounding, the authors identified local centering of treatment indicator w and simultaneous estimation of both predictive and prognostic effects of the treatment indication (mob( Ŵ )) as the key driver for good performance.Additionally, centering Y (mob( Ŵ , Ŷ )) is recommended, because it further improved performances in some cases.Splitting only according to τ but not μ (cfmob) resulted in lower predictive performance.

Review of Gao and Hastie (2022)
Robinson 34 derived the orthogonalization strategy only for semi-parametric additive models with Y ∈ ℝ. Gao and Hastie 35 extended the idea to a broader class of distributions including the exponential family and Cox model.
Local centering of the treatment indicator works analogously to mean regression.First, propensity scores (x) = ℙ(W | X = x) are estimated.The effects of the covariates X on the treatment assignment are then regressed out by subtracting π(x) from W .

Method
Linear predictor Definitions Orthogonalization of Y is not straightforward due to the link function that relates the linear predictor  w (x) in equation ( 5) to the outcome Y .To understand how Gao and Hastie derived m(x) to center Y , we consider equation (10) as a model of the exponential family with identity link function g.Now we can rewrite equation (12) to Accordingly, Gao and Hastie derived g((Y | X = x)) for all other distributions of the exponential family by We can regard the estimated m(x) as an offset in the linear predictor Note that equation (14) states that (only) for the Gaussian distribution we can directly estimate m(x) = (Y | X = x) without estimating  0 (x) and  1 (x).We can also derive m(x) for transformation models based on the definition of  0 and  1 in equation (8).As mentioned in Section 2.4, compared to the difference in conditional means, the DINA additionally suffers from the noncollapsibility issue. 50Gao and Hastie 35 also extend the Robinson strategy to tackle not only the confounding but also the noncollapsibility issue for members of the exponential family (without a linear or log-link function, otherwise confounding is not an issue) and the Cox model.While it is beyond the scope of this work to address the noncollapsibility issue, Section A of the Supplemental Material briefly summarizes and discusses the applicability of Gao and Hastie's approach to model-based forests.

Novel model-based forests for observational data
As stated above, our main goal is to assess how the orthogonalization strategy proposed for continuous outcomes could be extended to models beyond mean regression, specifically generalized linear models and transformation models.Based on Dandl et al. 36 and Gao and Hastie 35 we propose two different versions of model-based forests, which should be more robust against confounding.Following Dandl et al., 36 we formulate research questions for these versions, which we aim to answer empirically in Section 4.An overview of all proposed versions is given in Table 1.
The first version of model-based forests directly applies Robinson's orthogonalization strategy: First, we estimate propensities (x) as well as  0 (x) and  1 (x) to derive m(x).Then, we update the linear predictor of equation ( 5) by centering W by π(x) and by adding the offset m(x).For generalized linear models, we obtain and for the conditional distribution function of equation (8) in case of transformation models Based on the updated models, both prognostic and predictive effects μ(x) and (x) are simultaneously estimated by modelbased forests.
In the simulation study and practical example in Sections 4 and 5, we use regression forests to estimate (x) and gradient boosting machines (with tailored loss functions) to estimate  0 and  1 .In the following, we denote this version of modelbased forests as Robinson forests in recognition of Robinson 34

Empirical evaluation
In a simulation study, we aim to compare the different model-based forest versions (Table 1) by answering the following research questions: The study includes different outcome types, different predictive and prognostic effects, and a varying number of observations and covariates.Model-based forests were fitted with the model4you R add-on package. 51Similar to Dandl et al., 36 we base our study settings on the four setups (A, B, C and D) of Nie and Wager. 16In addition, in Section B of the Supplemental Material, we show the results for simulation settings first proposed by Wager and Athey 49 and later reused by Athey et al. 5

Data generating process
Given P = {10, 20}, for Setup A, we sampled X ∼ U([0, 1] P ).For all other setups, we used X ∼ N(0, 1 P×P ).The treatment indicator was binomially distributed with W | X = x ∼ B(1, (x)).The propensity function (x) differed for the four considered setups: (x) ≡ 0.5 in Setup B implies a randomized study.The treatment effect function (⋅) and the prognostic effect function (⋅) also differed between the setups Setup A has extensive confounding that must be eliminated before estimating an easily predictable treatment effect function (x).Setup B needs no confounding adjustment for reliable estimation of .Although Setup C contains strong confounding, the propensity score function is easier to estimate than the prognostic effect, while the treatment effect is constant.In Setup D, the treatment and control arms are unrelated, and therefore, learning the conditional expected outcomes of both arms jointly is not beneficial. 16,36o assess robustness of the methods to missing prognostic covariates, we additionally created Setup A' from Setup A by removing covariate X 3 from the training data.Therefore, the DGP of Setup A and Setup A' are identical, the only difference being that the training data did not contain X 3 although X 3 affects the prognostic effect.
We studied four different simulation models Due to w − 0.5 in all scenarios, half of the (negative) predictive effect (x) was added to the prognostic effect.We refer to the implied scenario-where one variable which is both prognostic (impact in (x)) and predictive (impact in (x)) exists-as overlay.Apart from Setup C in which the treatment effect is constant and independent of any covariate, overlay was present for all scenarios.
Like Dandl et al., 36 we compared all study settings and outcome types for a varying number of samples N ∈ {800, 1600} and dimensions P ∈ {10, 20}.All model-based forests were grown with the same hyperparameter options specified in Section 7. We used random forests as implemented in the grf package to estimate (x) for centering W . 52 To estimate  0 (x) and  1 (x) to derive m(x), we relied on different tree-based estimators depending on the outcome type.For normally distributed outcomes (models (15a)), we used grf regression forests. 52For all other outcomes, we relied on gradient boosting machines (with adapted loss functions) as implemented in mboost and gbm. 53,54The employed distribution varied depending on the outcome type.
In accordance with Dandl et al., 36 we evaluated the models with respect to the mean squared error  X {( τ(X) − (X)) 2 }, bias  X { τ(X) − (X)} and standard error √ Var X { τ(X) − (X)} on a test sample of size 1000.The results for the mean squared error are shown in Figure 1 and were statistically analyzed by means of a normal linear mixed model with a log-link.The model explained the estimated mean squared error for τ(x) by a four-way interaction of the data generating process, sample size N, dimension P, and random forest variant.We estimated the mean squared error ratios between different model-based forest versions according to the research questions stated in Section 4. The corresponding tables are given in Tables 2 to 4. In the Supplemental Material, the results for the bias and standard error are presented in Figures S. 2 and S. 3, as well as the statistical analysis of the relative efficiency in Tables S. 4 to S. 6.

Results
The results for the normal distribution coincide with the results obtained by Dandl et al. 36 summarized in Section 3.1.To some degree, they also hold for the other distributions.The boxplots are not directly comparable between different data generating processes because of different signal-to-noise ratios.In general, a more informative outcome (binary < ordered < right-censored < exact normal), more data (higher N), and less noise (lower P) leads to better results.Using a Cox model compared to a Weibull model (last two rows of Figure 1) did not lead to a major decrease in performance, although knowledge of the true functional form of the transformation function did not enter the Cox modeling process.
For Setup A, model-based forests without centering (Naive) were unable to cope with complex confounding, but solely centering of the treatment indicator (Robinson Ŵ ) was valuable.Additionally adding m(x) as an offset (Robinson) did not further improve the results for the normal, binomial, and Weibull distributions, but an improvement was observed for the multinomial distribution.Suppression of X 3 in the training dataset (Setup A') had little to no effect on the ranking of the methods.Slight deterioration in the performance of all methods was only observed for the binomial data with the logistic regression model as a base model.This is expected since this model is noncollapsible (see Section 2.4 for details).
For Setup B, the Robinson forests performed slightly better in disentangling the more complicated prognostic and predictive effects compared to Naive and Robinson Ŵ model-based forests.An exception is the binomial model: Robinson Ŵ forests performed similarly to Robinson forests.
In Setup C, over all distributions, uncentered model-based forests (Naive) failed to overcome the strong confounding effect and therefore did not provide accurate estimates for the treatment effect.The performance was fundamentally    improved by centering the treatment indicator (Robinson Ŵ ) and was further improved by additionally adding m(x) as an offset (Robinson).
In Setup D -with unrelated treatment and control arms -all methods had a higher mean squared error than in the other setups, as jointly modeling the expected conditional outcomes for both arms has no benefit.Apart from the normal distributions, Robinson forests were inferior to the Robinson Ŵ and Naive model-based forests.
The empirical evidence of our simulation study can be summarized as follows: If confounding was present, model-based forests performed better when centering W by π(x) (Robinson Ŵ ) compared to not centering W (Naive). Adding m(x) as an offset (Robinson) further improved the performance -especially in cases with very strong confounding.

Effect of Riluzole on progression of ALS
ALS is a progressive nervous system disease causing loss of muscle control.The status of the disease as well as the rate of progression is commonly evaluated by the ALS functional rating scale (ALSFRS). 55,56Here, physical abilities such as speaking, handwriting, and walking are assessed and rated on a scale from 0 (inability) to 4 (normal ability).In 1995, the FDA approved the first drug to manage and slow progression of ALS, named Riluzole.The largest database for study results on the effect of Riluzole offers the Pooled Resource Open-Access Clinical Trials (PROACT) database -initiated by the nonprofit organization Prize4Life (http://www.prize4life.org).The data comes from different randomized and observational studies not disclosed in the data.Thus, the assumption of random treatment assignment is quite hard to justify in an analysis.Patient characteristics and treatment group sizes might vary greatly between the centers, which affect both the probability of receiving treatment as well as the outcome.To account for these potential confounding effects, we compared the treatment effects estimated by the naive model-based forests to the ones estimated with local centering by Robinson.As in Section 4, we use random forests to estimate the propensity scores to center W and gradient boosting machines (with adapted loss functions) to estimate the values of the linear predictors  0 (x) and  1 (x) to center Y .Model-based forests, random forests, and gradient boosting machines rely on the hyperparameter values stated in Section 7. As for Seibold et al. 25 and Korepanova et al., 26 16 phase II and phase III randomized trials and one observational study from the PROACT database serve as a training dataset.We analyze the effect of Riluzole with respect to two outcome variables: Survival time and the handwriting ability score approximately 6 months after treatment-an item of the ALSFRS.We omitted observations with missing outcome values.As splitting variables, Seibold et al. 25 used demographic, medical history, and family history data, which were informative in the sense that not more than half of their values were missing.

Survival Time
The dataset for the survival time contains 3306 observations and 18 covariates.Of the 3306 observations, 2199 received Riluzole.Because very few patients had event times that exceed those of the others by a factor of two, we artificially censored five observations with (censoring or event) times of more than 750 days.The Kaplan-Meier estimates of survival probabilities for both treatment arms of the preprocessed dataset are shown in Figure 2. Overall, the estimated survival curves are very close to each other, and the treated group has only a slight survival advantage compared to the untreated group.As a base model, we use a Cox proportional hazards model.We compared treatment effects from two approaches: The naive uncentered model-based forests (Naive) and the model-based forest with Robinson's orthogonalization (Robinson).

Personalized models
For the naive model-based forests, the underlying Cox proportional hazards base model for the survival outcome T was, on the hazard scale, Because  0 (t) contains an intercept term,  is not identified (and was constraint to  ≡ 0).The treatment effect  is the log-hazard ratio of the treated versus untreated patients and our aim is to replace a constant marginal effect  with a heterogeneous (and thus conditional) log-hazard ratio (x) and, simultaneously, to estimate prognostic effects (x).
For Robinson's strategy, we first centered the treatment indicator W by estimating the propensity scores (x) = ℙ(W | X = x) using a regression forest.Figure 3 compares the distributions of estimated propensity scores (left) and of the  estimated centered treatment W − π(x) (Robinson's strategy, right), both obtained from regression forests.We can already see a decent overlap of propensity scores in the two treatment arms without centering, but the overlap increases if the strategy by Robinson was applied.
In addition to centering W , Robinson's strategy requires the estimation of m(x) to use as an offset (see Section 3).As in Section 4, we used gradient boosting machines (with the negative log partial likelihood of the Cox proportional hazards model as a loss) to estimate the natural parameters  0 (x) and  1 (x) for the control and treatment group, respectively. 57The offset m(x) for each observation is equal to the sum of natural parameter estimates weighted by π(x) (see equation ( 14)).The final base model for model-based forests using Robinson's orthogonalization is

Model-based forests
The corresponding base models serve as an input for the model-based forests to estimate personalized effects of Riluzole.Figure 4 compares the kernel density estimates of (x) for each forest version (Naive and Robinson).The naive approach reveals that on average the treatment reduced the hazard compared to no treatment, whereas the model-based forest with centering according to Robinson obtained weaker effects of Riluzole with more mass centered around 0. A meta-analysis of previous studies by Andrews et al., 58 also yielded a mixed picture: only eight of the 15 studies meeting their inclusion criteria showed a statistically significant increase of median survival time due to Riluzole.Over all strategies, for both approaches there were some patients for which Riluzole was estimated to increase the hazard.The dependency plots in Figures S. 6 and S. 7 in the Supplemental Material provide indications of the characteristics of the group of harmed individuals.For example, both the naive and centering approach agree that for patients with atrophy or fasciculation, Riluzole intake would increase the hazard.The estimated effects differed most between the uncentered forest (Naive) and the orthogonalized forest (Robinson) for the covariate sex (  25 : For middle-aged people with a longer time between disease onset and start of treatment, lower height, and no weakness, the treatment appears to be more beneficial.By considering confounding effects due to orthogonalization (Robinson forests), these effects diminished.For Korepanova et al. 26 the effect of Riluzole was also rather weak and showed low heterogeneity across covariates.

Handwriting ability score
The dataset for the handwriting ability score-an ordinal outcome with five categories-contains 2538 observations and 58 covariates.Besides the covariate age, all covariates had missing values (but less than 50 % of the values were missing per variable enforced by the preprocessing step stated at the beginning of this section).Of the 2538 observations, 1754 received Riluzole, and 784 did not.Figure 5 displays the frequency of the ability scores for both treatment groups.Most of the patients have an ability score of 3 or 4 (normal ability); only a few have ability scores less than 2. Note that the plot shows the conditional proportions given the treatment indicator.We chose a proportional odds logistic regression model as a base model for the model-based forests-once without further adaptions (Naive), and once parameterized with centered W and with an offset (Robinson).
In addition to the handwriting ability score after 6 months, the ability score values at treatment start are also available.In the following, we denote Y 6 as the handwriting score after 6 months and Y 0 as the handwriting score at the beginning of the treatment period.To account for the ability level at treatment start, Y 0 served as an additional splitting variable for both model-based forests (Naive and Robinson) and was included in X.The alluvial plot in Figure 6 breaks down the change in each ability class over 6 months.Overall, for most patients, the handwriting ability remained constant over the 6 months or worsened slightly.Rarely, patients experienced a progression to both extremes (0 to 4, or 4 to 0).These results hold regardless of whether patients received Riluzole or not.

Personalized models
The proportional odds logistic regression model for the naive model-based forests is defined as 59,60 logit(ℙ(Y 6 ≤ k|X = x, W = w, Y 0 = y 0 )) =  k (x, y 0 ) − (x, y 0 )w with k ∈ {0, … , 3} as the ordinal ability score classes.The parameters  k are increasing thresholds, depending on covariates x and the initial score y 0 .Due to the proportional odds assumption, the treatment effect (x, y 0 ) is the same for all scores k.Negative (x, y 0 ) indicate a negative effect of Riluzole, as treated patients are expected to have a higher odds of low writing ability scores compared to untreated patients.
Figure 7 compares the estimated treatment indicators with W as the outcome in the random forest without centering (left), with (W − π(x, y 0 )) as the outcome in the random forest (right).Before centering, there is a lack of overlap of the propensity scores; the distribution of π for the control group is bimodal, and the distribution for the treatment group is heavily left-skewed.After centering, the distributions of the estimated W − π(x, y 0 ) for the treatment groups move closer together and have a similar unimodal shape.However, there is still a lack of overlap of the groups, which indicates that important covariates to explain the remaining heterogeneity in the two treatment groups seem to be missing.

Model-based forests
The proportional odds logistic regression models served as a base model for the (Naive and Robinson) model-based forests to derive personalized treatment effects.Figure 8 displays the kernel density estimates of (x, y 0 ) for each forest version (Naive and Robinson).
Both random forests estimate on average a negative effect of Riluzole.Naive model-based forest estimated on average a log-odds of τ = −0.08,which indicates that treated patients have a 0.08 points higher log-odds for low writing scores than untreated patients.The distribution of τ(x, y 0 ) for the model-based forest relying on the Robinson orthogonalization is slightly shifted to the left ( τ = −0.10).For a larger subgroup of patients, the naive approach estimates a negative effect of Riluzole (−1 ≤ (x, y 0 ) ≤ −0.5), meaning that patients receiving treatment with Riluzole have higher odds of low writing scores than untreated patients.According to the dependency plots (Figures S. 8 to S. 13 in the Supplemental Material), this subgroup could be identified as having low initial ability scores (left side of Figure S. 8 (a)).For all other splitting variables, the distributions of estimated treatment effects are very similar.

Discussion and outlook
HTE estimation is a challenging problem, especially for observational studies and even more when the outcome cannot be modeled by a linear model.Covering diverse outcomes is not straightforward due to the noncollapsibility issue.Measures like the hazard ratios are shrunken towards zero and are incomparable between individuals if the full model including prognostic effects is not estimated.Model-based forests allow for the full model estimation given randomized trial data.In this work, we investigated several versions of model-based forests for the estimation of potentially complex HTEs (x) based on observational data with various outcome types based on the orthogonalization strategy by Robinson. 34These investigations suggest the following workflow for model-based forests: (1) Estimate propensities (x) using some machine learning procedure (binary random forests are a good default), (2) center the treatment indicator w − π(x) for each observation, (3) set up an appropriate model for the outcome conditioning on the centered treatment and -if possible -add an offset for centering Y , (4) use model-based forests to estimate predictive and prognostic effects (x) and (x) simultaneously.Notably, (x) is the CATE only in specific models, especially a linear or log-linear model.We demonstrate these steps by estimating the individual effects of Riluzole for ALS patients using survival times and ordinal ability scores as outcomes.
Many conceptual and technical issues remain to be addressed, for example the question how model-based forests perform for survival data for which the censoring procedure is not randomized but depends on X, or how (k-fold) cross-fitting influences the performance, where only one part of the data is used to estimate nuisance parameters and the other part to estimate (x). 15We leave investigations to these questions to future research.
Last but not least, we want to emphasize that all approaches for estimating HTEs -including those presented in this work -rely on strong and typically untestable assumptions.For example, for models beyond mean regression, τ(x) cannot while model-based forests without centering W and without offset m(x) are called Naive forests.Similar to Dandl et al.-who saw an improvement in performance when only centering W (compared to the naive modelbased forests)-we define an approach called Robinson Ŵ forests that applies model-based forests to models with linear predictors (x) + (x)(w − π(x))

RQ 1
To what extent does centering W by π(x) and including m(x) as an offset (Robinson) affect the performance of model-based forests in the presence of confounding?RQ 2 Do centered treatment indicator model-based forests (Robinson Ŵ ) perform better than uncentered model-based forests in the presence of confounding?RQ 3 Are model-based forests with centered treatment indicators (Robinson Ŵ ) relevantly outperformed by model-based forests with m(x) as an additional offset (Robinson) in the presence of confounding?
Model (15a) is a normal linear regression model, model (15b) is a binary logistic regression model, model (15c) is a 4nomial model with log-odds function  k −(x)−(x)(w−0.5)with threshold parameters  k = logit(k∕4) for k = 1, 2, 3, and model (15d) is a Weibull model with log-cumulative hazard function 2 log(y)−(x)−(x)(w−0.5).We added 50 % random right-censoring to the Weibull-generated data.Additionally, we applied a Cox proportional hazards model to the Weibull data to determine if the performance of model-based forests degrades when the forests do not take the true underlying model as their base model.

Figure 1 .
Figure 1.Model-based forest results for the empirical study (Section 4), Cox means a Cox model applied to the Weibull data.For the Weibull and Cox model, treatment effects (x) are estimated as conditional log hazard ratios.Direct comparison of model-based forests without centering (Naive), model-based forests with local centering according to Robinson 34 of Y and W (originally proposed) (Robinson) or only of W (Robinson Ŵ ).

Figure 2 .
Figure 2. Kaplan-Meier curves of survival probability for both treatment arms.

Figure 3 .
Figure 3. Distribution of estimated propensities π(x) (left) and estimated propensities of the centered treatment indicators (right, Robinson's strategy) as estimated by regression forests for the two treatment groups.

Figure 4 .
Figure 4. Kernel density estimates of the personalized treatment estimates for the naive model-based forest (Naive) and for the model-based forest with Robinson orthogonalization (Robinson).

Figure 5 .
Figure 5. Relative frequency distribution plot of the handwriting ability score (Y 6 ) (left) and of changes of the handwriting ability score over 6 months (Y 6 − Y 0 ) (right) for both treatment arms.Frequencies were calculated relative to the treatment indicator.
Figure S. 6 (c)), the covariate of whether patients swallow, and for the covariate specifying whether cases in the same generation exist (Figure S. 7 (f) and (i)).For the variables time onset treatment, age, height and weakness the dependency plots (Figure S. 6 (a), (d), (e) and Figure S. 7 (g)) of the Naive forest agree with the ones of Seibold et al.

Figure 6 .
Figure 6.Alluvial plot of the progression of the handwriting ability score over 6 months for both treatment arms.

Figure 7 .
Figure 7. Estimates returned by the regression forest for orthogonalization of the treatment indicator: left for original W as an outcome in the random forest such that it estimates propensity scores (x, y 0 ); right for the centered treatment indicator W − π(x, y 0 ) as an outcome in the random forest.

Figure 8 .
Figure 8. Kernel density estimates of the personalized treatment estimates for the naive model-based forest (Naive) vs. the forest with Robinson orthogonalized (Robinson).
jointly estimate the prognostic effect (x) and the predictive effect (x).

Table 1 .
Overview of proposed model-based forest versions.

Table 2 .
Results of RQ 1 for the experimental setups in Section 4. Comparison of mean squared errors for τ(x) in the different scenarios.Estimates and simultaneous 95 % confidence intervals were obtained from a normal linear mixed model with log-link.Cells printed in bold font correspond to a superior reference of the Naive model-based forests, and cells printed in italics indicate an inferior reference.

Table 3 .
Results of RQ 2 for the experimental setups in Section 4. Comparison of mean squared errors for τ(x) in the different scenarios.Estimates and simultaneous 95 % confidence intervals were obtained from a normal linear mixed model with log-link.Cells printed in bold font correspond to a superior reference of the Naive model-based forests, and cells printed in italics indicate an inferior reference.

Table 4 .
Results of RQ 3 for the experimental setups in Section 4. Mean squared error ratio for RQ 3: Robinson vs. Robinson Comparison of mean squared errors for τ(x) in the different scenarios.Estimates and simultaneous 95 % confidence intervals were obtained from a normal linear mixed model with log-link.Cells printed in bold font correspond to a superior reference of Robinson Ŵ , and cells printed in italics indicate an inferior reference.