Penalized variable selection in multi-parameter regression survival modeling

Standard survival models such as the proportional hazards model contain a single regression component, corresponding to the scale of the hazard. In contrast, we consider the so-called “multi-parameter regression” approach whereby covariates enter the model through multiple distributional parameters simultaneously, for example, scale and shape parameters. This approach has previously been shown to achieve flexibility with relatively low model complexity. However, beyond a stepwise type selection method, variable selection methods are underdeveloped in the multi-parameter regression survival modeling setting. Therefore, we propose penalized multi-parameter regression estimation procedures using the following penalties: least absolute shrinkage and selection operator, smoothly clipped absolute deviation, and adaptive least absolute shrinkage and selection operator. We compare these procedures using extensive simulation studies and an application to data from an observational lung cancer study; the Weibull multi-parameter regression model is used throughout as a running example.


Introduction
Multi-parameter regression (MPR) modelling refers to the approach whereby covariates are allowed to enter the model through multiple distributional parameters simultaneously.This is in contrast to the standard approaches where covariates enter through a single parameter (e.g., a location parameter) while holding the remaining parameter(s) (e.g., a scale or shape parameters) constant.Multi-parameter regression approaches have been considered in many areas of applied statistics.One of the earliest examples appears in econometrics literature where Park (1966) proposed a log-linear model for the scale parameter in the normal linear model when the assumption of homoscedasticity is violated and describes a two stage process for the estimation of the parameters.Harvey (1976) develops a maximum likelihood estimation procedure for these so called heteroscedastic regression models.Stirling (1985) and Aitkin (1987) illustrate the procedure using examples and GLIM (Generalized Linear Interactive Modelling) macros for the implementation of these models.Other examples of multi-parameter regression models include an extension of generalized linear models to allow for the joint modelling of the mean and dispersion (Smyth, 1989;Nelder and Lee, 1991;Lee and Nelder, 1998), the generalized additive models for location, scale and shape (GAMLSS) which goes beyond the exponential family to general parameters (not necessarily location and scale) for a variety of distributions (Rigby and Stasinopoulos, 2005).Taylor and Verbyla (2004) model the location and scale parameters of the t-distribution jointly to rectify the dependence of the scale parameter on the location.
In this article, we consider MPR modelling in the setting of survival analysis.Examples of the MPR modelling framework include Anderson (1991) who extended the Weibull accelerated failure time model such that both the location and dispersion parameters depend on covariates.A set of models known as the threshold regression models, model survival time through an underlying diffusion process (e.g., Weiner process) whose drift and initial state parameters both depend on covariates (Lee and Whitmore, 2006;Aalen et al., 2008).More recently, Burke and MacKenzie (2017) explored the general use of parameteric hazard MPR models, and demonstrated the favourable interpretation and flexibility afforded by jointly modelling the scale and shape parameters of a Weibull distribution.
The variable selection problem is omnipresent in most, if not all, statistical applications.This problem arises due to the uncertainty associated with the selection of a subset of explanatory variables to model some variable of interest (Thompson, 1978;George, 2000).With the growing availability of data, this problem has received a significant amount of attention in recent years.Traditionally, variable selection has been performed using methods such as backward elimination, forward selection, and stepwise regression; Miller (2002) provides a comprehensive discussion of these methods.Due to their inherent discreteness (i.e., covariates are either "in" or "out"), these methods can be unstable (Breiman, 1996).Furthermore, they are known to be computationally intensive with a power explosion of the number of possible submodels (2 p ) to be considered for p predic-tors.Modern approaches have been focused on penalized regression whereby parameter estimation and (continuous) model selection are both carried simultaneously.Methods such as the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), the elastic net (Zou and Hastie, 2005), and adaptive lasso (ALASSO) (Zou, 2006) are used to simultaneously select variables and estimate their regression coefficients.
Standard regression approaches only have one regression component, and, therefore, variable selection literature mainly focuses on this, i.e., selection of covariates in a location parameter (cf.Fan and Lv (2010)).Beyond this standard setting, MPR models require variable selection in multiple distributional parameters.However, little work has been done in this area.Wu and Li (2012) considered penalized likelihood for variable selection in the case of an inverse Gaussian joint mean and dispersion model, while Wu et al. (2013) and Wu (2014) applied this procedure to joint location-scale skew-normal and skew-t-normal models.None of the aforementioned papers consider censored data, and, furthermore, these papers only consider the case of a common tuning parameter for the two regression components.Therefore, the main objective of this article is to develop penalized variable selection in MPR models more generally.Using the Weibull MPR model as an example, we investigate the need for a separate tuning parameter for each regression component.We select tuning parameters based on the BIC function, and, because this is multi-modal, we propose the use of a differential evolution "global" optimization procedure.
The remainder of the paper is structured as follows.In Section 2, the Weibull multiparameter regression model, the penalty functions and the penalized likelihood estimation procedure are introduced.Section 3 describes the model fitting process and the algorithm used in the selection of the tuning parameter(s).Simulation studies to investigate the characteristics of the algorithm and evaluate its performance in both variable selection and parameter estimation are given in Section 4. The proposed methodologies are then illustrated on a real data example, a lung cancer dataset, in Section 5. A discussion of the proposed methods and some concluding remarks are given in the last section, Section 6.

Weibull MPR Model
Although the variable selection methods we consider in this article can be applied to any parametric MPR model, it is helpful to focus on a specific example.We therefore consider the Weibull MPR model as the Weibull distribution is one of the most popular parametric survival distributions.The Weibull hazard function for survival time Ti corresponding to the ith individual is given by for i = 1, 2, . . ., n, where τ i > 0 is the scale parameter and γ i > 0 is the shape parameter.The Weibull MPR model is obtained by letting both distributional parameters depend on covariates as follows: log , where x i = (1, x i1 , . . ., x ip ) T and z i = (1, z i1 , . . ., z iq ) T are scale and shape covariate vectors which may or may not have covariates in common, β = (β 0 , β 1 , . . ., β p ) T and α = (α 0 , α 1 , . . ., α q ) T are the corresponding regression coefficients, and the log link is used to ensure positivity of the parameters.
Parameter estimation within the unpenalized MPR model can be carried out in a standard fashion using maximum likelihood.First, let T i = min( Ti , C i ) be the observed survival time for the ith individual.Then the associated log-likelihood function given by where θ = (β T , α T ) T is the full parameter vector, t i is the realization of T i , and δ i is the censoring indicator which takes the value 0 for censored survival times and 1 for uncensored survival times.Beyond the Weibull case we consider here, the likelihood function is

Penalized Likelihood
Penalized MPR estimation can be developed on the basis of maximising a penalized loglikelihood given by where (θ) is the unpenalized likelihood, λ = (λ β 0 , λ β 1 , . . ., λ βp , λ α 0 , λ α 1 , . . ., λ αq ) is a vector of coefficient-specific tuning parameters, and J λ β j (•) and J λα j (•) are scale and shape penalty functions which we assume have the same functional form (but differ with respect to the tuning parameter).As is standard practice, we assume that the intercepts are not penalized, and, therefore, define λ β 0 ≡ λ α 0 ≡ 0 (rather than, for example, assuming the intercepts are zero as other authors do); we also assume that the covariates are standardized.Although we have defined λ quite generally, we will in fact impose constraints on this vector (beyond fixing λ β 0 ≡ λ α 0 ≡ 0) by considering the following possibilities (for j = 0): i) single penalty, ii) single adaptive penalty, where w β j and w α j are predefined weights, iii) separate non-adaptive penalties, iv) separate adaptive penalties, Note that the only "adaptive" penalty considered for the purpose of this article is the ALASSO penalty.(i) and (ii) are standard approaches where a single penalty, λ, applies to the whole vector of parameters.This is reasonable in more standard setting where there is only a β vector.However, in this particular MPR setting, we have two separate distributional parameters, which exist on different scales.For this reason we investigate methods (iii) and (iv) which apply different penalties to the two regression vectors via λ β and λ α .
For the purpose of this article, we consider the most commonly used penalties, namely: the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), which, although popular, is known to select too many variables (Radchenko and James, 2008); the non-convex smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), where a = 3.7, and the adaptive LASSO (ALASSO) (Zou, 2006), where, typically, w = 1/| θ0,j | and θ0,j is a unpenalized estimate of θ j .These so called adaptive weights are used to apply different penalties to different regression coefficients such that a larger amount of shrinkage is applied to the unimportant variables.Note that, here, we use θ j to denote a generic regression coefficient, and λ θ j is the corresponding tuning parameter.The latter two penalties (i.e., SCAD and ALASSO) are known to possess the oracle property, i.e., the procedure asymptotically identifies the right subset model and estimates the coefficients and covariance matrix as though the true model were known in advance (Fan and Li, 2001).

Model Fitting
We define θλ = argmax θ λ (θ), (3.3)where λ (θ) is given by (2.2).The corresponding score functions are given by where X is an n × (p + 1) matrix whose ith row is x i , Z is an n × (q + 1) matrix whose ith row is z i ; U β and U α are vectors of length n such that U βi = δ i − τ i t γ i i and i log t i ; V β and V α are vectors of lengths p + 1 and q + 1 respectively, such that, for j ≥ 0, V β,j+1 = dJ λ β j (|β j |)/dβ j = J λ β j (|β j |)d|β j |/dβ j and V α,j+1 = dJ λα j (|α j |)/dα j = J λα j (|α j |)d|α j |/dα j .Note however, the presence of the absolute value function renders the penalty functions non-differentiable at zero.Various algorithms have been developed to overcome this issue including quadratic programming (Tibshirani, 1996), least angle regression (LARS) (Efron et al., 2004), co-ordinate descent (Friedman et al., 2007) and the local quadratic approximation (Fan and Li, 2001).In this paper, we take a different approach, and use an extension of the absolute value function given by where lim →0 a(x) = |x|.This yields a differentiable penalty so that standard gradientbased optimization algorithms can be applied straightforwardly and transparently.Thus, a (x) = x/ √ 2 + x 2 (which is an approximation of the signum function) and a (x) = 2 /( 2 + x 2 ) 3/2 .Smaller values of bring the approximate penalty closer to the original penalty, but also closer to the penalty being non-differentiable; we have found that fixing = 10 −4 generally works well.As we use smooth J(•) functions, and a(x) in place of |x|, (3.4) is then smooth in the parameters and can therefore be solved using the Netwon-Raphson algorithm.

Tuning Parameter Selection
The selection of "optimal" tuning parameter(s) is typically done through the use of datadriven criteria such as generalized cross-validation (GCV), Akaike information criterion (AIC) or Bayesian information criterion (BIC).GCV and the AIC are known to be asymptotically "loss-efficient" and "selection inconsistent" variable selection criteria (Shao, 1997;Yang;Wang et al., 2009).Wang et al. (2007) provide a formal proof that the shrinkage or tuning parameter selected using GCV may not be able to identify the true model consistently for the SCAD estimator in linear models and partially linear models.Instead, they suggest using the BIC and prove its model selection consistency property.A similar conclusion has been reached by Wang and Leng (2007) for the ALASSO.Hence, due to its widely reported superior empirical performance in variable selection, we use a BIC-type criterion to determine the values of the tuning parameter(s), where ( θλ ) is the unpenalized likelihood function defined in (2.1), n is the sample size and e λ = tr[{I λ ( θλ )} −1 I 0 ( θλ )] is the effective degrees of freedom (Ha et al., 2007); we define Note that, as described in Section 2.2, λ * will either be one-dimensional (when a common penalty is applied to β and α) or two-dimensional (when separate penalties are applied).
The simplest method to solve this optimization problem is grid search.While it is straightforward to implement, grid search is known to suffer from the curse of dimensionality, i.e., the number of grid points grows exponentially with the dimension.Furthermore, if the grid is too coarse, the minimum may be overlooked.This is especially true in the case of a multi-modal function, such as what we are trying to optimize (see Figure 1).To overcome the issues associated with the grid search algorithm, we consider "global" optimization algorithms.In an empirical comparison of various algorithms for continuous global optimization, Mullen (2014) found "DEoptim" (implemented in R) (Mullen et al., 2011) to be among the best.The function implements a differential evolution algorithm, an example of an evolutionary strategy developed by Storn and Price (1997) (see Mullen et al. (2011) for a detailed overview of the underlying algorithm).

Variable Selection Algorithm
The variable selection algorithm described above is summarized in the following bullet points.
4 Simulation Studies

Setup
The performance of the proposed variable selection methods is evaluated through simulation studies.The failure time is simulated from a Weibull MPR model with log(τ i ) = x T i (−1.5, −1.0, 0.0, 0.0, 0.0, 0.0, 0.0, −0.8, 0.5, 0.0, 0.0) T , log(γ i ) = z T i (0.5, 0.4, 0.0, 0.0, 0.0, 0.4, −0.2, 0.0, 0.0, 0.0, 0.0)  where T is a vector of correlated variables generated from an AR(1) process with a correlation coefficient ρ = 0.5.Each variable is marginally standard normal and the correlation between any two consecutive variables x ij and x ik is given by ρ |j−k| .The corresponding censored times were generated from an exponential distribution.This setup was chosen so as to yield realistic survival data, where the true model is sparse and correlation in the covariates exists.Three different sample sizes (n = 100, 500 and 1000) and two censoring proportions (p cen = 25%, 50%) are considered.For each scenario, we considered the LASSO, SCAD, and ALASSO penalties with both a single tuning parameter or two tuning parameters (i.e., one for each of the two regression components).
Each simulation was repeated 200 times.

Simulation Results
The variable selection and estimation procedures described in Sections 2 and 3 are applied to the simulated data and the results are summarized and discussed here.A number of metrics are used to evaluate the performance of the variable selection procedures, namely: the average number of true zero coefficients correctly set to zero (C), the average number of true non-zero coefficients incorrectly set to zero (IC), and the probability of choosing the true model (PT); for the oracle model, C = 7 and IC = 0.As a measure of prediction accuracy, we also consider the mean squared error (MSE), given by MSE , where V , the simulated sample covariance matrix of the covariates, is computed for each simulation replicate (Zhang and Lu, 2007;Tibshirani, 1997).These metrics, averaged over simulation replicates for the scenarios with 25% censoring, are reported in Table 1.(The results for 50% censoring, shown in the Appendix, are similar.)As the sample size increases, we see an improvement across all the four metrics, for both the shape and the scale parameters and across all penalties.However, it is evident that the LASSO penalty does not set enough covariates equal to zero (i.e., it selects an overly complex model) irrespective of whether there is one tuning parameter or two.SCAD performs better but over-selects in the shape component, α, when there is only one tuning parameter; this is improved by having two separate tuning parameters.The best performance comes from the ALASSO penalty which, for the largest sample size, selects the true scale and shape covariates more than 90% of the time.Interestingly, the ALASSO performs well even with a single tuning parameter (but it does improve a little with two tuning parameters).Computation times for each of the penalties are given in Table 2 where we see that SCAD is considerably slower than LASSO and ALASSO, while the times for these latter two are comparable.Furthermore, the computation times for the cases with two tuning parameters are 2 -3 times longer than those with one tuning parameter.
Besides variable selection, we also consider parameter inference in terms of estimation bias, accuracy of the estimated standard error (SEE) computed using the sandwich formula, (3.6), and the empirical coverage probability (CP) of a nominal 95% confidence interval.The results for the ALASSO penalty (for the 25% censoring level) are presented in Table 3. Results for the ALASSO penalty show that the SEE is accurate for moderate sample sizes, but may underestimate the standard error (SE) for smaller samples.In the samples n = 100 and n = 500, the smaller parameters are overshrunk, i.e., they are biased  downwards.For this reason the CPs do not perform well.However, this is not the case for n = 1000.In the case of n = 1000, the CPs are close to the nominal 95% level (although perhaps a little low for the shape coefficients).An improvement can be seen in the case of two tuning parameters.Similar tables for the other penalties (and 50% censoring) can be found in the Appendix, and, to summarize these additional results: LASSO overshrinks all parameters and the standard errors are underestimated in all cases, whereas the results for SCAD are better, but not as good as the ALASSO; in particular, the coverage for SCAD confidence intervals is much lower than the nominal level even for n = 1000.As expected, when the censoring rate is increased from 25% to 50%, the variation (i.e., SE, SEE and MSE) of estimates is increased overall across all the three penalties.

Lung Cancer Data
To illustrate the penalized variable selection methods on real data, a lung cancer dataset is considered.This dataset was collected as part of a PhD thesis by Wilkinson (1995) (see also Burke and MacKenzie (2017)).This dataset contains all individuals, of all ages, diagnosed with lung cancer in Northern Ireland during the one year period 1 October 1991 to 30 September 1992.Only cases of primary lung cancer were included.The date of diagnosis was taken to be the time origin for an individual and the end point was the earlier of the occurrence of death or the study end date, which was on 30 May 1993.Individuals who were still alive on the study end date were taken to have censored survival times.Individuals who died from another cause or who dropped out of the study were also censored.

Adequacy of Weibull
Before considering covariates and variable selection, we first carry out an initial check that a baseline Weibull distribution is appropriate for the lung cancer data.The cumulative hazard function for the Weibull model is given by H(t) = t 0 h(u)du = λt γ , and, hence, log H(t) = log λ + γ log t.Therefore, given an estimate Ĥ(t), a plot of log Ĥ(t) against log t should produce a straight line.This standard Weibull model check is shown in Figure 2, and, despite a slight deficiency for very small survival times, it appears that the Weibull model is reasonable.

Variable Selection Results
The variable selection results for the different penalties are summarized in Table 4.In line with the results of the simulation study, the LASSO penalty selects the most complex model and the ALASSO penalty selects the least complex.Both ALASSO penalties (one and two tuning parameter cases) are in agreement on the non-importance of sex and smoking status, and although age group is selected in the scale in the case with one tuning parameter, it is not significant.Interestingly, the two tuning parameter ALASSO selects the same set of covariates as identified by Burke and MacKenzie (2017) using a BIC stepwise procedure (albeit they additionally selected treatment in the shape).We also see that, in the two tuning parameters cases, the scale tuning parameter is smaller that of the one tuning parameter case, while the shape tuning parameter is larger.This suggests that the single penalty over-penalizes the scale coefficients and under-penalizes the shape; this is also evident from the scale and shape degrees of freedom.Interestingly, the one tuning parameter ALASSO converges in less than half the time of the two tuning parameter ALASSO, and achieves similar results.We expect this based on our simulation studies, and also expect the results of the two tuning parameter case to be marginally better (albeit it takes longer to converge).Table 5 displays the estimated coefficients for both ALASSO penalties along with the unpenalized coefficients (we focus the ALASSO due to its superior performance in our simulation studies, but similar tables for LASSO and SCAD can be found in the Appendix).From Burke and MacKenzie (2017), note that the scale coefficients characterize the overall scale of the hazard (a positive value indicates an increase relative to the reference category), while the shape coefficients characterize its time evolution (a positive value indicates a hazard which increases over time relative to the reference category).We clearly see the similarity of the coefficient values for both the one and two tuning parameter ALASSO penalties, and, furthermore, that the selected variables are broadly in line with those which are statistically significant in the unpenalized model.Focussing on the results of the two tuning parameter case we find that: all treatments (apart from chemotherapy) have a negative scale coefficient suggesting that treatment reduces hazard (relative to palliative care); however, worse WHO status, small cancer cell type, presence of metasteses, and reduced sodium and albumen levels increase the hazard; lastly, sex, age group, and smoking status have no significant effect on the hazard.Since no variable appears in the shape component (i.e., all shape coefficients are set to zero), the selected model is a proportional hazards model, and exponentiating the scale coefficients yields the hazard ratios, e.g., the surgery hazard ratio is exp(−0.98)= 0.375 so that the risk of death is approximately 37.5% that of a patient receiving palliative care.β = "selected in scale", α = "selected in shape", and those which are non-significant (at the 5% level) are shown in gray.

Discussion
The MPR approach results in flexible models which extend standard models, but the presence of multiple regression components means that variable selection is necessarily more challenging than in standard settings where there is only a single regression component.In this article, we have proposed a penalized variable selection procedure for the simultaneous selection of significant variables in the shape and scale parameters of a Weibull MPR model in the survival analysis setting.The performance of these methods was illustrated using simulation studies and a real data example.While we have considered the Weibull model example in this article, the proposed variable selection procedures can be applied easily to other MPR models.
Given that we model different distributional parameters (a scale and a shape parameter), there is no reason to assume that variable selection can be achieved with a single penalty applied to both regression components; hence, we also investigated the need for a separate tuning parameter for each regression component.We have found that the ALASSO performs very favourably in terms of identifying the true subset of covariates and coverage of calculated confidence intervals.This is true even with a single tuning parameter, however the results are improved when there are two tuning parameters (albeit this is more computationally intensive).On the other hand, SCAD does not perform well in the MPR setting, selecting an overly complex model and with poor confidence interval coverage for shape parameters.
.5) where the various elements super-scripted by (m) depend on θ (m)

Figure 1 :
Figure 1: The BIC function evaluated at different tuning parameter values for the Weibull MPR model with the LASSO penalty for the lung cancer dataset analysed in Section 5, (a) corresponds to one tuning parameter and (b) corresponds to two.

Table 5 :
Coefficients estimates and standard errors for the ALASSO penalties (lung cancer dataset) significant at the 5% level.

Table 1 :
Simulation results: variable selection metrics averaged over 200 simulation replicates.
C, average correct zeros; IC, average incorrect zeros; PT, the probability of choosing the true model; MSE, the average mean squared error.

Table 2 :
Average computation time per simulation replicate (in minutes).

Table 3 :
Further simulation results: estimates, standard errors, and confidence intervals.
SE, standard deviation of estimates over 200 replications; SEE, average of estimated standard errors over 200 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.

Table 4 :
Summary of penalized models (lung cancer dataset)

Table 6 :
Simulation results: variable selection metrics averaged over 200 simulation replicates.

Table 7 :
Further simulation results: estimates, standard errors, and confidence intervals.

Table 8 :
Further simulation results: estimates, standard errors, and confidence intervals.

Table 9 :
Further simulation results: estimates, standard errors, and confidence intervals.

Table 10 :
Further simulation results: estimates, standard errors, and confidence intervals.

Table 11 :
Further simulation results: estimates, standard errors, and confidence intervals.

Table 12 :
Coefficients estimates and standard errors for the one tuning parameter setup (lung cancer dataset)

Table 13 :
Coefficients estimates and standard errors for the two tuning parameters setup (lung cancer dataset)