Effect fusion using model-based clustering

In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories is usually not unambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorisation has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. To achieve automatic grouping of levels with essentially the same effect, we adopt a Bayesian approach and specify the prior on the level effects as a location mixture of spiky normal components. Fusion of level effects is induced by a prior on the mixture weights which encourages empty components. Model-based clustering of the effects during MCMC sampling allows to simultaneously detect categories which have essentially the same effect size and identify variables with no effect at all. The properties of this approach are investigated in simulation studies. Finally, the method is applied to analyse effects of high-dimensional categorical predictors on income in Austria.


Introduction
Researchers in medicine, social and economic sciences routinely collect data measured on a nominal scale as potential predictors in regression models. The usual approach to include such categorical predictors in regression type models is to define one category as the baseline or reference category and use dummy variables for the effects of all other categories with respect to this baseline. Thus, the effect of one categorical covariate with c + 1 categories is captured by a set of c regression coefficients.
This leads to several issues. Including such predictors even with a moderate number of categories can easily lead to a high-dimensional vector of regression coefficients.
Further, only the subset of observations with a specific covariate level provides infor-mation on its effect which may result in high standard errors and unstable estimates for the effects of infrequent levels. These issues become even more pronounced if the researcher uses a fine classification grid when categorising the data. As often the definition of categories is not completely dictated by subject-specific matters, the scientist could categorise observations either finer or coarser when collecting the data. With both strategies she/he could run into problems when categorical variables are used as covariates in a regression model: fine categories can result in only a few subjects per category and imprecise estimates of the corresponding effects, whereas estimated effects using too coarse categories might be biased due to confounding effects of finer categories.
In order to avoid the risk of overlooking substantial differences in level effects it would be appealing to have a method which allows to start with a large regression model including categories on a very fine classification grid and to obtain a sparser representation of this model during estimation. Sparsity can be achieved whenever the effects of a categorical predictor can be represented by less than c regression effects.
Basically there are three different situations, where sparsity is an issue: First, if all level effects are zero, the whole covariate can be excluded from the model. Second, if some of the level effects are zero, the corresponding levels can be excluded from the model and finally if some levels have essentially the same effect on the response, sparsity is achieved by fusing the effects of these levels.
Usually, sparsity in regression type models is achieved by applying variable selection methods which allow to identify regressors with non-zero effects, i.e. lasso (Tibshirani, 1996) or the elastic net (Zou and Hastie, 2005) in the frequentist framework and shrinkage priors (Park and Casella, 2008;Griffin and Brown, 2010) or spike and slab priors (Mitchell and Beauchamp, 1988;George and McCulloch, 1997;Ishwaran et al., 2001) in the Bayesian framework. However, these methods are not appropriate for categorical covariates as only single level effects are selected or excluded from the model. Approaches that address exclusion of a whole group of regression effects have been proposed by Chipman (1996); Yuan and Lin (2006); Raman et al. (2009) ;Kyung et al. (2010), and recently by Simon et al. (2013) but none of these approaches allows also for effect fusion.
For metric predictors, effect fusion can be performed by the fused lasso (Tibshirani et al., 2005) and the Bayesian fused lasso (Kyung et al., 2010). Both methods assume some ordering of effects and shrink only effect differences of consecutive levels to zero and hence are not appropriate for nominal predictors where any pair of level effects should be subject to fusion. Explicit effect fusion for nominal predictors is considered in Bondell and Reich (2009) and by Gertheiss and Tutz (Gertheiss and Tutz, 2009;Gertheiss et al., 2011;Gertheiss and Tutz, 2010;Tutz and Gertheiss, 2016) who specify lasso-type penalties on effects and effect differences. In a Bayesian approach, recently  specified a prior distribution that can be interpreted as a spike and slab prior on effects and effect differences. However, these approaches are limited to covariates with a moderately large number of categories as for a covariate with c + 1 categories c+1 2 possible differences have to be considered which inflates the large model even more.
An appealing approach for effect fusion which avoids classification of effect differences and allows to fuse effects directly is to use model-based clustering techniques which rely on mixture prior distributions. Sparse modelling of regression effects by specifying a mixture prior is so far primarily used for continuous variables. Yengo et al. (2014) and Yengo et al. (2016) define a normal mixture prior for the regression effects and determine the number of components, i.e. coefficient groups, using model choice criteria. In a nonparametric framework, MacLehose and Dunson (2010) use an infinite mixture of heavy-tailed double-exponential distributions on the coefficients of continuous predictors to allow groups of coefficients to be shrunk towards the same, possibly non-zero, mean. Only Dunson et al. (2008) consider categorical covariates. They propose a multi-level Dirichlet process prior (DP) on the effects of single nucleotide polymorphism (SNP) in genetic association studies. This prior takes the hierarchical structure of the predictors into account and allows clustering of SNPs both within and across genes. However, by considering 22 markers, each with three levels, only a small number of levels is investigated.
Following this line of research we propose to achieve model based clustering of level effects by specifying a finite normal mixture prior. Our approach is explicitly designed to address effect fusion for categorical covariates and has several advantageous features.
First, fusing the level effects directly instead of focusing on all effect differences enables us to handle categorical covariates with a large number of categories, e.g. 100 or more. Second, the specified mixture prior can be interpreted as a generalisation of the standard spike and slab prior (George and McCulloch, 1993) where a spike distribution at zero is combined with a rather flat slab distribution to allow selective shrinkage of effects, see Malsiner-Walli and Wagner (2011) for an overview. We replace the slab distribution by a location mixture distribution with different, nonzero means. This mixture prior allows to shrink non-zero effects to various non-zero values and introduces a natural clustering of the categories: if level effects are assigned to the same mixture component, they are assumed to be (almost) identical and can be fused.
Third, the hyperparameters of the mixture prior are chosen very carefully to achieve the modelling aims. Their specification is based on the data to yield recommendations that are applicable to a wide range of real data situations. The 'fineness' of the estimated level classification can be guided by the size of the specified component variance, with smaller variances inducing a larger number of estimated effect groups.
The prior on the mixture weights is specified following the concept of 'sparse finite mixture' (Malsiner-Walli et al., 2016). Specifying a sparsity inducing prior on the weights in an overfitting mixture avoids unnecessary splitting of superfluous components and encourages concentration of the posterior distribution on a sparse cluster solution and thus allows to estimate the number of effect groups from the data. The paper is organised as follows. In Section 2 the model and the prior distributions for the model parameters are introduced. Details on posterior inference and model selection are given in Section 3. The method is evaluated in a simulation study in Section 4 and applied to a regression model for income data in Austria in Section 5.

Effect clustering prior
We consider a standard linear regression model with observations i, i = 1, ..., N , continuous response y and J categorical covariates with categories 0, ..., c j where j = 1, ..., J. For each covariate, 0 is defined as the baseline category and X jk denotes the dummy variable corresponding to the k-th category of covariate j. Hence, the regression model is given as where ∼ N (0, σ 2 ) is a Normal error term, β 0 is the intercept, and β jk , k = 1, . . . , c j is the effect of the k-th category of covariate j with respect to the baseline category.
We call β jk the 'level effect' of category k.
To complete Bayesian model specification prior distributions have to be assigned to all model parameters. We assume that regression effects are independent between covariates and use a prior of the structure where ξ j denotes additional covariate-specific hyperparameters. A flat normal prior p(β 0 ) ∼ N (0, B 0 ) is assigned to the intercept, and an improper inverse gamma Gertraud Malsiner-Walli et al.
Our goal is to specify a prior for the level effects of covariate j which allows the identification of effect groups. Therefore, we specify a finite mixture of normal distributions as a prior on the level effects β jk . In contrast to the popular spike and slab priors employed for selection of regression effects, we use a location mixture of more than two components which have a small variance, i.e. all components are spiky.
The prior on a regression effect β jk is specified hierarchically as where L j +1 is the number of normal mixture components for covariate j with location parameters µ jl and scale parameter ψ j . For each covariate, the location parameter of the first component µ j0 is fixed at 0 to allow identification of categories which have the same effect as the baseline category. If all level effects are assigned to this component, the covariate can be completely excluded from the model. We subsume in µ j = (µ j1 , . . . , µ jL j ) all other component means, which are assumed to be conditionally independent and follow a flat Normal hyperprior with location and scale parameters m j0 and M j0 . For each covariate, the variance ψ j is the same for all components in order to ensure that each level effect group has the same dispersion, however ψ j may vary between covariates. Finally, a symmetric Dirichlet distribution Dir L j +1 (e 0 ) with parameter e 0 is specified for the mixture weights η j = (η j0 , . . . , η jL j ).
An alternative to our finite mixture approach would be to specify an infinite mixture based on a Dirichlet process prior DP (α) for the level effects of covariate j. In this case, the a-priori specification of the number of components L j + 1 -a well-known limitation of finite mixtures -would not be necessary as it can be estimated from the data. However, we overcome this weakness of finite mixtures by specifying a sparse finite mixture (Malsiner-Walli et al., 2016) as prior on the level effects. This allows to estimate the number of 'true' components through the number of 'non-empty' components in an overfitting mixture. More details on this strategy will be provided in Section 2.1.
Additionally, it has to be pointed out that the clustering behaviour of finite and infinite mixtures is quite different. For infinite mixtures the a priori expected number of level groups is proportional to α · log(c j ) (MacLehose and Dunson, 2010;Malsiner-Walli et al., 2016) which means that with increasing number of levels c j also the number of expected clusters increases. In contrast, for a finite mixture prior as proposed here, the a-priori number of non-empty level groups is asymptotically independent of the number of levels c j (Malsiner-Walli et al., 2016). Hence using a finite mixture prior for the effects of a categorical predictor seems more suitable, as one would expect that in a hierarchical categorisation scheme there exists a certain level of aggregation which is able to capture all relevant effect differences and the number of different effect sizes would not increase with a 'finer' classification grid.

Choice of hyperparameters
The specification of the prior hyperparameters is crucial to achieve our modelling aims. To obtain recommendations that are applicable to a wide range of situations, we take an empirical approach and choose the hyperparameters depending on the data.
The location parameter of the first mixture component µ j0 is fixed at 0 in order to allow fusion to the baseline. For the location parameters of all other components µ jk , we specify a normal hyperprior located at the 'centre' of the effects and with large variance in order to induce only little shrinkage to the prior mean. Thus, we set the mean m 0j of the normal hyperprior to m 0j = mean(β j ) and the variance M 0j to the squared range ofβ j , i.e. M 0j = (max kβjk − min kβjk ) 2 , whereβ j is the estimated coefficient vector of covariate j under flat prior.
Levels effects should be assigned to the same component only if the sizes of their effects are almost identical. Therefore, specification of the component variance ψ j is crucial as it reflects the notion of negligible/relevant effect differences. As the prior on the component variance ψ j should take into account the scaling of covariates, we allow ψ j to vary across covariates but not between levels of one covariate.
We define the component variance ψ j as some proportion 1/ν from the variation of the estimated level effectsβ j under flat prior, i.e. ψ j = 1 With increasing ν the shapes of the mixture components become more spiky and more distinct groups of level effects will be identified. Thus, ψ j implicitly controls the 'fineness' of the estimated partition of level effects and hence the size of the selected model. As mentioned above, the component variances are defined covariate-specific in order to account for the dispersion of the level estimates within a covariate. However, the component variances could also be specified globally, i.e. with the same spike size for all covariates, if interest lies in defining a 'global' threshold for level effect differences across all covariates. Since the choice of the prior component variance ψ j influences effect fusion, as an alternative we consider ψ j to be random with a hyperprior ψ j ∼ G −1 (g 0 , G 0j ). We expect to obtain more robust cluster solutions as the influence of a fixed parameter ψ j should be mitigated. For a given value of g 0 , we choose G 0j such that the a priori the scale parameter g 0 controls the deviation from the expected value. To allow only for small deviations from the expectation we set g 0 = 100. Thus, a priori the standard deviation for ψ j is around 1/10 of the expected mean. We investigate the influence of the variance parameter ν for fixed variance ψ j as well as under a hyperprior in the simulation study in Section 4.
We now turn to the specification of the number of mixture components L j + 1. We set L j = c j in order to capture the redundant case where all effects are different from each other (and from the baseline). Thus, our prior defines an overfitting mixture model, where the mixture distribution on the level effects has more components than level effects to be estimated. In order to achieve a sparser estimation of the overfitting mixture model by encouraging superfluous components to be emptied we follow The posterior distribution, which results from combining the likelihood derived from equation (2.1) with the prior distribution of (β, σ 2 ) specified in (2.2) -(2.6), is not of closed form and therefore MCMC methods are used for posterior inference. During MCMC sampling the whole model space will be explored, i.e. different clustering solutions for the covariate effects will be visited, which allows to assess model uncertainty and also to determine model averaged estimates. However, though model averaged estimates of the coefficients may give good results in terms of prediction, researchers are often interested in selection of a final model and interpretation of its results. In regression models with categorical predictors, model selection is more involved than in standard variable selection, as the problem is to determine an appropriate clustering of level effects, which means that both the number of clusters as well as the members of each cluster have to be determined.
We address this problem in Section 3.3, where we present two different strategies for model selection for the effects of a categorical covariate.

MCMC sampling
Parameter estimation is performed through MCMC sampling based on data augmentation (Diebolt and Robert, 1994;Frühwirth-Schnatter, 2006). For each covariate j, latent allocation variables S j = S j1 , ..., S jc j are introduced to indicate the component a regression effect β jk is assigned to. S jk takes values in {0, 1, . . . , L j }. Conditional on S jk = l, the prior distribution for β jk is the normal mixture component distribution 5. If a hyperprior is specified on ψ j , sample the mixture component variances ψ j from their inverse gamma posterior G −1 (g jN , G jN ) for j = 1, ..., J; otherwise this step is omitted.
6. Sample the latent allocation indicators S from their full conditional pos- More details on the sampling steps are given in Appendix 1.1. The method is implemented in the R package effectFusion  which is available on CRAN.

Model averaged estimates
MCMC draws approximate the whole posterior distribution taking into account model uncertainty: e.g. for a regression effect β jk the posterior is the mixture distribution where the mixture components are model-specific posterior distributions and the mixture weights are the posterior model probabilities. Hence, the mean over all MCMC draws for β jk should be a robust, model-averaged estimator. Its predictive performance is investigated in Section 4.

Model selection
To perform model selection, generally the sampled mixture models have to be identified. In the Bayesian framework, identification of a finite mixture model requires handling the 'label switching' problem (Redner and Walker, 1984) which is caused by the invariance of representation (2.3) with respect to reordering the components: where ρ is an arbitrary permutation of {0, . . . , L j }. Practically, it may happen, that during MCMC sampling the labels associated with the components change, which impedes component-specific inference from the MCMC output. The label switching problem is usually solved by post-processing the MCMC output in order to obtain a unique labelling of the draws. We avoid the label switching problem by basing model selection on the information whether a pair of level effects is assigned to the same or to different clusters. For each iteration m and each covariate j, we construct the Similar to k-means clustering, k-medoids clustering aims at clustering points by minimising the distances between points assigned to a cluster and the point defined as the centre of the cluster. k-medoids always chooses a data point as centre of a cluster ('medoid') and works with arbitrary distance metrics between the data points. This feature makes it attractive for our approach since the similarity matrix can easily be transformed to a distance matrix D j = 1 − C j , where 1 is a matrix with elements 1. We use the clustering algorithm Partitioning Around Medoids (PAM) proposed by Kaufman and Rousseeuw (2005) which yields an optimal partition for a specified number of clusters. The final partition is chosen by comparing partitions with different numbers of clusters by their silhouette coefficients (Rousseeuw, 1987). The definition of the silhouette coefficient is given in Appendix 1.2.1.
An advantage of this approach is that level effect clusters are correctly identified even if distances are high, i.e. joint inclusion probabilities are rather small. This can happen if the number of categories is large and the strong overlapping of the mixture components induces a frequent switching of the levels between the components, so that the inclusion probability of any two level effects become small, and the most frequent model is not a good representative of the sampled models. However, a drawback of this approach is that the silhouette coefficient can not be computed for a one-cluster solution. Therefore, with this strategy it is not possible to identify the case where all level effects are assigned to the zero component and the corresponding predictor can be excluded from the model.

Simulation study
A sparser representation of the effects of a categorical covariate is possible when 1) some or 2) all of the levels have no effect at all or 3) some levels have the same effect and hence can be fused. To investigate the performance of the proposed prior distribution in these situations, we perform a simulation study where categorical covariates with moderate as well as large number of levels represent the various types of sparsity.
We evaluate both model selection strategies proposed in Section 3.3, i.e. using either the most frequent sampled partition or the partition selected by performing PAM and the silhouette coefficient, with respect to correct model selection. Further, we determine estimation accuracy and predictive performance of the estimates based on the selected models as well as the model averaged estimates.

Set-up
We define a regression model according to (2.1) with four independent categorical predictors, the first three predictors having 10 and the forth 100 categories. All categories have uniform prior class probabilities. The level effects of the first covariate have three different values (β 1 = (0, 0, 0, 0.5, 0.5, 0.5, 1, 1, 1)), for the second covariate only one level has a non-zero effect on the outcome (β 2 = (0, 0, 0, 0, 0, 0, 0, 0, 1)), the levels of the third variable have no effect at all, and levels of the last covariate has six different effects (0, 0.5, 1, 1.5, 2, 2.5) equally distributed among the levels. The

Model selection results
The model selection results are evaluated by reporting the estimated number of level effect groups. Additionally, the clustering quality is assessed by calculating the adjusted Rand index (Hubert and Arabie, 1985), the error rate, false negative and false positive rate.
The adjusted Rand index (Hubert and Arabie, 1985) allows to quantify the similarity between the true and estimated partition of the level effects. It is a corrected form of the Rand index (Rand, 1971), adjusted for chance agreement. A value of 1 corresponds to perfect agreement between two partitions whereas an adjusted Rand index of 0 corresponds to results no better than expected by randomly drawing two partitions, each with a fixed number of clusters and a fixed number of elements in each cluster.
A formal definition of the index can be found in Appendix 1.2.2.
The error rate (err) of the clustering result is the number of misclassified categories divided by all categories. It should be as small as possible. Since interest mainly lies in avoiding incorrect fusion of categories rather than unnecessary splitting of 'true' groups, additionally false negative rate (FNR) and false positive rate (FPR) are reported. They are defined as where F N is the number of levels incorrectly fused, F P is the number of levels incorrectly split, and T N and T P are the number of levels fused and split correctly, respectively. able to select the one-cluster solution with all level effects being 0. Also 'most' has some difficulty to fuse all levels to the baseline. However, using a broader variance by setting ν to 10 or 10 2 , fusion to the baseline is perfect for this variable, as can be seen in Table 6      In Table 3 we report the results for covariate 4 which is of special interest due to its large number of levels, results for all other covariates are reported in Appendix 1.3.
For fixed ψ j , as expected, the number of identified groups increases with ν as the spike variance ψ j decreases. To detect the 'true' effect clusters, a good choice for ν is a value in the range of ν = 10 2 to ν = 10 3 , also AR and error rate are good for

Parameter estimation accuracy and predictive performance
To evaluate the performance of the proposed approach with respect to estimation accuracy of the parameters we compute the mean squared error (MSE) of the coefficient estimates by averaging over all data set-specific mean squared errors where i is the number of the data set and C = J j=1 c j is the dimension of the vector of regression coefficients β in the full model.
In Figure 2 is the estimate in data set i. The average MSPE is displayed in Figure 3

Application
We illustrate the proposed approach for effect fusion in an application to data from EU-SILC data set (= Statistics on Income and Living Conditions) 2010 in Austria.
Relying on a questionnaire, the EU-SILC data are the main source for statistics on income distribution and social inclusion at the European level, see web page of Statistics Austria. We use a linear regression model to analyse the effects of sociodemographic variables on the (log-transformed) annual income and aim at identifying levels of categorical covariates which account for income differences.
As potential regressors we consider the continuous covariate age (as linear and squared term) and the categorical predictors gender, citizenship, federal state of residence in Austria, highest education level a person achieved, the economic sector a person is working and the job function.
The economic sector is classified using the classification scheme NACE (statistical classification of economic activities in the European Community), whereas job function is determined by using a two-level scheme. Both classifications have a hierarchical structure with 21 and 5 categories on the first level and 84 and 25 categories on the second level of aggregation, respectively. The definition of the categories of the two levels of the covariate job function are given in more detail in Table 9 of Austria. Figure 4 shows the 95% HPD intervals for the level effects under a flat Normal prior.
We fit regression models with prior specifications as described in Section 2, with fixed and random component variances, ν = 10, . . . , 10 6 , and perform model selection as    To visualize the cluster solutions for different values of ν, the estimated effects of the (refitted) selected models for variable job function are plotted in Figure 6.
Obviously with decreasing spike variance the clustering of the level effects gets 'finer'.
With a higher resolution of the effects (e.g. ν ≥ 10 5 ) an interesting structure is revealed: as levels are ordered by hierarchy function within each contract type (see Table 9 in Appendix 1.4) obviously effects are fused across contract types. This structure would have been missed by using the coarser classificaton level, whereas on the other hand even for the very fine resolution with ν = 10 6 the number of estimated effects is less than half compared to the full model.
With a hyperprior on the component variance ψ j the selected models are very sparse and the number of effect clusters is almost constant, see Table 5. This is in agreement q q q q q q q q q q q q q q q q q q q q q q  Table 9 in the Appendix.
with the results from the simulation study and the considerably higher values of BICmcmc indicate that a prior with fixed component variances should be prefered.
In this paper, we proposed to specify finite normal mixture priors on the level effects of a categorical predictor to obtain a sparse representation of these effects. The We noted that surprisingly the specification of a hyperprior on the component covariances did not work well. In contrast to the common clustering of data, we aim at clustering of regression effects, which are not fixed but have to be estimated from the data. Assigning an effect to a mixture component corresponds to selecting a particular prior distribution for its estimation and hence has an impact on its value in the We investigated two different model selection strategies, either to select model sampled most frequently ('most') or to apply PAM to the matrix of posterior inclusion probabilities and select the final model using the silhoutte coefficient ('pam'). Both strategies have shown to perform similar. An advantage of the 'most' strategy is, that also a one-group solution can be selected, which is not possibel for the 'pam' strategy, but the later is robust against the switching of single effects between groups.
The approach works well even if the number of categories is high, e.g. around 100.
For Gaussian response regression models the computational effort is low as a standard Gibbs sampling scheme can be used for MCMC estimation. However the method is not at all restricted to Gaussian regression models. It can be easily implemented as an "add-on" to an MCMC sampling scheme for any regression type model with a multivariate Normal prior on the regression effects, as in each MCMC iteration only the steps for model based clustering as well as the update of the prior parameters of the regression effects is required.
2. Sample the error variance σ 2 from its full conditional posterior distribution Model based clustering steps andβ jl is the mean of all elements of β j assigned to component l.

Silhouette coefficient
The silhouette coefficient in Rousseeuw (1987) is defined as follows. Let i be any object in the data set and A is the cluster to which it has been assigned. If cluster A contains other objects apart from i, then a(i) is the average dissimilarity of i to all other objects of A. d(i, C) is the average dissimilarity of i to all objects in cluster C which represents any cluster different from A. Compute d(i, C) for all clusters C = A and denote by b(i) = min C =A d(i, C). The silhouette coefficients is then computed as s(i) = b(i) − a(i) max(a(i), b(i)).

Adjusted Rand index
The adjusted Rand index (Hubert and Arabie, 1985) is a form of the Rand index (Rand, 1971) which is adjusted for chance agreement. If n is the number of elements and X = {X 1 , X 2 , ...X r } and Y = {Y 1 , Y 2 , ..., Y s } are two clusterings of these elements, the adjusted Rand index is defined as where a i and b j are the number of objects in X i and Y j , respectively and n ij is the number of objects in X i ∩ Y j .

Further simulation results
We report simulation results for covariates 1 to 3 of the simulation study in Tables 6 to 8.   Table 9 describes the two-level classification scheme of the variable job function and the frequencies of the categories.