Tensor product P-splines using a sparse mixed model formulation

A new approach to represent P-splines as a mixed model is presented. The corresponding matrices are sparse allowing the new approach can find the optimal values of the penalty parameters in a computationally efficient manner. Whereas the new mixed model P-splines formulation is similar to the original P-splines, a key difference is that the fixed effects are modelled explicitly, and extra constraints are added to the random part of the model. An important feature ensuring that the entire computation is fast is a sparse implementation of the Automated Differentiation of the Cholesky algorithm. It is shown by means of two examples that the new approach is fast compared to existing methods. The methodology has been implemented in the R-package LMMsolver available on CRAN (https://CRAN.R-project.org/package=LMMsolver).


Introduction
tensor product P-splines as a mixed model; Piepho et al. (2022) give an overview of the different twodimensional P-splines models.
Here, a new and efficient approach is outlined.In Boer et al. (2020) it was shown that the linear variance (LV) mixed model (Williams, 1986) is equivalent to P-splines with first-degree B-splines and first-order differences.This idea will be generalized here, to show that P-splines with kth-order difference penalties are equivalent to a class of mixed models.The extension to tensor product Psplines is relatively straightforward.
Another property of the LV model is that it is computationally attractive, as the precision matrix is sparse.It will be shown that the P-splines model is equivalent to a class of mixed models with a sparse precision matrix.In this article we will show that the sparse mixed model P-splines are computationally efficient.
The layout of the article is as follows.In Section 2 we will discuss in detail the equivalence between the linear variance mixed model and a special case of P-splines.Section 3 outlines a brief overview of B-splines, and gives some results that will be used to derive the connection between mixed models and P-splines.In Section 4 it will be shown how the idea presented in Section 2 can be generalized to P-splines.Section 5 shows how this can be further extended to mixed model tensor product Psplines.In Section 6 an implementation of the sparse mixed model in the LMMsolver R-package is presented, and its computational efficiency is demonstrated by means of two examples.The article concludes with a brief discussion in Section 7. The R-scripts to reproduce the figures are available in the Supplementary Material.

Linear variance model and Whittaker smoother
The Whittaker smoother (Whittaker, 1923) and the Linear Variance (LV) mixed model introduced by Williams (1986) are equivalent, as shown by Boer et al. (2020).Here we will use a new approach to show that the two models are equivalent with the aim to develop a computational efficient method to estimate the optimal penalty parameters using mixed models that can be further extended to P-splines.The Whittaker smoother is a special case of P-splines (Frasso and Eilers, 2015), and originally third-order differences were used by Whittaker (1923).Here we will use a simpler variant of Whittakers proposal, using first-order differences.
Suppose there are n observations at equidistant points i = 1, 2, . . ., n.Let y i be the observation at point i .The LV mixed model is given by where μ is the intercept, the parameters e i represent the residual error and are assumed to be independent and identically distributed, e i ∼ N(0, σ 2 e ), where σ 2 e is the residual variance.The covariances between u i and u j are given by cov where α is a scaling factor, and σ 2 u is the variance corresponding to the LV model.Whereas Williams (1986) used α = 1, here we will use α = 1 2 , to simplify the notation to show the equivalence with the Whittaker smoother.As can be seen from equation (2.2) the covariance decreases linearly with distance between two observation points i and j , and the covariance between first and last observations is equal to zero.
In this article we will formulate mixed models in terms of penalty and precision matrix (the inverse of the covariance matrix), which is a more modern representation (Lindgren et al., 2022).The parameter λ = σ 2 e /σ 2 u is called the tuning or penalty parameter (Eilers and Marx, 1996;Hastie et al., 2009).From equations ( 7) and (A2) in Boer et al. (2020) it follows that the objective function to be minized for equations (2.1) with respect to parameters μ and u i is given by (2.3) To make the connection between the LV model and the Whittaker smoother, we define the following invertible linear transformation: (2.4) Using equation (2.4) and (2.6) Equation (2.6) can be decomposed as S 2 (a 1 , . . ., a n , η) = S 3 (a 1 , . . ., a n ) + S 4 (η) where and S 4 (η) = λη 2 .To find the minimum of S 2 the functions S 3 and S 4 can be minimized separately.
The objective function S 4 has trivial solution η = 0. Equation (2.7) is the objective function to be minimized for the Whittaker smoother with first-order differences.This shows that equations (2.3) and (2.7) are equivalent.The LV model is not the only mixed model equivalent to the Whittaker smoother: other values of c i can also be chosen, provided that i c i = 0.An example is the Random Walk (RW) mixed model, with c 1 = 1 and c i = 0 for i = 2, . . ., n, cf.equation (A3) in Boer et al. (2020).
The LV and the RW mixed models have a sparse precision matrix and they can be solved in a computational efficient way (Boer et al., 2020).This approach can be further extended to show that there are mixed models, with sparse precision matrices, that are equivalent to P-splines.The method can be generalized to tensor product P-splines, as shown in Section 5.

B-splines
In this section we will give a brief overview of B-splines; for more details see for example De Boor (1978) and Lyche et al. (2018).Let the domain [x min , x max ], with x min < x max and x min , x max ∈ R, be divided in n seg intervals, each of length h = (x max − x min )/n seg .The knots for the B-spline basis are given by where p is the degree of the B-splines functions, denoted by B j, p (x).The B-spline functions have local support, that is, ).The number of B-splines is equal to q = n seg + p.Let g(x) be a linear combination of the B-splines The kth-order derivative of g(x) can be expressed as follows (De Boor, 1978;Eilers and Marx, 1996) dg(x) where k u j is the k-th order difference, recursively defined as k u j = k−1 ( u j ).Let f (x) be a (k − 1)-degree polynomial with coefficients β 0 , . . ., β k−1 .The representation of f (x) in terms of B-splines is given by where ξ i, j, p are constants.Analytic expressions for ξ i, j, p can be found in Lyche et al. (2018).Taking the k-th order derivative of equation (3.4) and using equation (3.3) gives a result that will be used to establish, in Sections 4 and 5, the equivalence of a class of mixed models with P-splines.The idea of representing polynomials in terms of B-splines can be extended a bit further, and this idea will be used to derive our main result in Section 4. Suppose we are interested in the interpolating Statistical Modelling 2023; 23(5-6): 465-479 (k − 1)-degree polynomial (Press et al., 2002, p.123) defined by k pairs (x * ,i , y * ,i ).For k = 1 we will use x * ,1 = (x min + x max )/2.For k > 1 the points x * ,i are taken equidistant on the interval [x min , x max ]: Let B * be a k × q matrix, with (i, j )th entry B j, p (x * ,i ), and let G be a q × k matrix with ( j, i )th entry with determinant (Harville, 1998, section 13.6) The determinant is unequal to zero, since the points x * ,i are distinct.

A new formulation for P-splines
Using the B-splines properties described in the previous section, we can generalize the results from Section 2 to P-splines.First we introduce some notation.Let y = (y 1 , y 2 , . . ., y n ) be the response variable, depending on the variable x = (x 1 , x 2 , . . ., x n ) with x i ∈ [x min , x max ] for i = 1, . . ., n.Let B be a n × q matrix, with (i, j )th entry B j, p (x i ).The matrix X is defined by In the next step we will define a model equivalent with P-splines, following the same steps as in Section 2. To make this connection more clear, we will use the same subscripts for the objective functions to be minimized, cf.equations (2.3), (2.6), and (2.7).The objective function S 1 to be minimized is given by where β = (β 0 , . . ., β k−1 ) , u = (u 1 , . . ., u q ) , and D is a (q − k) × q matrix, with kth-order differences.Next, we define the linear transformation with a = (a 1 , . . ., a q ) and η = (η 1 , . . ., η k ) .This transformation is invertible since is the objective function to be minimized for P-splines (Eilers and Marx, 1996), and S 4 (η) = λη η achieves its minimum at η = 0.It follows that equations (4.1) and (4.5) are equivalent.The advantage of the new formulation is that we now have a computationally efficient formulation as a mixed model.This will be explained in more detail in Section 6.

Extension to tensor product P-splines
In this section we will extend the new formulation of P-splines to higher dimensions.We generalise the notation that was used for the one-dimensional case.Let M be the number of dimensions, and x m the covariate for dimension m.The matrix B m is the n × q m matrix corresponding to dimension m, D m is a (q m − k m ) × q m matrix, G m is a q m × k m matrix, and B * ,m is a k m × q m matrix.Let q = M m=1 q m be the total number of parameters, and k = M m=1 k m .The Kronecker product is denoted by ⊗ and the row-wise Kronecker product (Eilers and Marx, 2003) is denoted by .The matrices B, B * and G are defined by The fixed part of the model is given by X = BG.For the difference penalty corresponding to dimension m we have: For the product between G and D m we obtain: Define the tensor product objective function S 1 to be minimized: Using the linear transformation defined in equation ( 4.2) and  2006).As will be shown in Section 6, the new approach presented in this article is considerably faster then the algorithm in Rodríguez-Álvarez et al. (2015).

The LMMsolver R-package
The LMMsolver R-package on CRAN is a general Linear Mixed Model (LMM) solver.The aim of the LMMsolver package is to provide an efficient and flexible system to estimate variance parameters using restricted maximum likelihood or REML (Patterson and Thompson, 1971).The package was developed specifically for models where the mixed model equations are sparse, including the mixed model P-splines introduced in the previous sections.An example of the use of LMMsolver for a standard mixed model and allowing for heterogeneous residual errors is in the R-package statgenMPP (Li et al., 2022).In this article we will only give the mixed model for the tensor P-splines, although extra fixed or random terms can be added to the mixed model in LMMsolver.

Solving the linear mixed equations
The penalized regression model defined by equation ( 5.1) can be formulated as a mixed model: with X = BG, Z = B and where θ 0 = 1/σ 2 e and θ m = λ m /σ 2 e (m = 1, . . ., M) are precision parameters to be estimated.For M > 1 equation (6.1) is not a standard mixed model, it has overlapping penalties (Currie et al., 2006;Rodríguez-Álvarez et al., 2015;Rodríguez-Álvarez et al., 2019).The mixed model equations corresponding to equation (6.1) are defined by 3) The matrix on the left hand side of equation ( 6.3) is called the mixed model coefficient matrix and will be denoted by C.This matrix is sparse and therefore equation ( 6.3) can be solved in a computational efficient way using the Cholesky decomposition of C (Furrer and Sain, 2010).
The followings equations are used to update the precision parameters θ i (see Appendix A for the derivation): where ê = y − X β − Zû and The iterative algorithm defined by equations (6.3) and (6.4) is an extension of the modified Henderson algorithm described in Harville (1977).
An important element needed is to calculate ρ i and τ i defined by equation (6.5) in an efficient way, avoiding the calculation of the matrices C −1 and , which are not sparse.One way to do this is to calculate the so-called sparse inverse (Takahashi, 1973;Misztal and Perez-Enciso, 1993;Johnson and Thompson, 1995;Lindgren et al., 2022).In LMMsolver another, but closely related, method was implemented, using Automated Differentiation of the Cholesky algorithm (Smith, 1995).Backward differentiation was implemented, which calculates the partial derivatives of the likelihood efficiently (Meyer and Smith, 1996;Smith, 1995).The chol() function in the spam package was used to permute the row and columns of C and −1 in an optimal way, see Furrer and Sain (2010) for details.The automated differentation was implemented using supernodal Cholesky factorization (Ng and Peyton, 1993;Furrer and Sain, 2010).The implementation was written in C++ using the Rcpp package (Eddelbuettel and Balamuta, 2018).

Example 1: United States precipitation data set
In this section we will compare the performance of LMMsolver with the R-packages mgcv (Wood, 2017) and SOP (Rodríguez-Álvarez et al., 2019).The method in the SOP package uses the same tensor P-splines model as in LMMsolver.For a detailed comparison between tensor P-splines in mgcv and SOP see Rodríguez-Álvarez et al. (2015).We will use two-dimensional P-splines for the USA precipitation data set (Rodríguez-Álvarez et al., 2015).There are n = 5, 906 observations, with longitudelatitude positions of monitoring stations, and total precipitation in millimeters and a standardization of this raw observation for April 1948.This data set can be found in the R-package spam (Furrer and Sain, 2010), under the name USprecip.
All computations were performed in R4.2.3 (R Core Team 2023) and a 2.90GHz Intel Core i5-9400 CPU with 24GB of RAM and Windows10 operating system.Version 1.8.42 of mgcv, version 1.0 of SOP, and version 1.0.5 of LMMsolver were used.For mgcv we used the bam() function, with method='fREML'.Cubic B-splines with second-order differences were used for both latitude and longitude.The differences between the estimated effective dimensions (Rodríguez-Álvarez et al., 2015) for the SOP and LMMsolver packages were less than 5 • 10 −3 .We compared the computation times for different number of segments.The same number was used in both dimensions.
Figure 1A shows the sparse structure for the mixed model coefficient matrix corresponding to equation (5.1) for 40 segments.
In Figure 1B the computation time is compared for different number of segments, showing that LMMsolver is much faster than mgcv and SOP.For example, if the number of segments is 40, the computation time for mgcv is 600 seconds (= 10 minutes), for SOP 140 seconds, and for LMMsolver only one second.The difference in computation time between mgcv and SOP is a factor 4, consistent with the results in Rodríguez-Álvarez et al. (2015).
There are two main reasons why LMMsolver is fast for this two-dimensional mixed model Psplines example.First, it retains the sparse structure of the original two-dimensional P-splines as in Eilers and Marx (2003).The second reason is an efficient implementation of the Automated Differentiation of the Cholesky Algorithm in LMMsolver, which avoids the calculations of the inverse of matrices for the derivatives of the REML log-likelihood (Smith, 1995;Meyer and Smith, 1996).

Example 2: Three-dimensional simulated data
In this section we will use a simulated three-dimensional example to illustrate that LMMsolver can analyze large datasets in an efficient way.One hundred thousand (n = 100, 000) values of covariates z 1 , z 2 , z 3 were simulated independently from a uniform distribution on the interval [0, 1], and the response was generated from the equation (Wood, 2006;Rodríguez-Álvarez et al., 2015) with ∼ N(0, 1). Figure 2A shows the structure of mixed model coefficient matrix C, which is less sparse than the two-dimensional case shown in Figure 1A.Therefore the differences in computation time between LMMsolver and the packages mgcv and SOP is less than in the two-dimensional case, but still substantial: if ten segments are used in each direction, the computation time for mgcv is 38 minutes, for SOP 10 minutes, and for LMMsolver half a minute.

Discussion
In this article we have shown that there is a mixed model formulation for P-splines that adheres closely to the original proposal by Eilers and Marx (1996).The main difference is that in the new approach there is an explicit term for the fixed part, with an additional term for the penalty.An important feature is that the precision matrices and the mixed model equations are sparse.Therefore the sparse matrix algebra developed for mixed models can be used.An additional advantage of the new approach is that the mixed model P-splines formulation can be further extended to Generalized Additive Models (GAM; Hastie et al., 2009).In standard 0, Statistical Modelling 2023; 23(5-6): 465-479where we use the properties of block matrices in the first step(Harville, 1998, section 13.3), and equation (3.7) in the second and third steps.For the differences we have Da = D(Gβ + u) = Du, (4.3) using DG = 0, which is equation (3.5) in matrix notation.By substitution of equation (4.2) into equation (4.1) and using equation (4.3) we obtain S 2 (a, η) = ||y − Ba|| 2 + λa D Da + λη η. (4.4) Equation (4.4) can be decomposed as S 2 (a, η) = S 3 (a) + S 4 (η), where S 3 (a) = ||y − Ba|| 2 + λa D Da (4.5) m , Statistical Modelling 2023; 23(5-6): 465-479 .2) Since S 2 can be decomposed as S 2 (a, η) = S 3 (a) + S 4 (η) and S 4 attains its minimum at η = 0, we obtain the tensor product P-splines model introduced by Eilers and Marx (2003) S 3 (a) = ||y − Ba|| 2 + M m=1 λ m a D m D m a. (5.3)This shows that equations (5.1) and (5.3) are equivalent.In Section 6 we will present an efficient mixed model formulation of equation (5.1).Rodríguez-Álvarez et al. (2015) use another transformation of tensor P-splines to a mixed model, which is based on the spectral decomposition of the difference penalty matrices D m D m .For a detailed explanation of this mixed model representation for data on a multidimensional grid, see Section 6 in Currie et al. (

Figure 1
Figure1Panel A: The mixed model coefficient matrix C for the new approach for the US precipitation example with 40 segments for both latitude and longitude.The non-zeros are indicated in red, showing that C is sparse.Panel B: Computation time on a logarithmic scale for the three methods as function of the number of segments in both dimensions: LMMsolver is much faster, especially if the number of segments is high.

Figure 2
Figure 2 Panel A: The mixed model coefficient matrix C for the new approach for the three-dimensional simulated data with 5 segments for each dimension.The non-zeros are indicated in red.Panel B: Computation time on a logarithmic scale for LMMsolver, mgcv, and SOP as function of the number of segments in each dimension.