Poisson-Tweedie mixed-effects model: a flexible approach for the analysis of longitudinal RNA-seq data

We present a new modelling approach for longitudinal count data that is motivated by the increasing availability of longitudinal RNA-sequencing experiments. The distribution of RNA-seq counts typically exhibits overdispersion, zero-inflation and heavy tails; moreover, in longitudinal designs repeated measurements from the same subject are typically (positively) correlated. We propose a generalized linear mixed model based on the Poisson-Tweedie distribution that can flexibly handle each of the aforementioned features of longitudinal RNA-seq counts. We develop a computational approach to accurately evaluate the likelihood of the proposed model and to perform maximum likelihood estimation. Our approach is implemented in the R package ptmixed, which can be freely downloaded from CRAN. We assess the performance of ptmixed on simulated data and we present an application to a dataset with longitudinal RNA-sequencing measurements from healthy and dystrophic mice. The applicability of the Poisson-Tweedie mixed-effects model is not restricted to longitudinal RNA-seq data, but it extends to any scenario where non-independent measurements of a discrete overdispersed response variable are available.


Introduction
In the last decade, RNA-sequencing (RNA-seq, Mortazavi et al. 2008) has progressively replaced microarrays and has become the standard method to measure gene expression.
A typical goal of RNA-seq studies is the identification of genes that are differentially expressed across groups of subjects, or are associated with a variable of interest. To this end, a statistical model is postulated to relate the expression levels of a gene to the variable(s) of interests and relevant confounders and hypothesis testing is performed to assess the significance of the relevant differences or associations. While microarrays yielded continuous measurements that could be analysed using models that relied on normality assumptions (Smyth, 2004), RNA-seq measurents are discrete and require statistical models more suitable for the analysis of count data. While the very first attempts to model RNA-seq data relied on the use of the Poisson distribution (Marioni et al., 2008), it was soon observed that the distribution of RNA-seq counts is typically characterized by phenomena such as overdispersion (the variation of counts is higher than the mean), zero-inflation and heavy tails, that the Poisson distribution cannot account for. To take overdispersion into account,  and Love et al. (2014) proposed two statistical models based on the negative binomial (NB) distribution, implemented in the R packages edgeR and DESeq2. To deal with small sample sizes, which were a very common feature of the first cross-sectional RNA-seq studies, they proposed to estimate the dispersion parameter of the NB distribution by borrowing information across genes. Esnaola et al. (2013) tested the adequacy of the NB distribution on data from a large-scale RNA-seq experiment. They showed that although the NB fitted sufficiently well the empirical distribution of counts for many genes, such distribution was inadequate for a relevant proportion of genes that exhibit strong zeroinflation or heavy tails. Therefore, they proposed a more flexible generalized linear model (GLM) based on the Poisson-Tweedie (PT) distribution, implemented in the R packages tweeDEseq. The PT is a three-parameter distribution that encompasses the NB as special case, and differently from the NB it can accommodate different levels of zero-inflation and heavy-tailness for a given amount of overdispersion. A different approach, called limma-voom, was proposed by Law et al. (2014), who combined a linear model for the log-transformed counts with sample-specific precision weights, making it possible to apply to RNA-seq measurements a pipeline originally developed for microarray data. At the beginning of the diffusion of RNA-seq, the high sequencing costs discouraged the setup of longitudinal RNA-seq studies; for this reason, limma-voom, edgeR, DE-Seq2 and tweeDEseq were developed to deal with cross-sectional RNA-seq datasets. However, interest in the dynamic evolution of gene expression and declining sequencing costs are nowadays motivating more longitudinal studies. One such example is our mice experiment carried out at the Leiden University Medical Center (LUMC) that aimed to identify biomarkers of disease progression for Duchenne Muscular Dystrophy (DMD). In this experiment, described more in detail in Section 4, blood samples were collected from healthy and dystrophic mice every 6 weeks over a period of 30 weeks, and gene expression in blood was quantified using RNA-seq. In longitudinal studies, repeated measurements are collected from the same subject at several time points. This introduces correlation between the repeated measurements from the same subject, a correlation that is not present in cross-sectional designs where all samples can be assumed to be independent. Nevertheless, often longitudinal RNAseq data are analysed using popular approaches (mostly edgeR, DESeq2 and limmavoom) for cross-sectional RNA-seq data, either ignoring the longitudinal nature of the data and assuming samples from the same individual to be independent, or adding for each individual a subject-specific indicator as a covariate (fixed effect). Cui et al. (2016) showed that modelling longitudinal RNA-seq data using such approaches results into increased false positive rates (FPR), or in a loss of power when the FPR is controlled. They argued that longitudinal RNA-seq data should instead be analyzed using generalized linear mixed models (GLMMs, McCulloch et al., 2008) and they illustrated their point by comparing those "improper" modelling approaches to a Poisson GLMM. They motivated their choice to use the Poisson distribution instead of a more flexible alternative based on frequent convergence problems that they encountered when attempting to fit a negative binomial GLMM with random intercept. Although their choice of a Poisson GLMM simplifies model estimation, it is clear that such a model cannot account for the different combinations of overdispersion, zero-inflation and heavy tails typical of RNA-seq counts. Therefore, the use of GLMMs based on more flexible distributions for count data, such as the NB and PT distributions, should be preferred even if their estimation is more involved. Recent proposals tailored to the analysis of longitudinal RNA-seq data include MaSig-Pro (Nueda et al., 2014), ShrinkBayes (van de Wiel et al., 2014), DyNB (Äijö et al., 2014 and timeSeq (Sun et al., 2016). MaSigPro fits a NB GLM that lets the expected value of a gene depend on time polynomials and their interaction with a group indicator; similarly to the methods for cross-sectional RNA-seq, this approach relies on a simple GLM that does not model properly the repeated measurements design. ShrinkBayes can fit Poisson, NB, zero-inflated Poisson (ZIP) and zero-inflated NB (ZINB) GLMMs that can accommodate a random intercept. Owing to the possibility to choose the ZIP and the ZINB as distributions, this approach can be employed to model both zero-inflation and the correlation between repeated measurements. However, it does not allow to fit GLMMs for heavy-tailed distributions. Moreover, the method requires the user to prespecify one specific distributional shape that is fitted to all genes, rather than letting the estimation procedure determine it automatically and separately for each gene as in the case of the PT family of models proposed by Esnaola et al. (2013). Finally, often interest may lie in assessing the joint significance of two or more coefficients, for example that of a main effect and of the corresponding interaction term, or in testing a system of linear combinations of the regression coefficients (an example of this type of hypotheses can be found in Section 4). However, the R package ShrinkBayes allows users to test only hypotheses that involve a single regression coefficient (although with some additional programming one may still be able test more complex hypotheses). DyNB fits a semiparametric model that combines a NB GLM with Gaussian-processes regression. Finally, timeSeq considers a NB GLMM where time is modelled using splines. Both DyNB and timeSeq are semiparametric and restricted to the NB case, thus they do not attempt to model zero-inflation and heavy tails. In this paper we present ptmixed, a Poisson-Tweedie GLMM that is suitable for the analysis of longitudinal RNA-seq data and that, more in general, can be employed to model discrete, longitudinal outcomes that are overdispersed, zero-inflated and/or heavy tailed. Owing to the use of the PT distribution, our approach is capable to capture different combinations of zero-inflation and heavy-tails that typically characterize the distribution of overdispersed counts and it automatically performs model selection among a range of different discrete distributions separately for each gene, similarly to tweeDEseq (Esnaola et al., 2013). Moreover, our mixed-effects approach allows to properly model within-subject correlations (Cui et al., 2016) and to flexibly test both one-dimensional and multi-dimensional statistical hypotheses. To the best of our knowledge, our proposal is the first extension of the PT GLM of Esnaola et al. (2013) to the context of random effects models, while a different extension based on generalized estimating equations has been proposed by Bonat et al. (2018). We provide an R implementation of functions for the estimation the proposed model and related hypothesis testing through the R package ptmixed (Signorelli et al., 2020), which can be freely downloaded from the Comprehensive R Archive Network (CRAN). The remainder of the article is organized as follows: in Section 2 we introduce our model, describe the use of the adaptive Gauss Hermite quadrature rule to approximate its likelihood, present our computational approach to maximum likelihood estimation and discuss the use of Wald and likelihood ratio tests for hypothesis testing. In Section 3 we assess the performance of our method using simulated data, and we compare it to several alternative methods that are commonly used in the analysis of longitudinal RNA-seq studies. In Section 4 we illustrate an application of ptmixed to the data from our motivating experiment. We conclude by discussing advantages and limitations of our approach in Section 5.

Filtering and normalization of RNA-seq data
The analysis of RNA-seq counts requires careful consideration of two pre-processing steps: filtering and normalization. First, because the expression profiles of lowly expressed genes are typically very noisy and strongly subject to measurement error, those genes are usually filtered out and excluded from the statistical analysis (Sha et al., 2015). Filtering allows to focus on those genes for which sufficient information is available and to mitigate the severity of multiple testing corrections. Furthermore, gene expression profiles are normalized to remove the effect of systematic technical biases due to the specific technology used to measure gene expression. Hereafter we assume that lowly-expressed genes have been preliminarly filtered out, and that a vector of sample-specific normalization weights has been obtained using one of the available RNA-seq normalization methods, such as for example the Trimmed Mean of M values (TMM) method of Robinson and Oshlack (2010).

Poisson-Tweedie generalized linear mixed model
We consider a longitudinal design in which j = 1, ..., m i repeated measurements on g = 1, ..., G genes are available for i = 1, ..., n individuals or subjects. The Poisson-Tweedie mixed-effects model assumes that conditionally on a vector of subject-specific random effects v g = (v g1 , ..., v gn ), the expression value of gene g measured in the j-th sample from individual i, Y gij , follows a PT distribution with mean µ gij , dispersion D g and power a g : where the mean expression profile µ gij depends on a vector of fixed effects x ij , a vector of random effects z ij and on an offset term o ij that corresponds to the logarithm of the sample-specific normalization weights discussed in Section 2.1. x ij will typically comprise both covariates of interests and relevant confounders. Although in principle it is possible to consider any random effect structure for z ij , hereafter we focus on the case in which z ij = 1 in (1), so that v gi i.i.d.
∼ N (0, σ 2 g ) denotes a subject-specific random intercept. The PT distribution (El-Shaarawi et al., 2011), also known as generalized NB distribution (Gerber, 1991), is a very flexible three-parameter distribution for discrete data that can be characterized in terms of a mean µ > 0, dispersion D ≥ 1 and power a ≤ 1. It encompasses several discrete random variables as special cases, including the Poisson (a = 1, D = 1), Poisson inverse Gaussian (a = 0.5), NB (a = 0), Pólia-Aepply (a = −1) and Neyman type A (a → −∞) distributions. El-Shaarawi et al. (2011) pointed out that the PT is particularly useful to account for extra zero-inflation or heavy tails compared to the NB distribution. In Figure 1 of Supplementary File 1 we provide a graphical illustration of the flexibility of the PT, comparing the probability mass function (pmf) of the PT distribution for different values of D and a. When D is close to the boundary value 1, the PT pmf is very similar to that of a Poisson distribution irrespective of the value of a. For larger values of D, the difference between pmfs becomes more noticeable: zero-inflation arises with negative values of a, while heavy tails are obtained when a is positive. Esnaola et al. (2013) proposed to exploit the flexibility of the PT in the analysis of cross-sectional RNA-seq data and introduced the PT generalized linear model (GLM), highlighting how the PT power parameter allows to perform automatic model selection from a wide range of different discrete distributions. They showed that for several genes the distribution of RNA-seq counts cannot be assumed to be NB, and that in those cases the use of the PT distribution allows to achieve a better model fit, to improve the estimation of the dispersion parameter and to control the type I error of tests on differences in mean between two groups already with moderate sample sizes. In this article we extend the model of Esnaola et al. (2013) by introducing the Poisson-Tweedie GLMM of equation (1), thus extending the applicability of the PT class of models to the analysis of longitudinal data.

Maximum likelihood estimation
Maximum likelihood (ML) estimation of mixed models is usually carried out by maximizing the marginal likelihood whose computation requires the integration of the joint likelihood of the response y g and the random effects v g over the distribution of the random effects. While for the linear mixed model it is possible to perform this integration analytically, for more complex GLMMs computation of the marginal likelihood requires a numeric approximation of the integrals over the random effects (we will discuss this in Section 2.3.1). The marginal likelihood of the PT mixed model with random intercept can be written as where f (z) = f (z β g , D g , a g ) denotes the density of the Tweedie distribution. If a more complex random effects structure is specified, then the unidimensional integrals in dv gi in (2) have to be replaced with multidimensional ones. A major complication with the computation of (2) is that because the pmf of the PT distribution does not have a closed-form expression, the computation of the marginal likelihood requires numeric evaluation for both of the integrals in (2). As a consequence, ML estimation requires the implementation of a combination of numerical integration and optimization techniques that we illustrate in Sections 2.3.1 and 2.3.2.

Evaluation of the likelihood function
To compute the marginal likelihood in (2) we first need to evaluate numerically the pmf ∫ ∞ 0 z y gij e −z y gij ! f (z)dz of the PT distribution, and then to carry out a numeric approximation of the integrals with respect to the random effects. We base the numeric evaluation of the PT pmf on the recursive formula proposed by El-Shaarawi et al. (2011), resorting in particular to its implementation in the R package tweeDEseq (Esnaola et al., 2013). Then, we implement the adaptive Gauss-Hermite quadrature (AGHQ) method (Pinheiro and Chao, 2006) to approximate the improper integrals associated to the random effects.
To illustrate how the AGHQ method works, we begin by explaining the simpler Gauss-Hermite quadrature (GHQ) method. Let where (w 1 , ..., w k ) is a vector of known weights (Steen et al., 1969). The quality of the approximation in (3), however, depends on whether the mass of q(x) is concentrated around 0; for this reason, in the AGHQ method the numerical integration is instead performed through an adaptive algorithm whereby one first obtains estimates of the modev of q(v gi ) and ofŝ 2 = −[d 2 log q(v) dv 2 gi ] −1 , and then performs the change of variable When k = 1, the AGHQ is equivalent to the Laplace approximation (LA), which is often employed to approximate the likelihood of GLMMs. Our choice to employ the AGHQ method is motivated by the fact that approximating the likelihood of GLMMs with the simpler but less accurate LA can result in frequent convergence problems during ML estimation. This problem is particularly frequent for mixed-effects extensions of the Poisson GLMM, such as for example the NB GLMM, and it is commonly encountered when analysing longitudinal RNA-seq counts. For example, Cui et al. (2016) attempted to fit NB GLMMs to a dataset with longitudinal RNA-seq measurements by employing glmmADMB (Fournier et al., 2012), an R package where the likelihood of GLMMs is approximated using the LA, and reported frequent convergence problems that induced them to fit the simpler Poisson GLMM instead of the NB GLMM. However, such convergence problems can often be solved if the likelihood function is approximated using an AGHQ with a suitable number of quadrature points instead of the LA. Moreover, ML estimates of the variance components obtained with the LA have been shown to suffer from severe bias, which vanishes if an AGHQ approximation with a sufficient number of quadrature points is used (Pinheiro and Chao, 2006). Overall, although computationally more intensive than the LA, the AGHQ produces more precise evaluations of the likelihood, reduces convergence issues, yields more accurate parameter estimates and allows a better type I error control. In Section 3.2 we will provide a demonstration of the gain associated to the use of the AGHQ in place of the LA.

Likelihood maximization
Estimation of the PT mixed model can be performed using the function ptmixed() available in the R package ptmixed (Signorelli et al. 2020). This function maximizes the marginal likelihood of the model using the computational approach outlined below, and it allows a certain degree of flexibility in the choice of the number of quadrature points, starting points and maximum number of iterations, as well as of the convergence criterion.
A critical point in the efficiency and success rate of the maximization is the choice of good starting values. As initial values for β and σ 2 we employ the maximum likelihood estimates of those parameters from a negative binomial GLMM or, when its estimation fails, from a Poisson GLMM; for the dispersion parameter D we consider either a method of moments estimate, or the ML estimator of D obtained by fitting a negative binomial GLMM through the function nbmixed(), that estimates the simpler NB GLMM. For the power parameter a, we use the method of moment estimator of a for the density of the PT distribution. Numeric maximization of the likelihood function in (2) is carried out using the Nelder-Mead (NM, Nelder and Mead 1965) algorithm (as implemented in the function optim()); when the NM method fails, a second attempt is made using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Because NM may require to evaluate the likelihood function several thousands of times, ptmixed offers the possibility to either update the position of the quadrature points at every iteration of the NM algorithm, or after every k iterations. In practice, maximization of the likelihood when the variance of the random intercept is very small may fail irrespective of the starting values and algorithm used; for this reason, when the starting value for σ 2 < ε we remove the random intercept and fit a PT GLM (we set ε = 0.001 as default in ptmixed()).

Inference
We base the evaluation of the standard errors of the parameter estimates on the asymptotic distribution of the ML estimator (El-Shaarawi et al., 2011), and use the Wald and the likelihood ratio tests for hypotheses testing. Ifθ is the ML estimate of θ = (β, D, a, σ 2 ), then the asymptotic distribution ofθ is multivariate normal with mean θ and variance V =V ar(θ) = I −1 (θ), where I(θ) denotes the observed Fisher information evaluated at θ =θ (McCulloch et al., 2008). Hypotheses that involve linear combinations of regression coefficients can be expressed as H 0 ∶ K T β = K T b 0 and tested with the Wald test statistic Hypotheses that can be expressed in the form H 0 ∶ θ ∈ Θ 0 , where Θ 0 is a subset of the parameter space Θ, can be verified with the the likelihood ratio test (LRT) statistic t = 2 log L(θ)− 2 log L(θ 0 ), whereθ 0 = argmin θ∈Θ 0 L(θ). When the null hypothesis does not involve testing values of θ on the boundary of Θ, the asymptotic distribution of t under H 0 is χ 2 d , where d is the number of restrictions implied by H 0 . If, instead, the hypothesis involves values on the boundary of the parameter space, the asymptotic distribution of t is not χ 2 in general; however, it is possible to show that in certain cases, such as for example when H 0 ∶ σ 2 = 0, the asymptotic distribution of t is a mixture of χ 2 distributions (Molenberghs and Verbeke, 2007). We have implemented the computational approach for ML estimation outlined in Section 2.3 in the R package ptmixed that is freely available from CRAN (Signorelli et al., 2020). The package comprises functions for the estimation of the Poisson-Tweedie GLMM and of simpler models such as the Poisson-Tweedie GLM and the negative binomial GLM and GLMM. Table 1 provides an overview of the functions currently available in the package; these functions can be used for data visualization, to fit the PT GLMM, NB GLMM, PT GLM and NB GLM, to summarize the results of each model fit, to perform hypothesis testing and, for the GLMMs, to obtain the best linear unbiased predictor (BLUP) of the random effects.

Simulations
In this section we assess the performance of ptmixed through several simulations. We begin by evaluating the accuracy of our approach for ML estimation and hypothesis  testing across a range of different distributional shapes (from zero-inflated to heavytailed) and sample sizes (Section 3.1). In Section 3.2 we show the added value of approximating the likelihood function with the AGHQ method by comparing it to the LA. Finally, in Section 3.3 we compare the performance of our method to that of several state of the art approaches for the analysis of RNA-seq data. R scripts to reproduce the results presented here are available at https://github.com/m-signo/ptmixed.

Accuracy of ML estimates and hypothesis testing
To assess the accuracy of the ML estimates and of the Wald and likelihood ratio tests introduced in Section 2.4, we simulated data from the PT mixed model in (1) considering different types of discrete distributions: zero-inflated (a = −5, −1), NB (a = 0), heavy-tailed (a = 0.5) and Poisson (a = 1). We fixed the dispersion parameter D = 3 (except for the case a = 1, where D = 1) and the number of repeated measurements per subject m i = 5 ∀i ∈ {1, ..., n}, and considered an increasing number of subjects n ∈ {10, 25, 50, 100, 150}. We splitted individuals into two groups, denoted by d i ∈ {0, 1}, of equal size. We let the mean µ ij depend on a subject-specific random intercept v i , on group d i and time t i in such a way that log(µ ij ) = β 0 + In simulation A we let β = (2.5, 0.3, 0) and test whether the time effect is null, i.e. H 0 ∶ β 2 = 0. Table 2 reports the root mean square error (RMSE) ofβ 2 for different values of a and n. We can observe that the magnitude of the RMSE is comparable across different values of a, and that it decreases linearly with √ n as expected. Table  3 compares the false positive rate (FPR) at α = 0.05 between the Wald test and LRT for each of the distributions considered and increasing sample size; we observe that the two tests exhibit similar performance, producing an FPR close to the nominal type I error (α = 0.05). In simulation B, instead, we let β = (2.5, 0, 0.2) and test whether the group coefficient is null, i.e. H 0 ∶ β 1 = 0. Also in this case we see that the RMSE of the parameter of interest (hereβ 1 ) is of comparable magnitude across different values of a, and that as expected it decreases linearly with √ n ( Table 4). Assessment of the FPR at α = 0.05 (Table 5) points out that for very small sample sizes (namely n = 10) both the Wald

Comparison between the AGHQ and Laplace approximations of the likelihood
To quantify the improvement associated to the use of the AHGQ in the approximation of the likelihood of our model, we illustrate a comparison of its performance to that of the LA, focusing on the case with a = −1 and n = 100 considered in simulation B (for the description of the settings of this simulation, see Section 3.1). Table 6 compares the mean of the ML estimates obtained when using the LA or an AGHQ with 5 quadrature points. The results show that even for n = 100 the LA can yield rather biased parameter estimates, in particular for D, a and σ 2 . Use of the AGHQ makes it possible to obtain more accurate parameter estimates for all parameters. Importantly, use of the AGHQ produced considerably higher convergence rates (87.5% versus the 65.2% of the LA) using the default starting values computed by ptmixed (thus reducing the risk that one may need to try different starting points to ensure convergence). Furthermore, the AGHQ reduced the proportion of false positives (5.6% versus 7% for the LA, α = 0.05) enabling a better type I error control. All these performance improvements, which are due to the fact that the use of more quadrature points allow a better approximation of the likelihood function, typically required a 12% increase in computing time with respect to the LA (Table 7).

Comparison with alternative modelling approaches
In this Section we describe the results of two simulation studies (C and D) designed to compare ptmixed to several state of the art approaches that are typically employed in the analysis of longitudinal RNA-seq data. Contrary to Section 3.1, where we assessed the performance of our method in parameter estimation and testing for an individual gene, here we evaluate the overall performance of each method on simulated datasets comprising G = 500 genes each; the reason for this is that some of the methods included in the comparison (DESeq2, edgeR and limma-voom) carry out parameter estimation by borrowing information across different genes, and therefore they require information from all genes to estimate and test the parameters associated to each individual gene. We have included in this comparison approaches that (i) treat repeated measurements from the same individual as independent, (ii) include the subject identifiers as fixed effects into the model, and (iii) use random effects to model the correlation induced by the repeated measurements design. All methods included in the comparison are listed in Table 8, where we summarize the type of model used and the way in which they handle the repeated measurements design. All methods were estimated including group and time as main effects, with the exception of MaSigPro where also an interaction term was included (in the R package MaSigPro the interaction term is automatically included and cannot be suppressed). As concerns hypothesis testing, for ptmixed we assessed the performance both of the Wald test and of the LRT introduced in Section 2.4. We simulate datasets comprising gene expression values of G = 500 genes for a sample of n subjects who are splitted into two groups of equal size. We assume that m i = 5 repeated measurements are available for each subject. The data are generated from model (1) assuming that where d i ∈ {0, 1} is the group indicator, t ij ∈ {0, 1, ..., 4} represents time, o ij ∼ N (0, 0.5 2 ) is an offset term whose purpose is to mimic the presence of sample-specific normalization weights, v gi ∼ N (0, σ 2 g ) with σ 2 g ∼ U (0.2, 0.8) is a random intercept, and the dispersion parameter D follows a translated gamma distribution, D − 1 ∼ Γ(α = 2, β = 1). Consistently with the results of Esnaola et al. (2013), who showed that the hypothesis that the distribution of RNA-seq counts is NB cannot be rejected for most genes, but that at the same time the distribution is not NB for a relevant proportion of genes, we let the distribution of Y gij to be NB (a g = 0) for 300 genes, zero-inflated (a g ∼ U (−10, −1)) for 100 genes and heavy-tailed (a g ∼ U (0.3, 0.7)) for the remaining 100 genes. In simulation C we compare the performance of each method in testing if a gene is upor down-regulated over time, i.e., β g2 = 0. We let β g0 ∼ N (3, 0.5), β g1 ∼ N (0, 0.3), and we assume that 400 genes are not dysregulated over time (i.e., β g2 = 0), 50 genes are upregulated with β g2 ∼ U (0.1, 0.5) and 50 genes downregulated with β g2 ∼ U (−0.5, −0.1). In simulation D, instead, we check the performance of each method in testing if a gene is differentially expressed (DE) between the two groups, i.e., β g1 = 0. We let β g0 ∼ N (3, 0.5), β g2 ∼ U (−0.1, 0.1), and we assume that 400 genes are not DE (i.e., β g1 = 0), 50 genes are up-regulated with β g1 ∼ U (0.5, 1) and 50 genes downregulated with β g1 ∼ U (−1, −0.5). First, we assess the precision of the parameter estimates obtained by the different methods by comparing their RMSEs. We report the RMSE for all methods with the exception of MaSigPro, for which the computation was not possible because this method reports parameter estimates only for the significant genes, and not for all genes. Then, in order to evaluate the extent to which each method controls the type I error we compute the False Positive Rate (FPR), which is the proportion of truly non-DE genes that are declared as DE by the test at a fixed significance level α. Furthermore, for the methods whose FPR appears to be under control we compare the True Positive Rate (TPR), which is the proportion of truly DE genes that are correctly declared as DE by the test at level α. Finally, we also present a comparison of the computing times of each method. The results of all methods were corrected for multiple testing. We employed the Benjamini-Hochberg (Benjamini and Hochberg, 1995) correction for all methods list in Table 8, with the exception of ShrinkBayes, for which we employed the Bayesian FDR suggested in van de Wiel et al. (2014). Results for each simulation and sample size are based on 50 random replications (i.e., 50 datasets comprising 500 genes each). Table 9 compares RM SE(β g2 ) in simulation C, where interest lies in the estimation and testing of the time effect β g2 , which is a within-subjects effect. In this simulation, the method with the lowest RMSEs is ShrinkBayes, closely followed by tweeDEseq when n = 10 and n = 20, and by ptmixed when n = 20 and n = 40. The remaining six methods are associated to larger RMSEs across all considered sample sizes. Table  10 compares RM SE(β g1 ) in simulation D, where interest lies in the estimation and testing of the group effect β g1 , which is a between-subjects effect. Here, ptmixed is by far the method with the lowest RMSEs. The RMSEs of ShrinkBayes, tweeDEseq,   edgeR, limma and DEseq2 are between 5 and 20 times higher than those of ptmixed, whereas the three approaches that use subject-specific fixed effects have extremely high RMSEs that do not even decrease with n. Overall, these results show that attempts to model repeated measurements using fixed effects can yield extremely inaccurate parameter estimates of between-subjects effects. Table 11 reports the estimated FPRs in simulation C. Overall, we can observe that in this case almost all methods control the type I error, achieving an FPR < 5% for all sample sizes. The only exception to this is DESeq2+FE. It can also be noted that the FPRs of the two mixed-effects modelling approaches, ptmixed (both with the Wald test and the LRT) and ShrinkBayes, are much closer to the prespecified α than those of all other methods, which appear to be more conservative. The estimated TPRs for the methods that control the type I error in simulation C are shown in Table 12. In general, all methods achieve comparable TPRs, with the exception of MaSigPro whose TPR is almost 0 for all n. The methods with the highest TPR when n = 10 are the three fixed-effects approaches, whereas for n = 20 and n = 40 edgeR+FE and limma-voom+FE are the most powerful, followed by ptmixed, ShrinkBayes and tweeDEseq. The estimated FPRs in simulation D are presented in Table 13. In this case, ptmixed and MaSigPro control the type I error across all sample sizes, and ShrinkBayes does so for n = 20 and n = 40. All other methods fail to control the type I error, producing much higher FPRs: the approaches that assume independence (DESeq2, edgeR, limma-voom and tweeDEseq) yield FPRs that range from 0.184 to 0.342, whereas the approaches that employ fixed effects (DESeq2+FE, edgeR+FE and limma-voom+FE) have FPRs close to 1. The estimated TPRs for the methods that control the type I error in simulation D are shown in Table 14. Once again, MaSigPro has TPRs close to 0, giving an indication Table 12: True Positive Rate in simulation C (H 0 ∶ β g2 = 0, α = 0.05) for the methods that control the type I error (FPR ≤ α in Table 11). Asterisks (*) denote methods that did not control the type I error (see Table 11).   Table 14: True Positive Rate in simulation D (H 0 ∶ β g1 = 0, α = 0.05) for the methods that control the type I error (FPR ≤ α in Table 13). Asterisks (*) denote methods that did not control the type I error (see Table 13). that the method has little power to reject H 0 both in simulation C and simulation D.
As concerns ptmixed, the Wald test appears to be a bit more powerful than the LRT, especially when n = 10 and n = 20. Finally, in the two cases in which it controls the type I error (n = 20 and n = 40), ShrinkBayes achieves a TPR slightly higher than the Wald test from ptmixed.
In Tables 1 and 2 of Supplementary File 1 we report the computing time of each method in simulations C and D. Overall, estimation of the GLMM approaches, ptmixed and ShrinkBayes, was considerably slower than that of the other approaches that fit simpler GLMs. For most methods, computing time increased linearly with n; this did not hold for the fixed-effects approaches, where computing time increased more than linearly with n, and for ShrinkBayes, which took more time to convergence when n = 10 than when n = 20 or n = 40.

Characterizing the dynamic evolution of muscular dystrophy in mouse
In this section we illustrate how the PT mixed model can be employed to analyse longitudinal RNA-seq data. We consider a longitudinal RNA-seq experiment performed at the LUMC (Leiden, NL) whose aim was to characterize the dynamic evolution of gene expression in mdx mice. The mdx mouse is the most used mouse model for DMD, an X-linked recessive genetic disease caused by protein truncating mutations in the DMD gene that encodes dystrophin. The progression of DMD is characterised by a process of muscle degeneration and regeneration whose consequences are muscle instability and the replacement of muscular tissue with fibrotic and fatty tissues. Eventually, this process leads to loss of muscle function, loss of ambulation and premature death. To date no effective treatment for DMD has been found; one of the reason for drug failure during clinical trials is the use functional tests with high inter-and intra-individual variability, resulting in underpowered studies. The possibility to quantify disease progression non-invasively with objective biomarkers is highly needed. In this experiment, repeated blood measurements were taken in order to obtain estimates of the evolution of blood-based gene expression over time.
The experiment involved five different groups of mice, but for simplicity of illustration here we focus on the comparison of the two most interesting groups: wild-type (wt), comprising 5 healthy mice, and mdx, comprising 3 dystrophic mice that shared the same genetic background of wt mice but carried a nonsense mutation in exon 23 that led to lack of dystrophin. Mice were kept under observation for 30 weeks, during which blood samples were collected from each mouse every 6 weeks starting from 6 weeks of age. Transcriptomic expression in these samples was quantified using RNAsequencing (GEO accession number GSE132741). We focus our attention on those genes that belong to 5 pathways, listed in Table 15, that are relevant in the DMD pathophysiology. We filter out lowly expressed genes by excluding genes that have less than 1 count per million in more than half of the samples; after filtering, 379 genes are retained in the analysis. We let the expression value for gene g from mouse i at the j-th time point follow a PT GLMM where Y gij v gi ∼ PT(µ gij , D g , a g ) and where d i is a dummy variable which is 0 for wt and 1 for mdx mice, t ij ∈ {0, 1, 2, 3, 4} denotes time (0 = week 6, 4 = week 30) and o ij is an offset term equal to the logarithm of the effective library sizes computed using the TMM normalization method (Robinson and Oshlack, 2010). We estimated model 5 using an AGHQ with 10 quadrature points and considered up to two different starting values (ptmixed default values and, when necessary, ML estimates of the negative binomial GLMM) for the optimization. ML estimation was successful for 315 genes; for the remaining 64 genes, for which estimation of model 5 failed, we attempted to estimate the simpler NB mixed model by fixing a g = 0; estimation of such model was successful for 63 of those genes. We do not consider further the remaining gene, for which also estimation of the NB GLMM failed. Figure 1 shows the distribution of the ML estimates of the power parameter,â g , for the 315 genes for which estimation of the PT GLMM was successful. For 74 geneŝ a g < −1, for 62 genesâ g ∈ [−1, 0), for 106 genesâ g ∈ [0, 0.5), and for the remaining 72 genesâ g ∈ [0.5, 1). The spread ofâ g over a range of values far from 0 is an indication of the fact that for several genes the NB distribution (that is, a PT with a = 0) does not provide the best fit to the data, in line with the observations made by Esnaola et al. (2013) about the distribution of cross-sectional RNA-seq data. With the aim of identifying genes whose progression is different between mdx and wt mice, we tested the null hypothesis H 01 ∶ β g1 = β g3 = 0 of no difference in the average expression level of gene g between cases and controls at any time point. p-values were adjusted for multiple testing using the Benjamini-Hockberg method (Benjamini and Hochberg, 1995). Detailed results of the test are provided in Supplementary File 2. In short, we found evidence of differential expression (FDR-corrected p-value < 0.05) for 120 genes; for these genes, we furthermore tested whether there was sufficient evidence that the intercept or the slope are different between groups, H 02 ∶ β g1 = 0 and H 03 ∶ β g3 = 0 respectively. These tests allowed to further identify 38 genes with a significant intercept (difference between wt and mdx at week 6) and 38 genes where the progression (slope) is different between groups.
In Figure 2 we show the trajectories of four of the significant genes, Igfbp4, Jak1, Stat1 and Tgif1. Igfbp4 is one of the insulin-like growth factor binding proteins, it binds to both IGF-1 and IGF-2 affecting the half-life of IGFs in circulation and their interaction with cell receptors. IGF-1 has been linked to muscle regeneration and multiple papers show a beneficial effect of IGF-1 in DMD (Forcina et al., 2019). IGF binding proteins such as IGFBP3 and IGFBP5 have also been described to be reduced in DMD patients serum compared to healthy controls (Hathout et al., 2019). In this work we observe a significant elevation of Igfbp4 in mdx mice, which is particularly apparent at weeks 6 and 12 when intensive muscle degeneration and regeneration is ongoing. Jak1 and Stat1 are part of the JAK-STAT pathway; inspection of the trajectory plots in Figure 2 shows that the two genes have a similar pattern of expression, with an elevation in mdx mice at week 6 followed by a steep decrease at the later time points. It has been described how the JAK-STAT pathway contributes to muscle repair promoting regeneration in DMD (Moresi et al., 2019). The observed behaviour of Jak1 and Stat1 in mdx mice is particularly interesting for two reasons: the intensive muscle regeneration ongoing at 6 weeks of age in mdx mice, and the previous finding that Stat1 is activated in mdx mice diaphragm at 6 weeks of age compared to healthy mice (Dogra et al., 2008).
Finally, for Tgif1 we can observe a downregulation in mdx mice at 6 and 12 weeks; later on, the expression of the gene in mdx mice increases towards the levels observed in wt mice. Tgif1 is part of the TGFβ pathway, a well known pathway in the DMD pathophysiology with genetic modifiers mapping to this pathway (Flanigan et al., 2013) and a number of reports involving TGFβ in DMD (Ceco and McNally, 2013).
In Figure 3 we compare the p-values for H 01 ∶ β g1 = β g3 = 0 obtained from the PT mixed model with those that would have been obtained if we had used a NB mixed model, focusing on the 311 genes for which both models could be fitted. It can be observed that while for several genes the p-values are quite similar, there are also numerous instances of genes for which the choice between NB and PT matters, as it may entail a different conclusion on the significance of the null hypothesis. In particular, p-values obtained from the PT mixed model are smaller than those from the NB mixed model in 178 cases, and larger 133 times; after application of the Benjamini-Hochberg multiple testing correction, 44 genes are found as significant (α = 0.05) with both the NB and the PT mixed model; furthermore, 12 genes are identified as significant just by the NB mixed model, whereas 53 genes are found as significant only by the PT mixed model. Overall, these results suggest that while there appears to be a good level of agreement between the two models, identifying for each gene the most suitable distributional shape within the PT family of distributions, rather than arbitrarily fixing a g = 0 and choosing the NB distribution, can have important consequences on the identification of differentially expressed genes and on downstream analyses. tails. The analysis of RNA-seq data arising from longitudinal experiments should be able to deal with such features and, at the same time, to account for the correlation arising from the availability of repeated measurements from the same individual. However, often such data are analysed with methods that either do not account for overdispersion, zero-inflation and heavy tails (Esnaola et al., 2013), or fail to model the dependence between repeated measurements (Cui et al., 2016).
In this article we have introduced ptmixed, a GLMM based on the PT distribution that can be employed to analyse of longitudinal sequencing data. On the one hand, the reliance of ptmixed on PT distribution allows to flexibly model the typical distributional shapes of RNA-seq counts and, at the same time, it performs automatic model selection thanks to its power parameter, which allows to distinguish between several discrete distributions that can be obtained as special cases of the PT (Esnaola et al., 2013). On the other hand, ptmixed properly models the correlation between repeated measurements obtained from the same subject through the specification of subject-specific random effects. The applicability of ptmixed extends beyond RNAseq data, embracing all scenarios where non-independent measurements of a discrete, overdispersed response variable are available; such settings can arise, for example, in ecology (Bolker et al., 2009) and in clinical trials (Albert, 1999). Our proposal represents the first theoretical development and software implementation of a mixed model based on the PT distribution. An important contribution of our work is to have shown that an accurate evaluation of the integrals in (2) makes it possible, in most (but not all) cases, to model longitudinal overdispersed counts using a GLMM that is more flexible than the Poisson and NB GLMMs. In our experience, we have observed that the rate of success for the estimation of the PT mixed model depends to a good extent on the sample size and on whether the values of a and σ 2 are on the boundary of the parameter space; in particular, convergence appears to be easier with larger n, and to be more problematic when a is very close to 1 or σ 2 approaches 0. The simulations presented in Sections 3.1 and 3.2 show that ptmixed allows to accurately perform parameter estimation and hypothesis testing, and demonstrate the advantage connected to the use of the AGHQ instead of the LA. Furthermore, in Section 3.3 we presented a comparison between ptmixed and alternative modelling approaches through two genomewide simulations. Those simulations showed that analysing longitudinal sequencing data with methods that do not properly model the repeated measurements design (assuming independence or introducing subject-specific fixed effects) yield inaccurate parameter estimates (Tables 9 and 10) and fail to control the type I error in tests on between-subjects effects (Table 13). On the other hand, they showed that use of GLMM approaches such as ptmixed and ShrinkBayes make it possible to accurately estimate and test both within-and between-subject effects. The results of Section 3.3 further allowed to compare ShrinkBayes and ptmixed. As concerns parameter estimation, the RMSEs of ShrinkBayes were found to be 1.5 / 2 times smaller than those of ptmixed for within-subject effects (Table 9), but 5 times larger than those of ptmixed for between-subjects effects (Table 10). Both methods successfully controlled the FDR at α = 0.05 in simulation C (Table 11), but ShrinkBayes did not control it when n = 10 in simulation D. Interestingly, ptmixed was somewhat more powerful than ShrinkBayes in simulation C (Table 12), while the opposite was found in simulation D (Table 14). Finally, the estimation of ptmixed was faster than ShrinkBayes with small sample sizes, but slower with larger sample sizes. Overall, the comparison pointed out that none of the two methods uniformly outperformed the other, suggesting that both methods are valid modelling approaches with their own advantages and weaknesses. Aside from those performance measures, there are a few additional features of ptmixed that make it more flexible than ShrinkBayes: as a matter of fact, ptmixed is capable of modelling not only zero-inflation but also heavy-tails, it allows to select a different distributional shape for each gene rather than fitting the same distribution to all genes, and it allows to easily test not only simple univariate hypotheses involving a single parameter, but also more general multivariate hypotheses involving linear combinations of parameters and / or the joint significance of two or more parameters. In Section 4 we illustrated an application of ptmixed to data from a longitudinal RNAseq experiment involving WT and mdx mice. Through this application we showed not only that ptmixed is capable to identify as differentially expressed many genes known to play a major role in the evolution of Duchenne Muscular Dystrophy, but also that using the more general PT GLMM instead of the simpler NB GLMM can yield substantially different conclusions on which genes are differentially expressed. This result points out the importance of properly modelling of zero-inflation and heavy-tails not just from a theoretical point of view but also from an applied perspective, since doing so can substantially change the conclusions of real data analyses.  Figure 1: Plots of the empirical probability mass function of the PT distribution for different values of the dispersion parameter D (1.1, 5, 10) and power parameter a (-20, 5, 0, 0.5) for fixed mean µ = 10. Each plot is obtained based on 10000 random draws from the PT pmf.