Dynamic modelling of corporate credit ratings and defaults

In this article, we propose a longitudinal multivariate model for binary and ordinal outcomes to describe the dynamic relationship among firm defaults and credit ratings from various raters. The latent probability of default is modelled as a dynamic process which contains additive firm-specific effects, a latent systematic factor representing the business cycle and idiosyncratic observed and unobserved factors. The joint set-up also facilitates the estimation of a bias for each rater which captures changes in the rating standards of the rating agencies. Bayesian estimation techniques are employed to estimate the parameters of interest. Several models are compared based on their out-of-sample prediction ability and we find that the proposed model outperforms simpler specifications. The joint framework is illustrated on a sample of publicly traded US corporates which are rated by at least one of the credit rating agencies S&P, Moody's and Fitch during the period 1995–2014.


Introduction
The last decades have witnessed an increased interest from practitioners in the financial industry, researchers and regulators alike in developing tools for appropriately measuring and modelling credit risk as well as developing and amending regulations which limit and monitor such risks. In this context, credit risk assessment typically relies on statistical models based on a historical database of defaults together with debtor-specific and market variables or on credit ratings which are forward-looking opinions of a debtor's creditworthiness and are assigned by external credit rating agencies (CRAs). This approach is also reflected in the regulations introduced in the Basel Accords I and II (Basel Committee on Banking Supervision, 2004Supervision, , 2011. For example, under Pillar I of Basel II financial intermediaries can develop their own default prediction models, which, in case the history of defaults is limited and portfolio coverage is satisfactory, can be enhanced or replaced by models using external rating information. However, clear guidelines on how to integrate all available information sources into a proper modelling framework are scarce. More recently, the need of such an integrated approach has been stressed by both academia (e.g., Hilscher and Wilson, 2017;Hirk et al., 2020) and regulators with the IFRS 9 accounting standard issued by the International Accounting Standards Board (IASB, 2014) requiring banks to build provisions based on forward-looking expected loss models by considering 'all reasonable and supportable information, including forward-looking measures'.
In this article, we extend the work of Hirk et al. (2020) and propose a framework for modelling defaults and credit ratings from Standard and Poor's (S&P), Moody's and Fitch in a multivariate ordinal model, where the latent credit quality is modelled as a dynamic process. The dynamic modelling of credit risk allows for dependencies in the cross-section and over time to be accounted for by typically making a distinction between systematic and idiosyncratic risk, where the systematic risk relates to the relationship between credit risk and a business factor and is of prime importance in portfolio credit risk modelling (see Vasicek, 2002;Koopman and Lucas, 2005;McNeil and Wendin, 2007;Betz et al., 2018). We therefore build a longitudinal model of binary default events and ratings on an ordinal scale where a common latent variable which is a measure of credit quality is underlying the observations. In particular, we model the conditional distribution of these responses given a set of financial covariates known to be relevant for credit risk modelling by assuming that the latent variable corresponding to the credit quality, referred to in this article as a 'probability of default (PD) score', depends on unobserved firm-specific effects, and on a systematic as well as an idiosyncratic factor which both have an auto-regressive structure of order one. This approach allows us to disentangle differences in firms' credit quality due to idiosyncratic causes from the effects due to business conditions. In modelling the credit ratings we also consider several characteristics of the ratings market. First, CRAs claim to provide a forward-looking long-term measure of credit quality by employing a so-called 'through-the-cycle' (TTC) approach to ensure that their ratings are stable over the business cycle (as opposed to a 'point-in-time' (PIT) approach, which measures credit quality at a given point in time). Second, we do not disregard the criticism of the three big CRAs for their inability to assess risk accurately (e.g., Becker and Milbourn, 2011;Bolton et al., 2012;Bar-Isaac and Shapiro, 2013;Kashyap and Kovrijnykh, 2016) and, from the modelling perspective, we assume the ratings to be noisy observations of the underlying PD score by assuming a rater 'bias' for each of the CRAs which depends on covariates and has a time-varying component common to all the CRAs which captures yearly shifts in the rating standards of the rating agencies. Here, we resort to Bayesian estimation techniques implemented in the open-source software Stan (Stan Development Team, 2018) to estimate the parameters of interest and illustrate the dynamic framework on a subset of the COMPUSTAT-CRSP universe of publicly traded US firms which are rated by at least one of the big three CRAs over the period 1995-2014. Dynamic modelling of corporate credit ratings and defaults 3 Over the last decade, joint modelling frameworks for credit risk measures have become more popular but it is still common practice in both industry and academia to model defaults and credit ratings separately. Statistical binary response (typically logit) models are often employed for predicting defaults (among the most prominent articles in the finance literature, for example Shumway, 2001;Campbell et al., 2008;Tian et al., 2015), while static ordinal or linear regression is used in modelling the credit ratings (e.g., Alp, 2013;Baghai et al., 2014). Ordered regression models with a dynamic specification have also been employed for modelling rating transitions (e.g., Malik and Thomas, 2012;Creal et al., 2014). Several articles have jointly investigated different credit risk measures, including credit ratings from possibly various raters for the purpose of credit risk measurement. For example, Hornik et al. (2010) propose a static parametric framework based on the existence of contemporaneous PD estimates for the same obligor provided by different rating sources for estimating consensus ratings as well as validate the different rating sources by analyzing the mean/variance structure of the rating errors. Grün et al. (2013) extend the analysis to a dynamic model and analyze a panel of ratings from the three big CRAs by first transforming ratings to PD estimates by using observed default rates. Creal et al. (2014) build a multivariate dynamic factor model for signal extraction and forecasting of macro, credit, and loss given default risk conditions in the US. Hilscher and Wilson (2017) provide an analysis of both ratings and defaults (even though not in a joint statistical model) where they investigate whether the measures have different information content. They conclude that, while the credit ratings are poor measures of raw default probabilities, they are strongly related to systematic risk. Hirk et al. (2018) build a joint ordinal model for the ratings from the big three CRAs while Hirk et al. (2020) propose a static framework for jointly modelling defaults and ratings using the class of multivariate ordinal regression models and show improved results in terms of default prediction conditional on the observed ratings.
The article is organized as follows: The joint modelling framework is introduced in Section 2 and details regarding the estimation and prior specifications are presented in Section 3. Section 4 introduces the data set used in the analysis and presents the results while Section 5 concludes the article.

General set-up
Suppose that for each firm i ∈ {1, . . . , I} in period t ∈ T i , with T i being the set of all available time points for firm i, we observe the corresponding (P × 1) vector of covariates x i (t) measuring firm liquidity, profitability, indebtedness, a binary default indicator y def i (t) which takes the value one if the firm defaulted between time t and t + 1, and a vector of available credit ratings y rat i (t) = [y rat ij (t)] j∈J i (t) which are observed at the end of period t for a non-empty subset J i (t) of all available raters {S&P, Moody's, Fitch}.

Laura Vana and Kurt Hornik
The one-year probability of default of firm i at time t denoted by PD i (t) is modelled as a random variable which follows a logistic regression model [Logistic regression is widely employed in the credit risk literature (e.g., Campbell et al., 2008;Tian et al., 2015).] and we assume that, conditional on PD i (t), the default indicator y def i (t) follows a Bernoulli distribution: where S i (t) is a real-valued score indicating the credit quality of firm i at time t with a high (low) value implying a low (high) credit quality. We refer to this quantity as the 'one-year PD score'. For the credit ratings, we employ an ordinal regression model using the cumulative link approach by treating the ratings observed for the jth rater y rat ij (t) which can take one of C j classes as a coarser version of an underlying latent variable y rat ij (t), which can be interpreted as a real-valued rating score. A regression model is then assumed on the latent scale: where θ j = (θ j,0 , θ j,1 , . . . , θ j,C j ) is a set of rater-specific threshold parameters satisfying the order restriction θ j,0 = −∞ < θ j,1 < · · · < θ j,C j = ∞ which are used to slot the underlying variable into intervals corresponding to the non-default ordinal rating classes r ∈ {1, . . . , C j } (where 1 denotes the class with best creditworthiness); η ij (t) is a rater-specific bias term; ε ij (t) iid ∼ L(0, 1) is a mean-zero noise term which follows a standard (to ensure identifiability) logistic distribution.
By allowing rater-specific thresholds we are able to capture the heterogeneity in the rating scale and methodology of the different CRAs. [The raters employ different coding for the rating classes: S&P and Fitch employ a rating scale with eight main non-default rating categories AAA, AA, A, BBB, BB, B, CCC, CC while Moody's scale is Aaa, Aa, A, Baa, Ba, B, Caa, Ca. Moreover, the CRAs claim to use different credit risk measures in their assessments: Moody's relies on loss given default while S&P and Fitch measure the relative likelihood of default.] Furthermore, in this application one may think of the term µ ij (t) in Equation (2.1) as the 'expected rating score' assigned by rater j to firm i in year t, which the raters then transform to an ordinal scale using a suitable mapping. In the specification of expected rating score, we assume that the raters observe and/or produce noisy versions of the 'one-year PD score' S i (t) and that the CRAs' biases η ij (t) can be modelled additively on the scale of the underlying latent variables. The noisiness in the ratings can be motivated on the one hand by some sort of information asymmetry between firm owners and raters and on the other hand by the fact that the ratings are forward-looking measures  Dynamic modelling of corporate credit ratings and defaults 5 of creditworthiness for horizons longer than one year which are assigned by the CRAs based on different metrics (see, e.g., Grün et al., 2013). On the other hand, the assumption of additivity on the rating score scale is in line with previous specifications of rating models (e.g., Alp, 2013;Baghai et al., 2014) but also by Merton-type models under partial information, where the error in the observation of the log firm value is additive on the scale of the log normal firm value process (first introduced in Duffie and Lando, 2001).

Dynamic specification of the latent PD score S i (t)
The basic framework underlying dynamic models of credit risk also adopted by regulators (see, e.g., the methodology underlying the CreditMetrics TM framework) is that credit risk depends on a systematic and an idiosyncratic component and a Gaussian single factor model is typically employed in the modelling process (Vasicek, 2002). Moreover, it is reasonable to assume that the PD scores are correlated over the time dimension owing to macroeconomic events such as recessions whose influence is not fully captured by the covariates or over the cross-section from direct effects of one corporate failure on other distressed corporations. We propose the following dynamic specification of the latent one-year PD score: where β 0 is an intercept term, β is a (P × 1) vector of regression coefficients and u i (t) is a random effect which consists of a firm-specific effect a i , a latent market factor b(t) and idiosyncratic changes i (t), and the loading ω measures the dependence of S i (t) on the latent market factor (a similar specification has been proposed in Grün et al., 2013). The effects a i are firm-specific deviations from the overall intercept which can capture unobserved heterogeneity such as management ability, and are assumed to be normally distributed with a firm-specific variance. An auto-regressive structure of order one is assumed for the latent market factor b(t) whereas the above sign convention implies that positive b(t)'s correspond to favourable market conditions. The modelling of the systematic factor as a latent quantity is rather standard in the literature, owing to the fact that the theory on which observed variables would be adequate as a proxy for systematic credit risk is rather scarce (see remarks in, e.g., Koopman and Lucas, 2005). By restricting ω to be constant for all firms and years we implicitly assume that b(t) is indeed a common market factor impacting all firms equally, while the remaining unexplained variation can be captured by the idiosyncratic effect i (t).
[We also investigated whether the model improves when allowing the factor loading to depend on the industry in which firm i operates but find no compelling improvement in the performance.] The variance of υ b (t) is fixed to one to ensure identifiability and we restrict |φ b | < 1 to ensure stationarity. Finally, the idiosyncratic disturbances i (t) are independent over the firms but are serially correlated through an AR(1) process with |ρ| < 1 for ensuring stationarity. Due to the rather short history for most firms in the sample (see also Figure A.1 in the Supplementary Material) the data might not deliver enough information on identifying a firm-specific persistence or standard-deviation parameters, this is why we assume a constant ρ and ψ for all firms.

Dynamic specification of the rater bias η ij (t)
The rater bias specification proposed in this article is given by: where γ j is a rater-specific (P × 1) vector of regression coefficients, δ(t) is a time-varying 'rater factor' which is modelled dynamically using an AR(1) process with |φ δ | < 1 and λ j is a rater-specific factor loading. The rater-specific bias η ij (t) depends linearly on the covariates x i (t), which were also employed in the specification of the latent PD score S i (t) in Equation (2.2). This assumption is reasonable as these risk factors also affect the credit ratings, but to a different extent than they affect defaults. However, one could include the lagged ratings as a proxy for the 'stickiness' of credit ratings or the number of years that the firm has been rated by a CRA (having the firm as a client for a longer time period can potentially reduce information asymmetry) as potential covariates. We investigated whether these variables improve the model performance and find that, considering these additional covariates does not improve the results markedly. The specification above does not account for any dependence among the three raters, which might seem rather restrictive as lead-lag relationships among the raters have been found empirically (e.g., Güttler and Wahrenburg, 2007;Berwart et al., 2019). However, the rating adjustments are reported to follow within months of the lead rating change, so this effect can be expected to be negligible on an yearly basis. The rater factor δ(t) is independent of observed covariates and of S i (t), and should thus pick up any time variation specific to the CRAs' behaviour and industry practices beyond that implied by the variation of the PD scores caused by the market factor. The reason for including δ(t) in the model lies mainly in previous results showing shifts in the rating standards over the sample period analyzed in this article (e.g., Alp, 2013). Moreover, given the low number of defaults in the sample, omitting the rater factor from the model specification might make the estimation of the business factor cumbersome given the strong (weak) signal in the data coming from the ratings (defaults). For the sake of parsimony, we do not employ one factor for each rating agency, as we expect the behaviour of the three CRAs to be similar given the oligopoly structure of the ratings' market and the high degree of agreement among the raters.
Dynamic modelling of corporate credit ratings and defaults 7

Bayesian inference
Several methods can be considered for the estimation of the proposed model, which contains effects at different levels of hierarchy. The non-nested firm and time effects, however, restrict the techniques which can be employed, as the estimation of ordinal models with non-nested effects is cumbersome due to the necessity to compute high-dimensional integrals. The maximum likelihood approach using the EM algorithm can be employed (e.g., Cagnone et al., 2009, employ the EM algorithm in estimating a multivariate ordinal model with item and time-specific random effects). McCulloch (1994) proposed the inclusion of a Metropolis Hastings step in the E-step of the EM algorithm so that the required expectation can be approximated by the average of Monte Carlo samples from the target distribution (approach used by, e.g., Xie et al., 2013, in a two-level model). Bellio and Varin (2005) tackle the dimensionality issue by maximizing the product of the pairwise marginal likelihoods and estimate the parameters of a two-level generalized linear mixed model (GLMM) with crossed random effects. The Bayesian framework is an attractive alternative for multi-level models with (crossed) effects at different levels of hierarchy. Through the specification of a prior distribution, Bayesian estimates of the effects can be obtained even in cases where there are few data points per group Moreover, the growing number of available open software tools for performing Bayesian inference make the implementation and estimation of such models more accessible.

Posterior
The posterior, which is proportional to the product of the likelihood of the four responses and the prior densities on all unobservables (i.e., parameters and latent quantities), is the object of interest in the analysis and inference relies on samples drawn from this posterior distribution. Samples from the posterior are drawn by using the package rstan (Stan Development Team, 2019) for R (R Core Team, 2020), which is an interface to the open-source software Stan. Stan is a probabilistic C++ library which provides full Bayesian inference through Markov chain Monte Carlo (MCMC) methods to obtain posterior simulations. In order to investigate how well the parameters of the proposed model can be estimated empirically, we conducted a simulation study which is presented in Section A.3 of the Supplementary Material.
Conditional on the latent PD scores S i (t), the responses are independent over all i ∈ {1, . . . , I} and t ∈ T i and the joint likelihood is the product of the individual likelihoods of the responses.
The term of the likelihood corresponding to the default indicator is given by the Bernoulli probability mass function p(y def while the likelihood of the ordinal responses is given by the probability of observing category r ij (t): p(y rat ij (t)|x i (t), ζ) = C j r=1 P(y rat ij (t) = r|x i (t), ζ) 1{r=r ij (t)} , where 1{.} denotes the indicator function and for the sake of notational simplicity ζ denotes unobservables. In the cumulative link logit model, the probability can be rewritten as P(y rat (t)). Priors are set on all model parameters. We find the results to be insensitive to the prior specified on the coefficients in the multivariate ordinal regression, which is to be expected given that the covariates were pre-selected based on their relevance in the existing literature. For the coefficients β and γ j of the standardized covariates, we proceed with the Student-t prior. For the threshold parameters, we employ a Dirichlet prior on the probability of the ordinal outcome falling in each of the C j categories π j ∼ Dirichlet(α j ) and obtain the thresholds by the transformation θ j,r = logit −1 r i=1 π j,i . For the firm-specific intercepts a i a realistic assumption in our dataset is to expect the a i 's to have different variances a i ∼ N(0, τ 2 i ) as we expect firms to have different variability in their creditworthiness. We separate the prior on the regression coefficients from the prior on the random firm-specific intercepts a i , especially due to information imbalance (the number of observations available for identifying a i is given by the number of years each firm spends in the sample, which for the analyzed sample ranges from 1 to 20, as can be seen in Figure A.1 in the Supplementary Material). Individual shrinkage of the firm effects is achieved by imposing a shrinkage prior which has a hierarchical representation. We consider here a non-Gaussian shrinkage prior on a i with τ i ∼ N(0, q 2 ), where we treat q 2 as a hyper-parameter of the shrinkage prior above with an inverse Gamma distribution q 2 ∼ G −1 (c 0 , C 0 ) (for more details see, e.g., Frühwirth-Schnatter and Wagner, 2011). The prior on the persistence parameters φ b , φ δ and ρ is chosen as a scaled beta distribution x+1 2 ∼ Beta(a x , b x ), where the hyper-parameters are chosen to reflect prior information. For all other parameters, we use the improper prior p(x) = 1/x.

Model evaluation and out-of-time prediction
In order to evaluate the performance of the model in terms of out-of-sample prediction, we use two approaches. First, we employ approximate leave-one-out (LOO) cross-validation methods (as proposed in Vehtari et al., 2017). Second, we perform a K-fold cross-validation exercise adapted to the panel data structure. The difference between the two approaches is that the K-fold cross-validation requires refitting the model K times whereas approximate LOO methods require only one evaluation of the model. K-fold cross-validation has, however, the advantage that it allows to check the ability of the model to perform well out-of-time, in which case LOO methods are not suitable for assessing the performance on unseen time points (see, e.g., discussion in Vehtari and Ojanen, 2012 Dynamic modelling of corporate credit ratings and defaults 9 density employed here can be found in Vehtari et al. (2017). In the following, we proceed with the exposition of the out-of-time prediction exercise. Assume we fit model M on the period up to a time t which implies we observe the ratings and the covariates up to time t and we also observe whether the company defaulted in the year following the rating observations. We denote the responses and the covariates observed up to time t by Z o [1:t] = {(y def i (τ), y rat ij (τ)); τ = 1, . . . t, i ∈ I t , j ∈ J} and X [1:t] = {x i (τ); τ = 1, . . . t, i ∈ I t }, respectively, where I t denotes the set of observed firms up to time t. We evaluate the predictive performance of M by making use of the posterior predictive density, which for a future data point containing the default and rating observations z * = (y def * , y rat * ) and a future set of covariates x * is given by: In addition to the joint posterior predictive densities, for the application case it is also relevant to assess whether adding the information about the ratings at the end of each year conditionally improves the prediction of the default component (see, e.g., Hirk et al., 2020). Hence, for a future default observation y def * we compute the conditional default probability implied by the model, that is, the one-year probability of default conditional on the corresponding ratings y rat * taking values r 1 , . . . , where both the denominator and the numerator can be rewritten as integrals similar to the one in Equation (3.1). The predictive performance of M can then be measured by evaluating the above quantities at specific observations. For this purpose, we employ Bayesian cross-validation adapted to account for the time dimension of the data at hand. More specifically, we employ a one-step-ahead out-of-time exercise on an expanding window basis where, for increasing t, we repeatedly partition the data into a training set (Z train • the posterior mean conditional default probability for firm i in t + 1: . From these conditional probabilities, we compute for each test sample a measure of calibration, namely the (square root of) Brier score (Brier, 1950), which is mean the squared error between the conditional PDs and the observed binary default indicator. Finally, for evaluating the discriminatory power we compute the area under the precision-recall curve (Davis and Goadrich, 2006) [Using the precision-recall curve is more desirable than employing the receiver operating curve given the imbalance in the default indicator.].

Empirical application
In this section, we introduce the dataset used in the analysis and proceed with a discussion of the results obtained from the proposed dynamic framework as well a with comparison to benchmark models in terms of the predictive performance. Throughout the analysis, the hyper-parameters for the dynamic specification are kept constant. We set c 0 = C 0 = 0.5 for the distribution of q 2 , which implies a median of one for the variance τ 2 i of the random effects and allows for heavy tails to accommodate for extreme observations; the parameters for the Student-t priors for the regression coefficients are fixed to four degrees of freedom, mean zero and unit variance; for the intercept β 0 we employ the same prior with a variance of two; for the threshold parameters, the 'concentration' hyper-parameter of the Dirichlet distribution is chosen in a data-dependent fashion by setting α j,r as the number of ratings in class r assigned by rater j. For the persistence parameters, we choose a φ b = 20 and b φ b = 2.5 which translates into a prior mean of roughly 0.8 and a prior standard deviation of 0.12, motivated by previous results which find the market factor to be persistent. We choose the same hyper-parameters for the prior on φ δ . For the persistence of the idiosyncratic effects we set a ρ = 20 and b ρ = 5 which has a larger variation, that is, is vaguer and has a mean around 0.60. For all models, five chains of length 2 000 were randomly initialized and run in parallel. The first 1000 MCMC iterations of each chain were discarded as burn-in. Trace plots and density plots show satisfactory convergence of all chains in each model.

Data
The empirical analysis is performed on a dataset containing credit ratings from S&P, Moody's and Fitch, default data and firm-level information for a sample of publicly traded and rated US corporates, excluding financial and utilities companies over the period 1995-2014. Long-term issuer credit ratings of S&P were downloaded by the S&P Capital IQ's COMPUSTAT North America Ratings file. The ratings from Moody's and Fitch were purchased by the research institution directly from the CRAs. Data on corporate defaults and bankruptcies were obtained from the UCLA-LoPucki Bankruptcy Research Database and the Mergent issuer default file. Firm-level data was downloaded from the merged COMPUSTAT/CRSP database.
We perform the analysis on a calendar year basis and match the latest available firm-level information with all available end-of-year ratings. The default indicator is set to one whenever we observe a bankruptcy filing under Chapter 7 (liquidation) or Chapter 11 (reorganization) of the US bankruptcy code or if a default rating [Default ratings assigned by a CRA include in distressed exchanges or missed interest payments addition to bankruptcy filings.] is assigned by at least one CRA in the year following the rating observations. In all other instances, the default indicator is set to zero, including cases where the firm disappears from the dataset for some reason other than bankruptcy such as acquisition, delisting or if no longer rated. The firm-level information is used to construct covariates which have been identified as significant predictors of credit quality in the literature. In our analysis, we rely on the work of Tian et al. (2015), who employ model selection techniques to identify factors relevant for bankruptcy prediction: current liabilities/total assets (LCT/TA), total debt/total assets (F/TA), net income over market value of total assets (NI/MTA), annualized standard deviation of stock returns over a three month period (SIGMA), the logarithm of the end-of-year stock price, whereas the stock price is capped at USD 15 (PRICE) and gross excess log return over value-weighted S&P 500 return (EXRET). All variables are winsorized at 99% and 1% if negative values are allowed.
After eliminating the missing data in the covariates (which appears mainly due to different coverage of the data sources), the merged dataset contains 2528 firms and 19952 yearly observations for all firms (so-called firm-year observations), whereas the panel is highly unbalanced in the time dimension, with firms staying on average 7.89 years in the sample (see Figure A.1 in the Supplementary Material). The sample contains 375 defaults (1.88% sample default rate). Figure 1 shows the cyclical behaviour of the default rates from 1996 to 2014, with default rates increasing during and immediately after recessions.
There is a high rating agreement in the sample with Spearman's correlation higher than 90% for all pairs of raters. Not all ratings are observed for all firm-years, with 97.52% of the firm-years being rated by S&P, 58.67% by Moody's and 17.17% by Fitch. For all CRAs the number of ratings falling into the best and worst classes is small so for the analysis we use the following aggregated rating classes: AAA/A, BBB, BB, B, CCC/C and Aaa/A, Baa, Ba, B, Caa/Ca, respectively. The rating distribution for S&P and Moody's ratings is relatively stable over the sample period, while Fitch assigns more favourable ratings especially at the beginning In the sample, we also observe that defaulted firms exhibit on average higher liabilities ratios (LCT/TA, F/TA, TL/MTA), higher stock price volatility, lower stock prices as well as negative income ratios and negative excess returns. The summary statistics of the covariates for both the entire and the defaulted sample are presented in Table A.1 in the Supplementary Material. Table 1 shows the posterior mean and posterior standard deviation for the regression coefficients β and γ j corresponding to the standardized covariates (the posterior distribution of these coefficients is illustrated in Figure A.3 of the Supplementary Material). We observe that the coefficients β of the latent PD score all have the expected signs. Firms with higher current liabilities, debt or total liabilities ratios have a higher likelihood to default. Similarly, a higher volatility of the stock price is associated with higher PD scores. On the other hand, higher profitability ratios, higher stock prices and higher excess returns lead to lower PD scores and improved creditworthiness. When looking at the rater biases, we observe no marked differences among the three raters. The quantity β + γ j corresponds to the rater-specific coefficients and, while caution should be employed when comparing the magnitude of these coefficients among raters (as absolute scale of the underlying variables in ordinal models is unidentifiable and assumed to be equal to one, see, e.g., Kern and Stein, 2015), insight can be gained from looking at the signs of these coefficients. It is worth noting that the resulting rater-specific coefficients for the covariates change sign for the covariates LCT/TA and EXRET. This has been previously observed empirically (e.g., Alp, 2013;Hirk et al., 2018) and is likely due to the ratings being forward-looking measures of creditworthiness over periods extending beyond one year. In this setting, the resulting negative coefficient of LCT/TA can be explained by the fact that short-term liabilities, that is, liabilities due within one year, which are problematic for firms close to default, in the longer run are less risky than long-term liabilities. Similarly, the positive coefficient of EXRET indicates that firms with higher excess returns, while more likely to avoid an imminent default, are considered typically to be riskier. The posterior mean and the 80% symmetric credible intervals for the market factor b(t) are shown in the left-most panel of Figure 2. We observe the drops in the estimated posterior means of the market factor during the burst of the dot-com bubble and the financial crisis 2007-2009. We also investigate the time-varying intercept in the rater bias which captures changes in the CRAs' behaviour. Figure 2 illustrates the posterior means and the 80% credible intervals of the λ j δ(t) term for each CRA, (note that in the analysis we fix λ S&P = 1 to reduce the parameter space and to ensure a better convergence of the model). We observe a rather counter-cyclical behaviour, where the drops (spikes) represent a relaxation (tightening) of the rating standards. The factor loading λ Fitch is larger than the one for the other two CRAs, probably owing to Fitch being the youngest of the CRAs in the US market and to the least ratings being observed for Fitch in the sample. The tightening and relaxation of the rating standards in a counter-cyclical fashion could potentially be explained by the TTC approach employed by the raters, who adapt their standards to counterbalance the business cycle in order to keep the rating distribution constant. This hints towards the fact that the rating standards become stricter after economic downturns, behaviour not completely explained by the TTC approach (in line with e.g., Alp, 2013; Bar-Isaac and Shapiro, 2013). The posterior summaries of the other parameters are presented in Table A.2 of the Supplementary Material.

Comparison with benchmark models
For comparison purposes, we formulate several alternative model specifications which differ from the proposed modelling framework in the specification of the random effect u i (t) and in the specification of the average rater bias η ij (t). We consider two static benchmark models which only contain one level of hierarchy in the random effects specification after accounting for the covariates, that is, the idiosyncratic term is normally distributed term u i (t) ∼ N(0, ψ 2 ). In model (S1) we assume no rater bias η ij (t) = 0 for all raters while in model (S2) we allow for covariate dependent rater bias η ij (t) = γ j x i (t) for all raters. The other benchmark models considered share the dynamic specification of u i (t) introduced in Equation (2.3) but differ in the specification of the rater bias. Model (D1) assumes η ij (t) = 0, while model (D2) assumes that the rater bias is a linear combination of the covariates η ij (t) = γ j x i (t). The proposed model is denoted by (PM). For a motivation on the choice for the rater bias specification in model (PM), we direct the reader to Section A.4 of the Supplementary Material for an exploratory residual analysis. We consider for model comparison purposes the widely used Bayesian leave-one-out estimate of the expected log pointwise predictive (ELPD LOO). Table 2 contains the difference in ELPD LOO relative to the best estimated model, that is, the model with highest ELPD LOO, together with an estimate for the standard error of the differences. We observe a value of zero for the proposed model and notice that the differences for all other models are more than two standard deviations away from zero, confirming the superior performance of Model (PM).
In order to compare the proposed joint model with the above models in terms of out-of-time performance, we set up a cross-validation exercise as described in Section 3.2 and start by training the model on the period 1995-2006 and then sequentially add one sample year of data to the training set. This results in seven Table 2 This table presents for five models considered the difference in ELPD LOO relative to the best performing model, the corresponding standard error, and the measures of the out-of-time exercise: one-step-ahead joint log predictive density, the square root of the Brier score and the area under the precision-recall curve averaged over 2007-2013 (whole) and 2007-2009 (crisis)  Notes: a η ij (t) = λ j δ(t) + γ j x i (t), b η ij (t) = γ j x i (t), c η ij (t) = 0, d u i (t) as in Eq. 2.3, e u i (t) ∼ N(0, ψ 2 ).
training and test samples. Table 2 presents the one-step-ahead joint and marginal LPD averaged over all test samples (i.e., over the period 2007-2013) and over test samples covering the financial crisis 2007-2009, respectively. Similarly, we report the average one-step-ahead square root of the Brier score and area under the precision-recall curve based on the conditional probabilities of default. We observe that the proposed model (PM) performs best out-of-time in terms of the joint log predictive density scores, and more generally, the models with a dynamic specification in the PD score are better in terms of predictive performance than the static models (S1) and (S2). Model (PM) also performs best in terms of calibration, as it achieves the lowest Brier score on average over all samples and over the crisis period. In terms of discriminatory power, the static model (S2) performs best for the crisis years, while dynamic model (D2) performs best for the whole test period. This suggests that including a time-varying rater factor in the rater bias specification does not necessarily improve the ability of the model to discriminate between defaults and non-defaults.

Concluding remarks
In this article we present a joint analysis of defaults and credit ratings for a sample of US publicly traded corporates by considering a multidimensional longitudinal model of multivariate ordinal data. We integrate both default and forward-looking credit rating data in a joint statistical model and employ a dynamic specification in the latent creditworthiness equation and in the rater bias. Bayesian methods implemented in the R package rstan are used for estimation. To examine the empirical performance of the posterior estimates under the proposed joint model, we conducted a simulation study which is presented in Section A.3 of the Supplementary Material. Conditional on well-established covariates in the bankruptcy prediction literature, we allow the latent PD score to depend on a dynamic unobservable market factor common to all firms, as well as on idiosyncratic effects with a dynamic specification. Moreover, the ratings from the big three CRAs are assumed to be noisy observations of the latent PD score, where a rater bias which depends on covariates and has a time-varying component is specified for each CRA. When comparing the predictive performance of the proposed framework to benchmark models we find that the proposed model has superior overall predictive performance. Future research avenues include the incorporation of a wide range of covariates in the model and tackling the issue of variable selection to allow a data-driven identification of relevant factors for both the latent PD score and the rater biases as well as the exploration of more flexible mixed-effect specifications for the latent PD score, for example, more general parameterizations for the factor loading ω capturing the dependence between the latent PD score and the latent market factor as well as for the persistence ρ and standard deviation ψ of the idiosyncratic effects by allowing, for example, industry-specific parameters. Finally, the potential of the proposed joint model in serving as a framework for estimating and validating TTC versus PIT PDs and for measuring the degree to which rating systems are employing the TTC approach (topics of renewed interest mainly in the context of IFRS 9), could be further investigated.