Solutions to problems of nonexistence of parameter estimates and sparse data bias in Poisson regression

Poisson regression can be challenging with sparse data, in particular with certain data constellations where maximum likelihood estimates of regression coefficients do not exist. This paper provides a comprehensive evaluation of methods that give finite regression coefficients when maximum likelihood estimates do not exist, including Firth’s general approach to bias reduction, exact conditional Poisson regression, and a Bayesian estimator using weakly informative priors that can be obtained via data augmentation. Furthermore, we include in our evaluation a new proposal for a modification of Firth’s approach, improving its performance for predictions without compromising its attractive bias-correcting properties for regression coefficients. We illustrate the issue of the nonexistence of maximum likelihood estimates with a dataset arising from the recent outbreak of COVID-19 and an example from implant dentistry. All methods are evaluated in a comprehensive simulation study under a variety of realistic scenarios, evaluating their performance for prediction and estimation. To conclude, while exact conditional Poisson regression may be confined to small data sets only, both the modification of Firth’s approach and the Bayesian estimator are universally applicable solutions with attractive properties for prediction and estimation. While the Bayesian method needs specification of prior variances for the regression coefficients, the modified Firth approach does not require any user input.


Introduction
Poisson regression is widely used to model the distribution of count variables as functions of predictive covariates. This approach provides particular utility when accommodating differential follow-up times of study subjects, 1 as well as the modelling of 'non-events' or excess zeroes through so-called zero-inflated models. 2 As with other generalized linear models, such as logistic regression, Poisson regression can be especially challenging in the presence of rare events, making it more likely that particular covariate patterns in a given dataset result in the nonexistence of maximum likelihood (ML) estimates; for example, if no events are observed for one of two groups represented by a binary covariate. A necessary and sufficient condition for the nonexistence of ML estimates has been identified by Correia et al. 3 This problem of 'separation' has been studied extensively for logistic regression, 4-7 but comparatively little is known about how various approaches perform when separation arises in Poisson modelling, such as the bias reduction method of Firth 8,9 and (in special cases) conditional exact Poisson (EP) regression. 10 Bayesian alternatives, including the application of weakly informative priors for the distribution of the log relative risk, have likewise not been evaluated with regard to separation. Given the common use of Poisson regression, particularly for rare events and in other settings that are prone to separation, more definitive empirical studies are needed to assess the comparative performance of these modelling options. In particular, we need to better understand how various modelling choices impact the estimation of regression coefficients to ensure that analysts are confident in their predictions.
In this paper, we provide a more comprehensive evaluation of methods that provide finite estimates of regression coefficients when separation arises. We propose a modification of Firth's approach to improving its performance for predictions, without compromising its attractive bias-correcting properties for regression coefficients. Furthermore, we include the conditional median unbiased estimator 11 as implemented in the software package LOGXACT 10 or in SAS's PROC GENMOD and a Bayesian estimator using weakly informative priors in our evaluation. To provide context, we first illustrate the issue of separation for rare event data with a dataset arising from the recent outbreak of COVID-19. We subsequently provide a brief review of Poisson regression and describe proposed solutions to the problem of separation. We then summarize the results of a comprehensive simulation study to compare these options under a variety of realistic scenarios. We further illustrate their relative performance with an additional example.

Motivating example
During the outbreak of the coronavirus (COVID- 19) in winter and spring 2020, employees of supermarkets, nursing homes, and hospitals were considered key professionals as they are in contact with many clients each day even under the lockdowns of public life that were imposed by many countries in that time. Many of their clients such as older adults or persons with chronic diseases were considered to be at high risk of a severe course of the disease if they got infected. At some point, the question arose if more stringent control of virus spread should be imposed on one of these groups, in particular, considering the unknown probability of asymptomatic infections. Therefore, on 28 March 2020, and 30 March 2020, a representative series of 1161 tests for the presence of an infection with the novel coronavirus (COVID-19) was performed in Austria among randomly selected asymptomatic employees of supermarkets, nursing homes, and hospitals. 12 The research question behind this series was how the risk of infection differs between employees of supermarkets, nursing homes, and hospitals. The question could be answered by Poisson regression to compute risk ratios, for example, comparing nursing homes and hospitals to supermarkets.
Among the 1161 persons tested, only six tested positive for COVID-19. Three of them were working in hospitals and three in nursing homes (Table 1).
Unfortunately, neither SAS/PROC GENMOD nor R/glm was able to provide risk ratios based on a Poisson regression analysis because ML analysis failed as there was a category with no events. However, the conditional median unbiased estimates for the two interesting risk ratios and associated exact 13 95% confidence intervals (CIs) were 3.05 (0.46, ∞) for nursing homes versus supermarkets and 3.71 (0.56, ∞) for hospitals versus supermarkets. Using Firth's bias reduction approach, 8 which in the absence of covariates can be obtained by adding 0.5 events to each of the three observations, the analysis resulted in risk ratios (95% profile penalized likelihood (PPL) CI) of 5.55 (0.54, 746.13) and 6.75 (0.65, 907.60), respectively. Therefore, while it appears that employees of nursing homes or hospitals are at considerably higher risk of spreading the disease compared to supermarket employees, this claim is not fully supported by the study but still could be found in local media. Note that estimates of risk accompanied by 95% CI in each group, which could be obtained much easier than risk ratios, give a good summary of the data but do not answer the question of whether supermarket employees or health care workers are at higher risk of infection.

Methods
In this section, we will review the Poisson regression model with a special focus on nonexistence of ML estimates. We will further review Firth's correction in the context of the Poisson regression model and we will propose a modification that gives unbiased predictions. Finally, we will present exact conditional analysis and Bayesian estimation with weakly informative priors.

The Poisson model
The Poisson model assumes that the counts of events in a study, Y, follow a Poisson distribution with parameter μ: Y ∼ Poisson(μ), where the logarithm of μ is modelled by a linear combination of covariates: log (μ) = X β + Z. Here, X describes a n × (k + 1) matrix of covariates, with n and k denoting the number of observations and covariates, respectively. By convention, X ·0 , the first column of X, consists of 1s only to enable the estimation of an intercept. Z denotes an offset variable, for example, the logarithm of follow-up time or the number of people tested in total. The parameter μ is interpreted as incidence per unit of follow-up time, β 0 is the intercept and β j , j = 1, . . . , k, is the log incidence rate ratio (IRR) between two individuals differing in X j by one unit. The log-likelihood of the Poisson model is given by Estimates of β j , j = 0, . . . , k, can be obtained by ML estimation, solving the score equations where x ij is the observed value of covariate j for subject i, x i· is the row vector of covariate values for subject i, and z i is the value of the offset for that subject.

Conditions for nonexistence of ML estimates in Poisson regression and consequences
In the coronavirus testing study, no infections were detected among the 352 supermarket workers, while among the 365 hospital employees, three were infected. The risk ratio would be computed as (3 of 365)/(0 of 352), which is not defined because of the division by zero. Similarly, there is no finite maximizer β of the corresponding Poisson likelihood. Correia et al. 3 showed that in Poisson regression the ML estimate does not exist if and only if there is a non-zero (k + 1)-dimensional vector γ * such that x i· γ * = 0 for i with y i > 0 and x i· γ * ≤ 0 for i with y i = 0, see Appendix 1 for a replication of their proof in the special case of Poisson regression. If such a linear combination exists, we say that the data are 'separated'. From a geometrical point of view, the data are separated if and only if there exists a hyperplane such that all observations with y i > 0 lie on the plane and all observations with y i = 0 lie on one side of the plane or also on the plane. For the corona virus testing study, multiplying the dummy variable for supermarket employees by −1 represents a linear combination satisfying the condition given by Correira et al. More generally, we observe that the ML estimate would exist if and only if for each category (supermarket, nursing home, and hospital) at least one person had been tested positive. The existence of the ML estimate only depends on the number of events (people tested positive), but not on the number of people tested in total (Table 1, last column). This observation highlights the difference to the concept of separation in logistic regression (Albert and Anderson, 1984): for the corona virus testing study, ML estimates in logistic regression would exist, if and only if at least one person had been tested positive and at least one person had been tested negative for each category. In the following, we will always use the term separation in the context of Poisson regression as defined above.
As in logistic regression, adding covariates will not remove the nonexistence in Poisson regression. While numerical ML algorithms may declare convergence when the log-likelihood cannot be improved by a further iteration, 14 'a spurious solution is characterized by a "perfect" fit for the observations with y i = 0',that is, exp (x i·β ) 0 for all y i = 0.

Firth's likelihood penalization applied to the Poisson model
Generally, Firth 8 suggested adding a penalty term to the log-likelihood of exponential family models that resembles the Jeffreys invariant prior such that the penalized log-likelihood becomes: where |I(β)| denotes the determinant of the Fisher information matrix. The modification is motivated by elimination of bias of order O(n −1 ) in the ML estimates of β, and various empirical studies have proven the bias-preventive properties of Firth's correction. It also prevents the nonexistence of ML estimates of β and has become the default solution to solve this problem for logistic and Cox regression. 5,15 Already in Firth's seminal paper, 8 an example with Poisson regression was included. However, Firth's likelihood penalization (FL) for the Poisson model has not been studied any further. FL estimates maximizing (1) can be obtained through iteratively solving the modified score equations: where h i are the diagonals of the 'hat' matrix H = XW 1/2 (X ′ WX ) −1 XW 1/2 , with W = diag(exp (x i· β + z i )). Generally, tr(H) = k + 1 and h i > 0 for all i. Equation (2) can been written in a form revealing that FL estimates can be obtained by ML estimation on an augmented data set that consists of the original data in which observed outcomes y i were augmented by h i / 2. While we expect that Firth's correction will correct some of the small-sample bias of the ML estimates also in Poisson regression, it will supply predictions for the countsμ i = exp (x i·β + z i ), which are slightly too high because the modified score equation

A modified Firth correction to achieve unbiased prediction
Similarly, in logistic regression with rare events, Firth's penalization provides predicted probabilities that are on average higher than the observed event rate. To solve this problem, Puhr et al. 7 suggested two methods, 'FL with intercept correction' (FLIC) and 'FL with added covariate' (FLAC). Here we explore the performance of these methods in the setting of Poisson regression. FLIC consists of first obtaining the FL solution and then to correct the intercept parameter by adding a constant δ such that n i=1 y i = n i=1μi . This is achieved by using the linear predictors from the FL solution,η i = k j=0 x ijβj + z i as offsets in a second logistic regression that only estimates an intercept δ FLIC by ML. The regression coefficientŝ β j , j = 1, . . . , k, are left unchanged by FLIC, and the new intercept is given byβ FLIC 0 =β 0 +δ FLIC . FLAC starts by estimating the FL solution as well, but this is only done to compute the values of h i . Subsequently, an augmented data set is constructed, adding a pseudo observation to each original observation with event count h i / 2, and defining an 'added covariate' G such that it distinguishes original from pseudo-observations by assuming values of 0 and 1 for them, respectively. The augmented data set with the added covariate is then subjected to ML Poisson regression where an additional coefficient γ FLAC corresponding to G is estimated. For predictions, G is assigned a value of 0. As for FLIC, we have n i=1 y i = n i=1μi (see Appendix 1). While the two approaches generally give different results for logistic regression, it is remarkable that not only the FLIC estimates, but also FLAC estimates of β 1 , . . . , β k coincide with those obtained by FL for Poisson regression, and that β FLIC 0 =β FLAC 0 (see Appendix 1). Because FLIC and FLAC do not modify the FL regression coefficients, CI for the regression coefficients can be obtained out of the FL model, and in the case of data sparsity should preferably be computed by the PPL method. 5,6 Here, a (1 − α) × 100 per cent CI for a parameter β j is defined as the set of values β * j for which quantile of the chi-squared distribution with one degree of freedom. PPL CI can become asymmetric and in such a case indicate the inadequacy of the Wald method for CI estimation.
We have written a SAS macro FLACPOISSON that performs the Firth correction and the FLAC modification based on iterated data augmentation using repeated calls to PROC GENMOD (https://github.com/georgheinze/flicflac). For simplicity, we approximate the PPL CI in FLACPOISSON by evaluating the profile likelihood (PL) of the augmented data fixing the event counts of the pseudo data at the values h i / 2 obtained at the FL solution. In LogXact, 10 PPL for FL is available where the hat diagonals are iteratively updated when computing the confidence limits. We will illustrate the difference between the two methods in an example of the occurrence of complications in implant dentistry.

Exact conditional Poisson regression with median unbiased estimation
In exact conditional Poisson regression, inference is based on the exact conditional likelihood of a parameter β j conditional on the observed sufficient statistics t j ′ of all other parameters β j ′ ; j ′ ∈ {1, . . . , k} \ j, where a sufficient statistic t j ′ is given by t j ′ = n i=1 x i j ′ y i . 13 The maximum conditional likelihood estimate (MCLE) is the value of β j that maximizes its exact conditional likelihood. In case a finite MCLE does not exist, it can be replaced by a median unbiased estimate (MUE). 16 If the exact distribution is degenerate, neither MCLE nor MUE can be computed. Exact conditional Poisson regression does not provide an estimate of the intercept and thus cannot be used for prediction. The implementations in SAS/PROC GENMOD, in Cytel studio, and in Cytel's PROC LOGXACT add-on for SAS 10 allow for computation of exact and mid-p CIs and more details can be found in the software documentation. 17 In the remainder, we will refer to this method as EP regression.

Bayesian data augmentation
Bayesian estimation with properly specified priors also solves the separation issue. To overcome problems of computing time and diagnostics in Bayesian analysis with Markov chain Monte Carlo algorithms, Sullivan and Greenland 18 illustrate the use of data augmentation to specify normal priors, including an example for Poisson regression. The spread of the normal prior for a regression coefficient is determined by the width of a prior interval for the associated IRR. For example, if a 95% prior interval for the IRR of (1/1000, 1000) is specified, then based on a normal distribution the prior standard error is log (1000) / 1.96 = 3.52, suggesting a prior variance v of 3.52 2 = 12.39. The prior distribution can be specified by adding pseudo-observations, one for each regression coefficient, with a value of 1/S for the associated covariate and 0 for all other covariates, where S denotes an approximation constant (higher values giving a better approximation). No pseudo-observations are specified for the intercept. The event count of each pseudo observation is set to y = S 2 / v, and the corresponding offset to z = log(y). While Sullivan and Greenland 18 chose S = 25, we used S = 10,000 to obtain from the pseudo-observations 95% PL CI for the regression coefficients that were symmetric up to the third decimal place. After data augmentation, maximum posterior estimates can be computed by applying ML methods to the augmented data, and intervals for them from PL. Bayesian data augmentation (BDA) results in unbiased predicted counts in the sense that n i=1 y i = n i=1μi .

Methods
The methodology of our simulation study is described as recommended by Morris et al. 19 Aims: We intended to explore and compare the performance of different estimation methods for Poisson regression with sparse data.
Data-generating mechanisms: To capture a plausible context, we considered a data-generating scheme as described by Binder et al. 20 and Zöller et al. 21 First, we generated data sets of n observations on 10 covariates X 1 , . . . , X 10 of different, prespecified distributions that were obtained by applying certain transformations to normal random variables Z 1 , . . . , Z 10 sampled from a standard multivariate normal distribution with correlation matrix Σ ( Table 2). In this way, we generated four binary covariates X 1 , . . . , X 4 , two ordinal covariates X 5 and X 6 with three levels, and four continuous covariates X 7 , . . . , X 10 . The correlation structure of the variables Z 1 , . . . , Z 10 is transferred to the variables X 1 , . . . , X 10 in a somewhat attenuated way.
We kept all other β j fixed with β 2 , β 4 = 0.69; β 3 = −0.69; β 5 = 0.35; β 6 = −0.35; β 7 , β 9 = 0.69 / ISR; β 8 , β 10 = −0.69 / ISR, where ISR was the intersextile range (difference between fifth and first sextile) of the corresponding continuous covariate. The intercept β 0 was chosen such that the marginal event incidence was ∼0.1. We simulated a rate multiplier ψ following our example on the occurrence of complications in implant dentistry (see below) by sampling from a zero-truncated Poisson distribution (restricted to numbers greater than 0) with mean 1.6. The outcome (number of events) y i was then drawn from a Poisson distribution with parameter μ i = exp (η i ) = exp (β 0 + x i1 β 1 + · · · + x ik β k )ψ i . The sample size was determined by fixing the expected EPV ratio at desired values typical for sparse epidemiological data sets (3, 5, or 10). This resulted in 81 possible combinations of simulation parameters. We simulated 10,000 data sets with each of those combinations.
Methods: We analysed each simulated data set by fitting a Poisson regression model including log ψ i as an offset variable and estimating the regression coefficients by maximizing the likelihood (ML), using BDA based on prior intervals for the IRR of (1/1000, 1000) for binary and ordinal covariates and of (1/100, 100) for continuous covariates, using FL, and using FLAC. We also included EP regression in simulated scenarios where it was computationally feasible, that is, with k ≤ 5 and n ≤ 250.
We estimated 95% CI for regression coefficients by the Wald method for ML using PROC GENMOD, and by likelihood profiles applied to the augmented data for BDA, FL, and FLAC (FLACPOISSON macro). In the case of separation, the values for ML are those reported by PROC GENMOD at the last iteration.
Exact point and interval estimates and mid-p corrected CI were obtained by PROC LOGXACT. 10 Estimands: The estimands in this study were the expected event counts μ i and the regression coefficient β 1 . We also evaluated the frequency of nonexistence of ML estimates (separation).
Performance measures: For point estimates of β and predictions, we evaluated bias and root mean squared error (RMSE) × n √ . For predictions, the mean squared prediction error was obtained as n −1 n i=1 (μ i − μ i ) 2 in each data set and then averaged over all simulated data sets in a scenario. The root of the averaged mean squared prediction error (root mean squared prediction error (RMSPE)) times n √ is reported. For CI of β, we evaluated left/right-tailed one-sided coverage rates (nominal levels 0.975) and power (probability to exclude 0). We summarized the simulation results graphically using nested loop plots. 23,24

Incidence of separation
The incidence of separation was generally higher in scenarios with smaller sample sizes and with larger negative values of β 1 , see Figure S1. The latter phenomenon is a consequence from the imbalance of X 1 , where with negative β 1 events were less likely in the less frequent group (X 1 = 1), and data sets with no events when X 1 = 1 occurred more frequently. Among the scenarios with a given EPV ratio and a given value of β 1 , scenarios with 10 covariates often had the fewest separated data sets. This might seem counterintuitive since, for a fixed non-separated data set, omitting covariates can never induce separation. However, in our simulation study, scenarios with 2, 5, and 10 covariates do not only differ in the number of covariates but also in the type of covariates, the magnitude of the intercept, and the sample size.
We only included EP in the comparison of methods for simulation scenarios with n ≤ 250 and k ≤ 5. For larger data sets, its application was computationally not feasible or not possible because with continuous covariates (when k = 10) degenerate distributions of sufficient statistics were encountered for which no inference is possible. The MCLE in EP did not exist and had to be replaced by the MUE for almost the same data sets where ML estimation failed, see Figure S1. Only with large positive values of β 1 and small-sample sizes, there were considerably more data sets where the MCLE had to be replaced by the MUE than separated data sets. Finally, there were a few datasets where neither the MCLE nor the MUE existed (at most 0.7% of data sets, as observed for the scenarios with n = 60, k = 2 and β 1 = −log(16) or β 1 = −log(4)).

Predictions
The description of the prediction performance is restricted to the methods ML, FL, FLAC, and BDA, since XL does not allow for predictions. Across all methods, predictions were more accurate in terms of RMSPE(μ) for simulation scenarios with fewer variables or with higher EPV ratio. Throughout all evaluated scenarios, FLAC and BDA yielded the most accurate predictions, followed by ML and FL, which, because of the overprediction, performed worst, see Figure 1. With FL, the bias in predictions is fully characterized by the number of covariates, in fact the sum of predictions n i=1μi overestimates the sum of observed counts n i=1 y i by (k + 1) / 2. As described in Section 3, ML, FLAC, and BDA yield unbiased predictions in the sense that n i=1 y i = n i=1μi . Figure S2 shows the scaled bias and RMSPE(μ) in relation to the true incidence exemplarily for one simulation scenario.

Regression coefficients: point estimates
When β 1 was evaluated as estimand, FLAC was not considered separately as it yields the same regression coefficients as FL. Concerning accuracy of regression coefficients, FL, EP and BDA performed similarly well, with some advantage of BDA for extreme, negative values of β 1 and some advantage of FL and EP for β 1 close to 0, see Figure 2. A major drawback of EP is that it is not applicable with larger sample sizes. The worse performance of ML is partly due to the occurrence of separation, but also to a generally higher inaccuracy for data with a low EPV ratio.
ML lead to a large negative bias because of divergent estimation caused by separation when β 1 was negative, see Figure S3. By contrast, FL, EP and, to a lesser extent, BDA lead to a positive bias in scenarios with large negative

Regression coefficients: CIs
Left-tailed and right-tailed coverage rates of 95% two-sided CIs are depicted in Figure 3, and the power to exclude β 1 = 0 is shown in Figure S6. For large negative β 1 , all methods yielded higher than nominal right-tailed coverage often in combination with low power. In these scenarios ML-Wald intervals showed undercoverage of their left tails despite the considerable amount of separated data sets (cf. Figure S1). This left-tailed undercoverage for large negative β 1 was even more severe with FL-PPL intervals. While exact CIs were over conservative, mid-p intervals could correct the conservatism, resulting in increased power and coverage close to the nominal one-sided level of 97.5%. BDA-PL intervals generally were preferable, both in terms of coverage and power. Median width of CIs is described in Figure S7.

Implant dentistry study
Feher et al. 25 studied risk factors for complications in implant dentistry. Risk factors were assessed in 1133 patients undergoing 2405 implantations. We used Poisson regression to model the number of haematological complications by the risk factors age (in decades), smoking (no smoking, light smoking, and heavy smoking), and diabetes mellitus and considered the number of implantations performed per patient as the rate multiplier. Smoking was coded using 'ordinal coding', that is, Figure 3. Left-tailed and right-tailed coverage of 95% two-sided CIs for regression coefficient β 1 for all 81 simulation scenarios. CIs were estimated using the Wald method with ML estimation, likelihood profiles with FL, exact interval estimates (EP), and mid-p corrected CIs (EP:Mid-p) with EP regression and likelihood profiles with BDA. The upper halves of the plots describe the right-tailed coverage, lower halves the left-tailed coverage. Nominal levels of 97.5% are marked by dashed lines. Rows correspond to the number of covariates in the respective simulation scenario, columns to the EPV ratio, and ticks on the x-axis to the true value of β 1 . Grey step functions below the plots indicate sample size. For one scenario (n = 60, k = 2, β 1 = log(16)) the left-tailed coverage of the CIs for ML exceeded the lower limit of the plotting range. CI: confidence interval; ML: maximum likelihood; BDA: Bayesian data augmentation; EP: exact Poisson; FL: Firth's likelihood penalization; EPV: events per variable. two dummy variables were defined contrasting light smokers from non-smokers, and heavy smokers from light smokers. Table S1 contains the basic descriptives for these variables.
ML analysis failed to converge because no haematological complications were observed for light smokers (Table 3). This caused the regression coefficients of the two dummy variables associated with smoking to diverge. PROC GENMOD reported arbitrarily large estimates and standard errors for the corresponding regression coefficients but rather than reflecting the large standard errors, the reported Wald CI was collapsed at the point estimates.
By contrast, FL, BDA, and EP using the MUE gave plausible estimates for all variables (Table 3). These methods also supplied 95% CI, which provided some evidence that light smokers experienced fewer haematological complications than non-smokers or heavy smokers. When comparing the two methods of estimating CI for FL, as expected, fixing the pseudo data with weights computed at the maximum penalized likelihood estimate ('PL CI' from the augmented data, used in the simulation study) led to slightly narrower intervals than iterating the weights ('PPL CI'). We also compared the impact of specifying different priors with BDA. Compared to weakly informative priors ('BDA 100/1000': 95% prior intervals for the IRR extending to 100 for age and to 1000 for the binary covariates, used in the simulation study), with narrower priors ('BDA 5/50': 95% prior intervals extending to 5 for age and to 50 for the binary covariates), all point estimates and nearly all CI were pulled towards unity. The effect of changing the prior was particularly strong for the two IRRs corresponding to smoking which caused the separation problem. Employing weakly informative priors resulted in CI supporting the hypothesis that light smokers experienced fewer complications than non-smokers and heavy smokers, while with narrower priors the CI included unity. EP generally resulted in IRR and CI closer to the estimates by BDA 5/50 than to the estimates by other methods.
While 37 hematologic complications were observed, 39.5 complications were predicted by FL, which estimated the intercept at −4.3728. Applying FLAC changed the intercept to −4.4382, and recalibrated the total number of predicted complications to the observed number of 37. Because age was centred at 50 years, exp(β 0 ) expresses the risk of a complication with one implantation for a 50-year-old non-smoking non-diabetic person. This risk was estimated at 1.26% by FL and at 1.18% by FLAC, and for a 70-year-old diabetic at 21.0% or 19.7%, respectively.

Discussion
We proposed and investigated Poisson regression methods to deal with the problem of separation, which leads to nonexistence of ML estimates. We adapted two modifications of FL, namely FLIC and FLAC, which were originally developed to debias predictions in logistic regression, for Poisson regression. It turned out that in Poisson regression FLIC and FLAC lead to the same estimation method. This method, which we refer to as FLAC, competed well with alternative approaches such as EP regression, which needs special software and is only applicable for small-sized problems, or BDA, which is easy to implement but crucially depends on the choice of the width of the prior distribution. A possible advantage of BDA could arise if a model with many covariates should be fitted. It is essentially equal to ridge regression, and unlike FL can handle situations where the number of covariates exceeds the number of events by regularizing parameter estimates. In our application and simulations we fixed the variance of the prior distribution, which is inversely proportional to the penalty parameter in ridge regression. Optimizing that penalty parameter by, for example, cross-validation invalidates inference about regression coefficients, is not robust to separation 14 and can lead to instable results. 26 To sum up, for data sets fitting into the framework of this study, that is, with an EPV ratio of 3 or higher and moderate correlation between covariates, we advocate using FL as it does not need any user input or optimization of a penalty parameter and is computationally feasible, while showing good performance. In some situations, count outcomes can be naturally reinterpreted as dichotomous outcomes, for example, in the coronavirus testing study where we can either count the number of infections per working place or determine whether a person is infected or not. These data allow for analysis via Poisson regression as well as via logistic regression. It was not the aim of this study to provide guidance on which analysis method to prefer with sparse data, but primarily the decision should depend on the estimand of interest, that is, whether risk ratios or odds ratios should be estimated. Our study employed a realistic design for simulations that resembled data one could typically see in epidemiological studies. Such a design is suitable to draw conclusions on the relative performance of methods in practically relevant situations. While in the simulation study we used R for data generation and summarizing results, we focused on the SAS software for fitting the Poisson models. SAS is widely used among epidemiologists, and with the Cytel SAS procedures, robust and efficient software was available to include EP regression in our comparisons. Hence we employed Cytel's PROC LOGXACT for EP regression even if an implementation of EP regression is readily available in PROC GENMOD. We provide a SAS macro to apply the Firth correction and its modification FLAC in Poisson regression. The median unbiased estimator implemented in PROC LOGXACT has been described to suffer from extreme shrinkage in case the exact conditional distribution of the sufficient statistic is nearly degenerate. Therefore, an alternative estimator based on maximizing the conditional penalized likelihood was proposed. 27 We did not include it in our comparison as Heinze and Puhr 27 found it to be very similar to the FL estimator. We also did not consider the median bias-corrected estimator of Kosmidis et al. 28 R code for fitting a Poisson model with FL is also available, 29 however, without the possibility to invoke the FLAC extension or the estimation of PPL CI. We are also not aware of software packages in R which allow fitting EP regression models. Our SAS macro FLACPOISSON, a SAS macro to implement BDA and further SAS macros implementing FL and FLAC for logistic, conditional logistic and Cox regression are available on the GitHub repository https://github.com/georgheinze/flicflac. A further public repository, https://github.com/georgheinze/PoissonF, contains the aggregated data set of the implant dentistry study and code to reproduce its analysis, all codes used to conduct the simulation study, and an R markdown file which summarizes its results.
The FLAC method lends itself to several extensions. Most naturally, it can easily accommodate overdispersion by including the estimation of a dispersion parameter. 30 This can already be achieved with our SAS macro which is based on PROC GENMOD. Further work could be done to investigate the advantages of considering FLAC for this and other extensions, such as zero-inflated Poisson models and Poisson hurdle models. Theorem 1. Let the n × (k + 1) design matrix X be of full rank. ML estimation does not give a finite solution for the Poisson regression model if and only if there exists a non-zero vector γ * ∈R k+1 with x i· γ * =0 for i with y i > 0 and x i· γ * ≤ 0 for i with y i = 0. (A1) Proof. The log-likelihood is given by . Let β, γ ∈ R k+1 be arbitrary vectors with γ ≠ 0 and let k > 0 be a positive scalar. Differentiating ℓ(β + kγ) with respect to k yields − exp (x i· β + x i· kγ + z i )x i· γ + x i· γy i We will first show that if there exists γ * fulfilling properties A1, then the log-likelihood function ℓ(β) is always increasing in the direction of γ * , in other words there does not exist a finite β maximizing the log-likelihood. For γ * satisfying conditions A1, the directional derivative reduces to This expression is greater than 0 since the exponential function is positive and x i· γ * ≤ 0 for all i with y i = 0 and x i· γ * < 0 for some i with y i = 0. (The inequality x i γ * ≤ 0 has to be strict for at least one observation because otherwise the full rank assumption would be violated.) This proves that for any β and for any k > 0 we have ℓ(β + kγ * ) > ℓ(β), that is, we cannot find an ML solution maximizing ℓ.
Next, we will show that if there does not exist γ * with the properties A1, then the log-likelihood has a maximum. Assume that there does not exist γ * satisfying A1. Then for every γ ∈ R k+1 at least one of the following two conditions must be true: 1. there is j such that x j γ ≠ 0 and y j > 0, 2. there is j such that x j γ > 0 and y j = 0.
Denote by ℓ i (β) = − exp (x i· β + z i ) + (x i· β + z i )y i − log (y i !) the summand in the log-likelihood function corresponding to the ith observation. One can show that if any of the two conditions above holds, then we have lim k ∞ ℓ j (β + kγ) = −∞. Since the components ℓ i (β) have a common upper bound we can find a positive scalar k such that ℓ(β + kγ) = i ℓ i (β + kγ) < i ℓ i (β) = ℓ(β) for any β, γ ∈ R k+1 and k > k. This means that following the log-likelihood from any starting point β in any direction γ will finally yield a decrease in the log-likelihood, guaranteeing the existence of a finite ML solution. □ Combining the two equations, we conclude that i y i = i exp (x i· β + z i ) = iμi .

A.2 Properties of FLAC
A.2.2 FLAC estimates of β 1 , . . . , β k agree with the FL estimates in Poisson regression Consider the score equations, U FL 0 , . . . , U FL k , of FL given as follows: The corresponding equations for FLAC are given by: x ij exp (x i1 β 1 + · · · + x ik β k + z i ) = 0. (A2) This shows that estimates of β 1 , . . . , β k by FL and FLAC are identical.
A.2.3 FLIC and FLAC give identical estimates β 0 , . . . , β k in Poisson regression We have shown above that FLAC estimates of β 1 , . . . , β k agree with the FL estimates. The same is true for FLIC by definition. Moreover, for both methods the sum of the predicted counts iμi is equal to i y i . This implies that FLIC and FLAC also give identical estimates for the intercept β 0 .