Development and evaluation of probabilistic forecasting methods for small area populations

Planning and development decisions in both the government and business sectors often require small area population forecasts. Unfortunately, current methods often produce forecasts that are inaccurate, particularly for remote areas and those with smaller populations. Such inaccuracy necessitates the development and evaluation of methods to forecast and communicate forecast uncertainty, however, little research has been conducted in this domain for small area populations. In this paper, we evaluate a set of probabilistic forecasting methods which include Autoregressive integrated moving average, Exponential Smoothing, THETA, LightGBM and XGBOOST, to produce point forecasts and 80% prediction intervals for Australian SA2 small area populations. We also investigate methods to combine the intervals to produce ensemble forecasts. Our results show that individual probabilistic methods generally produce prediction intervals which underestimate forecast uncertainty. Combining forecasts improves the overall accuracy of point forecasts and the coverage of their intervals, however, coverage still tends to be less than the expected 80% for all but the most conservative combination method.


Introduction
One of the key elements of urban planning is the forecasting of local area populations.Such small area population forecasts are used widely by both governments and businesses to guide planning and development decisions.Current practices often produce forecasts which are inaccurate, particularly for areas which are remote and have small populations (Wilson and Rowe, 2011).Methods applied are generally deterministic, producing point forecasts which give users no indication of uncertainty.However, planning decisions would benefit from an understanding of the uncertainty of the predictions that are being used (Keilman, 2020).A recent review of the small area population forecasting literature found that very few projection methods incorporate any indications of forecast uncertainty (Wilson et al., 2021a).As pointed out by Keyfitz (1981), whilst demographers cannot be held responsible for providing inaccurate forecasts, they are responsible for '…warning one another and our public what the error of our estimates is likely to be' (Keyfitz, 1981, p. 579).Therefore, given the extensive use of small area forecasts it is vital that methods to calculate and communicate forecast uncertainty be developed and evaluated.
Statistical agencies often communicate forecast uncertainty by applying a scenario-based approach and creating low, medium and high variants of deterministic forecasts.However, such an approach doesn't explicitly provide uncertainty measures and it does not enable forecast users to ascertain the likelihood that a future population will fall within a certain range (Tayman, 2011).Furthermore, forecast users often give little attention to the high and low variants and focus on the medium forecast (De Beer, 1997).A more useful method is to create forecasts with prediction intervals which give a predicted range into which a future value is expected to fall, provide a better representation of uncertainty and encourage forecast users to consider a range of population outcomes in their decision-making.
Prediction intervals can be generated using probabilistic forecasting methods, empirical investigations of past forecast errors, or a combination of the two (Rayer et al., 2009).Whilst deterministic methods remain the norm, probabilistic methods are now commonly applied to generate forecasts for countries (e.g.Azose et al., 2016;Keilman et al., 2002;Okita et al., 2011;United Nations, 2019;Wilson and Bell, 2004) and large sub-national geographic areas (e.g.Setiawan et al., 2021;Tayman et al., 2007;Wilson and Bell, 2007).As we move to smaller areas such as local council areas or census tracts, probabilistic forecasts become 'more challenging' (Statistics New Zealand, 2016, p. 78).
Quantitative, out-of-sample evaluations of the accuracy of prediction intervals for small areas are rare and challenging.Rayer et al. (2009) generated empirical prediction intervals for US counties by examining the forecast errors produced by extrapolative models.They found that 90 th percentile errors for trimmed mean forecasts of the prior period provided reasonably good estimates of 90% prediction intervals for the subsequent period.Wilson et al. (2018) generated empirical prediction intervals for Australian local area population forecasts based on past error distributions.Simpson et al. (2020) similarly created empirical prediction intervals for English local authority areas by using errors of past cohort-component forecasts for prior years and applying them to later periods.A few papers have also developed and evaluated model-based prediction intervals.Zhang and Bryant (2020) developed a Bayesian approach to estimating and forecasting internal migration for Iceland.Bryant and Graham (2013) used a Bayesian approach to estimate and forecast the components of population change, whilst Jiang et al. (2007) used an approach based on the Hidden Markov model to create point forecasts and prediction intervals for suburbs of Canberra, Australia.The literature on the application of machine learning algorithms for the probabilistic forecasting of small area populations is scarce even though such methods have previously been shown to be the most accurate approaches for creating prediction intervals in the M4 (Makridakis et al., 2018) and M5 (Makridakis and Spiliotis, 2021) forecasting competitions.These M series forecast competitions challenge forecasters to create the most accurate forecasting methods and evaluate them on a large dataset of time series (Makridakis et al., 2020).
The purpose of this research is to provide a comprehensive evaluation of probabilistic forecasting methods.We consider statistical and machine learning models, and combinations thereof, for Australian SA2 areas (Australian Bureau of Statistics, 2017a); these areas are the smallest units in the Australian Bureau of Statistics geographical hierarchy for which population statistics are regularly collected and published.This work will advance prior research by Grossman et al. (2022) which investigated combination methods for Australian SA2s but focused exclusively on deterministic forecasts and included a slightly different set of individual models.Specifically, this paper has the following objectives: 1.To identify probabilistic models that produce the most accurate forecasts with the best coverage.2. To investigate methods to combine the interval forecasts produced by our set of probabilistic models.3. To investigate how the results are influenced by the remoteness of the SA2 areas.
In this study, we generate population forecasts with 5 and 10 years forecast horizons for Australian small areas.The "Results" section presents the model evaluation results which are then discussed in the "Discussion" section.The aims, findings, and implications are then summarised in the "Conclusions" section.

Data
We prepared and evaluated probabilistic forecasts with 80% prediction intervals using the 2016 midyear Estimated Resident population (ERP) totals for Australian SA2 areas (Australian Bureau of Statistics, 2017a).ERPs are produced by the Australian Bureau of Statistics using the place of usual residence data collected during the 5 yearly census.Data on births, deaths and migrations and administrative data are used to adjust sub-state ERPs during census years and estimate populations between census years.ERPs are revised after the subsequent census (Queensland Government Statistician's Office, 2019).The estimates are generally good quality according to the postenumeration survey findings, which found that for SA2s with total populations greater than 1,000, the average 2016 absolute intercensal difference was 3.5% (Australian Bureau of Statistics, 2017b).
The 2016 dataset includes population totals from 1991 to 2016, and the median across areas in 2016 was 9681.Two sets of forecasts were prepared for two different jump-off years: 10-years forecasts from 2006, and 5 years forecasts from 2011.These forecasts covered the 2066 SA2 areas whose populations remained above 100 for the duration of the base period.

Methods
In our study, we use three statistical models and two machine learning models to generate point and probabilistic forecasts.Here, the probabilistic forecasts are provided to demonstrate the uncertainty of the model forecasts, where it is expressed in the form of a probability distribution.Specifically, we provided the point forecast and 80% prediction intervals for each attested model.We implemented the three statistical models (ARIMA, ETS and THETA) using the 'forecast' package (Hyndman and Khandakar, 2008;Hyndman et al., 2022) in the R programming language, whereas gradient boosting based machine learning models (LGBM and XGBOOST) were developed using the packages 'lighgbm' (Ke et al., 2021) and 'sklearn' (Pedregosa et al., 2011) in Python.
We generate the prediction intervals for statistical models, assuming the residuals produced from these models are uncorrelated and normally distributed.This assumption is valid because the statistical models, that is, ETS, ARIMA and THETA are capable of capturing time series seasonalities and trends, that is, mostly the trend patterns observed in population-based datasets.In contrast to statistical models, machine learning based models produce prediction intervals based on quantile loss, independent of model residuals being uncorrelated.
The 'forecast' package generates the probabilistic forecasts assuming the prediction intervals are wider as the forecast horizon increases.Based on the standard deviation of the forecast distribution, the 'forecast' package computes the probabilistic forecasts relevant for a specific prediction interval.We use the level parameter, in the 'forecast' package to generate probabilistic forecasts relevant for the 80% of prediction interval.To produce probabilistic forecasts with machine learning models, we use quantile loss functions that attempt to optimise the model for a specific forecast quantile.
Autoregressive integrated moving average (ARIMA).The Autoregressive integrated moving average (ARIMA) model is widely used for time series forecasting (Box et al., 2015).ARIMA models aim to describe autocorrelations in the time series data.An ARIMA model is a composition of differencing with autoregression (AR) and moving average (MA) components.Here, the AR component uses the lagged values of a time series in a regression model, whereas the MA component uses past forecast errors (residual errors) in a regression model.The integration (I) term represents the degree of differencing (d) applied for the time series to become stationary.A general class of an ARI-MA(p,q,d) model can be defined as In equation (1), X t is the time series, p denotes the order of lags in the AR model and q denotes the order of error terms in the MA model.Moreover, c is a constant, ε t is a white noise error term and f and θ are the model parameters.We used the auto.arimafunction from the R package 'forecast' (Hyndman and Khandakar, 2008;Hyndman et al., 2022).This implementation uses the corrected Akaike's Information Criterion (AIC c ) to determine the optimal order of an ARIMA model, that is, p, q, while the Kwiatkowski-Phillips-Schmidt-Shin test is used to identify the required d to make the time series stationary.
Exponential smoothing (ETS).The Exponential Smoothing (ETS) algorithm generates forecasts from a weighted average of past observations, with the weights of the past observations set to decrease exponentially as the observations get older.A simple ETS model can be defined as b In equations ( 2) and (0) ≤ α ≤ 1 refers to the smoothing parameter that controls the influence of the past observations in the time series X 1 ,. . ., X t towards X t+1 , which is the one-step-ahead forecast for X t .In this way, α assigns higher weights to more recent observations, whereas smaller weights are assigned to older observations.In our experiments, we used the ets function from the 'forecast' package (Hyndman and Khandakar, 2008;Hyndman et al., 2022).The AIC c information criterion is used to determine the optimal ETS model for a given time series.
THETA.THETA is a relatively simple statistical model used widely by forecasting practitioners which extrapolates the trend and the level of a series (Assimakopoulos and Nikolopoulos, 2000).
The THETA model firstly decomposes a time series into a new set of series, known as Theta-lines, and afterwards applies a forecasting method to extrapolate the Theta-lines separately.Here, exponential smoothing methods are often used as the forecasting method to extrapolate the Thetalines.The THETA model is also treated as a special case of the simple exponential smoothing with drift (SES) method (Hyndman and Billah, 2003).The final forecasts of the time series are obtained by combining the individual forecasts of Theta-lines.Compared to ETS, the THETA model is incapable of modelling seasonal patterns.However, it can capture time series trends and often deal with short time series better compared to seasonal models such as ETS and ARIMA.
The THETA model performed well for the forecasting of yearly time series in the M3 competition (Makridakis and Hibon, 2000), and it was also included as a base model in the feature-based forecast combination (FFORMA) framework that achieved the second-best results in the M4 competition (Makridakis et al., 2018).As small area populations datasets often deal with yearly, non-seasonal time series data, we have included the THETA model in our evaluation of probabilistic forecasting models for small area populations.We used the THETAf function from the package 'forecast' (Hyndman and Khandakar, 2008;Hyndman et al., 2022) to obtain the THETA forecasts for a given population time series.
Gradient boosting models (GBM).Gradient Boosting Models (GBM) are a popular class of machine learning algorithms based on decision trees that can be used to solve a wide range of machine learning problems.The fundamental idea of GBMs is to use an ensemble of weak decision-tree learners sequentially, where each weak learner attempts to improve on the error from the previous model.As this process is repeated sequentially, GBMs are also referred to as sequential ensemble decision-tree models.In the machine learning literature, Extreme Gradient Boosting (XGBoost) models (Chen and Guestrin, 2016) and Light Gradient Boosting (LGBM) models (Ke et al., 2017) are identified as the most popular variants of GBMs, due to their promising model accuracy and computational efficiency.The main structural difference between XGBoost and LGBM is that XGBoost uses a depth-wise algorithm to grow the decision-tree, while LGBM employs a leaf-wise algorithm.
The competitive performance of LGBMs has been recently demonstrated in the M5 Forecasting Competition, where a globally trained LGBM-based solution achieved first place in both point and uncertainty prediction tracks, followed by many other LGBM-based solutions headlining the rankings (Makridakis et al., 2022).In our study, we use both XGBoost and LGBM architectures to generate point and probabilistic forecasts for small area populations.
Total populations vary considerably between different SA2s, that is, some areas have populations in the 100s, and others in the 10,000s.Machine learning models which are trained globally across time series often require the time series within a dataset to be on the same scale for optimal performance.Thus, we first pre-processed the time series in our dataset using the recommendations of Bandara et al. (2020) and Hewamalage et al. (2021), and applied the meanscale transformation strategy, where each time series was divided by its mean value using the equation below where X i,normalised represents the ith normalised population time series, and k is the number of observations in time series X i , where i 2 {1, 2, Á Á Á, n}.Here n refers to the total number of time series in the dataset.
After normalising each time series, we applied the Moving Window (MW) transformation strategy to convert time series observations into input and output frames following the recommendations of (Hewamalage et al., 2021).The generated input and output frames were then pooled together and trained as a global model.Figure 1 shows an example of a single input and output window.Here, we assumed the size of the input window sequence {x 1 , x 2 , ..., x N } is N and the size of the output window sequence {y 1 , y 2 , ..., y M } is equivalent to M. Following the Multi-Input Multi Output (MIMO) strategy in multi-step forecasting (Hewamalage et al., 2021), we made the size of the output window M identical to the intended forecasting horizon.The MIMO is a direct forecasting strategy widely used in the forecast literature (Ben Taieb et al., 2012) that enables direct forecasts of all future values up to the intended forecasting horizon.As the default implementation of XGBoost and LGBM models are only able to train single output from a given set of inputs, we trained models separately for each forecast horizon.In other words, we train M number of models to generate forecasts for the intended forecast horizons.In our experiments, we set the value of M to 5 and 10 to generate 5-years and 10-years ahead probabilistic forecasts, respectively.As exogenous variables, we used the mean and the standard deviation of each input frame to train the XGBoost and LGBM models.
As discussed in Januschowski et al. (2021), GBMs can be trained with any loss function.Therefore, we use the quantile loss function to train the XGBoost and LGBM models to generate probabilistic forecasts for the required prediction intervals.Given a forecast b y t and actual observation y t , the quantile loss L t for a given quantile τ at time t is defined as follows In equation ( 4), as the τ value lies within 0-1 range, the first term becomes positive when y t ≥ b y t , representing the under-forecasting scenario, whereas second term becomes positive when b y t > y t , representing the over-forecasting scenario.When τ is set to 0.5, under-forecasting and overforecasting are penalised by the same factor, representing the median forecasts.In our study, to obtain the 80% prediction interval, we use the τ values of 0.90 and 0.10, which generated the required upper-level and lower-level forecasts.
To implement the LGBM model, we use the lightgbm function from the Python package 'lightgbm' (Ke et al., 2021), whereas the GradientBoostingRegressor function from the Python package 'sklearn' (Pedregosa et al., 2011) is used to run the XGBoost models.
XGBoost and LGBM models contain numerous hyper-parameters, which need to be optimised for a given dataset.To achieve this, we created a validation dataset from the pre-processed training data, that is, input and output window tuples, so the model hyper-parameters can be optimised on this unseen portion of the training data.We used 80%, 20% splitting criteria to divide the original pool of windows to training and validation data.We used an autonomous hyper-parameter optimisation framework, Hyperopt (Bergstra et al., 2013), a python implementation of a Bayesian hyper-parameter optimisation process to identify the optimal values for these hyper-parameters.Table 1 and Table 2 summarise the ranges of hyper-parameter values explored for the LGBM and XGBoost models.

Ensemble models
Combining forecasts often produces improved accuracy due to a reduction in model variance and bias (Schapire, 1999;Breiman, 2001).Most of the literature on combination methods focuses on point forecasts.Taking the mean of a set of point forecasts is one of the most used ensembling techniques (Timmermann, 2006;Genre et al., 2013).Investigations of methods to combine interval forecasts have been rarer, particularly in demography.Grushka-Cockayne and Jose (2020) investigated several methods for the combination of interval forecasts using the M4 competition results (Makridakis et al., 2018).Here we consider three of the methods for interval combination which performed relatively well in Grushka-Cockayne and Jose (2020) to create four ensemble models: STAT, ALL average , ALL envelope and ALL inner trim .For each of these models, we produce point forecasts, upper and lower forecasts.The point forecasts of the ALL models are created using the simple mean of the mean forecasts generated by the five constituent individual models.The point forecast of the STAT model is created using the simple mean of the three statistical methods.Taking the mean of a set of forecasts has previously been shown to work well for small area population forecasts (Grossman et al., 2022).The methods to create the lower and upper forecasts are described below.
Simple mean.For the ALL average and STAT ensemble models the upper forecast (U a ) is calculated as the mean of the upper 80% prediction intervals generated by the included models j where n is the number of included models.The ALL average model includes all five of the individual models, whilst the STAT model includes only the three statistical models.
The lower interval (L a ) is calculated as the mean of the lower intervals generated by the included models Envelope.For the ALL envelope ensemble model the upper interval (U e ) is equal to the largest upper interval from the included forecasts And the lower interval (Le) is equal to the smallest lower interval from the included forecasts In this way the intervals of ALL envelope cover the intervals of all the constituent models.This is a conservative approach to forecast uncertainty which may be helpful for small area population forecasting given the high level of expected forecast error.
Interior trimming.For the ALL inner trim ensemble model the upper interval (Uit) is calculated by taking the average of the (n -1) largest upper intervals, which is equal to the four largest upper intervals in our application of the model And the lower interval (Lit) is equal to the average of the (n -1) smallest lower intervals, which is equal to the four smallest lower intervals in our application of the model This method provides a more conservative interval than using a simple mean (Jose et al., 2013;Yaniv, 1997).

Error metrics
For each model we produce point, low and high forecasts, where the latter two represent the lower and upper bounds of the 80% prediction interval.80% prediction intervals are commonly presented with probabilistic population forecasts, including those produced by the United Nations (e.g.United Nations, 2019).Accuracy of point forecasts is assessed using the absolute percentage error (APE), a standard metric used in population forecasting research (Rayer, 2007).At time t, the absolute percentage error (APE) equals Where A t is the actual population at time t and F t is the mean population forecast at time t.The median Absolute Percentage Error (MedAPE) and the mean Absolute Percentage Error (MAPE) is used to summarise the APE across all small areas for each forecast horizon.Population forecasts generally produce APE distributions with a long tail of errors.The impacts of these errors is smaller on MedAPE than on the MAPE, therefore, we focus our analyses and discussion on the MedAPE.
We use two metrics to assess the accuracy of the prediction intervals.Coverage, which we report as the percentage of true values which fall within the 80% prediction intervals.Coverage is a simple and easy to understand metric which has previously been used in probabilistic forecast evaluations (Rayer et al., 2009;Sevcikova et al., 2018;Tayman et al., 2007).However, if a method produces very large prediction intervals, then this will produce large Coverage but unhelpful forecasts.Therefore, we also calculated the Mean Scaled Interval Score (MSIS; Gneiting and Raftery, 2007), which is a scale independent metric for the evaluation of prediction intervals.MSIS is based on the interval score developed by Gneiting and Raftery (2007), whose work is the basis of a significant portion of the literature evaluating probabilistic forecasts, including the evaluation of national probabilistic forecasts by Keilman et al., (2020), and of the probabilistic forecasts in the M4 competition (Makridakis et al., 2020).
The formula for the MSIS at time t is presented below where n is the length of the time series, and m is the frequency of the times series which is equal to one as our data is yearly.L t is the lower bound of the prediction interval and U t the upper bound.h is the forecast horizon; when we evaluate the 5 and 10 years forecasts, h is set as 5 and 10 respectively.Y t is the actual value.a is the significance level which is 0.2 for our study as we present 80% prediction intervals.The calculation of MSIS is conditional, if the actual value falls outside the intervals an additional penalty is added to the score.MSIS is scale independent as the denominator is divided by the average seasonal difference over the length of the time series using the equation below (Castle et al., 2019).
In our study the data is yearly, therefore, equation ( 13) gives the average absolute change in population from 1 year to the next over the base period.We calculate MSIS for each area and present the mean MSIS across areas.The MSIS metric penalises methods which produce large intervals.However, we suggest that large intervals are not necessarily indicative of a bad model if those intervals are accurately showing that an area's population is difficult to predict and that forecasts for it are likely to be erroneous.To investigate this, for each model, we consider the correlation between the absolute percentage errors of the point forecasts and the half-widths of its prediction intervals (Lee and Tuljapurkar, 1994).

Remoteness
Australian SA2 areas can be classified by their remoteness as major cities, inner regional, outer regional, remote and very remote (Australian Bureau of Statistics, 2013).Here we disaggregate our forecast results by remoteness and present tables containing the MedAPE, MAPE, Coverage, MSIS, Half-Widths and correlation between APEs of the point forecasts and the half-widths of the prediction intervals, for each remoteness area.

Probabilistic forecast evaluation
Table 3 summarises the overall results of the probabilistic forecasting models for 5 and 10 years forecast horizons.The most accurate method in terms of MedAPE is the ALL model.The MedAPE for each of the three ALL ensemble variants are the same because the variants differ in how their intervals are calculated, not in how their point forecast is calculated.The ensemble models perform notably better than the individual models.The statistical models and GBMs perform similarly in terms of MedAPE.
The Coverage of our 80% prediction intervals is shown below MedAPE in Table 3. Coverage was less than 80% for all methods except ALL envelope , thus for most methods the prediction intervals did not adequately forecast the range where actual populations would be.The statistical methods produced better Coverage than the GBMs.The ALL envelope model produced the best MSIS results.
Half-widths for XGBOOST and LGBM are notably smaller than for the other methods, explaining why their Coverage is smaller despite comparable MedAPEs.As expected, the ALL envelope model produces forecasts with the largest intervals.THETA had the narrowest intervals of the individual statistical methods.
Correlations of the absolute percentage errors of a model's point forecasts and the size of its half widths are shown at the bottom of Table 3. THETA consistently produced correlations between 0.4 and 0.5, suggesting that the size of its intervals gives forecast users a better estimate of forecast uncertainly relative to the other models considered, however, it still isn't able to account for most small area forecast uncertainty.
Figure 2 visualises 10 years 2006-based forecasts for the Bellingen SA2, an outer regional area in New South Wales.This visualisation indicates that methods which produce wider intervals in the first few years of a forecast may differ to those that produce wider intervals in the longer term.THETA produces relatively conservative intervals for Year one of the forecast and may be appropriate for probabilistic nowcasting (refer to Tables S1 to S6 of the Supplementary Materials for results for all forecast horizons).

Remoteness
Previous research has shown that the accuracy of deterministic forecasts tends to be less for remote and very remote areas (Wilson and Rowe, 2011;Wilson et al., 2021b).We investigated whether remoteness influences the performance of both point forecasts and interval forecasts.These results are presented in the supplementary material, as Tables S7 to S12.
Tables S7 presents the MedAPEs for the SA2 total population point forecasts.The ensemble models tend to outperform the individual models for major cities and inner regional areas.However, the benefit of ensembling for the MedAPE score is less clear for more remote areas where individual models often perform better, particularly THETA.When we compare MedAPEs for the STAT and ALL average models the results indicate that the inclusion of Machine Learning models using a simple mean combination method tended to increase forecast accuracy for Major cities, however, there was no clear pattern for other remoteness areas.There is a 'U' shaped relationship between remoteness and forecast accuracy.Accuracy tends to be better for regional areas than for SA2s in major cities, remote areas, or very remote areas.The 'U' is asymmetrical, with the accuracy of forecasts for very remote areas being notably worse than for major cities.The MAPE results by remoteness are presented in Table S8.They are similar to the MEDAPE results, albeit the Remote areas have particularly high MAPEs for the 2006-based 10 years forecasts.Table S9 presents Coverage by remoteness for the prediction intervals.For the 5 years forecasts, Coverage tends to be worst for more remote areas.After 10 years, the Coverage of the prediction intervals tends to be worst for major cities.Using simple mean interval combination methods did not improve coverage to desired 80% levels and the inclusion of the Machine Learning forecasts tended to decrease forecast accuracy, thus the STAT model tends to have slightly better coverage than the ALL average model.ALL envelope is the only model to consistently generate prediction intervals that cover approximately 80% of actual values across remoteness levels, jump-off years and forecast horizons (range: 65.96%-90.85%).All other models always underestimate forecast uncertainty.
MSIS is presented in Table S10.The ensemble methods tended to perform better than the other statistical methods, particularly ALL envelope .The ML methods tended to have relatively poor MSIS scores.Most models performed relatively poorly for Major Cities for the 2006-based 10 years forecasts.
Table S11 presents the mean half widths of the 80% prediction intervals by remoteness.
LGBM and XGBOOST have the smallest half widths.As expected by its construction, the ALL envelope model has the largest prediction intervals across remoteness areas.Across the models, the half-widths tend to be greatest for very remote areas and smallest for the inner regional and outer regional SA2s.
Table S12 presents the correlations between the APEs generated by the point forecasts, and the half widths of the prediction intervals, for each of the remoteness levels.The models which produce the highest correlations vary depending on the jump-off year, forecast horizon and remoteness.None of the models produce reliably high correlations across the remoteness areas.

Discussion
In this study, we have investigated and evaluated the use of probabilistic models for small area population forecasting.This context requires stochastic forecasting methods to be developed and implemented due to the high uncertainty present.We considered five individual probabilistic forecasting models and four ensemble models.Our overall results indicate that ensemble models performed better when they included both statistical models and globally trained GBMs.However, an analysis by remoteness found that whilst GBMs improved the performance of ensemble models for SA2s in major cities, they did not improve performance for remote and very remote areas.We discuss these findings below, how they addressed our aims, explore issues related to model-based prediction intervals, consider the limitations and implications of this study.
Our first aim was to evaluate the performance of individual probabilistic models, which included statistical models and GBMs.Of these, ARIMA and ETS produced point forecasts with the smallest MedAPEs, and 80% prediction intervals with the highest Coverage, for the 2006-and 2011-based forecasts, respectively.However, the coverage of the best performing individual methods (54%-65%) was still significantly smaller than what may be expected from 80% prediction intervals.The GBMs did not perform particularly well in our study.Whilst their point forecasts were relatively accurate, coverage was poor for their interval forecasts.Conversely, the top performing methods in the M5 uncertainty competition utilised LGBM methods (Makridakis et al., 2021).The M5 competition dataset contains daily hierarchical retail sales time series data, which are longer than the yearly time series in our small area dataset and additional explanatory variables were used to develop the models.However, the results of the M5 uncertainty competition showed that the benefit of the LGBM based methods, relative to their benchmark ARIMA model, were most evident at the higher levels of the hierarchy; the benefit was limited at the lowest levels (Makridakis et al., 2021).Thus, GBM based methods may be more useful for forecasting populations at higher geographical scales.
Our second aim was to investigate methods to combine the forecasts produced by our individual probabilistic models.The overall MedAPEs of the Ensemble models were considerably lower than those of the individual models.Indeed, the ALL ensemble produced more accurate point forecasts in terms of MedAPE than all bar one of the unconstrained results found in an evaluation of the top M4 competition methods for point forecasts of Australian SA2 small areas (Wilson et al., 2021b).Overall, the ALL average model performed better than the STAT model, indicating that it was advantageous to include the GBMs in the ensemble.This is in line with previous work investigating combination methods for small area population point forecasts (Grossman et al., 2022).
Our third aim was to investigate how the remoteness of small areas impacted the performance of the point forecasts and prediction intervals.Across all models, forecasts are generally more accurate and have better Coverage for regional areas than for major cities or more remote areas.Similar results were found in an evaluation of deterministic methods for Australian SA2s in Wilson et al., (2021b), however, results were different for New Zealand with Major Urban and Rural areas generally having smaller MedAPEs than Medium and Small urban areas.Whilst our overall results indicate that the addition of GBMs to ensemble models improves forecast performance, analysis by remoteness found that this is only consistently true for major cities.SA2s in major cities make up 1187 of the 2066 time series in the dataset.Conversely there were only 47 remote areas and 50 very remote areas.Therefore, it is unsurprising that the GBM forecasts are not as helpful for more remote areas, as they were not well represented by the training set.
Model-based prediction intervals routinely underestimate forecast uncertainty.This is a welldocumented phenomenon, it was one of our findings, and one of the results of the M4 competition (Makridakis et al., 2018).Hyndman et al. (2002) found that the 95% Prediction intervals generated by Exponential Smoothing Methods covered between 71.1 to 87.5% of the series' true values.The reasons for this attribute of prediction intervals relate to there being multiple sources of uncertainty that impact the performance of forecasts.Hyndman (2014) lists four: random error, parameter estimates, model choice and the uncertainty related to whether the processes which generate the data will change in the forecast period.The intervals generated by probabilistic forecasting models generally only address the uncertainty linked with the error term (Hyndman, 2014).Using an ensemble model approach can help address some of the uncertainty linked with the choice of model.Overall, our ensemble models tended to be more accurate and provided better Coverage than the individual models.ALL envelope is the only model which achieved Coverage at the desired 80% level with relatively wide prediction intervals which had half-widths of approximately 10% of the point forecast after 5 years and 20% after 10 years.Whilst other models produced narrower intervals, this came at the cost of understating the level of uncertainty associated with their predictions.Bryant and Zhang (2016) used a Bayesian approach to create probabilistic forecasts for New Zealand small area emigration rates.Their 50% and 90% intervals covered 58% and 90.6% of the true values respectively.Their intervals were large, but they argued that this is necessary given that long-term forecasting is uncertain, and it prompts 'users to confront the substantial uncertainty about longterm trends' (Bryant andZhang, 2016, p. 1337).

Conclusions
Given the inherent uncertainty in small area population forecasting, it is important to provide users with an indication of forecast uncertainty.In this study, we evaluated five individual and four ensemble models for probabilistic small area population forecasts.The individual probabilistic models produced prediction intervals that cover less than the desired 80% of true values.This is a known issue for model-based prediction intervals.We show that combination methods such as the Grossman et al.
inner trim and envelope techniques can improve Coverage.Ensemble models which included both statistical models and GBMs tended to outperform individual models for both point forecasts and prediction intervals.However, when we disaggregated the forecast results by the remoteness of the small areas, we found that the inclusion of GBMs in ensemble models was only beneficial for major cities.Practitioners using global models should consider if areas of interest are represented by the wider dataset and investigate the use of additional variables.
One of the limitations of this work is that we only evaluated our models using an Australian SA2 dataset.It is important to investigate how models perform across multiple datasets before recommendations can be made to practitioners about which probabilistic models to use.In turn, it is important for practitioners to help researchers understand what the requirements for the intervals are, and how they will be used.

Figure 1 .
Figure 1.The input and output window training procedure.

Figure 2 .
Figure 2. 2006-based 10 years forecasts for the Bellingen SA2 area.Note: Figures show forecasts with 80% prediction intervals indicated in light blue.

Table 1 .
The hyper-parameter ranges used to train LGBM in our experiments.

Table 2 .
The hyper-parameter ranges used to train XGBoost in our experiments.

Table 3 .
SA2 total population forecast errors: An evaluation of point forecasts and prediction intervals, after 5 and 10 years.: The most accurate forecast is bolded.MedAPE is the median absolute percentage error, the smaller the MedAPE the more accurate the overall forecasts.Coverage refers to the percentage of true values that fit within the projected 80% prediction intervals.MSIS is the mean scaled interval score, forecasts are penalised for having larger confidence intervals; smaller values are better.We present the mean half-width, which is half the width of the prediction interval divided by the point forecast.The Pearson correlation coefficient between the half-widths of the prediction intervals and the absolute percentage errors of the point forecasts is given at the bottom of the table.The ALL ensembles each have the same point forecast, therefore they have the same MedAPE and MAPE results. Note