Estimating Ambient Air Pollution Using Structural Properties of Road Networks

In recent years, the world has become increasingly concerned with air pollution. Particularly in the global north, countries are implementing systems to monitor air pollution on a large scale to aid decision-making. Such efforts are essential but they have at least three shortcomings: (1) they are costly and are difficult to implement expediently; (2) they focus on urban areas, which is where most people live, but this choice is prone to inequalities; and (3) the process of estimating air pollution lacks transparency. In this paper, we demonstrate that we can estimate air pollution using open-source information about the structural properties of roads; we focus on England and Wales in the United Kingdom (UK) in this paper although the methods here described are not dependent on specific datasets. Our approach makes it possible to implement an inexpensive method of estimating air pollution concentrations to an accuracy level that can underpin policymakers' decisions while providing an estimate in all districts, not just urban areas, and in a process that is transparent and explainable. Impact Statement. We show that a linear regression model using a single structural property -- length of the track and unclassified road network within 0.36% of districts within England and Wales (in the UK) -- can accurately estimate which districts are the most polluted. The model presents a transparent and low-cost, yet effective, alternative to more expensive models such as the one currently used by DEFRA in the UK. The model has apparent practical uses for policymakers who want to pursue clean-air initiatives but lack the capital to invest in comprehensive monitoring networks. Its low implementation cost, accessible model design, and worldwide coverage of the dataset provide a basis for implementing systems to estimate air pollution concentrations in low-income countries.


Introduction
The world is becoming increasingly urbanised.Urban environments bring efficiencies and opportunities unavailable in rural settings, such as job opportunities, education and healthcare access.However, urbanisation comes with several problems, such as air pollution (Qing, 2018).However, air pollution affects not only those who live within the urban setting themselves, but those in surrounding rural areas.Although the urban-rural gap has increased since 1970s (33% decrease in rural population and a 100% increase in urban population) and the air pollution has affected more urban areas (by a factor of more than 5) (Deng and Mendelsohn, 2021), there is plenty of evidence of rural areas continuing to suffer consequences of air pollution mostly generated in the large urban concentrations (Agrawal, 2005;Du et al., 2018;Aunan et al., 2019).

Related Work
As the effect of climate change becomes more prevalent, pressure mounts on governments and policymakers to act to counter the detrimental effect on society.Amongst the many issues, air pollution is quite prevalent for many reasons including its effect on health, its incriminatory effects (it affects everyone regardless of possible differences), and its global effect (no location in the world is immune to its effects despite their own contribution to the issue) (Shaddick et al., 2020).In fact, the issue is so globalised that certain nations are asking for restitution (loss and damage) 2 from polluting countries (Johnson, 2017).
In order to have a global view of the issues, we need estimators because of the lack of real-time monitoring infrastructures.Recent works have attempted to estimate air pollution caused by road traffic in a transport network (Gualtieri and Tartaglia, 1998;Karppinen et al., 2000).These works have taken the approach of estimating the use of the road infrastructure by vehicles alongside a dispersion model for the pollution that results in a fine-grained pollution map for the study period over the study area.However, the issue present with this approach is the data and computation needed to underpin the model.The constraints presented by the model have recently improved with modern techniques such as Artificial Neural Networks (ANNs) (Catalano et al., 2016).However, the task of collecting road traffic data is still prohibitively expensive and time-consuming to collect, resulting in spatially sparse datasets. 3The prospect of using road traffic data for air pollution estimation over a sizeable spatial extent is further compounded due to the changing traffic dynamics on roads over small distances.These constraints make it challenging to expand the method to a national level.While there are models to assess air pollution at a national level, such as the UK DEFRA Modelling of Ambient Air Quality (MAAQ) contract model, they are not open source, convoluted and expensive (Brookes et al., 2021), with the 2021 renewal contract valued at £3,800,000 (DEFRA Network eTendering Portal, 2021).
In the scientific literature, one will encounter works that are based on deep learning and others that are not.The non-deep-learning methods are further separated into deterministic and statistical methods (Liu et al., 2021).Deterministic models are argued to have limited predictive performance due to various factors including parameter estimation (Stern et al., 2008;Pak et al., 2020).Statistical methods are further subdivided into works using classical statistics and ones based on machine learning (ML).Although statistical methods can capture interesting features such as non-linearity in the data, they are expensive to run and not able to extract complex features such as spatial correlations in historic data (Yan et al., 2021).This work explores the possibility of estimating ambient air pollution at a coarse but adequate level to inform policy-makers decisions using open source, low cost and non-invasive data extracted from the structural properties of road networks.Our work's final output helps free resources for direct investment into clean air initiatives rather than monitoring infrastructure.

Literature Gap
A 2019 review by Public Health England outlined air pollution as the most significant environmental threat to health in the UK, with 28,000-36,000 deaths attributable yearly to long-term exposure (Jim Stewart-Evans, 2019); unfortunately, this is also true worldwide (Shaddick et al., 2020).
The first step to tackling air pollution is knowing where it is polluted; mapping the levels of pollution around the country.The most robust method of determining air pollution within an area is to perform ground measurements with specialised equipment.However, the cost of this equipment can be prohibitive, especially if there is a need for covering large areas.Costly equipment limits the scope of the air pollution monitoring networks.Even in wealthier countries such as the UK, which has reduced air pollution as a core policy goal (Eustice and of Richmond Park, 2021), the scope of its air pollution monitoring network is limited.Figure 1 shows the scope of the England and Wales automatic air pollution monitoring network a part of the UK Automatic Urban and Rural Network (AURN).Note that despite the name of the network, the majority of the stations are in urban environments, which leads to poor performance in rural settings and inequalities between urban and rural residents.
Given the gaps that exist in ground monitoring, model estimates for air pollution can supplement the ground observation network to ensure compliance with policy targets across the whole of the UK (Department of Environment, Food and Rural Affairs, 2019a).The UK's air pollution datasets are augmented with outputs from the UK DEFRA Modelling of Ambient Air Quality (MAAQ) model, which like other aforementioned models suffer from shortcomings such as lack of transparency; we do not actually know the details of how this model works.
The goal of this work is threefold: (i) we want to estimate pollution concentrations based on datasets that are already available in most countries (road infrastructure properties); (ii) we aim to achieve the same or better results of current models using open source input data while also making the model itself open source, ensuring no barrier to implementing the approach based on cost alongside making the approach accessible; (iii) the process is made as streamlined as possible using minimal data to inform data acquisition for policymakers constrained by a lack of data availability, which is becoming an increasing divide between countries (Karlsson, 2002).
The amount of data collected around the world and the fact that information about a phenomenon can be embedded in datasets collected for other purposes, mean that lack of data to be used in air pollution models may not necessarily be an excuse for not having a model.This work shows that we can effectively use secondary datasets containing information about the phenomenon we want to model; this can be important in locations where data is collected by institutions such as non-profits, foundations, the United Nations, etc. but no curated pollution datasets exist.
We chose the UK (England and Wales) as the study area to build a proof of concept model, as its air pollution monitoring data are open source and freely available.However, we see no reason for not being able to use a similar approach in other locations around the world, as long as we have similar data used here.
When dealing with spatial data, we often have to choose the aggregation level for the model.We chose Middle Layer Super Output Areas (MSOA) as the districts under which we would aggregate the road and air pollution data.The three district designs considered can be seen in Section S1.The OS National Grid district design was ruled out because many districts had no roads, which created districts where we could not create a feature vector.A feature vector is an ordered list of features about an observed phenomenon.Each feature describes a measurable property of the observed phenomena, in this case, the road network within a given district.A corresponding target vector is then created, in which the goal is for the model to learn the relationship between the corresponding features and targets.This study's target vector is related to observed ambient air pollution concentrations within a district.The Local Authority District (LAD) design was a viable option but had very inconsistent geographical sizes for the different districts across the study area, ranging from 2.8km 2 area in the City of London LAD to 26,159.6km 2 area in the Highland LAD, alongside only 381 districts.The MSOA design offered a higher number of districts (at 7,201) while also providing a measure of population density with the population for each MSOA readily available; the population density then allowed for an estimation of how urbanised an area was in a continuous spectrum from least to most urbanised.

Data and Methods
In this work, we used OpenStreetMaps (OSM) data from 2014-2019.We selected 2014-2017 for training the models, 2018 for the test set, and 2019 for data exploration and parameter searching.Data prior to 2014 is quite sparse and post 2019 was avoided in this first instance because of the COVID-19 pandemic, which have resulted in changes in the way people behave spatially (Santana et al., 2022) and also because the COVID-19 pandemic had significant implications on air pollution across the world (Brown et al., 2021).
The OpenAir package (Carslaw and Ropkins, 2012) provides access to data from the UK Automatic Urban and Rural Network (AURN) (Stevenson et al., 2009) with measurements taken every 15 minutes.Data starts from early 1973.Our study focused on 12 pollutants: Carbon Monoxide (CO), Nitrogen Oxide (NO), Nitrogen Dioxide (NO 2 ) Nitrogen Oxides (NO  ), Particles < 10m (PM 10 ), Particles < 2.5m (PM 25 ), Non-volatile PM 10 (NV 10 ), Non-volatile PM 25 (NV 25 ), Volatile PM 10 (V 10 ), Volatile PM 25 (V 25 ), Ozone (O 3 ), and Sulphur Dioxide (SO 2 ).The number of active AURN stations per pollutant by year is shown in Figure S4 Inverse Distance Weighting (IDW) Interpolation (Shepard, 1968) was used to achieve full spatial coverage of the study area from the AURN ground observations by creating a raster from which air pollution values at specific locations could be sampled; IDW interpolation uses a power parameter to determine the effect of neighbouring sample points on a location's value which can be tuned for specific air pollutants that are more or less susceptible to neighbouring pollution.We experimented with a range of possible values from 0 to 3.5 to determine the power parameter value for each pollutant, with the value that minimised the root-mean-square error (RMSE) (Daintith, 2009) from leave-one-out-validation process used; Figure S5 shows the plots for various power parameter values which were used for the choice of the power parameter.Figure S6 shows the raster produced by the IDW process with the best performing power parameter for each of the 12 pollutants.
We compared the resulting raster to the UK Government DEFRA air pollution model output for the same year, 2019 (Department of Environment, Food and Rural Affairs, 2019b), to verify the suitability of using interpolation to create a raster from ground observations.The dataset from the DEFRA model gives point estimates with a 1km resolution.We sampled the raster produced at the exact location of the 1km point, which we then aggregated to the district level, giving a value for the pollution in a given area.
We conducted a Pearson correlation coefficient (Benesty et al., 2009) analysis was calculated between the DE-FRA and interpolated output shown for various road length per district to ensure that the relationship between datasets was similar.We achieved a correlation value of 0.95 and 0.94 indicating that the datasets describe the same phenomena and the IDW process is able to replicate the point sample annual observation on a complete spatial scale.The spearman rank-order correlation coefficient (Spearman, 1961) was also calculated between the DEFRA and interpolated districts to ensure they had a similar ordering of the most / least polluted districts.The correlation value was high at 0.98 indicating that there is also agreement between the two datasets as to which districts are the most and least polluted across the study area.Figure 2 shows the pearson correlation plot between the DEFRA and interpolated output and the road length within a given district.
OpenStreetMaps is an open-source collaborative project that contains road data for most parts of the world, and it is quite comprehensive for the UK (OpenStreetMap contributors, 2021).The method used to create historical road infrastructure datasets was to use the OpenStreetMaps history files (.osh.pbf) to revert an OpenStreetMaps file (.osm.pbf) to its state at a given point in time.We could then use the historical OpenStreetMaps file to extract the network's structural properties at that time, such as the total road length per district.
The feature vector within our study is based on the length of the road network within a given area, determined by summing the length of the roads within the district under the WGS 84 / World Mercator EPSG: 3395 coordinate reference system (CRS), giving the length of the line segments of roads in meters.This process is then repeated for Figure 2: AURN ground observations interpolation vs. DEFRA air pollution model.Comparison between the two approaches.The blue points and line of the best fit depict the relationship between total air pollution in an MSOA (acquired via interpolating ground observations).In contrast, the orange points and line of the best fit represent air pollution data acquired via the UK DEFRA Ambient Air Quality Contract Model.The chart shows the two measurements of pollution, ordered by road length (x-asis).A similar relationship between the two lines of best fit indicates that the methods for generating the air pollution data are similar each of the MSOA districts.Each MSOA district then gives one element to the final feature vector for a given time.
The target vector is created by summing point estimates across a uniform point grid with 250m intervals.We chose the 250m interval to ensure that every MSOA had a point estimate.Details of the uniform point grid used can be seen in Table S1, alongside a visualisation for a single MSOA in Figure S7.
We observed a positive linear correlation between the total road network length and the annual aggregate air pollution within a district.This correlation is shown in Figure 3 for the pollutant PM2.5 in 2019.This observation led us to use a linear model to predict the annual level of air pollution.The Pearson correlation coefficient between the aggregated interpolation pollution and road length was 0.943.

Models
We used three variations of the model using different feature vectors.Example feature vectors and predicted pollution values for the models can be seen in Section S3.
The first developed model aimed to predict a district's pollution from the total road network length; a simple linear regression (Freedman, 2009) was used.The results from this model can be seen in Table 1, detailing the mean absolute error, mean squared error, and the R 2 (coefficient of determination) (Draper and Smith, 1998).This table should be seen in conjunction to the other tables described later to allow a fair comparison between the models.An example feature vector for the length model can be seen in Table S2, with predictions made by the model shown in Table S5.In relation to Figure 4 the model only considers the total length of the blue road network.
The second variation of the model splits the road network into the different types of highways (roads) detailed in the OpenStreetMaps database (OpenStreetMap contributors, 2021), and created a length element per road type in the feature vector.In relation to Figure 4 the feature vector still only considers the blue road network but now creates a separate feature for each road type, such as the solid black lines denoting the residential roads within the blue road network.An example feature vector for the composition model can be seen in Table S3, with predictions made by the model shown in Table S6.A multiple linear regression (Freedman, 2009) implemented this model.The results from this model can be seen in Table 2, detailing the mean absolute error, mean squared error and the R 2 (coefficient of determination) (Draper and Smith, 1998).When contrasting this with Table 1 which uses the total length of roads, it becomes clear that this model in which lengths are separated per type of road performs better.
The third model created was a spatial variant of the second model, using the same multiple linear regression technique as the composition model.We created the spatial model to capture whether the district was in the centre  S5 the interpolation process was unable to estimate a power parameter for the sample locations for CO, SO 2 , V 10 , indicating that there are not enough stations within the UK to pick up the variability over the distances separating the stations at the annual temporal level.However, V 10 achieved a 0.86 R2 score.The reason for this discrepancy between the three pollutants is the change in air pollution between the training (2014-2017) and test year (2018).For the V 10 interpolated raster, a change of concentrations from 3.03 (µg/m 3 ) in 2017 to 3.16 (µg/m 3 ) in 2018 (4.3% increase) was observed.The change in V 10 was considerably smaller than in CO, decreasing from 0.22 (µg/m 3 ) in 2017 to 0.18 (µg/m 3 ) in 2018 (18.2% decrease).Similarly, SO 2 decreased from 1.99 (µg/m 3 ) in 2017 to 1.73 (µg/m 3 ) in 2018 (13.1% decrease).The Mean Absolute Error is included to give a context of the model performance between pollutants with considerably different maximum values, as seen in Figure S5.The Mean Squared Error is also included to highlight the model's performance concerning the more extreme values seen within the dataset, which are experienced due to the varying geographical sizes of the districts being estimated.of an urbanised zone, on the outskirts or wholly removed, as the primary sources of a district's air pollution might be outside the district's boundaries.The feature vector for the spatial model comprised the same features as the composition model; however, the process conducted for the blue road network within Figure 4 was repeated for the yellow road network, providing a total length of the adjacent district's road networks, thereby doubling the number of features present with the feature vector over the composition model, as seen in Table S4.Predictions made by the spatial model are shown in Table S7.The use of an additional set of elements that give values for the sum of the road network length by road type in adjacent MSOAs to the estimated district has the effect of something akin to a network in which the neighbours influence the values of the current location, causing the values used in the composition model feature vector to be more contextualised.The results from this model can be seen in Table 3, detailing the mean absolute error, mean squared error and the R 2 (coefficient of determination) (Draper and Smith, 1998).
While the improvements in the  2 score are minimal, there is no overhead cost in creating the additional features for the vector as data collection concerning the roads has already taken place.Similar computational costs between the models also arguably make the time investment worthwhile, as there are minimal barriers to implementing the spatial variant of the composition model.
Overall the three different variations of the model help alleviate a specific situation that could be encountered and have varying benefits.The length model has the benefit of being conceptually simple, allowing multiple datasets to be integrated due to the universal property of road length between datasets.The composition model offers improvements over the length model but restricts the datasets that can be used where the classification schema is consistent across the study period.Finally, the spatial model offers some further performance improvements but requires considerably more data than the composition model, alongside adding further complexity to the model itself.

Feature Selection
The OpenStreetMaps dataset is exceptionally comprehensive with regards to road information.The feature vector in the previously described models included all road types within the OpenStreetMaps dataset.There are 99 different road types but some are misclassified (shown in tables in Section S3.3).This section describes the work conducted to reduce the input dataset's size to only what was needed to achieve the goal of estimating air pollution within a district.
Given the number of variables (road types) that exist in the OpenStreetMaps dataset, we relied on information theory Shannon (1948) and in particular mutual information between the road types and the different air pollutants to determine which road types have the highest relation with each air pollutant.As mentioned above, the initial number of road types in the dataset was 99.Removing road types that had no relevance to the air pollutants being studied, represented by a mutual information summation value of 0, kept 88 road types.The types removed were mostly misclassifications by a user inputting a road type, such as "fence" or "residential;footway".Our next step was incrementing the threshold for removal for a road types summation value and comparing the resulting model's performance by its  2 score.Figure 5a shows a plot of the resulting  2 score for both the length and composition models against differing numbers of road types subset by the value of the threshold for inclusion of the mutual information summation.Table S8 shows mutual information values for the pollutants across a range of road types, and Table S9 shows the model's performance with reduced input datasets.1 and Table 2.
We then repeated the experiments; however, we included only road types relevant to all 12 pollutants in the test.25 out of 99 pollutants had relevance to all 12 pollutants to be estimated; relevance is measured by ensuring that the road type had non-zero mutual information for all 12 pollutants.The results of this experiment are shown Figure 5b.Table S10 shows the results from the previous three models with road types with zero mutual information with at least one of the pollutants.
In Figure 5, the best model uses two road types, track and unclassified, to predict the pollution within a district, seen distinctly with an increase in the  2 score for the length model to 0.83, up from 0.78.We can see that some road types cause the length model to perform worse, but even when the result improves, it never performs as well as the composition model.Eventually, with only one road type remaining, the length and composition model are reduced to the same linear regression model, producing an  2 score of 0.65.
We performed pairwise mutual information between the road types to further reduce the dataset.The goal was to determine the road network types that contained the most information about other road types within the dataset.The road types we performed the mutual information on were the 25 road types with some mutual information with every pollutant covered in the study.
Figure 6 shows a heatmap of the mutual information values between the 25 road types.The values in the heatmap have been normalised row-wise.In the heatmap, there are 625 data points.457 points have a value of mutual information greater than zero.Some data points have a value of 0 for mutual information; for example, living street and raceway have zero pairwise mutual information.
The heatmap was modelled as a network of the 25 road types to determine the road types that contain the most mutual information about other road types, as shown in Figure S8.Each node in the network represents a road type, and the edge weight between the two nodes is the mutual information between the two road types.The degree of each node is equal to the sum of the weights of the edges adjacent to the given node, giving a value for the amount of mutual information a given road type contains about other road types in the network.
The node with the highest degree was chosen for inclusion in the model feature vector and then removed from the network as the mutual information of the edges no longer needed to be included in the dataset.This process was repeated, with the next highest node based on degree weight chosen, until no nodes remained.The results of this process on the network representation can be seen in Figure S9.
The test results in Figure 7 show that a model underpinned by the "track" and "unclassified road" road type produce a model with an  2 score of 0.83 while only making use of 25.9% of the total road network length in the study area.Indicating that these two types of roads are sufficient to provide a good estimation of air pollution.

Missing Districts Tests
We also conducted experiments to explore whether some districts within the study area could be used for training, having the remaining districts estimated from the model created.The idea was to reduce the input dataset further by removing similar districts from the training set, retaining only a subset of urban, suburban or rural districts.The population density was used as a proxy metric for how urbanised a district is, with example districts shown in Table S12.The goal was to understand the minimal number of roads that need to be measured across different districts to predict pollution across all districts accurately.We did not include the spatial variant in these tests, as it would require additional data acquisition over the minimal amount of measuring needed for just the district itself.We used the data from 2018 for the tests.For a set of 16 random states, different MSOAs were randomly sampled from the 7201 total MSOA districts.When selecting the additional districts to be included in the test, we ensured that the districts used in the previous iterations were still present.For example, the districts used in one random state in the 100 training size set were also used in the 500 training size set, as visualised in Figure S10.
Table S13 shows the results for the mean  2 score for both the length and the composition model for reducing training set size across 16 random state tests.We chose to explore test set sizes ranging from 100 MSOAs to 7200 MSOAs, with the training set being made up of the remaining MSOAs, for example, 100 MSOAs in the test set and 7101 MSOAs in the training set.
Figure 8 shows the scatter plots for the data detailed in Table S13.The random tests show that the length model is more robust than the composition model, indicating that the composition of the road network between different districts can vary greatly, resulting in poor performance in some random training sets.The length model maintains performance as the training size is reduced from the initial 7101 MSOAs to 26 MSOAs, with an  2 score of 0.85.After this, the performance of the model begins to decrease.Across a range of different random states, around 26 MSOAs need to be used to train the model to successfully predict all MSOAs pollution, representing 0.36% of the total number of MSOAs.Shown is the performance of the length and composition model when removing certain roads from being included as part of the feature vector.Inclusion in the feature vector depends on the sum of mutual information of the road type with pollutants considered in the study.A threshold is incremented to remove the road types with less mutual information than others.The requirement for a non-zero mutual information sum removes 11 road types with no relation to pollution, such as the OpenStreetMaps classification of turning circles, leaving 88 road types.Many road types have a minimal sum for mutual information, so incrementing the threshold to 0.1 removes 65 road types, with classifications such as living street removed.This process is then repeated until a single road type remains, in this case, the track road classification, which can be seen at which point the length and composition model reduces to the same model.Figure 5b details the same experiment detailed in Figure 5a however only road types that have a non zero value for mutual information with all pollutants were considered, leaving 25 road types, removing road types such as lane that only have a non zero mutual information values with SO 2 Figure 6: Mutual information between road types.The heatmap shows the mutual information between all 25 road types with a non-zero mutual information value with all the pollutants in the study.The heatmap values are normalised by row with the diagonal removed (which has value 1), allowing for an understanding of which road type has the highest mutual information to the other road types in the dataset.Motorway link has the highest mutual information with the motorway road type.While the motorway type has strong mutual information with the motorway link type, there are also strong mutual information scores with other road types, such as track and path.This difference in the strength of mutual information indicates that the motorway link dataset contains information mainly about the motorway road type, but the motorway dataset contains information about motorway link and other road types such as track, path etc., giving it a more desirable use in retaining information about all datasets (values are normalized by row and the diagonal removed) Figure 7: Total inter road cumulative mutual information in input feature vector vs  2 score.Shown is the performance of the length and composition model when removing road types from being included as part of the feature vector based on the road type's mutual information with other roads, aiming to reduce the data required while maintaining performance by including road types containing information about other road types.A single road type was removed at each step, with the x-axis detailing the total remaining mutual information about other road types within the datasets used for the feature vector.Eventually, the model reduces into the same model when only a single road type remains.The performance improvement of the length model as some road types are removed indicates that some of the datasets included are noisy and affect model performance; this issue is not present with the composition model that can differentiate different road types, seen with a gradual reduction in model performance.The full experiment results can be seen in Table S11 (a) Length Model (b) Composition Model Figure 8: Test set size districts vs  2 score.Shown is the performance of the length and composition model as the input districts for the training set are reduced from 7101 to 1.Of note is the difference in the scale of the y-axis between the figures.As the test set size increases (and so the training set size decreases), the performance of both models remains stable until the test set comprises 7,175 out of 7,201 (0.36%) districts.At that point, both models begin to reduce in performance.The length model maintains its performance at around 0.8  2 score for longer, with the composition model degrading to an average  2 score of 0.65.When the test set size is 7,200, meaning only a single district is used as the training set, both models break down worse than just estimating the average air pollution value; however, the composition performs significantly worse than the length model.Thereby indicating that while the composition model performs better than the length model initially, it does require more input data to perform well As seen in Figure 8, the  2 score for the length model averaged 0.83, providing a baseline performance for the next experiment.We trained a set of models with 1500 districts with similar population densities to explore the effect of changing population density distribution in the training set on model performance.Starting from the least urbanised MSOAs, to the most urbanised MSOAs, with increments of 125, a model was trained and the  2 score computed.The results from this experiment are shown in Figure 9, where the average  2 score across the 46 tests was 0.83.The  2 score remained similar to the expected 0.83 from the random tests as the training set changed from least to most urbanised.Thus, pointing toward the idea that the  2 score of the length model is not related to the population density distribution of the MSOAs included in the subset of training districts.S14 shows the full experiment results

Use Cases
From the tests conducted as part of this work, the critical road types to predict the pollution in an area are the "minor roads"; the road types "track" and "unclassified" within the OpenStreetMaps classification schema.Not only do the road types "track" and "unclassified" have the most mutual information about the 12 pollutants analysed within this study, with a normalised summation value of 12 and 10.12 respectively, but they also contain the most mutual information about the other road types with a total value of 2.97 and 3.00 respectively.The "track" and "unclassified" road type input data produce a model with an  2 score of 0.83 for both the Length and Composition model.The model then degrades to an  2 score of 0.65 when only the "track" road type is included, where the length and composition model is the same.While the  2 score of the complete composition model, with all 25 road types, improves to 0.89 compared to the length models  2 score of 0.78, an argument is that the increased amount of data needed to attain the increase of 0.11 on the  2 is not worthwhile.The entire length of the road network within the MSOA boundaries in 2018 is 1,140,835,548m, out of which 295,948,514m, represents the "track" and "unclassified" road network, representing a 74.1% reduction in input data for minimal performance loss.However, it is also vital to consider the context in which the model is intended for use.The model was never designed to accurately predict the pollution in an area to a specific value, but rather to give an idea of where it is most polluted within an area to help identify areas for intervention.Within this context, the loss of 0.11  2 score is meaningless in the model's practical use.
The missing districts tests have clear implications on the approach of deploying monitoring stations to create a target vector.The experiments show that a distribution of urban, suburban and rural districts across the training set is unnecessary.In the UK case, only about 0.36% of districts, or 26/7201 districts, were required to accurately predict pollution across all districts at the annual level.Using the model proposed would allow policymakers to have full spatial coverage of air pollution while only deploying monitoring stations in 0.36% of districts, representing significant monetary savings over deploying a comprehensive air pollution monitoring network like the UK's AURN.
The findings of this study show that a linear regression model underpinned by a single structural property, length, of the track and unclassified road network within 0.36% of districts within England and Wales was enough data to identify an ordering for which districts are the most polluted.Furthermore, the model discussed presents a low-cost method of achieving similar results to more expensive models such as the one currently used by DEFRA.The model has apparent practical uses for policymakers that want to pursue clean air initiatives but lack the capital to invest in comprehensive dedicated monitoring networks.

Discussion
In this work, we presented a solution for the estimation of annual air pollution using a dataset related to the structure of roads, we have focused on a model based on length of roads and also on a composition model which consider other factors associated to road classifications.
Our contributions can be split into 3 categories.
(1) We have shown that one can get to good annual estimations on a very low budget.This contrasts with current approaches in the UK which costs millions of pounds (£).Making an approach that is cheaper and accessible to multiple stakeholders is more inclusive, allowing society to use the estimations in secondary applications (health exposure, traffic, city planning, etc.) (2) Our model is quite transparent, and the characteristics used as well as the estimator based on simple regressions is quite explainable.This leads to a better reusability and generality of the approach.(3) In a more general sense, we believe this work has an impact on showing that in data science and in particular, environmental data science, lack of data about a particular phenomenon can be overcome by using the information embedded in other datasets which may not have been collected with that intent.This happens due to mutual information, in which aspects of one phenomenon may be present in other pieces of information.
While the use of road length is inferior at estimating air pollution to approaches that use dynamic data, the method presented has the benefit of a nominal implementation cost due to the data's open source nature, minimal training computation, and accessible model design.The missing districts experiments give a framework for using a minimal amount of monitoring stations across an area, in the case of the UK, 0.36% of MSOAs, and use the model presented to fill in missing districts, producing a full spatial map of air pollution at the annual temporal level.The full spatial map ensures that all districts, urban, suburban and rural alike, have an estimate for ambient air pollution, helping to address inequality related to a current focus on urban areas with monitoring station placement.
While the composition model surpassed the length model in performance, there is a trade-off in operationalising the composition model over the length model due to the data used.The length model has the benefit of using a universal structural property of roads, its length, which is present across datasets such as OpenStreetMaps used in our study or other datasets concerning road networks such as Ordnance Survey Opens Roads 4 .However, when using the composition model, there is an immediate limitation of being tied to a single dataset and the classification schema used for road types.For example, in the OS Opens Roads, there are only 6 classifications of roads, but as seen in Section 4.2, there were 99 different classifications in OpenStreetMaps.The road classification used also exposes the model to potential system shocks.One of them is the potential for OpenStreetMaps to change their classification schema, which could make it challenging to have a complete dataset that spanned multiple years.This issue is simple to solve with the length model, with the ability to combine datasets depending on their availability to extend the temporal coverage of the data used.
While the UK and many other global north countries can invest in comprehensive monitoring-station networks, this is not always the case for lower-income counties, particularly those in the global south.Data concerning annual air pollution globally would help to inform policymakers' decisions and help to combat some of the 4.2 million death/year attributed to ambient air pollution (Public Health, Social and Environmental Determinants of Health Department, 2018).The framework presented, paired with the global availability of road structural properties' data through OpenStreetMaps, provides a basis for future work to design a global annual air pollution model from similar secondary datasets to those used in this study.
Future work could explore the possibility of using the same framework to estimate air pollution at a higher temporal level, such as the daily or hourly level.However, there are likely limits to the approach presented in this paper at a finer temporal level due to the static nature of road transport infrastructure.Future work would highlight the need to transition from data concerning the static elements of the road, such as its length, to more dynamic aspects of the road, such as the traffic counts of different types of vehicles.However, this data is more expensive to acquire and more invasive to privacy.Therefore, critical evaluations are needed whether the benefits derived from the model underpinned by more invasive dynamic data are worth the additional benefits of more accurate air pollution environmental intelligence.

S2.1 Air Pollution Ground Observation Data
In 2019 there were 173 functional monitoring stations across the UK in the Automatic Urban and Rural Network (AURN).Each monitoring station only measures a subset of the 12 air pollutants of interest in this study.Figure S4 shows the count for the number of stations online in each given year during the 2014-2019 period coloured by air pollutants.
Supplementary Figure S4: Number of active AURN monitoring stations by year.The number of AURN stations, according to UK Air (DEFRA), that are online in any given year.Some air pollution monitoring stations are decommissioned each year, and others are commissioned.Of note is that a single air pollution monitoring station can monitor multiple different air pollutants.
Another critical aspect of the AURN is the type of location in which the station is placed.There are six different station location types † .The stations used in this study were also restricted to stations within the area covered by the MSOA, meaning only stations within England and Wales were included.Local Authority Environmental Health Offices are responsible for running the stations.While locations within the MSOA are closer to a monitoring station outside the MSOA boundary, it was essential to restrict the stations to the top-level political authority due to the local authority running the station.We made this choice in case of differences in standard operating procedures by health officers within different local authorities ‡ .S7, that the model aims to estimate.

S2.2 Air Pollution Data Interpolation
Figure S5 shows the 12 different pollutants scatter plots of the values for the Root Mean Square Error (RMSE) against different values of the power parameter, used during the interpolation leave one out validations process.
Figure S6 shows the resulting interpolated raster from the monitoring station ground observations, with the given pollutants power parameter that minimised the RMSE.
Of note is that some air pollutants, such as Carbon Monoxide (CO), had minimal ground observation monitoring stations.Therefore, the interpolation estimated the mean of the stations across the study area, highlighted by a power parameter of 0.

S3 Model Variant Details
S3.1 Model Variant Feature Vector Table S2 shows an example feature vector for the length model.Table S3 shows an example feature vector for the composition model.Table S4 shows an example feature vector for the spatial variant of the composition model.

S3.2 Model Variant Example Output Estimations
Table S5 shows example pollution predictions for the length model for different MSOAs.Table S6 shows example pollution predictions for the composition model for different MSOAs.

S3.3 Model Feature Selection
Table S8 shows the mutual information between every 5th road type and all 12 pollutants, with the final column, summation, providing the total mutual information for that road type across each of the pollutants.The mutual information is also normalised between 0 and 1 and have been ordered from least to most total mutual information.
Table S9 shows results from repeating the experiments for all three models while reducing the input data set through the removal of specific road types reducing the data used to create the feature vector.The road types that were removed were chosen by incrementing a threshold value and removing road types that had a summation mutual information value below the threshold.
Table S10 shows the results from the same tests as Table S9, however with the starting set of road types being restricted to road types that have a mutual information value with every pollutant.The initial set of road types numbered 25.
Figure S8 shows the full network representation of the 25 road types with non zero mutual information with all 12 pollutants, and Figure S9 the network state at key stages during the removal process for reducing the road type inclusion list, detailed in Table S11.
Supplementary Figure S8: Network model of the inter road mutual information.The mutual information between the different road types was modelled as a network with the nodes representing road types and the edge between two nodes being weighted depending on the mutual information between the two.

S4 Missing Districts Tests Details
Figure S10 shows the visualisation of the MSOAs selected for inclusion in the training set during the missing districts tests.Table S12 shows the population density for a range of MSOAs.Table S13

Figure 1 :
Figure 1: Spatial distribution and classification of monitoring stations within England and Wales.The Automatic Urban and Rural Network (AURN) stations are divided into 3 classes: urban (202), suburban (12), and rural (21).Note the inequality of station numbers; most of the stations are in urban settings.The AURN has 274 stations total out of which 235 are in England and Wales

Figure 3 :
Figure 3: MSOA district road network total length vs aggregate PM2.5 pollution in 2019.Line of best fit for a set of points depicting the relationship between the aggregate PM2.5 pollution within an MSOA and the total road length within the corresponding MSOA.The linear relationship led to our decision to use a linear regression model

Figure 4 :
Figure 4: Road network structure within MSOA City of London 001 (blue) and surrounding MSOAs (yellow).The solid black line represents the residential roads within MSOA City of London 001 and the dashed black line the residential roads within the neighbouring MSOAs.The road network will contribute differently to the feature vector values depending on the framework used.In the case of the length model, the simple overall length of the road network is used, which for MSOA City of London 001 is 223,092m (in meters), visualised with the blue and black solid lines.In the case of the composition model, each road type contributes to a different single feature vector element, with the black solid residential road contributing a single element of the feature vector with the value 23,235m.The spatial variant of the model takes into account the total road length for different road types in adjacent MSOAs alongside the considerations of the composition model, meaning the total for the dashed black line compromises a single element of the feature vector used, in this case, 82,500m (a) Starting with all 99 road types.(b) 25 road types relevant to all air pollutants.

Figure 5 :
Figure5: Model performance vs  2 score, with road type inclusion based upon road type mutual information with air pollution.Shown is the performance of the length and composition model when removing certain roads from being included as part of the feature vector.Inclusion in the feature vector depends on the sum of mutual information of the road type with pollutants considered in the study.A threshold is incremented to remove the road types with less mutual information than others.The requirement for a non-zero mutual information sum removes 11 road types with no relation to pollution, such as the OpenStreetMaps classification of turning circles, leaving 88 road types.Many road types have a minimal sum for mutual information, so incrementing the threshold to 0.1 removes 65 road types, with classifications such as living street removed.This process is then repeated until a single road type remains, in this case, the track road classification, which can be seen at which point the length and composition model reduces to the same model.Figure5bdetails the same experiment detailed in Figure5ahowever only road types that have a non zero value for mutual information with all pollutants were considered, leaving 25 road types, removing road types such as lane that only have a non zero mutual information values with SO 2

Figure 9 :
Figure 9: Test sizes vs  2 score for length model for changing population density within the training set.Shown is the length model's performance across a range of training sets chosen based on the MSOAs population density.The population density for an MSOA is used as a proxy for how urbanised an MSOA is; the higher the population density, the more urbanised an MSOA is.1,500 MSOAs were chosen to be part of each training set, with the variation being in the average population density of the training set MSOAs.The first point (Min 0 Max 1,499) is where the least urbanised MSOAs are the training set.The last point (Min 5,625 Max 7,124) is where the most urbanised MSOAs are the training set.The consistent performance of the length model as the average urbanisation of the training set MSOAs increases indicates how urbanised an MSOA does not affect the model performance.TableS14shows the full experiment results Shown are the results from the experiment conducted to explore the effect of changing the mean urbanisation score in the training set on the performance of the length model.The training set moves from least (Min 0.0 Max: 1499.0) to most (Min 5625.0Max: 7124.0)urbanised, where the performance of the length model remains consistent.Indicating there isn't a need to consider the distribution of how urban the MSOAs are within the training set.

Table 1 :
Pollutant Name Mean Absolute Error (µg/m 3 ) Mean Squared Error (µg/m 3 )  2 Coefficient of Determination Results from the length feature vector model using all road types.The results show that model performance achieves a high of 0.87 in  2 score across the 12 pollutants.Of note is that CO and SO 2 have a lower score of around 0.535.As shown in Figure

Table 2 :
Results Pollutant Name Mean Absolute Error (µg/m 3 ) Mean Squared Error (µg/m 3 )  2 Coefficient of Determination from the composition feature vector model using all road types.The  2 for all air pollutants has improved over the length model detailed in Table1, alongside reducing both the Mean Absolute Error and Mean Squared Error, indicating an improved model framework for estimating air pollutants, with the same relative performance between air pollutants.Note that this model is also based on road lengths, but the lengths here are not the total but based on road type.

Table 3 :
Pollutant Name Mean Absolute Error (µg/m 3 ) Mean Squared Error (µg/m 3 )  2 Coefficient of Determination Results from the spatial composition feature vector model using all road types.The  2 has improved for some of the pollutants included within the study over the composition model detailed in Table2other than NO 2 , O 3 , however not all the metrics have improved in the same manner as the transition from the length to composition model, for example with the Mean Absolute Error and Mean Squared Error increasing, potentially indicating a less robust model framework for estimating air pollutants that requires additional data over the other model's details in Table . Deep learning-based pm2. 5 prediction considering the spatiotemporal correlations: A case study of beijing, china.Science of The Total Environment, 699:133561.Public Health, Social and Environmental Determinants of Health Department (2018).Burden of disease from ambient air pollution for 2016.Technical report, World Health Organisation WHO, World Health Organiization 1211 geneva 27 Switzerland.

Table S1 :
Example target vector for NO x in 2018 at the annual temporal level.Sample (every 500th MSOA by sample point count) aggregate pollution sum for MSOA district boundaries with associated number of point samples from the raster based on the uniform point grid shown in Figure

Table S2 :
TableS1details each MSOAs aggregate pollution value calculated from the 2019 interpolated raster using the uniform point grid with a count for the number of points sampled.Example feature vector for the length model for 2018.The feature vector is sorted in ascending order by total road length (m), with every 500th MSOA shown.

Table S3 :
Example feature vector for the composition model for 2018.The feature vector is sorted in ascending order by total road length (m), with a selection of different road types for every 500th MSOA.

Table S4 :
Example feature vector for the spatial model for 2018.The feature vector is sorted in ascending order by total road length (m).The same feature vector values as in TableS3are included alongside their neighbouring MSOA road-type counterparts.

Table S5 :
Length model air pollution concentration predictions for NO x in 2018.Results from the length model with every 500th MSOA shown and sorted by the difference between predicted and actual pollution.

Table S6 :
Table S7 shows example pollution predictions for the spatial model for different MSOAs.Composition model air pollution concentration predictions for NO x in 2018.Results from the length model with every 500th MSOA shown and sorted by the difference between predicted and actual pollution.

Table S7 :
Spatial model air pollution concentration predictions for NO x in 2018.Results from the length model with every 500th MSOA shown and sorted by the difference between predicted and actual pollution..

Table S9 :
Score Length Model R 2 Score Composition Model R 2 Score Spatial Model Threshold Value No. of Road Types Experiment results for model performance on reducing feature vector input road types.Results from reducing the included road types for the feature vector based on the total summation mutual information a road type has with all air pollutants, subsetted depending on the denoted threshold value.Score Length Model R 2 Score Composition Model R 2 Score Spatial Model Threshold Value No. of Road Types Supplementary TableS8: Mutual information between road type and air pollutants.Mutual information from every 5th road type concerning every air pollutant ordered by summation mutual information; the total mutual information a road type contains with every air pollutant.R 2 R 2

Table S10 :
Experiment results for model performance on reducing feature vector input road types, with road types with non-zero mutual information with all air pollutants.The results from rerunning the experiment are detailed in TableS9However, only road types with non-zero mutual information with all air pollutants are included in the initial set.R 2 Length R 2 Composition Inter Road Cumulative Mutual Information No. of Road Types New Road Type Included

Table S11 :
Experiment results for the inter road mutual information subsetting tests.Detailed is the model performance average for the Length and Composition models across all air pollutants with different road types subsets input into the feature vector.The New Road Type Included details the additional road type included over the previous row in the table input feature vector.Min and Max Urbanised MSOA In Training Set R 2 Score Length Model Normalised Range of MSOA Population Density Range of MSOA Population Density (Individuals / m 2 ) Mean MSOA Population Density (Individuals / m 2 )

Table S14 :
Experiment results for a consistent training set size (1500) with changing mean urbanisation scores across included MSOAs.