What would other emergency stroke teams do? Using explainable machine learning to understand variation in thrombolysis practice

Introduction: The aim of this work was to understand between-hospital variation in thrombolysis use among emergency stroke admissions in England and Wales. Patients: A total of 88,928 patients who arrived at all 132 emergency stroke hospitals in England Wales within 4 h of stroke onset, from 2016 to 2018. Methods: Machine learning was applied to the Sentinel Stroke National Audit Programme (SSNAP) data set, to learn which patients in each hospital would likely receive thrombolysis. We used XGBoost machine learning models, coupled with a SHAP model for explainability; Shapley (SHAP) values, providing estimates of how patient features, and hospital identity, influence the odds of receiving thrombolysis. Results: Thrombolysis use in patients arriving within 4 h of known or estimated stroke onset ranged 7% -49% between hospitals. The odds of receiving thrombolysis reduced 9-fold over the first 120 min of arrival-to-scan time, varied 30-fold with stroke severity, reduced 3-fold with estimated rather than precise stroke onset time, fell 6-fold with increasing pre-stroke disability, fell 4-fold with onset during sleep, fell 5-fold with use of anticoagulants, fell 2-fold between 80 and 110 years of age, reduced 3-fold between 120 and 240 min of onset-to-arrival time and varied 13-fold between hospitals. The majority of between-hospital variance was explained by the hospital, rather than the differences in local patient populations. Conclusions: Using explainable machine learning, we identified that the majority of the between-hospital variation in thrombolysis use in England and Wales may be explained by differences in in-hospital processes and differences in attitudes to judging suitability for thrombolysis.

Data were retrieved for 246,676 emergency stroke admissions to acute stroke teams in England and Wales for the three calendar years 2016 -2018, obtained from the Sentinel Stroke National Audit Programme 5 (SSNAP).Data fields were provided for the hyper-acute phase of the stroke pathway, up to and including our target feature: receive thrombolysis (full details of the data fields obtained are provided in the appendix).Of these patients, 88,928 arrived within 4 hours of known (precise or estimated) stroke onset, and were used in this modelling study.The data included 132 acute stroke hospitals (these were all units admitting an average of 100 patients per year, and delivering thrombolysis to at least 10 patients over 3 years).There are 60 original features used from the SSNAP dataset.SSNAP has near-complete coverage of all acute stroke admissions in the UK (outside Scotland).All hospitals admitting acute stroke participate in the audit, and year-on-year comparison with Hospital Episode Statistics 6 confirms estimated case ascertainment of 95% of coded cases of acute stroke.

Stroke Team
• StrokeTeam: Pseudonymised SSNAP 'routinely admitting team' unique identifier.For emergency care it is expected that each hospital has one stroke team (though post-72 hour care may be reported under a different team at that hospital).

Patient -general
• Pathway: Total number of team transfers, excluding community teams • S2StrokeType: Whether the stroke type was infarction or primary intracerebral haemorrhage • S2TIAInLastMonth: Whether the patient had a Transient Ischaemic Attack during the last month.Item from the SSNAP comprehensive dataset questions (not mandatory)

Patient -thrombolysis given
• S2Thrombolysis: Whether the patient was given thrombolysis (clot busting medication)

Patient -reason stated for not giving thrombolysis
• Age: If the answer to thrombolysis given was "no but", the reason was Age • Comorbidity: If the answer to thrombolysis given was "no but", the reason was Co-morbidity • Haemorrhagic: If the answer to thrombolysis given was "no but", the reason was Haemorrhagic stroke • Improving: If the answer to thrombolysis given was "no but", the reason was Symptoms Improving • Medication: If the answer to thrombolysis given was "no but", the reason was Medication • OtherMedical: If the answer to thrombolysis given was "no but", the reason was Other medical reason • Refusal: If the answer to thrombolysis given was "no but", the reason was Refusal • TimeUnknownWakeUp: If the answer to thrombolysis given was "no but", the reason was Symptom onset time unknown/wake-up stroke • TimeWindow: If the answer to thrombolysis given was "no but", the reason was Age • TooMildSevere: If the answer to thrombolysis given was "no but", the reason was Stroke too mild or too severe

A.2 Probability, odds, and Shap values (log odds shifts): A brief explanation
Many of us find it easiest to think of the chance of something occurring as a probability.For example, there might be a probability of 10% that it will rain today.That is the same as saying there will be one rainy day out of ten days for days with this given probability of rain.
In our stroke thrombolysis model, Shap values tell us how knowing something particular about a patient (such as the patient feature, 'Is their stroke caused by a clot or a bleed?') adjusts our prediction of whether they will receive thrombolysis or not.This is made a little more complicated for us because Shap is usually reported as a log odds shift.It is useful for us to see how those relate to probabilities, and get a sense of how significant Shap values in the range of 0.5 to 5 (or -0.5 to -5) are, as that is a common range of Shap values that we will see in our models.

A.2.1 Probability
We will take the example that Shap reports that a model's base probability prediction, before consideration of features is 0.25, or a 25% probability of receiving thrombolysis; that is 1 in 4 patients with this prediction would be expected to receive thrombolysis.

A.2.2 Odds
Probability expresses the chance of something happening as the number of positive occurrences as a fraction of all occurrences (i.e. the number of patients receiving thrombolysis as a fraction of the total number of patients).
Odds express the chance of something happening as the ratio of the number of positive occurrences (i.e.receiving thrombolysis) to the number of negative occurrences (i.e.not receiving thrombolysis).
If we have probability prediction of 0.25 would receive thrombolysis, that would mean 1 in 4 of those patients receive thrombolysis.Expressed as odds, for every one patient that receives thrombolysis, three will not.The odds are expressed as 1:3 or 1/3.This may also be calculated as a decimal (1 divided by 3), 0.333.

A.2.3 Shap values: Log odds shifts
Here we will calculate the effect of Shap values, and try and build some intuition on the size of effect Shap values of 0.5 to 5 give (we will look at positive and negative Shap values).
Shap usually outputs the effect of a particular feature in how much it shifts the odds.For reasons we will not go into here, that shift (which is the 'Shap value') is usually given in 'log odds' (the logarithm of the odds value).For the mathematically inclined, we use the natural log (ln).
Let's look at some Shap values (log odds) and see how much they change the odds of receiving thrombolysis.
First we'll look at the shift in odds the Shap values give.This is calculated as shift = exp(Shap) (table A.1). So, for example, a Shap value of -0.5 for one particular feature tells us that that particular feature in that patient shifts our expected probability of that patient receiving thrombolysis from 25% to 17%.A Shap value of 5 for the same feature would shift the probability of that patient receiving thrombolysis down to 2%.

A.2.4 Observations about Shap values
We begin to get some intuition on Shap values.A Shap value of 0.5 (or -0.5) leads to a small, but still noticeable, change in probability.Shap values of 5 or -5 have effectively pushed probabilities to one extreme or the other.

A.2.5 Limitations of SHAP
SHAP is a popular method for explaining the predictions of machine learning models, but it does have some limitations.SHAP values are an approximation of the Shapley values, from which they are based, in order to calculate them efficiently.Even so, SHAP can be computationally expensive and slow for large datasets and complex models.SHAP can only help to explain the fitted model, and so it can only be as good as the model (with caveats around training data containing bias, or incomplete information).Another limitation is the interpretation of SHAP values and its components (the main effect and interactions) can be challenging.To aid our dissemination of the findings from SHAP we have engaged with clinicians, patients and carers to learn how best to communicate this information.SHAP assumes feature independencethat is the assumption that the values of each feature in a dataset are independent of the values of other features in the same dataset.In other words, the value of one feature should not have a direct effect on the value of another feature.We used feature selection to ensure that very little covariance existed between the 10 features that were included in the models.

A.3 Python libraries
All modelling and analysis was performed using Python in Jupyter Notebooks 1 , with general analysis and plotting performed using NumPy 2 , Pandas 3 , Scikit-Learn 4 , and Matplotlib 5 .

A.4 Variation in thrombolysis use
Thrombolysis use in the original data varied between hospitals (Figure A.1), from 1.5% to 24.3% of all patients, and 7.3% to 49.7% of patients arriving within 4 hours of known stroke onset.
We used default settings apart from *learning rate* was set at 0.5 (see section A.12).

A.8 Model accuracy
Model accuracy was measured using stratified 5-fold cross validation.The key results are shown in table A.5.We found an overall accuracy of 85.0%, with a balanced accuracy.The predicted thrombolysis rate of 29.4% was very close to the observed thrombolysis rate of 29.6%.

A.9 Model calibration
The model calibration was checked by binning predictions by probability, and comparing the mean predicted probability with the fraction that were actually positive (table A.6 and figure A.4).In a well-calibrated model, in each bin the average probability of receiving thrombolysis should be close to the proportion of patients who actually received thrombolysis.Results demonstrated that the model was naturally well-calibrated, and was not in need of any calibration correction.As expected, the fraction of predictions that were correct is related to the predicted probability of receiving thrombolysis (when predictions were close to 50% probability of receiving thrombolysis the model was correct about 50% of the time, whereas when the model had predictions of less than 10% or greater than 90% probability of receiving thrombolysis, the model was be correct about 90% of the time).
Nearly 50% of patients fell in the 0-10% probability of receiving thrombolysis -that is the model gave a confident prediction that the these patients would not receive thrombolysis, with the model being correct in these predictions 98% of the time.5).The mean of these standard deviations was 0.057, but the variation depended on the probability, with variation peaking at about 0.13 when the prediction probability of receiving thrombolysis was around 0.5.
Additionally, we used the models and test set to predict thrombolysis use at each of the 132 hospitals if the 10k cohort of patients had attended each of the hospitals (by changing the hospital one-hot encoding, figure A.6).We predicted the thrombolysis use at each hospital, and examined the variation between the 30 bootstrapped models.The mean of the standard deviation of bootstrap replicates was 1.7% (where hospital thrombolysis use rates were 10% to 45%).
Bagging experiments were repeated with Baysian Bootstrapping based on weighting training samples using a Dirichlet distribution.Very similar results were achieved, with a mean standard deviation of bootstrap replicate probability predictions of 0.054, and a mean standard deviation of bootstrap replicate 10k thrombolysis use in hospitals of 1.6%.
The evaluation of bootstrapped replicates gave us confidence that a single model fit would be sufficient.

A.11 Learning curves
Learning curves evaluate the relationship between training set size and model accuracy.Learning curves were performed using stratified 5-fold validation, and by random sampling (without replacement) of the training set (figure A.7.The maximum accuracy achieved was 85% using 70k training instances, 82.5% accuracy was achieved with 4k training instances.There was a shallow improvement between 4k and 70k training points.

A.12 Fine-tuning of model regularisation
As hospital ID is encoded as one-hot, and there are 132 hospitals, it is possible that the effect of hospitals ID becomes 'regularised out', especially as for each one-hot encoded column about 99 As we are concerned with differences between hospitals, we did not want to over-regularise the model.To optimise *learning rate* we looked at the between-hospital variation of predicted thrombolysis use in a 10k cohort of patients (with the model predicting the use of thrombolysis in each hospital with the same 10k cohort).The model was trained on the remaining 78,928 patients, with varying learning rates (figure A.8 and table A.7).
Reducing the learning rate below 0.5 led to reduced between-hospital variation in the predicted use of thrombolysis, suggesting that the effect of hospital ID was being reduced by over-regularisation.
A learning rate of 0.5 was chosen for all modelling (including the accuracy measurements above).

Figure A. 1 :
Figure A.1: Histogram of observed thrombolysis use in 132 hospitals.Left: Thrombolysis shown as a percentage of all emergency stroke admissions.Right: Thrombolysis shown as a percentage of those patients who arrive at hospitals within 4 hours of known stroke onset.

Figure A. 2 :
Figure A.2: The effect of increasing the number of features on model accuracy measured by Receiver Operating Characteristic (ROC) Area Under Curve (AUC).Left: Improvement with ROC AUC with selection of up to 25 features.Right: Improvement with ROC AUC with selection of the best 10 features.ROC was measured with stratified 5-fold cross-validation.Results show the mean of the 5-fold replicates.

Figure A. 3
Figure A.3 shows the receiver operating characteristic curve, along with the trade-off between sensitivity and specificity.

Figure A. 3 :
Figure A.3: Model accuracy of a XGBoost model using 10 features.Left: Receiver Operating Characteristic (ROC) Area Under Curve (AUC).Right: The trade-off between Sensitivity and Specificity.Accuracy was measured with stratified 5-fold cross-validation, and both charts show all 5 k-fold replicates.

Figure A. 4 :
Figure A.4: Calibration check of the model.Left: The proportion of patients receiving thrombolysis for binned probability of receiving thrombolysis.Right: The proportion of predictions in each bin (blue), and the proportion of predictions that are correct (orange).Plot show results for all 5 k-fold replicates.

Figure A. 5 :
Figure A.5: Standard deviation of predicted probability of receiving thrombolysis, from 30 bootstrapped models predicting the probability of receiving thrombolysis in 10k patients.Results are binned by predicted probability.

Figure A. 6 :
Figure A.6: Mean and standard deviation of predicted thrombolysis use at 132 hospitals from 30 bootstrapped models.Results are for predicted thrombolysis use for the same 10k patient cohort for each hospital.In addition to the results for the bagging models, the predicted thrombolysis use for a single model with bootstrap sampling is shown.Results are ordered by thrombolysis use at each hospital predicted from the single non-bootstrap model.

Figure A. 7 :
Figure A.7: Learning curves showing the relationship between training set size and model accuracy.Left: training set size up to 70k.Right: training set size up to 5k (same results as the results on the left).Results are shown for all 5 k-fold replicates.

Figure A. 8 :
Figure A.8: Effect of adjusting XGBoost learning rate on the distribution predicted thrombolysis use across 132 hospitals.A narrower distribution indicates that hospital thrombolysis rates are tending towards the mean thrombolysis hospital rate.
Patient Ethnicity.Aggregated to White, Black, Mixed, Asian and Other 1 https://www.strokeaudit.org/ 2 https://www.hqip.org.uk/ 3 https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics 4 http://www.hra-decisiontools.org.uk/research/ 5 https://www.strokeaudit.org/ 6https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics 1 • S1Ethnicity: • S1AdmissionQuarter: Year quarter (Q1: Jan-Mar; Q2:April-Jun; Q3: Jul-Sept; Q4: Oct-Dec) • S1AdmissionYear: Year of admission • S2BrainImagingTime min: Time from Clock Start to brain scan.In minutes."Clock Start" is used throughout SSNAP reporting to refer to the date and time of arrival at first hospital for newly arrived patients, or to the date and time of symptom onset if patient already in hospital at the time of their stroke.• S2ThrombolysisTime min: Time from Clock Start to thrombolysis.In minutes."Clock Start" is used throughout SSNAP reporting to refer to the date and time of arrival at first hospital for newly arrived patients, or to the date and time of symptom onset if patient already in hospital at the time of their stroke.Patient -comorbidities • CongestiveHeartFailure: Pre-Stroke Congestive Heart Failure • Hypertension: Pre-Stroke Systemic Hypertension • AtrialFibrillation: Pre-Stroke Atrial Fibrillation (persistent, permanent, or paroxysmal) • Diabetes: Comorbidities: Pre-Stroke Diabetes Mellitus • StrokeTIA: Pre-Stroke history of stroke or Transient Ischaemic Attack (TIA) • AFAntiplatelet: Only available if "Yes" to Atrial Fibrillation comorbidity.Whether the patient was on antiplatelet medication prior to admission • AFAnticoagulent: Prior to 01-Dec-2017: Only available if "Yes" to Atrial Fibrillation comorbidity; • Visual: National Institutes of Health Stroke Scale Item 3 Visual Fields (higher values indicate more severe deficit) Patient -other clinical features • S2INR: Patient's International Normalised ratio (INR) on arrival at hospital (available since 01-Dec-2017) • S2INRHigh: INR was greater than 10 on arrival at hospital (available since 01-Dec-2017) • S2INRNK : INR not checked (available since 01-Dec-2017) • S2NewAFDiagnosis: Whether a new diagnosis of Atrial Fibrillation was made on admission • S2RankinBeforeStroke: Patient's modified Rankin Scale score before this stroke (Higher values indicate more disability)

Table A .
Now let us work through an example of starting with a known baseline probability (before we consider what we know about a particular patient feature), converting that to odds, applying a Shap log odds shift for that particular feature, and converting back to probability after we have applied the influence of that feature.The the effects of those shifts on our baseline probability of 0.25 are shown in table A.2.Table A.2:The effect of SHAP values between 0.5 and 5 on a base probability of 0.25 Shap value of 0.5 for one particular feature tells us that that particular feature in that patient shifts our expected probability of that patient receiving thrombolysis from 25% to 36%.A Shap value of 5 for the same feature would shift the probability of that patient receiving thrombolysis up to 98%.If we have a negative Shap value then odds are reduced (a Shap of -1 will lead to the odds being divided by 2.72, which is the same as multiplying by 1/2.72, which is 0.3679), as shown in table A.3.
1:The relationship between odds and log odds.

Table A .
4: Correlations between the 10 features selected for the XGBoost machine learning model.

Table A .
5: Accuracy of 10 feature XGBoost model in predicting thrombolysis use in patients arriving at hospital within 4 hours of known stroke onset.

Table A . 6 :
Model calibration based on binning by predicted probability of thrombolysis.

Table A .
7: Statistics on the variation in predicted thrombolysis use between hospitals, with varying learning rate