A comparative analysis of Patient-Reported Expanded Disability Status Scale tools

Background: Patient-Reported Expanded Disability Status Scale (PREDSS) tools are an attractive alternative to the Expanded Disability Status Scale (EDSS) during long term or geographically challenging studies, or in pressured clinical service environments. Objectives: Because the studies reporting these tools have used different metrics to compare the PREDSS and EDSS, we undertook an individual patient data level analysis of all available tools. Methods: Spearman’s rho and the Bland–Altman method were used to assess correlation and agreement respectively. Results: A systematic search for validated PREDSS tools covering the full EDSS range identified eight such tools. Individual patient data were available for five PREDSS tools. Excellent correlation was observed between EDSS and PREDSS with all tools. A higher level of agreement was observed with increasing levels of disability. In all tools, the 95% limits of agreement were greater than the minimum EDSS difference considered to be clinically significant. However, the intra-class coefficient was greater than that reported for EDSS raters of mixed seniority. The visual functional system was identified as the most significant predictor of the PREDSS–EDSS difference. Conclusion: This analysis will (1) enable researchers and service providers to make an informed choice of PREDSS tool, depending on their individual requirements, and (2) facilitate improvement of current PREDSS tools.


Introduction
Kurtzke introduced the Expanded Disability Status Scale (EDSS) in 1983 1 as a revision of his initial 1955 Disability Status Scale, 2 to provide a valid and comprehensive assessment of multiple sclerosis (MS)related disability, for which it still remains the gold-standard tool despite its limitations. 3 EDSS scores range from 0 to 10 in 0.5 step intervals, with 0 being no impairment, and 10 being death from MS. At low levels of disability (scores from 0 to 3.5), the EDSS score is determined by neurological examination, while at high levels of disability (scores ⩾ 5.5), it is primarily influenced by ambulation and dependence on help in daily activities. EDSS scores between 4.0 and 5.0 are reached by combinations of neurological examination, functional status and ambulation assessment.
Physician-determined EDSS (henceforth referred to as EDSS) is time-consuming, expensive and restricts assessment of the EDSS to clinic visits, which may be infrequent or impractical. There have been several tools developed to enable patients to report their own EDSS score, that is, a Patient-Reported EDSS (henceforth referred to as PREDSS). [4][5][6][7][8][9][10][11] PREDSS is potentially useful in various situations such as patient follow-up during long term or geographically challenging studies where clinic attendance is difficult, or to enable EDSS assessment in busy or under-staffed clinical service environments.
There are two clinical scenarios where PREDSS may be employed instead of the EDSS: 1. In the first scenario, PREDSS and EDSS are used interchangeably and therefore it is important to have agreement between the two. In this case, agreement statistics would be relevant.
There are several measures of agreement. Percentage agreement is a useful directly intuitive measure, but it does not correct for chance. Cohen's kappa statistic is the proportion of agreement after having allowed for that expected by chance. The weighted kappa coefficient additionally puts a weight to the distance between disagreements. The value of kappa is dependent on prevalence of the scores within a particular population. 12 The intra-class coefficient (ICC) measures the proportion of total variance that is due to differences between patients (with the rest being the variance due to differences in the scales being compared); therefore, its size depends on the variability in the sample. 13 The Bland-Altman method visualizes the data and more openly describes agreement, instead of attempting to summarize agreement as a statistic. 14 It is now recognized that the Bland-Altman method is the most appropriate way to assess agreement, and as a result, it has become the most frequently used method. 15 The differences between the two scores are plotted against the reference or 'gold standard' method (in this case, the EDSS). Horizontal lines are drawn at the mean difference, and at the 95% limits of agreement, which are defined as the mean difference plus and minus 1.96 times the standard deviation of the mean difference. If the difference between the 95% limits of agreement is not clinically significant, a correction factor (the mean difference) may be used to enable interchangeability between PREDSS and EDSS if the PREDSS consistently underscores or overscores the EDSS. It is accepted that EDSS change is clinically significant if its magnitude is of at least 1.0 point on Kurtzke's EDSS in patients with an EDSS score of 5.5 or lower, or 0.5 point in patients with a higher EDSS score. 16 2. In the second scenario, PREDSS is the only tool used to serially assess patients in a clinic or study where it is not so necessary to have agreement of scores between PREDSS and EDSS, but it is important to have a linear relationship between PREDSS and EDSS which is as good as possible with respect to strength and direction. In this case, correlation statistics would be relevant.
It is difficult to compare the PREDSS tools with each other since the original study reports used different metrics to compare PREDSS and EDSS scores. This study aims to make a head-to-head comparison of the tools for which the original individual patient level data was available, thus enabling researchers or clinicians to make a well-informed decision in choosing a PREDSS tool that best suits their needs in a particular setting. These phrases were searched in combination and independently. The outcomes of these searches were inspected by three authors (I.G., C.C. and B.I.) for the inclusion criteria of: (1) patient-reported EDSS, (2) physician-assessed EDSS score and (3) inclusion of all levels of disability. The authors of eligible studies were invited to participate as co-authors, dependent on the availability of their studies' raw data.

Statistical analysis
All analyses were performed in SPSS v. 22. On receipt of the data, the identity of the studies was masked using a coding system so that the analysis was blinded. The distribution, mean and variance of data from all studies were compared in order to help guide the correct choice of statistical analysis; this was performed visually and using one-way analysis of variance (ANOVA) for means and Levene's test for variances. Spearman's rho was used for correlation. Bland-Altman analysis was employed to assess agreement; the gold-standard EDSS was plotted on the x-axis. The relationship of EDSS and tool identity with the PREDSS-EDSS difference was explored using analysis of covariance (ANCOVA) within the General Linear Model. For stepwise multivariate linear regression, standard assumptions were met. Significant difference from the null hypothesis was considered to be present when p < 0.05.

Literature search
The systematic literature search resulted in 423 publications. Eight publications met the inclusion criteria for this study. The first and last authors of each publication were invited to participate by providing a copy of the raw data, which included the physician-assessed EDSS scores, PREDSS scores and functional system (FS) scores. At least one author for each publication responded to the invitation. Data were unavailable for three of the eight studies. [9][10][11] PREDSS tool study characteristics Table 1 presents the main characteristics of the studies. The tools were developed over a period of 15 years studying a total of 460 patients. Three of the studies deployed their questionnaire directly to the patients in a printed format, 4-6 while one was assessed using an online electronic format, 8 and another was deployed via telephone. 7 The basic design concept is similar among the tools, using a combination of dichotomous, multiple choice or Likert-type questions to assess each of the FS scores within the EDSS, as well as ambulation and dependence on help in daily activities; exceptions are Tool 4 which does not use Likerttype questions, and Tool 5 which includes also some scaling questions asking patients to give percentages. An FS score is generated for each FS, and from this the overall EDSS is calculated. The way in which information about neurological symptoms and functional status was collected differed between tools; this is described in detail in the Supplementary material. In all studies, physician EDSS was performed by raters working in the field of MS, in established centres; raters were all trained and assessed, and in two of the studies this was done using a standardized audiovisual package (Neurostatus). 7,8 Study populations were predominantly female, ranging from 56% to 82%, and approximately half the cases were relapsing-remitting MS. There was no difference between studies with respect to gender, MS type or age. Sample size was similar between studies except for Tool 5 which had a very small sample size of 30 patients. There were significant differences in the mean EDSS and its variance across studies.

Using PREDSS interchangeably with EDSS: agreement
In Clinical Scenario 1, agreement between EDSS and PREDSS would be needed for interchangeability during data collection, or comparison between datasets.
Of the three statistical methods used to assess reliability, the Bland-Altman analysis was considered to be the most suited; it enables direct visualization.
Bland-Altman analysis provides a numerical and pictorial estimate of the differences and their 95% limits of agreement. The Bland-Altman plots for EDSS-PREDSS agreement across the whole EDSS range are depicted in Figure 1, which shows a tendency for less agreement at lower levels of disability. The Bland-Altman data, across the whole EDSS range and for EDSS ⩽ 5.5 and > 5.5, are listed in Tables 2-4; this division was necessary since the minimum clinically significant change in EDSS is different in these two disability categories. For EDSS ⩽ 5.5, all the tools overestimated the EDSS (mean difference of all tools combined = 0.51), while for EDSS > 5.5, there was a tendency to slightly underestimate the EDSS (mean difference of all tools combined = -0.02). PREDSS can be corrected for over-or underestimation of the EDSS by subtracting or adding the mean difference respectively, with 95% confidence that the real value of the EDSS lies between the 95% limits of agreement shown on the Bland-Altman plots. Hence, the 95% limits of agreement are more crucial than the mean difference. For all tools, the difference between the 95% limits of agreement exceeded the EDSS change that is considered to be clinically meaningful. Hence, none of the tools can be used interchangeably with the physician-derived EDSS. For EDSS ⩽ 5.5, where a change of ⩾1 is considered to be meaningful, the smallest difference between the 95% limits of agreement was three times higher (3.09, Tool 5). For EDSS > 5.5, where a change of 0.5 is considered to be meaningful, the smallest difference between the 95% limits of agreement was nearly twice as much (0.85, Tool 2).

Putting PREDSS-EDSS agreement in context: comparison with EDSS inter-rater agreement
To put the PREDSS in context, the agreement between EDSS and PREDSS was compared with published inter-rater and intra-rater agreement data for the EDSS.

PREDSS-EDSS agreement varies with tool identity and EDSS
The Bland-Altman analysis showed different levels of PREDSS-EDSS agreement among studies. In addition, agreement was better at the higher levels of disability. There was not a better agreement at the lower end of the EDSS scale, to indicate a floor or ceiling effect. This suggested that PREDSS-EDSS agreement was dependent on the extent of disability. ANCOVA, using tool identity as a fixed factor and EDSS as a covariate, against the PREDSS-EDSS difference as the dependent variable, showed that both EDSS and tool identity significantly affected the variance in PREDSS-EDSS difference. The contribution of EDSS (4.7%) to the variance of the PREDSS-EDSS difference was circa double that of the tool identity (2.6%).

The contribution of individual FS scores to PREDSS-EDSS agreement
To explore the relative contribution of FS scores to the PREDSS-EDSS difference, the ANCOVA was repeated with tool identity as a fixed factor and EDSS and FS differences as covariates, against the PREDSS-EDSS difference as the dependent variable. Most, but not all cases, had FS score data available (n = 383). Tool identity and EDSS maintained a significant relationship with the PREDSS-EDSS difference. The pyramidal, cerebellar and visual FS score differences significantly affected the variance in PREDSS-EDSS difference, indicating that the differences between physicians and patients in the scoring of these domains were contributing to the overall difference in scoring   8 Bowen et al. 4 Cheng et al. 5 Lechner-Scott et al. 7    between the PREDSS and EDSS. Stepwise multivariate linear regression of the EDSS and functional score differences against the PREDSS-EDSS difference within individual studies identified the visual domain as the most common FS significantly affecting the PREDSS-EDSS difference, with substantial standardized beta coefficients (Table 6).

Clinical Scenario 2
Using PREDSS on its own: correlation Clinical Scenario 2, described in the 'Introduction' section, does not require agreement between the PREDSS and EDSS. In this scenario, correlation between PREDSS and EDSS would indicate the ability of PREDSS to substitute EDSS, as long as PREDSS is used throughout the data collection, and no external comparison is made to EDSS datasets.
The output of all the PREDSS tools correlated highly with the EDSS ( Table 7). The highest correlation coefficients were seen with Tools 2, 4 and 5. Correlation differed markedly across disability categories in most studies. The highest coefficients were seen in Tool 2, which also exhibited least variation of correlation between disability categories.
In order to determine how FS scores contributed to the correlation between the PREDSS and EDSS, correlation coefficients between patient-and physician-derived scores were computed for all FS scores (Table 7). Oneway ANOVA showed that there was a significant difference in correlation coefficients between FS scores (p = 0.004). Dunnett's post-hoc analysis confirmed the mental, visual and brainstem domains as having statistically significantly lower correlation coefficients.

Clinical Scenario 1
This clinical scenario is where agreement is required between PREDSS and EDSS, that is, when the PREDSS and EDSS are used interchangeably, whether this is a research or clinical service setting.
Bland-Altman analysis showed that most tools performed better at higher EDSS. Using the EDSS score as the gold standard for the measurement of disability throughout its range, three reasons could explain the effect of EDSS on PREDSS scoring. First, PREDSS may be easier to score as disability levels rise, for instance, if patients become more aware of their disability because of having a more severe condition for longer. Second, the use of ambulation capacity in the higher EDSS scoring categories may allow for better performance of PREDSS because patient report of ambulation status better matches physician-assessed ambulation capacity (especially if the latter is derived by asking the patient). Third, EDSS in the range of 0 to 3.5 is particularly prone to inter-rater disagreement compared to the higher range, 19,24 possibly because the combination of FS scores means there are more opportunities to have a poorer correlation; therefore, the disagreement between PREDSS and EDSS at the low end of the scale may reflect the inherent uncertainty in this region.
Strictly speaking, none of the PREDSS tools can be used interchangeably with the EDSS, since the Bland-Altman 95% limits of agreement were wider than the minimum clinically significant EDSS change; this was the case in all tools, across all EDSS categories. Tool 2, in the setting of an EDSS > 5.5, was closest, giving the user 95% confidence that a corrected PREDSS was within 0.85 EDSS points of the physician-derived EDSS.

Clinical Scenario 2
This clinical scenario is where agreement is not required between PREDSS and EDSS, but changes on the two scales need to be comparable, that is, when the PREDSS is used instead of the EDSS and comparability needs to be retained with respect to rate of disability progression (ratio of change), whether this is a research or clinical service setting.
All the PREDSS tools correlated highly with EDSS. It is important to emphasize that correlation is not a measure of agreement; 26 it tests the presence of a relationship between two variables, and the strength and direction of this relationship. Hence, the high correlation demonstrates that PREDSS can replace the physician-derived EDSS in serial measurements for the sole purpose of ensuring proximity of percentage changes between PREDSS and EDSS, but not absolute values of scores or score differences. Correlation coefficients varied depending on the disability level and therefore one may want to select the tool that best suits their application, using Table 7.
Agreement statistics (used in Clinical Scenario 1) and correlation (used in Clinical Scenario 2) measure different entities. 26 Hence, if there is high agreement, then correlation must be high, but the reverse is not necessarily true, as happened here. Agreement statistics assess to what extent scoring is identical, while correlation statistics measure the relationship between the scores, irrespective of agreement.

Future directions
Improved versions of these tools should concentrate on the way that pyramidal, cerebellar, brainstem, mental and visual FS scores are scored, since these domains were identified as significant contributors to disagreement and lack of correlation between the PREDSS and EDSS. Among these, the visual FS deserves most attention, since it performed poorly in both Clinical Scenarios (i.e. agreement and correlation). The identification of the visual FS as a major contributor to disagreement with EDSS presents a real opportunity for improvement of PREDSS tools since a smartphone/tablet-based visual acuity testing app, validated for clinical and community-based practice, is now available. 27