The reliability of the graded Wolf Motor Function Test for stroke

Introduction The graded Wolf Motor Function Test assesses upper limb function following stroke. Clinical utility is limited by the requirement to video record for scoring purposes. This study aimed to (a) assess whether video recording is required through examination of inter-rater reliability and agreement; and (b) assess intra-rater reliability and agreement. Method A convenience sample of 30 individuals were recruited following stroke. The graded Wolf Motor Function Test was administered within 2 weeks of rehabilitation commencement and at 3 months. Two occupational therapists scored participants through either direct observation or video. Inter- and intra-rater reliability and agreement were examined for item-level and summary scores. Results Excellent inter-rater reliability (n = 28) was found between scoring through direct observation and by video (intraclass correlation coefficients >0.9), and excellent intra-rater reliability (n = 21) was found (intraclass correlation coefficients >0.9) for item-level and summary scores. Low agreement was found between raters at the item level. Adequate agreement was found for total functional ability, with increased measurement error found for total performance time. Conclusion The graded Wolf Motor Function Test is a reliable measure of upper limb function. Video recording may not be required by therapists. In view of low agreement, future studies should assess the impact of standardised training.


Introduction
Upper limb impairment is common following stroke (Lawrence et al., 2001), with survivors generally experiencing a combination of reduced motor control, reduced coordination and somatosensory deficits (Lang et al., 2013). With links to increased dependence in daily life activities (Lang et al., 2013), improvement in upper limb motor control and function is central to stroke rehabilitation (Pollock et al., 2014).
Choice of outcome measure has been identified as one of the top three research priorities for improving clinical trials (Smith et al., 2014). Currently, various upper limb outcome measures are recommended according to treatment modality (Sivan et al., 2011), sample group or setting (Langhorne et al., 2011), with no consensus demonstrated in the guidelines (Intercollegiate Stroke Working Party, 2016). The use of standardised outcome measures is essential for evidence-based occupational therapy practice and promoted across occupational therapy guidelines (Association of Canadian Occupational Therapy Regulatory Organizations, 2011; College of Occupational Therapists, 2017; Occupational Therapy Australia, 2018).
The Wolf Motor Function Test (WMFT) was developed to measure upper limb motor activity following stroke and traumatic brain injury (Wolf et al., 1989). Demonstrating adequate psychometric properties among people who have had a stroke (Lin et al., 2009;Morris et al., 2001;Wolf et al., 2001), the WMFT has become a widely used and recommended assessment of upper limb activity (Alt Murphy et al., 2015;Santisteban et al., 2016). The WMFT is recommended for individuals with mild to moderate upper limb impairment (Taub et al., 2011) and is most sensitive to those with a higher level of motor function (Thompson-Butel et al., 2014;Wolf et al., 2001), with floor effects found when used in the early stages of stroke (Lin et al., 2009). The graded Wolf Motor Function Test (gWMFT) was developed for accurate assessment of moderate to severe upper limb impairment (Constraint Induced Movement Therapy Research Group, 2002). The WMFT and gWMFT are conducted in real time with performances video recorded to reduce measurement error when scoring this complex assessment (Constraint Induced Movement Therapy Research Group, 2002;Taub et al., 2011).
A systematic review explored the clinical application and psychometric properties of the gWMFT reported in the literature (Turtle et al., 2019). This review found that the gWMFT was a secondary outcome measure in 11 clinical trials, with two versions of the outcome measure reported: the 14-item gWMFT and the more recent 13-item gWMFT. The studies included in the review were predominantly of low quality due to inconsistencies in how the gWMFT was administered and scored, with some authors adapting it to meet study objectives (Bonifer et al., 2005;Iwamuro et al., 2011;Triandafilou and Kamper, 2014).
Reliability of the two versions of the gWMFT has been assessed across two studies. The 14-item gWMFT was assessed by Bonifer et al. (2005), who found a high level of intra-rater reliability for scoring functional ability in 20 individuals more than 12 months post-stroke. Pereira et al. (2015) found a high level of inter-rater reliability for scoring functional ability and performance time using a Brazilian Portuguese version of the 13-item gWMFT in 10 individuals in the chronic stage of stroke. With no further psychometric evaluation of the gWMFT reported, the gWMFT has limited utility in clinical practice and research. For a more detailed review of the application and psychometric properties of the graded Wolf Motor Function Test, see Turtle et al. (2019).
As noted previously, authors of the gWMFT recommend the use of video recording for scoring participants (Constraint Induced Movement Therapy Research Group, 2002). However, this adds to the burden of delivery and may not be appropriate for use in clinical practice, with evidence suggesting video recording the WMFT is not required for accurate scoring (Whitall et al., 2006). Therefore, the aims of the current study were to investigate inter-and intra-rater reliability and agreement, and internal consistency for the gWMFT in a sub-acute stroke population (within 3 months of stroke onset).

Method
This study is presented based on the published guidelines for reporting reliability and agreement (Kottner et al., 2011). Ethical approval was granted by the Office for Research and Ethics Committees (Ref:14/NI/1149). All participants provided written informed consent.

Participants
Thirty individuals in the sub-acute phase of stroke recruited to an ongoing pilot randomised controlled trial formed the sample (ClinicalTrials.gov: NCT02276729).
Inclusion criteria were: adults aged 18 years or over and recently admitted to an inpatient rehabilitation ward; stroke diagnosis within 3 months with upper limb motor loss and upper limb rehabilitation a key component of treatment; able to understand and follow two-part verbal and written commands in the English language; and able to provide written consent. Exclusion criteria were: having had a previous stroke or gross cognitive impairment.

Raters
Rater one and rater two were research occupational therapists. The therapists were employed solely to collect outcome measures on the trial and had no clinical relationship with the participants. Training for both raters involved reviewing the manual (Constraint Induced Movement Therapy Research Group, 2002) and viewing training videos, the scoring of which was verified by occupational therapists experienced in the clinical administration of the outcome.

Outcome measure
The gWMFT assesses timed performance and quality of movement (Constraint Induced Movement Therapy Research Group, 2002). The gWMFT consists of 13 graded test items (Appendix 1) (Constraint Induced Movement Therapy Research Group, 2002) and takes approximately 40 minutes to administer. Video recording of the gWMFT is recommended to enable retrospective scoring of functional ability. A template can be purchased from the test's authors to standardise placement of the 13 test items.

Video recording
Test items 1 to 8 require placement of the video camera to the side of the template, 3 feet to the side of the participant being tested, allowing the view of their entire torso (Constraint Induced Movement Therapy Research Group, 2002). Test items 9 to 12 require the same placement of the video camera but zoomed in to detail the upper limb and fine finger movements. Test item 13 requires placement of the video camera to the front of the template and 3 feet in front of the participant (Constraint Induced Movement Therapy Research Group, 2002).

Scoring of the gWMFT
Quality of movement is assessed on the gWMFT using a functional ability scale (FAS). This is an eight-point ordinal scale, ranging from zero (not attempted) to seven (normal movement). Items are completed on two levels (A and B), where level A items are of a higher level of difficulty and are scored between four and seven. Level B items are of a lower level of difficulty and are scored between zero and three. Any items not completed are scored zero. For the assessment of performance time, participants have 30 seconds to complete level A items, and if unable to do so have a second opportunity to complete the task at level B. Sixty seconds are added onto performance time for level B items, with a maximum time of 120 seconds.

Procedure
The test was administered and video recorded according to protocol guidelines by one occupational therapist (rater one) (Constraint Induced Movement Therapy Research Group, 2002). To standardise placement of objects and participants, the template was devised from a plexiglass sheet according to protocol instructions and securely affixed to a table top (Appendix 2). The gWMFT was used to assess the participant's affected arm.
Assessments were completed at 2 weeks (T1) and 3 months (T2). The assessments completed at T1 took place in a private room used for research purposes on the hospital site. Assessments completed at T2 generally took place in the participant's own home.
For inter-rater analyses, rater one completed scoring through direct observation and rater two later viewed and scored participant videos for assessments completed at T1.
For intra-rater analyses, rater two scored assessment videos completed at T2 and re-scored one month later.
Internal consistency was assessed using rater two scoring at T1 and T2.
All recorded participant footage was viewed in a private room on hospital premises. Raters were blinded to each other's scoring.

Measurement constructs
Reliability and agreement determine the amount of measurement error in an outcome, and contribute to test validity (Kottner et al., 2011;Streiner et al., 2015). Reliability refers to the amount of variability between rater scores, while agreement assesses the degree to which allocated scores are identical (Kottner et al., 2011;Streiner et al., 2015). Internal consistency is a form of reliability that assesses the degree to which test items are inter-related and therefore indicative of measuring the same construct (Cronbach, 1951).

Data analysis
Descriptive statistics for age, gender and side of hemiparesis were recorded. The mean value was reported for the total FAS score, and the median value was reported for total performance time (Constraint Induced Movement Therapy Research Group, 2002). Score distributions were examined for both time points. Floor and ceiling effects were present if 15% or more of the sample achieved the minimum or maximum scores (McHorney and Tarlov, 1995).
Item-level reliability and agreement were completed to determine if there were any issues with individual items of the gWMFT. Inter-rater reliability for total and item-level functional ability and performance time were assessed using a two-way random consistency intraclass correlation coefficient (ICC 2,1 ) (Shrout and Fleiss, 1979). This enables generalisations to be made to other raters within the same population.
Intra-rater reliability for total and item-level functional ability and performance time were assessed using two-way mixed effects, consistency ICC (ICC 3,1 ) (Shrout and Fleiss, 1979 coefficients determine the level of consistency in the ranking of scores (Hallgren, 2012). A reliability score of 0.60 and above was considered acceptable (Cicchetti, 1994).
To examine item-level inter-and intra-rater agreement, proportion of agreement and proportion of agreement AE1 point were completed for functional ability. Standard error of measurement (SEM) (Stratford and Goldsmith, 1997) was completed for item-level performance time. Standard error of measurement was calculated for the total scores of both functional ability and performance time. The SEM portrays the amount of measurement error in scoring; the larger the value, the greater the variability between raters.
Internal consistency of functional ability and performance time were analysed using Cronbach's alpha. Values above 0.70 were considered indicative of test items measuring the same construct and correlating well together (Terwee et al., 2007).
All analyses were completed using SPSS Statistics (Version 24.0. IBM Corporation, Armonk, NY).

Results
A total of 30 participants were recruited (mean days post-stroke [SD],14.73 [8.36]). Due to medical reasons, loss to follow-up and technical difficulties in viewing recorded videos, two and nine participants were not assessed at T1 and T2 respectively. Consequently, data from 28 participants yielded the analyses for inter-rater analyses (mean age [SD], 71.3 [9.85]; 18 males and 10 females) and data from 21 participants yielded the analyses for intra-rater analyses (mean age [SD], 70.5 [8.7]; 16 males and five females).
Technical difficulties prevented the scoring of one item for participant one and one item for participant two at T2. In order to utilise existing data, summary scores were calculated using the available items. Patient characteristics are presented in Table 2.

Floor and ceiling effects
Ceiling effects were not evident for either assessment session. At T1, floor effects were found for performance time and functional ability by both raters, with 35.7% and 21.4% of the sample achieving the maximum score of 120 seconds and minimum score of zero, respectively (Table 2).
At T2, floor effects were found for performance time, with 33.7% of the sample achieving the maximum score of 120 seconds (Table 2). Floor effects were also found for functional ability at both testing sessions, with 19% of the sample achieving the minimum score of zero (Table 2).

Inter-rater reliability and agreement
High levels of reliability were found between rater one scoring through direct observation and rater two scoring using recorded videos for item-level (Table 3) and total (Table 4) functional ability and performance time, with ICC values above 0.8.
The proportion of agreement for scoring functional ability at the item level ranged from 0.43 to 0.64 and proportion of agreement AE1 ranged from 0.56 to 0.96 (Table 3). Agreement based on SEM values for performance time at the item level ranged from 0.32 to 19.30, with greater differences found for scoring items 1 and 4 through 12 (Table 3). Standard error of measurement values for total scores was 0.33 for functional ability and 6.49 for performance time (Table 4). Larger differences for scoring performance time occurred where there were differences between raters in assigning participant performance to level A or level B tasks.

Intra-rater reliability and agreement
High levels of reliability were found for item-level (Table 3) and total (Table 4) functional ability and performance time, with ICCs above 0.9. The proportion of  agreement ranged from 0.57 to 0.86 and proportion of agreement AE1 ranged from 0.90 to 1 for functional ability scores at the item level (Table 3). Agreement based on SEM values for item-level performance time ranged from 0.07 to 9.29, with greater differences found for scoring items 3, 4, 5, 9, 11 and 12 (Table 3). Standard error of measurement values for total scores were 0.19 for functional ability and 3.64 for performance time (Table 4).

Internal consistency
Internal consistency values for functional ability and performance time for both assessment points were above 0.9 (Table 4).

Discussion
This study estimated the psychometric properties of the gWMFT in a cohort of individuals with stroke and compared the results between scoring through direct observation and using video. Excellent inter-rater reliability was found for the FAS and performance time, and adequate agreement was found for scoring functional ability through direct observation and by video. However, unacceptable measurement error was found for scoring performance time. Excellent reliability was also found for intra-rater analyses. This is the first reported study to investigate the reliability and agreement properties of the gWMFT in the sub-acute phase of stroke. With limited psychometric evaluation existing, the ability to compare this study to previous literature is limited. Substantial floor effects were found for performance time, with a high proportion of scores clustering at the maximum performance time allowed. Floor effects for the FAS were found by both raters at T1, and at both testing sessions at T2. Comparable findings were found for the WMFT when used with lower-functioning participants, with five participants unable to complete any item within 120 seconds (Thompson-Butel et al., 2015). Lin et al. (2009) found floor effects for the WMFT FAS when applied within 14 days of stroke onset. A large proportion of the current sample were unable to attempt all test items. With no recorded item available to score, participants scored 120 seconds and zero on the FAS. The pilot study, from which this sample was derived, did not preclude individuals with more severe upper limb impairment from recruitment procedures, potentially explaining the floor effects found. With participants demonstrating varying degrees of upper limb function, the gWMFT was not able to sensitively measure the range of motor capabilities exhibited.
The high levels of inter-rater reliability found between raters scoring through direct observation and by video indicates that scoring by video may not be a necessary adjunct. This was further substantiated by adequate agreement found between raters for scoring functional ability. While agreement for total FAS scores was adequate, exact agreement was poor across all items. The SEM for performance time highlighted greater discrepancies between raters. Examination of scores at the item level highlighted rater variations in assigning participant performance to level A or level B. Examining agreement at the item level, SEM values greater than 9 seconds were found for 10 items. Whilst the raters underwent training separately, the training content was consistent for both. This comprised reading the manual (Constraint Induced Movement Therapy Research Group, 2002), viewing training videos of an experienced occupational therapist administering the test with stroke survivors, and scoring in real time. This was augmented by a review of the scoring results with an experienced occupational therapist in a training session. In previous studies raters have been required to demonstrate approximate scoring to each other prior to study commencement (Morris et al., 2001;Whitall et al., 2006). This was not required in this study, potentially leading to measurement error and the disagreements demonstrated at the item level. Duff et al. (2015) recognised the issues of variability in ascribing the subjective aspects of the WMFT to patient performance and designed a quality process to ensure rater standardisation.
Excellent intra-rater reliability for total and item-level functional ability and performance time were found, indicating consistent scoring by one rater, over a 1-month interval. Intra-rater SEM values for functional ability displayed minimal variation between scoring sessions, indicating a good level of agreement. Adequate agreement was found for nine test items, with proportion of agreement greater than 0.7. However, similar to inter-rater agreement analyses, there were unacceptable differences in scoring performance time at both the item level and for total scores.
A previous study has reported good agreement between videotaped and observed scoring for the WMFT Table 4. Inter-and intra-rater reliability, standard error of measurement and internal consistency of gWMFT. based on ICC 2,1 agreement factor (greater than 0.9) (Whitall et al., 2006). However, the ICC is not a recommended agreement parameter, potentially obscuring the presence of wider variability (Kottner et al., 2011). Whilst differences in scoring modality may have impacted on rater differences in the current study, unacceptable measurement error was found for scoring performance time using video alone. This indicates the presence of additional factors impacting on measurement error. The study authors consider this the result of differences in accurately differentiating between a level A and level B performance by participants.
Although recommended by authors of the gWMFT and the WMFT (Constraint Induced Movement Therapy Research Group, 2002;Taub et al., 2011), the least affected limb was not tested. Scores for the less affected limb may act as a comparison for the more affected limb and help raters discern between FAS ratings accordingly.

Limitations and future research
As part of an ongoing pilot study, the sample size was small, limiting the amount of data available. This study examined participants in the sub-acute phase of stroke, with most experiencing difficulty attempting all test items. Therefore, consideration of reliability and agreement estimates should be applied with caution. Future study could stratify participants according to level of ability and examine use of the gWMFT in chronic stroke. In addition, the grade 5 Wolf Motor Function Test could be used, which was developed for individuals with more severe upper limb impairment (Uswatte et al., 2018).
Due to the discrepancies in rater agreement, provision of a standardised training programme throughout may reduce disagreement across level of item assigned, minimising error, and should be considered in future studies.

Implications for occupational therapy practice
The results of this study have the following implications for occupational therapy practice: • The gWMFT is a reliable measure for assessing upper limb function post-stroke.
• Different therapists could potentially deliver the gWMFT with stroke survivors and score at different time points, leading to reliable results.
• Given the complexity of the assessment, training would be recommended prior to use, potentially using a fidelity check as developed by Morris et al. (2009) for the WMFT.
• Video recording may not be necessary when scoring the gWMFT, thereby increasing its clinical utility. This would also help to avoid technical errors in video recording and issues with obtaining consent and adhering to General Data Protection Regulations.
• The gWMFT showed floor effects. Therefore, caution should be applied in using the gWMFT with individuals who demonstrate more severe impairments following stroke. The level 5 WMFT could act as a suitable alternative (Uswatte et al., 2018).

Conclusion
The gWMFT demonstrated good levels of inter-and intra-rater reliability and internal consistency. There was acceptable agreement for functional ability, with greater measurement error found for performance time. This study demonstrates the potential use of the gWMFT in a sub-acute stroke population, without the additive strain of scoring individuals by video.

Key findings
• The graded Wolf Motor Function Test can be reliably scored by video and/or by direct observation.
• Inadequate agreement for scoring performance time and individual items indicates future studies should consider the impact of standardised training in the use of the assessment.

What the study has added
The graded Wolf Motor Function Test is a reliable measure of upper limb function in sub-acute stroke, and videotaping for scoring purposes may not be required.