Measuring health-related quality of life in chronic headache: A comparative evaluation of the Chronic Headache Quality of Life Questionnaire and Headache Impact Test (HIT-6)

Objective To compare the quality and acceptability of a new headache-specific patient-reported measure, the Chronic Headache Quality of Life Questionnaire (CHQLQ) with the six-item Headache Impact Test (HIT-6), in people meeting an epidemiological definition of chronic headaches. Methods Participants in the feasibility stage of the Chronic Headache Education and Self-management Study (CHESS) (n = 130) completed measures three times during a 12-week prospective cohort study. Data quality, measurement acceptability, reliability, validity, responsiveness to change, and score interpretation were determined. Semi-structured cognitive interviews explored measurement relevance, acceptability, clarity, and comprehensiveness. Results Both measures were well completed with few missing items. The CHQLQ’s inclusion of emotional wellbeing items increased its relevance to participant’s experience of chronic headache. End effects were present at item level only for both measures. Structural assessment supported the three and one-factor solutions of the CHQLQ and HIT-6, respectively. Both the CHQLQ (range 0.87 to 0.94) and HIT-6 (0.90) were internally consistent, with acceptable temporal stability over 2 weeks (CHQLQ range 0.74 to 0.80; HIT-6 0.86). Both measures responded to change in headache-specific health at 12 weeks (CHQLQ smallest detectable change (improvement) range 3 to 5; HIT-6 2.1). Conclusions While both measures are structurally valid, internally consistent, temporally stable, and responsive to change, the CHQLQ has greater relevance to the patient experience of chronic headache. Trial registration number: ISRCTN79708100. Registered 16th December 2015, http://www.isrctn.com/ISRCTN79708100


Introduction
Chronic headaches, which can be defined epidemiologically as headaches on 15 or more days per month for at least 3 months (1-3), have profound effects on people's lives. Those affected describe strained relationships, and that the spectre of headaches can be a crucial driver of their behaviour (4). When testing treatments for these chronic headache disorders, an international, multi-stakeholder consensus process rated the measurement of the overall health impact of chronic headaches as being at least as important as counting headache days (5). These health impacts should be assessed using patient-reported outcome measures (PROMs) with robust evidence of measurement quality, relevance, and acceptability (5,6). There is substantial heterogeneity in PROMS used in trials of headache disorders (7).
A 2018 systematic review of PROMS for headaches found the strongest, albeit limited, evidence was for two headache-specific measures (7), the Migraine-Specific Questionnaire (MSQ v2.1) (8) and the sixitem Headache Impact Test (HIT-6) (9). However, essential evidence of data quality and interpretation, reliability, and responsiveness was mostly absent or of insufficient quality. Moreover, the relevance and acceptability of these measures to people with headache were not explored. The use of PROMs that lack relevance to patients, and hence fail to capture the outcomes that matter, places an unnecessary burden on patients, and maybe judged to be unethical (10).
We report here on a mixed-methods comparative evaluation of the measurement and practical properties of the HIT-6 and an adaption of the MSQ v2.1 to make it suitable for people with unspecified chronic headache disorders -the Chronic Headache Quality of Life Questionnaire (CHQLQv1.0).

Methods
The Chronic Headache Education and Self-Management Study (CHESS) is a programme grant funded by the UK's National Institute for Health Research (RP-PG-1212-20018) to test the effectiveness of a supportive self-management intervention for people living with chronic headache disorders (11). This current work forms part of the feasibility study, reported elsewhere (January 2016 to April 2017) (Black Country Research Ethics Committee (15/WM/0165)) (12). In summary, participants completed questionnaires on three occasions during a 12-week prospective cohort study (baseline, 2 and 12 weeks).

Study population
We recruited people living with chronic headaches, predominantly chronic migraine or chronic tension-type headache, from general practices in the West Midlands region of the UK. Practices wrote to people who had, in the previous 2 years, consulted for headaches or had a prescription for a migraine-specific drug (i.e. triptans/pizotifen), inviting expression of interest in the study. In a subsequent telephone interview, study team members assessed if participants met an epidemiological definition of chronic headaches: Headache for 15 or more days per month for at least 3 months (1)(2)(3). For this validation of a generic headache-related quality of life outcome that is not diagnosis specific, this is the appropriate population. However, as part of this overall programme of work, we also validated a classification interview in this population. Of the 131 people included in this report, 107 (82%) also had paired telephone interviews with research nurses and doctors from the National Migraine Centre. The final classification was: Definite chronic migraine (59; 55%), probable chronic migraine (40; 37%) chronic tension-type headache (6; 6%), cluster headache (2; 2%), hemicrania continua (1; 1%). Over half, 44/74 (59%), also had medication overuse defined as "headache occurring on 15 or more days per month taking acute or symptomatic headache medication (on 10/15 or more days per month, depending on the medication) for more than 3 months". The sample size was driven by requirements for validation of a chronic headache classification interview. This work is described in detail elsewhere (13).

Patient-reported outcome measures
The feasibility study included general headache-specific (not diagnosis-specific), generic and domain-specific measures and a headache-specific health transition question (detailed in Appendix 1). The CHQLQ is a 14-item questionnaire, which assesses the functional aspects of headache-related quality of life, producing three domain scores (role prevention, role restriction, and emotional function) (8). Modification of the CHQLQ from the MSQ (v2.1) simply involved replacing the word 'migraines' with 'headaches' throughout the questionnaire. The HIT-6 is a 6-item questionnaire, which produces a single index score of headache impact on functional ability (9). Participants self-completed postal questionnaires at baseline, 2 and 12 weeks.
Data quality and interpretability. Item-scale characteristics, completion rates (missing data) and percentage of computable scale scores are reported (15,16). Interpretability was informed by evidence of end effects and calculation of the minimal important change (MIC) -the smallest change in score perceived as important by participants) (15) -calculated as the mean change score for people reporting "minimal change" in their headache at 12 weeks.
Structural validity and internal consistency. An exploratory factor analysis on baseline data hypothesised that the CHQLQ's original three-factor solution would be retained. Absolute item loadings !0.45 were accepted as sufficient correlation with a principal component to support domain inclusion. Confirmatory factor analysis was then used to confirm the three-and one-factor structures of the CHQLQ and HIT-6, respectively. Factor loadings exceeding 0.3-0.4 were judged to be meaningful (15)(16)(17). Internal consistency was assessed with Cronbach's alpha (15,16) values between 0.7 and 0.90 suggest a good to excellent agreement between items and the total (domain) score (15,16).
Reliability and measurement error. Two-week test-retest reliability (intra-class correlation coefficient (ICC 2,1)) was assessed in those indicating no change in their headache. We calculated the standard error of measurement (SEM) to determine the extent of absolute measurement error (6,18,19). The SEM supports score interpretation by accounting for variability, or error, in measurementonly a change greater than measurement error is considered 'real' (18). The SEM was subsequently converted into the smallest detectable change (SDC), representing the smallest change in score that is greater than measurement error; the SDC was calculated for individuals and for groups (19,20). The SDC allows one to rule out measurement error (i.e. distinguishing measurement error from true change) when assessing the reliability of a self-reported measure to detect change in health status. Thus, a score change greater than the SDC value is necessary to provide evidence of true change (improvement or deterioration) in health-status.
Construct validity. Score correlation between measures was assessed to evaluate convergent validity (Pearson's correlation coefficient). Hypothesised theoretical associations were considered a priori (Appendix 2).
Responsiveness. Responsiveness reflects the ability of a measure to detect real change in health that is greater than measurement error.

(i) Smallest detectable change (SDC)
We calculated the absolute measurement error at 12 weeks (standard error of measurement (SEM) and the smallest detectable change (SDC)), to represent the smallest change in score that is greater than measurement error in patients reporting change in headache at 12 weeks. We calculated the minimal important change (MIC) as the mean change in those reporting minimal improvement or deterioration at 12 weeks. We calculated the minimal important clinical difference (MICD) as the mean change in score in those who are "somewhat better" minus the mean change in those who are the same at 12 weeks (6,16).
(ii) Criterion-based assessment Receiver operating characteristic (ROC) curves were calculated to assess the ability of measures to discriminate between people whose headache had improved or deteriorated (on headache-specific transition question) at 12 weeks (16). An area under the curve (AUC) score of > 0.70 is considered sufficiently discriminatory; an AUC of 0.5 suggests no discriminatory power.
(iii) Effect size (ES) and standardized response mean (SRM) The ES and SRM were calculated for subgroups of patients in each health transition category. The main hypotheses we tested were: ES and SRM would be <0.2 for patients who reported no change in headache; >0.2 for patients reporting a slight improvement; >0.5 for patients reporting improvement (much better); greater for patients indicating an improvement in their headache than those indicating no change.

Content validity
Semi-structured cognitive interviews were conducted within 24 h of questionnaire self-completion with a purposive sample (age, gender, headache type) of participants. Measurement relevance, acceptability, clarity, and comprehensiveness were explored (21,22). Overarching questions explored how patients determined headache improvement, and if specific questions were missing. Interviews continued until thematic saturation was achieved; they were audio-recorded, transcribed verbatim, and checked for accuracy (VN). We used framework analysis (23) and cross-case comparison to generate themes. NVivo software (QSR International Pty Ltd. Version 11, 2015) supported data organisation. Data were independently explored by two researchers (VN, KH); emergent themes were discussed and interpreted with a third researcher (FG) and with two of our patient research partners (BB, LM).

Data quality and interpretability
Item missing data for the CHQLQ was low (range 0% to 3%); domain scores were computable for 96% (role prevention), 97% (role restriction) and 100% (emotional function) of respondents ( Table 2). All response options were endorsed. Except item 12 ("fed up or frustrated"), which correlated more highly with role restriction (0.71) than emotional function domain (0.64), all item-total correlations with specified domains were greater than 0.7 (Table 3).
There were no missing data for the HIT-6; index scores were computable for all responders. Except for item 1 (pain severity), for which response option 1 ("never") was not endorsed, all response options were supported. Item-total correlations ranged from 0.68 to 0.79, with five of the six items achieving scores higher than 0.70 (Table 3).  Role function -restrictive (RR); Role function -preventative (RP); and Emotional function (EF) -are calculated as the sum of item responses across each domain, rescaled to a 0-100 scale, where the higher score indicates better headache-related quality of life. A floor effect at item level is where more than 15% of responders score at the minimum (floor) indicating "best" health on the CHQLQ. b HIT-6: Each item has five descriptive response options, with each awarded a specific number of points: "Never" (6 points), "Rarely" (8 points), "Sometimes" (10 points), "Very often" (11 points) and "Always" (13 points). The score is the sum of item (points) responses. The index score ranges from 36 to 78, where scores 49 indicate little to no impact on life; 50-55 indicates some impact on life; 56-59 indicates substantial impact on life; and ! 60 indicates very severe impact on life. A floor effect at item level is where more than 15% of responders score at the minimum (floor) indicating "best" health on the HIT-6. c End effects: Where more than 15% of respondents score the minimum (floor) or maximum (ceiling) score respectively.
Floor effects (>15%) were identified for three CHQLQ role-prevention items and two emotional function items, suggesting many respondents were not "prevented" from undertaking usual activities or experienced specific emotional difficulties ( Table 2). Ceiling effects were observed for two HIT-6 items: >15% respondents indicated they would "always" "lie down" or feel "fed up or irritated" when experiencing a headache, suggesting the importance of these items, but further impact discrimination was impossible.

Structural validity and internal consistency
Standard loadings and goodness-of-fit indices for the CHQLQ exploratory factor analysis supported the Table 3. Exploratory (EFA) and confirmatory (CFA) factor analysis: Standardised factor loadings for the proposed three-factor measurement model for the CHQLQ and single-factor measurement model of the HIT-6. three-factor model, with factor loadings > 0.50 for all items except item 12 ("fed up or frustrated") (

Reliability
All values for the CHQLQ and HIT-6 exceeded the lower threshold for acceptable test-retest reliability (intra-class correlation coefficient > 0.70), supporting use with groups of patients (  ). This implies that, when using the CHQLQ for individual assessment, changes in people with stable symptoms would need to be greater than 22, 24 or 29 points (between 22% and 29% of total score change) to be distinguishable from measurement error. Alternatively, on a group level, group means would need to differ between 2.74 and 3.58 (up to 4% of total score change) to ensure a true detection of a difference in people with stable symptoms. The standard error of measurement for the HIT-6 was 2.42, resulting in a SDC individual of 6.69 and SDC group of 0.78. When using the HIT-6 in individual assessment, changes in people with stable symptoms would need to exceed 6.7 points (16% of total score change) to be distinguishable from measurement error. Alternatively, on a group level, group means need to differ by 0.78 (up to 2% of total score change) to be distinguishable from measurement error in people with stable symptoms.

Construct validity
Most hypothesised associations were supported (Table 5): the CHQLQ's three domains were strongly associated, with moderate to strong associations with the HIT-6. However, the association between role restriction and the SF-12 mental component score was stronger (moderate) than that observed with emotional function, reflecting the emotional component of the role-restriction domain. (Appendix 3). Similarly, although smaller than hypothesised, associations between role restriction and the HADS were similar or greater than that observed for emotional function, reflecting the limited emotional content of the emotional-function domain specifically, and the CHQLQ generally. Moderate associations between the CHQLQ and the Social Impact Scale and Pain Self-Efficacy Scales reflect the CHQLQ focus on the social impact of headache and pain, respectively.
A strong association with the Pain Self-Efficacy Questionnaire reflects the HIT-6 focus on pain. Apart from the moderate association with the Social Impact Scale, reflecting the HIT-6 emphasis on social impact, small associations with the remaining measures evidence a limited focus on the emotional impact of headache.

Responsiveness (Table 6)
Of the 105 people completing the 12-week questionnaire, 94 and 100 completed the health-transition question and CHQLQ or HIT-6, respectively.  Smallest detectable change (SDC). The CHQLQ standard error of measurement ranged from 5.60 to 10.31 for participants indicating minimal improvement or deterioration in headache status at 12 weeks. The resultant smallest detectable change for individuals (SDC individual ) for improvement ranged between 15 (role prevention) to 21 (role restriction), and 26 (role restriction and role prevention) to 28 (emotional function) for deterioration. The corresponding smallest detectable change for groups (SDC group ) ranged between 3 (role prevention) to 5 (role restriction) for improvement, and 7 (role prevention) to 8 (emotional function) for deterioration. These results imply that when using the CHQLQ for individual assessment, changes of <21 (improvement) or <28 (deterioration) points cannot be distinguished from error. However, much smaller differences are detectable for groups of patients: For groups who indicate minimal improvement, a change from baseline to 12 weeks of >5 points on the role-restriction and emotional-function domains and > 4 on the role-prevention domain are required to demonstrate a change that is greater than measurement error. For groups indicating minimal deterioration, a change of approximately 8 points is required to demonstrate change that is greater than measurement error.
The standard error of measurement for the HIT-6 ranged from 1.7 (deterioration) to 3.5 (improvement). The smallest detectable change at the individual level (SDC individual) was 9.5 and 1.7, and at the group level (SDC group ) was 2.1 and 1.3 for improvement and deterioration, respectively.
Minimal important change (MIC). Fifty-three of the 94 valid CHQLQ responses at 12 weeks (56%) indicated no change in headache status (mean change in score between 2.57 (SD 13.6) (emotional function) and 7.04 (SD 13.35) (role restriction)). Nineteen reported some ("better") improvement, with a mean score improvement (minimally important change) of 5.26 (role prevention), 8.00 (emotional function) and 10.79 (role restriction). The remaining 12 participants reported a deterioration ("worse") in headache status and a mean score deterioration of À0.75 (role prevention), À2.25 (emotional function), and À3.17 (role restriction). The smallest difference between clinically stable and improved participants (i.e. the minimal clinically important difference (MCID)) was 0.84 (role prevention), 3.75 (role restriction) and 5.43 (emotional function). The minimally important change for the HIT-6 is À3.15 and 0.42 for minimal improvement and deterioration, respectively. The smallest difference between clinically stable and improved patients (minimal clinically important difference) is À1.06 for the HIT-6.
For both measures, the minimal important changes were greater than the smallest detectable change in groups (SDC group ), indicating that a greater change in score is required to denote "important change" than that required to illustrate change that is greater than measurement error.
Criterion-based responsiveness (Figure 1). Moderate correlations between CHQLQ and HIT-6 change scores with the headache-specific transition item (range À0.35 (emotional function) to À0.45 (role prevention); 0.36 (HIT-6)), supported its use as an external marker of change (24). The higher AUC scores were found when dichotomising patients according to those who were "much better" versus those reporting that they were "better, the same or worse" (Figure 1). Two   Figure 1. ROC curves. Note: Respondents were dichotomised in three different ways: i) "Much better": Headache was "much better" versus headache was better, about the same or worse; ii) "much better, better": Headache was "much better" or "better" (that is, the improved group) versus headache was the same or worsened (the not improved group); and iii) "much better, better, same": Headache had improved or remained about the same vs. headaches had deteriorated.
Effect size statistics. As hypothesised, both effect size and standardised response means for patient subgroups increased with increased reported improvement on the transition question. Moderate to large effect sizes were found for people reporting some (better) and greater (much better) improvements in headache status at 12 weeks for both the CHQLQ and HIT-6. However, for patients who were unchanged, most values (75%) did not confirm the hypothesis by exceeding 0.2. Small numbers limited interpretation of any headache deterioration.

Content validity
We interviewed 14 participants (age 21-72 years; nine female) with chronic migraine. Typically, participants felt the CHQLQ was relevant to their headache experience, specifically welcoming the emotional impact items. However, item overlap -particularly around work -caused participants to refer back to previous items, and increased completion time. Participants described experiencing different headache intensities across the 4-week recall period, requiring judgement as to how they selected the most appropriate response. Double-barrelled items that aligned headache impact on "work" with "leisure activities" or "home" were challenging, as different environments influenced response. Contextual situations -for example, being retired or without dependents -caused participants to rate headache impact differently.
Typically, participants felt that the HIT-6 was relevant, welcoming its brevity and simplicity. However, when considering different headache intensities, the lack of recall period (items 1 to 3) was problematic: A range of recall periods (daily, weekly, fortnightly, monthly, study duration) were reported to assist in completion. The lack of "pain severity" definition (item 1) was problematic -participants made their own judgement of severity before answering. The double-barrelled nature of three items (2, 5, and 6) caused concern. The impact of headache on work, social or household activities could be scored differently -some chose one activity, whereas others "averaged" activities. Ambiguity of meaning was raised for three items: item 3, "wishing" that one could lie down versus "actually" being able to lie down; item 4, what "tiredness" was, and its relationship to headache; and item 5, "fed up or irritated" was perceived as unclear.

Discussion
This comparative evaluation of the CHQLQ (adapted MSQ v2.1) and HIT-6 found the appropriateness of the CHQLQ as a measure of headache-specific quality of life was supported. Whilst the HIT-6 was similarly strong, concerns over content and relevance were identified.
Although the shortness of the HIT-6 was welcomed, the capture of headache impact was limited when compared to the CHQLQ. The CHQLQ questions addressing the emotional, symptomatic and social impact of headache were appreciated. However, item repetition and redundancy unnecessarily increased completion time. Participants "averaged" responses to manage the CHQLQ's 4-week recall period; however, the lack of recall period for several HIT-6 items was a greater concern. This limitation was not identified by the quantitative analysis, highlighting the importance of seeking end-user perspectives throughout development and testing. Low levels of missing data supported the acceptability of both measures.
The CHQLQ three-factor model was supported. However, the dual loading of item 12 ("fed-up or frustrated") on both role-restriction and emotionalfunction domains suggested multiplicity and interpretation problems (25,28), which was further supported by a stronger item-total correlation with the rolerestriction domain than with the emotional-function domain. Qualitative interviews further identified CHQLQ item interplay between domains, describing the importance of context when thinking about headache impact. Similar contextual problems, including a noticeable divide between work and social commitments was described for both the CHQLQ and HIT-6: For example, interviewees reported endeavouring to keep going while at work, but would often cancel social activities.
The magnitude of the between-domain correlations found in our work suggest that the CHQLQ domains are measuring somewhat different aspects of headacherelated health and should be retained. Our confirmatory factor analysis and work by Rendas-Baum et al. (26) further support this. High alpha values supported the internal consistency of the three CHQLQ domains. Similarly, high alpha values have been reported for the MSQv2.1 following completion by patients with chronic (27,28) and episodic migraine (8,27).
The single-domain structure of the HIT-6 was supported by both factor analysis and high alpha values, confirming evidence following completion across chronic and episodic headache populations (29,30).
Low reliability was reported for the MSQv2.1 (ICC < 0.70) in patients with "stable" episodic migraine at a 4-week retest (26). Acceptable levels have been reported for the HIT-6 (29,30). The high levels of reliability in this study support application of both measures in groups, with the smallest detectable change (SDC) suggesting a CHQLQ difference in group means greater than 2.74 (role restriction), 2.86 (role prevention), 3.58 (emotional function) and 0.78 for the HIT-6 is required to demonstrate a real change in stable patients.
Associations between different variables provided acceptable evidence of CHQLQ and HIT-6 construct validity, consistent with earlier MSQv2.1 (26,28) and HIT-6 (9,31) evaluations. However, the CHQLQ's emotional function domain association with alternative measures of emotional wellbeing were less than hypothesised. Given the importance afforded by patients to the emotional impact of headache, the inclusion of measures providing a more nuanced assessment of emotional wellbeing is recommended.
Both measures demonstrated acceptable evidence of responsiveness to headache improvement over 12 weeks. Moreover, two CHQLQ domains (role restriction, emotional function) and the HIT-6 discriminated between dichotomous configurations of self-reported change in health when grouped as "much better" versus "better, same or worse". The role-prevention domain was unable to discriminate at a higher level of discrimination.
The minimal important change (MIC) values for both measures were greater than the smallest detectable change (SDC) for groups of patients whose headaches had minimally improved, indicating an "important change" for participants is greater than measurement error. The minimally important change values for CHQLQ domains closely approximate those reported following a 3-month completion of the MSQv2.1 by a large US-based, mixed population of migraineursrole-restriction 5, role-prevention 5 to 7.9, emotional function 8.0 to 10.6 (32).
The HIT-6 minimal important change value closely approximates that determined in US patients with chronic headache (À3.7) (33) and Dutch patients with episodic migraine (À2.5) (34). However, it is smaller than a minimal important change of 8.0 proposed in a Dutch study of patients with tension-type headache (35), where global improvement was defined according to both global improvement and a reduction in headache days (greater than 50%). Published minimal important change values for the HIT-6 range from À1.5 (episodic migraine) to À2.3 (chronic daily headache) (7,(33)(34)(35), approximating the minimal clinical important difference (MCID) of À1.06 found in this study.
This study describes the first, mixed methods comparative evaluation of two generic, headache-related quality of life measures that are not diagnosis specific, in a UK-based cohort of patients living with chronic headaches. Despite the importance of content validity to the relevance and acceptability of measures, few PROM-evaluative studies explore the qualitative aspects of measures (7). While both measures demonstrated comparable psychometric properties, qualitatively the content validity of the CHQLQ was enhanced by the inclusion of items assessing the emotional toll of chronic headache. However, all interviews were conducted with people with definite or probable chronic migraine, potentially limiting the generalisability of these findings to other headache types. While the number of participants were adequate to support a robust evaluation of measurement data quality, reliability and validity, the majority of participants reported "no change" in health at the 12-week follow-up, substantially reducing the numbers available to explore measurement responsiveness. Further evaluations of measurement responsiveness in a larger cohort and following an active intervention will further enhance confidence in the measure's ability to capture important change, and towards calculation of the minimal important change in score. Evidence suggests that the CHQLQ shows potential for further use in other groups of patients with chronic headache, but this analysis is limited to participants in a feasibility study (for a larger trial) (12). Hence, some caution is required in generalising conclusions and recommendations more widely to the general population of people with chronic headaches.
Since the reported PROM evaluation was explicitly in people without a specific headache diagnosis, the evidence supports application of both measures in trials where recruitment takes place before diagnosis; for example, where diagnosis is part of the intervention, or for epidemiologic surveys -for example, capturing the impact of headache disorders. Further work may be needed to evaluate use of the CHQLQ in other populations of people with chronic headaches where case mix may be different. For example, it might be a useful measure for people with definite chronic migraine and medication overuse headache after further evaluation in that population. That the design of this study did not allow a precise diagnosis for all participants is not a weakness since the evaluation sought to provide evidence in support of the CHQLQ when assessing people with undiagnosed headache disorders.

Conclusion
This study describes the first comparative evaluation of the new CHQLQ with the HIT-6, demonstrating the added value to be gained from a mixed-methods approach to PROM evaluation. The results of this study, and the consistency with previous evaluations, supports recommendation of the CHQLQ as a high quality, relevant and acceptable measure for chronic headache. In comparison to the HIT-6, for which similarly strong psychometric evidence was reported, the CHQLQ had greater relevance to the wide-ranging impact of chronic headache.

Clinical implications
• The quality, relevance and acceptability of a new measure of chronic headache quality of life -the Chronic Headache Quality of Life Questionnaire (CHQLQ) -was compared with that of an existing measure, the 6item Headache Impact Text (HIT-6), following completion in a UK population. • The CHQLQ better captured the emotional, symptomatic and social impact of chronic headache.
• Both measures had comparable measurement properties.
• The CHQLQ is recommended as a high quality, relevant and acceptable measure for use with patients with chronic headache.

Ethics approval
Ethics approval was given on 11

Availability of data and material
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
KH conceived of the comparative analysis PROM study, developed the analysis plan, analysed the statistical data, designed and led the qualitative analysis, supported the PPI activities and took the lead on writing the manuscript. FA contributed to the design of the analysis plan, ran the statistical analysis, analysed the data and contributed to the writing of the manuscript. VN undertook the qualitative interviews, contributed to the qualitative analysis, supported PPI activities and contributed to the writing of the manuscript. GP was a patient research partner on the study, contributing to the design, data analysis and writing the manuscript. BB was a patient research partner on the study, contributing to the design, data analysis and writing the manuscript. LM was a patient research partner on the study, contributing to the design, data analysis and writing the manuscript. SP supported the feasibility study (intervention design and training), supported PPI activities and provided critical revision to the manuscript. FG contributed to the qualitative analysis and contributed to the writing of the manuscript. KS supported the feasibility study set-up and management, supported PPI activities and reviewed the manuscript. MU secured funding for the overall project, contributed to the design of this study, data analysis, and contributed to the writing of the manuscript. MSM supported securing of funding, supported the concept and design of the PROM analysis, and provided critical revisions to the manuscript. All authors have read and approved the final manuscript.

Declaration of conflicting interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: MSM serves on the advisory board for Allergan, Medtronic, Novartis and TEVA and has received payment for the development of educational presentations from Allergan, electroCore, Medtronic, Novartis and TEVA. MU is a director and shareholder of Clinvivo Ltd. Use of this company for some data collection, not included in this paper, was specified in the original application for funding to NIHR. MU has recused himself from all subsequent discussions regarding the use of Clinvivo in this study. All contracting processes have been in accord with University of Warwick financial regulations. MU was Chair of the NICE accreditation advisory committee until March 2017, for which he received a fee. He is chief investigator or co-investigator on multiple previous and current research grants from the UK National Institute for Health Research, Arthritis Research UK and is a co-investigator on grants funded by Arthritis Australia and Australian NHMRC. He has received travel expenses for speaking at conferences from the professional organisations hosting the conferences He is part of an academic partnership with Serco Ltd related to return to work initiatives. He is a co-investigator on two studies that receive support in kind from Orthospace Ltd. He was, until March 2021, an editor of the NIHR journal series, and a member of the NIHR Journal Editors group, for which he received a fee. SP is a director of Health Psychology Services Ltd, which in part provides psychological treatments for those with chronic pain. Comparator measures: Headache-specific Headache Impact Test (HIT-6).
A six-item measure that purports to provide an overall assessment of headache impact on an individual's ability to function Each item has five descriptive response options, with each awarded a specific number of points: "Never" (6 points), "rarely" (8 points), "sometimes" (10 points), "very often" (11 points) and "always" (13 points) No recall period items 1-3 4-week recall period items 4-6 The score is the sum of item (points) responses  (15,16) Interpretability -the ability to assign qualitative meaning to a score or change in score (https://www.cosmin.nl/wp-content/uploads/COSMIN-definitions-domains-measurementproperties.pdf) (14,15) End-effects Where more than 15% of respondents score the minimum (floor) or maximum (ceiling) score (6,15) -Minimal important change (MIC) The MIC is defined as the smallest change in score perceived as important by participants) (14,15) Calculated as the mean change score for people reporting 'minimal change' in headache at 12 weeks on the headache-specific health transition questionnaire (HTW) (that is, "better" or "worse") (continued) Internal consistency Assesses the relationship (interrelatedness) between items within a measure (or sub-domains), reflecting the total number of items and their average correlation (15,16) The internal consistency of the three CHQLQ domains and the HIT-6 was assessed by calculation of Cronbach's alpha (15,16) Values between 0.7 and 0.90 suggest a good to excellent agreement between items and the total (domain) score (15,16) Reliability and measurement error -the degree to which a measure is free from measurement error Test-retest reliability The extent to which scores for patients who have not changed are the same for repeated assessments over time (temporal stability) (6,(14)(15)(16) Two-week test-retest reliability was assessed in patients who indicated on health transition item that their headaches had remained stable (15,16) The intra-class correlation coefficient (ICC 2,1) was used to measure the level of agreement between test and re-test (15,16

Analysis and interpretation
The SEM supports score interpretation by accounting for variability, or error, in measurement -only a change greater than measurement error is considered "real" (15,17) The Smallest Detectable Change (SDC) represents the smallest change in score that is greater than measurement error The SDC allows one to rule out measurement error (i.e. distinguishing measurement error from true change) when assessing the reliability of a self-reported measure to detect change in health status Thus, a score change greater than the SDC value is necessary to provide evidence of true change (improvement or deterioration) in health-status Construct and content validity -the degree to which a measure measures what it purports to measure Content validity -qualitative evidence in support of purported measurement focus The degree to which the content of the PROM measures the construct(s) it purports to measure (https://www. cosmin.nl/wp-content/uploads/COSMIN-definitionsdomains-measurement-properties.pdf [14][15][16] Evidence that details the clarity of measurement content in terms of relevance, comprehensiveness and comprehensibility with respect to the purported measurement focus (construct of interest) and the target population (for example, chronic headache) (14)(15)(16) Semi-structured cognitive interviews were conducted with a purposive sample of patients with confirmed chronic headache to explore the relevance, acceptability, clarity and comprehensiveness of the measures, as per the four stages of cognitive processing: (21,22) -Comprehension: The process of making sense of the questions and developing a response -Memory retrieval: The process of accessing relevant information to enable a response -Judgement: The process to determine if memory retrieval is accurate and complete -Response mapping: The process by which an appropriate response is selected Several overarching questions sought to explore how patients determined an improvement in their headache, and if specific questions were missing. Interviews were continued until thematic saturation was achieved Participants were interviewed within 24 h of questionnaire selfcompletion. Verbal prompting was used to facilitate the interview process. To counteract fatigue bias, the CHQLQ and HIT-6 were alternately completed All interviews were audio recorded, transcribed verbatim, and checked for accuracy (VN) Framework analysis (23) and cross-case comparison was used to (continued) Table 2. Continued.

Description
Analysis and interpretation generate themes, informed by PROM item content, relevance and the additional over-arching questions NVivo software was used to organise the data Data was independently explored by two researchers (VN, KH). Emergent themes were discussed and interpreted with a third researcher (FG) Construct validity -quantitative evidence in support The degree to which PROM scores are consistent with hypotheses, and based on the assumption that the PROM is a valid measure of the construct to be measured (https://www.cosmin.nl/wp-content/uploads/COSMINdefinitions-domains-measurement-properties.pdf) [14][15][16] Assessed by correlating the scores for separate measures to assess the convergent validity of related domains (Pearson's correlation coefficient): it was expected that related constructs would correlate more strongly Hypothesised theoretical associations between the three domains of the CHQLQ and comparator measures were considered a priori (Appendix Table 3).
The RF and RP domains of the CHQLQ and HIT-6 measure related domains and hence a stronger association than with the EF domain was hypothesised. Similarly, the SF-12 PCS and EQ-5D-5L have a greater focus on physical aspects of health, and a stronger association with the RL and RP domains was hypothesised than with the EF domain; but a stronger association between the EF and the SF-12 MCS was hypothesised. A stronger association between the EF domain and the HADS-A and HADS-D was hypothesised. Several items within all three domains of the CHQLQ consider the social impact of headache, and hence moderate to strong association with the HEiQ SIS was hypothesised. The focus of the PSEQ is one's ability (and confidence) in managing pain and engaging in (largely) physical and social activities; hence a moderate association with the RL and RP domains, but small association with the EF domain was hypothesised Responsiveness -the ability of a measure to detect real change in health over time that is greater than measurement error Smallest detectable change (SDC) in score To understand the smallest change in score that is greater than measurement error in patients reporting change in headache at 12 weeks Standard error of measurement (SEM) and smallest detectable change (SDC): To represent the smallest change in score that is greater than measurement error in those patients reporting change in headache in the headache-specific health transition question at 12 weeks, we calculated: -The absolute measurement error at 12 weeks (SEM)-The SEM was subsequently converted into the smallest detectable change (SDC) (14)(15)(16)19,20) -The SDC was calculated for both individuals and groups: -SDCindividual *EQ-D item content: Stronger focus on physical function (mobility, usual activities, self-care), so stronger association with physical than with emotional domains hypothesised.