Construct Validity of the EuroQoL–5 Dimension and the Health Utilities Index in Head and Neck Cancer

Objective The objective of this study was to evaluate the construct validity of 2 health utility instruments—the EuroQoL–5 Dimension (EQ-5D) and the Health Utilities Index–Mark 3 (HUI-3)—and to compare them with disease-specific measures in patients with head and neck cancer. Study Design Prospective cross-sectional analysis. Setting Princess Margaret Cancer Centre. Methods Patients were administered the EQ-5D, HUI-3, the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire (EORTC QLQ-C30) and its head and neck cancer module (EORTC QLQ-H&N35), and the University of Washington Quality of Life Questionnaire (UWQoL). Several a priori expected relations were examined. The correlative and discriminative properties of the various instruments were examined. Results A total of 209 patients completed the 4 questionnaires. A significant ceiling effect was observed among EQ-5D responses (23% reported a maximum score of 1). The EQ-5D (rho = 0.79) and HUI-3 (rho = 0.60) had a strong correlation with the social-emotional domain of the UWQoL. The EQ-5D had a moderate correlation with the physical domain of the UWQoL (rho = 0.42), whereas the HUI-3 had a weak correlation (rho = 0.29). The EQ-5D and HUI-3 were able to distinguish among levels of health severity measured on the EORTC QLQ-C30 though not the QLQ-H&N35. Comparatively, the UWQoL was able to distinguish levels of disease severity on the EORTC QLQ-C30 and QLQ-H&N35. Conclusion The results of this study demonstrate that disease-specific domains from head and neck quality-of-life instruments are not strongly correlated with the EQ-5D and HUI-3. Consideration should be put toward development of a disease-specific preference-based measure for health economic evaluation. Level of evidence 4.

Module (EORTC QLQ-H&N35) 6 and the University of Washington Quality of Life Questionnaire (UWQoL). 7 The EORTC and UWQoL have been shown to be valid and reliable for patients with HNC. They are capable of detecting even small changes in HRQoL. [6][7][8][9][10][11] However, 2 key limitations exist. First, both questionnaires are specific to the HNC population and cannot be used across different diseases. Second, they do not provide health utilities (HUs). HUs are a universal measure of health outcomes and a key component of costeffectiveness studies. 12,13 HU scores are anchored between 1 (perfect health) and 0 (death). HUs are combined with length of time spent in that health state to generate quality-adjusted life years.
Several generic questionnaires have been developed that allow the HUs to be determined, including the EuroQoL-5 Dimension (EQ-5D) and the Health Utilities Index-Mark 3 (HUI-3). 14,15 Despite the EQ-5D and HUI-3 being commonly used for many cancer sites, there has been limited uptake within head and neck oncology. [16][17][18] Patients with HNC face unique toxicities, such as difficulties with swallowing, speech, and breathing. None of these symptoms are explicitly asked about in the EQ-5D or HUI-3. In other diseases, this lack of specificity has been shown to make the EQ-5D and HUI-3 nonresponsive to changes in HRQoL. [19][20][21] This is problematic, particularly in the clinical trials space, where we are often comparing costly therapies that may lead to small but important differences in health status. Therefore, instruments that are able to generate HUs while being sensitive to change are desired.
Given this, we sought to better delineate the potential role of the EQ-5D and HUI-3 in HNC cost-effectiveness studies. Specifically, we aimed to evaluate the validity of the EQ-5D and HUI-3 by assessing whether these tools were able to distinguish across varying degrees of disease and treatment severity.

Patient Population and Setting
This validation study was conducted with prospectively collected data obtained from the Princess Margaret Cancer Centre, Toronto, Canada, between November 24, 2017, and March 23, 2018. HNC care in Ontario is provided through a universal single-payer health care system. Adult patients with mucosal squamous cell carcinomas of the upper aerodigestive tract were consecutively recruited from outpatient clinics. Non-English speakers and those who lacked decision capacity were excluded. Patients independently completed the EQ-5D, HUI-3, EORTC, and UWQoL in the same setting. Basic sociodemographic and clinical data were obtained.

Instruments
The EQ-5D 14 is a generic HU questionnaire that consists of items relating to mobility, self-care, usual activities, pain and/ or discomfort, depression and/or anxiety, as well as a visual analog scale of the overall health status. We administered the 5-level version, which provides participants with a greater number of rating options as compared with the 3-level version. This has been shown to have reduced ceiling effects (fewer patients reporting the maximum score of 1). 14 EQ-5D responses are transformed into utility scores (range, 0-1), which we performed using US population tariffs (preference weights derived from the US general population).
The HUI-3 is another generic HU questionnaire. 15 It contains 8 questions/attributes pertaining to vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain and/or discomfort. Scores are assigned to each attribute and are combined with a formula to generate an HU score ranging from 0 to 1. The HUI-3 and EQ-5D allow for the generation of negative utilities, indicating a health state worse than death.
Version 4 of the UWQoL 7 is an HNC-specific questionnaire based on 12 items, with each being scored on a scale from 0 (worst) to 100 (best). The physical and socialemotional domains are scored separately. The physical domain includes chewing, swallowing, speech, taste, saliva, and appearance. The social-emotional domain comprises anxiety, mood, pain, activity, recreation, and shoulder function. Domain scores are calculated through a mean score across items ranging from 0 to 100.
The EORTC QLQ-C30 is a cancer-specific quality-of-life questionnaire. 22 It consists of 30 questions spanning key functional and symptom domains. The EORTC QLQ-H&N35 is an extension of the QLQ-C30 and is a module developed for use among patients with HNC. 6 It consists of 35 items related to pain, swallowing, taste/smell, speech, social eating, social contacts, sexuality, teeth problems, trismus, dry mouth, sticky saliva, cough, and feeling ill. Unlike the UWQoL, an overall domain score does not exist.

Determination of Validity
Validity is defined as the degree to which an instrument truly measures the constructs that it purports to measure. 23 Two main types of validity testing exist: construct and criterion validity. Criterion validity assesses how the measure is compared against a gold standard, while construct validity tests expectations about how a measure should behave relative to hypotheses explicit to a conceptual framework. 24 Owing to the absence of a gold standard in the measurement of HRQoL and health status, construct validity alone is typically measured. An early step in construct validity testing is to hypothesize how different measures should relate, also known as convergent construct validity. 25 The more that an instrument behaves according to a priori hypothesized relations, the stronger the evidence for validity in that setting. 24 We examined construct validity in several ways. We hypothesized that there would be a strong positive correlation between EQ-5D/HUI-3 and the physical and social-emotional domains of the UWQoL. As a person's health status improves, so should one's quality of life. Indeed, the UWQoL, EQ-5D, and HUI-3 all ask about mental health and physical functioning in various ways. We anticipated that there would be a strong positive correlation among these instruments. Unlike the UWQoL, the EORTC does not yield an overall score. Nonetheless, it does have several questions pertaining to attributes known to heavily influence HRQoL. We selected several of these questions a priori through a review of the literature and hypothesized that median EQ-5D and HUI-3 scores would be lower for patients reporting higher levels of symptom severity as compared with lower levels of symptom severity. 26 We included UWQoL scores as a reference/comparator. Finally, we examined the EQ-5D, HUI-3, and UWQoL scores for patients who were \2 years from treatment completion, as compared with those who were 2 years from treatment completion. We hypothesized that HRQoL and HUs would be higher for those who were further from treatment completion.

Statistical Analysis
Baseline characteristics were examined with descriptive statistics for the entire cohort. Categorical variables were reported as absolute number and proportion and continuous variables as either mean with standard deviation or median with interquartile range. Continuous variables were assessed for normality through the Shapiro-Wilk test and through visual inspection of the histogram and quantile-quantile plots.
The minimal clinically important difference for each HU instrument was calculated through a distribution-based method, which uses statistical criteria defined from the measurement results themselves, as opposed to external indicator ''anchors.'' 27 We defined the threshold of discrimination and thus the minimal clinically important difference for each HU instrument as half of the standard deviation for the generated HU score. 28 Spearman correlation coefficients were calculated to assess convergent construct validity between the UWQoL and the EQ-5D and HUI-3. Coefficients .0.60, 0.40 to 0.59, 0.21 to 0.39, and 0.20 were considered strong, moderate, weak, and no correlation, respectively.
The effect size, or standardized mean difference, between 2 groups on a measured outcome was also calculated. Responses from the EORTC were classified into meaningful comparator groups: not at all/a little bit or quite a bit/very much. The standardized mean difference describes the difference in means in units of standard deviation between 2 groups. It therefore allows us to directly compare the discriminative abilities of the EQ-5D, HUI-3, and UWQoL despite having different scales/variance. 29 The absolute value of effect sizes were categorized as small (0.2-0.5), medium (0.5-0.8), or large (.0.8). 30 Instruments that displayed larger effect sizes in a particular analysis were considered to have superior discriminative ability as compared with instruments displaying smaller effect sizes. Finally, we used the Mann-Whitney U test to compare EQ-5D, HUI-3, and UWQoL scores for those who were 2 or\2 years from treatment completion.
A 2-sided P value .05 was considered significant. All analyses were performed with SAS University Edition (SAS Institute). The study received ethics approval from the University Health Network Research Ethics Board.

Results
In total, 285 consecutive participants were approached and 209 agreed to participate (73%). Demographic characteristics are listed in Table 1. The majority of patients were men (72%) with a mean age of 63 years. The most common tumor sites were the oral cavity (35%) and oropharynx (25%). Relatively equal numbers underwent primary surgery and primary radiotherapy. Over 90% of patients had completed treatment and were being seen in surveillance. Table 2 reports the mean and median scores and the frequency of maximum scores (the ceiling effect). Ceiling effects were more common for the EQ-5D (23%) and the physical domain of the UWQoL (17%) than for the HUI-3 (7.7%) and the social-emotional domain of the UWQoL (9.6%). Based on a distribution-based method, the minimal clinically important difference of the instrument was 0.06 for the EQ-5D, 0.12 for the HUI-3, and 0.08 for the physical and social-emotional domains of the UWQoL. As the EORTC has no overall score, a minimal clinically important difference could not be calculated. Table 3 presents the correlations among the EQ-5D, HUI-3, and physical and social-emotional domains of the UWQoL. As anticipated, the EQ-5D (rho = 0.79) and HUI-3 (rho = 0.60) had a strong correlation with the social-emotional domain of the UWQoL. The EQ-5D had a moderate correlation with the physical domain of the UWQoL (rho = 0.42), whereas the HUI-3 had a weak correlation (rho = 0.29). Table 4 presents various relations between the EQ-5D, HUI-3, and UWQoL and selected items from the EORTC known to affect quality of life. 26 As an example, a patient who answered quite a bit or very much on EORTC question 4 (''Do you need to stay in bed or a chair during the day?'') had an average EQ-5D score of 0.62 on a 0-1 scale. In contrast, a patient who answered not at all or a little bit on question 4 had an average EQ-5D score of 0.83. This corresponds to an effect size of 1.89 for the EQ-5D. Anything .0.8 is considered a large effect size.
Among the 7 selected questions from the generic EORTC QLQ-C30, the effects sizes between the not at all/a little bit and quite a bit/very much health states were large (.0.8) for all 3 instruments: UWQoL (7/7 questions), EQ-5D (6/7 questions), and HUI-3 (5/7 questions). Within the head and neckspecific module (QLQ-H&N35), the discriminative ability of the HU instruments was more limited. Whereas the effect sizes between dichotomized health states were large for the UWQoL (7/9 questions), large effect sizes were generated in only 2 of 9 questions for the EQ-5D (trouble talking, Q53; painkillers, Q61) and the HUI-3 (trouble talking, Q53; feeding tube use, Q63). For the EQ-5D and HUI-3, the generated effects sizes were small or nonsignificant for swallowing (Q38), dry mouth (Q41), sticky saliva (Q42), and nutritional supplementation (Q62). This suggests that these instruments have poor discriminative ability when attempting to differentiate varying levels of HNC disease severity. As expected, the median UWQoL, HUI-3, and EQ-5D scores were higher for those 2 years from treatment completion, though none of the results were significant to the predetermined P \ .05 threshold.

Discussion
Accurate HU elicitation is the cornerstone of health economics. National health agencies have stated the EQ-5D or HUI-3 should be preferentially used to measure HUs unless there is proof that these measures are not valid in the target population. 31 This study demonstrates suboptimal construct validity for the EQ-5D and HUI-3 in the HNC population. While both measures correlate well with the social-emotional domain of the UWQoL and generic items of the EORTC QLQ-C30, they show moderate to weak correlation with the physical domain of the UWQoL and poor convergent validity with EORTC QLQ-H&N35. The EQ-5D is superior to the HUI-3 in that it has strong correlation with disease-specific quality-of-life measures and a tighter standard deviation leading to a smaller minimal clinically important difference. The EQ-5D is, however, limited by a more prominent ceiling effect.
Previous literature examining the validity of HU instruments in HNC is relatively sparse. 16 Rogers et al demonstrated significant correlation between domains of the UWQoL and the EQ-5D in a cohort of 224 patients with HNC. 8 Our group has shown that direct and indirect measures of HUs often produce disparate results and that the EQ-5D and HUI-3 (indirect measures) are better at distinguishing various measures of cancer severity relative to standard gamble and time trade-off (direct measures). 18 We recently used the same data set from this study to generate a series of mapping algorithms to convert EORTC and UWQoL responses into EQ-5D and HUI-3 HU scores with ordinary least squares regression and 2-part models. 32,33 The predictive performance of both algorithms was strong, though notably many of the HNC-specific items were not significant predictors of HUs on multivariable analysis.
Important differences exist between generic HU measures (EQ-5D and HUI-3) and disease-specific tools (EORTC and UWQoL). Because the EORTC and the UWQoL are focused on patients with HNC, more relevant and cancer-specific domains are included. From a clinical perspective, the EORTC and UWQoL offer superior face and content validity when compared with the EQ-5D and HUI-3. 34 The EORTC and UWQoL appear to measure appropriate dimensions of HRQoL for patients with HNC (face validity) and have been     well designed to include all important facets of HRQoL for this population (content validity). Disease-specific tools are, at times, required to identify relative differences in health status for economic evaluations in cases where generic HU measures are unable to do so. In this study, the EQ-5D and HUI-3 had suboptimal correlation with HNC disease-specific items (UWQoL physical and QLQ-H&N35), implying that these tools may not be able to detect all relevant differences in health status for this population. This is a concern, particularly in the context of clinical trials, where we are often attempting to distinguish between 2 modalities that may generate relatively small changes in health status. As an example, the NRG-HN006 compares sentinel lymph node biopsy with elective neck dissection in early oral cavity cancer and is collecting EQ-5D data as part of a planned cost-effectiveness study. 35 It may be that the EQ-5D is unable to distinguish between these 2 health states, which would limit the investigators' ability to generate accurate quality-adjusted life years. The results of this study must be interpreted in the context of several key limitations. HU tariffs are region specific, and the results of this study might not generalize to other parts of the world. Specifically, the Canadian population and its ethnodemographic profile might not mirror other jurisdictions. Additionally, while we assessed convergent validity, we did not assess the responsiveness of the instruments, as this requires longitudinal assessment. Another limitation of this study is a possible selection bias because the questionnaires were distributed largely among survivors of HNC, potentially affecting the generalizability of our findings. Since the validity of the EQ-5D is in part limited by its prominent ceiling effect, its discriminative ability may improve in a broader HNC population. Additionally, there is an element of nonrespondent bias to this study for which we cannot readily adjust in our analysis. Finally, the EORTC QLQ-H&N35 was recently updated to the QLQ-H&N43. Several items were modified in this process, and results may not be directly transferrable. 36 Conclusion Generic instruments such as the EQ-5D and HUI-3 are preferred by various national health agencies, although they run the risk of being unable to detect changes in the health status of certain patient populations. The EQ-5D and HUI-3 demonstrate suboptimal construct validity for patients with HNC. Head and neck oncology may benefit from a disease-specific preference-based measure for HU elicitation.