Psychometry: Cutting-Off Points and Standardization of the Jefferson Empathy Scale Adapted for Students of Kinesiology

Currently, the most common measurement of empathy is obtained using scales that offer a continuum between a minimum and a maximum value. The objectives of this study were to establish a norm and estimate cut-off points that would make it possible to assess the Jefferson Scale of Empathy (JSE) version for Health Professions students (HPS-version), and to determine its psychometric properties in Chilean physical therapy students. A secondary analysis was done on a data set from three schools of physical therapy ([n = 850], 412 women [48.5%], and 438 men [51.5%]), applying confirmatory factor analysis (CFA) and hierarchical cluster analysis. A CFA replicated the original three-factor model of empathy with sufficiently fit the data. A hierarchical cluster analysis yielded four categories for the level of empathy: high, medium-high, medium-low, and low. Multi-group analyses supported the assumption of a gender-invariant factor structure. Results confirmed the reliability of the global scale (α = .835), and the Perspective Taking (α = .732), Compassionate Care (α = .842), and Walking in Patient’s Shoes (α = .686) dimensions. The instrument made it possible to establish four ordinal categories in the level of students’ empathy. We conclude that the HPS-version of the JSE has adequate psychometric properties; namely validity, reliability, and cut-off points that justify administering it to Chilean physical therapy students.


Introduction
Empathy is defined as the ability to understand feelings and emotions objectively and rationally; experiencing what other people feel and think Díaz-Narváez et al., 2017;Preusche & Lamm, 2016;Svenaeus, 2016). Empathy is an attribute that plays an important role in the interaction between physical therapists and their patients. It involves both emotional and cognitive factors (Svenaeus, 2016), and is conditioned by their interactions . Some authors have argued that the cognitive component of empathy may be taught via disciplinary training processes, but the same does not occur with affective empathy Preusche & Lamm, 2016). The latter seems to consolidate itself from the first processes of ontogeny and continues its formation until later adolescence . In this sense, the full integration of the longitudinal teaching of empathy in undergraduate curricula is important, and will positively influence patient care .
Compassionate care is associated with the emotions of the subject, and seems to be influenced by both biology (i.e., a product of evolution, ontogeny, and their interaction) and culture, and is associated with moral, altruistic, and religious behavior, among others (Calzadilla-Núñez et al., 2017;Díaz-Narváez et al., 2017). Perspective Taking (PT) is associated with a person's ability to differentiate them-self from others (i.e., patients) and avoid "emotional contagion." The ability to understand others, or "walk in a patient's shoes" (WIPS) refers to the ability to actively observe a subject to thereby penetrate their thinking. CC is part of the emotional component, while PT and WIPS are parts of the cognitive component (Calzadilla-Núñez et al., 2017).
There is a positive correlation between empathy and compassion (Lee & Seomun, 2016). People suffer from burnout, stress, excessive academic load, and compassion fatigue (among other conditions) are observed to have lower empathy (Hunt et al., 2017). These findings hint at the complex and multidimensional nature of empathy .
Regarding gender, while women are often perceived as more empathetic than men, both in the general population and among health students (Rueckert et al., 2011), results of empirical measurements of empathy in students from Latin America have shown contradictory results (Calzadilla-Núñez et al., 2017;Díaz-Narváez et al., 2017;González-Martínez et al., 2018).
Teaching empathy in Chilean universities is not formalized in their curricula (Díaz-Narváez et al., 2015). There is a trend of conducting "empathic interventions," however this lacks prior definitions of this attribute's behavior. Nevertheless, some authors recognize the possible errors of these activities and postulate that interventions of this type should involve modification of the entire curriculum to achieve permanent positive changes regarding empathy in students (Calzadilla-Núñez et al., 2017;Díaz-Narváez et al., 2015Galán et al., 2014;González-Martínez et al., 2018;Madera-Anaya et al., 2016;Preusche & Lamm, 2016). As a consequence, an objective diagnosis of specific characteristics of empathic behavior is both important and necessary so that our interventions correspond to reality. However, currently no cut-off points allow us to divide students by classification into empathy categories such as "high," "medium," or "low." Such classification is important, as it allows for pre-and post-intervention comparison within and between student groups from different schools and universities. The development of cut-off points requires that criteria for their delimitation be established, and clinical (or physiological) evidence for setting these values, or norms for generating them, be found; however, this does not currently exist, at least regarding empathy measurements produced with the JSE in Chile or in the rest of Latin America.
Norms for administering the JSE are recent, and nationwide guidelines only exist for US osteopathic medicine students (Hojat et al., 2018(Hojat et al., , 2019. In those studies conducted in Latin American countries (Chile, Colombia, Mexico, El Salvador, Ecuador, the Dominican Republic, Panama, and Puerto Rico, among others), JSE results have been reported using raw scores due to the absence of national or international norms, preventing researchers from adequately interpreting respondents' performance relative to their reference populations.
A suitable adaptation of the JSE for Chilean students would require both developing a norm and setting cut-off scores to ensure the usefulness of the results obtained in research and empathy diagnosis. The aim of the present study is to establish such a norm and cut-off scores based on a criterion that distinguishes levels of empathy in physical therapy students in Chile, detailing the psychometric evidence in support of these constructs.

Participants
The participants were 850 physical therapy students, 412 women (48.5%), and 438 men (51.5%), selected via convenience sampling from University of Atacama (n = 191, 22.5%), Bernardo O'Higgins University (n = 484, 56.9%), and Universidad Mayor (n = 175, 20.6%), faculties located in Copiapó, Santiago, and Temuco. Respectively and covering the north, center, and south of Chile. The sample (n = 850) was randomly subdivided into two sub-samples. The first sub-sample (n 1 = 510, 60%) allowed establishing the cut-off scores that generate four levels of empathy. The second subsample (n 2 = 340, 40%) allowed us to classify the students according to the previously defined levels and to establish the norm. We compared the sub-samples using the Mann-Whitney U test, which revealed no statistically significant differences in empathy and its dimensions (all p-values > .05).

Procedures
This is a methodological study with a descriptive design, aimed at performing a secondary analysis of empathy data collected from 2015 to 2020. Before the application of the scale in its Spanish version, the cultural adaptation of the instrument was conducted using the criteria of raters as a procedure to adapt it to physical therapy students, and was applied to groups of between 30 and 50 students during regular class hours after they had given informed consent. This study was approved by the Ethics Committee of the Universidad San Sebastián, Chile (Resolution 2020-2).

Data Analysis
The data were subjected to normality testing via the Kolmogorov-Smirnov test (Kolmogorov, 1933). Using Levene's (1960)test to evaluate the similarity of variables between sub-samples according to empathy categories. Reliability was evaluated in multiple ways: using Cronbach's alpha (Cronbach, 1951) to establish internal consistency, the intraclass correlation coefficient (ICC) to establish the stability of data among universities (Mohamed & Shoukri, 2004), and McDonald's coefficient omega (McDonald, 1999), which yields a more accurate measurement of reliability by considering the homogeneity of items through item-test correlation. To analyze the factor structure of the JSE, confirmatory factor analysis (CFA) was used, with maximum likelihood (ML) as recommended by Curran et al. (1996). To evaluate the fit of the models, various goodness of fit indices were considered: (a) chi-square index (χ 2 ), a non-significant value indicated a good fit; (b) chi-square normed (χ 2 /df), considering values lower than 2 to indicate adequate adjustment; (c) goodness of fit index (GFI), comparative fit index (CFI), and adjusted goodness of fit index (AGFI), values ≥.90 indicated an acceptable fit and ≥.95 was indicative of a good fit; (d) Root mean square error of approximation (RMSEA), a value ≤.05 (90% CI ≤ 0.08) was indicative of a good fit, and (e) standardized root mean square residual (SRMR), values around .08 were judged as acceptable fits, and around .06 as excellent fits (Bentler & Bonett, 1980;Browne & Cudeck, 1992;Hu & Bentler, 1999;Kline, 2005). Factor loadings greater than or equal to 0.40 were considered significant (Stevens, 1992). Factor invariance was analyzed using a multi-group analysis model (Jöreskog, 1971), using the chi-square test (χ 2 ) to assess goodness of fit, but since the chi square test is sensitive to the sample size, decreases in the CFI of less than .01 (Δ ± .01) in comparison with the previous model were considered to be the most adequate indicator of invariance (Cheung & Rensvold, 2002). To determine the cut-off scores, as no prior criteria existed, we conducted a hierarchical cluster analysis according to the recommendations of Hair et al. (2013). We employed a posteriori cases with standardized data and centroid clustering based on the squared Euclidean distance to the group together the cases of the first sample (n 1 = 510). This process yielded approximate values that made it possible to establish ranges with which to classify participants according to their empathy scale scores. Multiple statistics were calculated to describe the clusters, adding a Huber M-estimator due to the lack of symmetry in the distribution of the variables, which generated optimal values regardless of error distribution (Cajal et al., 2012). To determine whether sufficiently different clusters were generated, we estimated the difference between empathy means using a one-factor ANOVA and minimum significant difference method (MSD) to make multiple comparisons of the measurements, calculating effect size (partial eta squared: ή 2 ), and the adjusted coefficient of determination (R 2 ). To validate the cut-off scores, we employed the second sample (n 2 = 340) and calculated its sensitivity and specificity relative to the overall scale score, to evaluate whether new cases had been accurately classified according to the cut-off scores established by the first sample (n 1 ). Finally, we established a norm that made it possible to estimate a percentile value based on scores yielded by the JSE and each of its dimensions according to two different levels of the measurement and according to the cut-off scores that we had set. The level of significance used was α < .05 and β ≤ .20. All analyses were conducted with IBM SPSS Statistics 25 and Amos 25 ( Figure 1).

Descriptive Statistics
The results of the normality and homoscedasticity tests were not significant (p > .05), therefore the data on empathy and its components were distributed normally and with equal variance. Descriptive statistics for the total sample and for the sample segmented by gender and university are presented in Table 1. The mean score of empathy for the total sample was 107.70 (SD = 16.54, range = 59-140); the mean score for men was 105.70 (SD = 16.67, range = 66-137) and for women it was 109.82 (SD = 16.15, range = 59-140), with statistically significant differences between the sub-samples (t 848 = 3.66, p < .0001, d = .251).

Reliability Analysis
Reliability was estimated for the total sample, made up of the three universities, and reached suitable values for a global scale (Cronbach's alpha, α = .835, and McDonald's Omega, ω = .832). With satisfactory values in compassionate care (α = .842), perspective taking (α = .732) and walking in patient's shoes (α = .686) dimensions. Coefficients were consistent with the estimates made with each university's subsamples (see Table 1). The intraclass correlation coefficient was 0.835 (CI: 0.818; 0.851) and highly significant (F = 6.05; p < .001). When examining the homogeneity value of items by means of the corrected item-total correlation (r), a range was found between .095 and .699 with a median of .407. Seventy-five percent of the items presented an adequate value above 0.30, and five items (items 15, 17, 5, 18, and 20) displayed correlations below the expected value (r15 = .095, r17 = .166, r5 = .240, r18 = .278, and r20 = .279).

Confirmatory Factor Analysis
To evaluate the construct validity of the scale, and confirm the structure of the latent variables of the JSE, a confirmatory factor analysis (CFA) was performed (n 1 = 510) that sought to test the theoretical structure of three factors and 20 items proposed by Hojat and his collaborators (Hojat et al., 2018). A baseline model with adequate adjustment has been established (χ 2 = 373.006, p = .0001; χ 2 /df = 2.275; GFI = .958; AGFI = .946; CFI = .951; RMSEA = .039 [90% CI = .034-.044]; SRMR = .043), whose significant standardized factor loadings vary between λ = .236 and λ = .790 for the total sample. Models established by university and by gender present similar values.
The general model includes three items with factor loadings of below 0.40 (items 15, 17, and 18, see Table 2).
The CFA generates a similar pattern of results for each sub-samples, achieving an adequate adjustment to the global sample (made up of the three universities and divided by gender), with goodness-of-fit indices that confirm the adjustment of the original model of three factors to the samples studied (see Table 3).

Invariance Analysis
A factor invariance analysis was performed comparing women and men via a multigroup analysis. This analysis  Homologously, the invariance by university was analyzed: χ 2 = 849,490, p < .000, χ 2 /df = 1.727, SRMR = .057, GFI = .908, AGFI = .882, CFI = .914, RMSEA = .029 (90% CI = 0.026-0.033). After establishing the baseline models by gender and university, nested models were established from the base model. Significant changes were observed in the chi-square value by university, which is reasonable given the high sensitivity of this statistic to the sample size (Lévy & Iglesias, 2006), however, the differences in CFI are irrelevant (ΔCFI < .01 for universities and ΔCFI < .001 for gender). Being less than 0.01, this allows us to assume configurational and metric invariance (Cheung & Rensvold, 2002) (see Table 4).

Establishment of Cut-Off Scores
Our cluster analysis of the initial sample (n 1 = 510) yielded four clusters, clearly defined according to both their total empathy scores and the individual dimension's scores, and cut-off scores were set using the upper limit of the mean in each cluster. For instance, for the total scale score (E), we defined cut-off values of 88, 108, and 121, which resulted in four levels of empathy: low (20-88 points), medium low (89-108 points), medium-high (109-121 points), and high (122-140 points). Since the minimum and maximum scores are 20 and 140 points, respectively, we employed these theoretical values to set limits on the extreme values of total empathy, which empirically ranged from 66 points (minimum) to 139 points (maximum). This criterion was also applied to the scores for each dimension of empathy. The descriptive statistics of each level of empathy and its dimensions are shown in Table 5.
To verify that the levels set for empathy and its dimensions had been well delineated and used to define the categories considered to be sufficiently different, we performed an analysis of variance to compare the means of each level for total empathy and each of its dimensions. Results revealed an adequate effect size (ή 2 , partial eta squared, and R 2 , coefficient of determination), with statistically significant differences as follows: Empathy (F = 1373.78, p = .0001, ή 2 = .932, R 2 = .931), CC (F = 813.16, p = .0001, ή 2 = .890, R 2 = .889), PT (F = 571.97, p = .0001, ή 2 = .850, R 2 = .849), and WIPS (F = 627.3, p = .0001, ή 2 = .862, R 2 = .860). We used the MSD method to perform multiple comparisons among the four empathy levels, for both overall and per-dimension scores, which revealed statistically significant differences in all pairs of levels compared (p < .001).
To validate the cut-off values of the overall scale, we used the second sample (n 2 = 340) to perform a hierarchical cluster analysis that yielded four a priori clusters. We first classified all cases according to the cluster to which they originally belonged, and then according to the cut-off scores set. With this information, we generated 2 × 2 tables comparing low and medium-low levels, medium-low, and medium-high levels, and finally medium-high and high levels. The data thus organized enabled us to establish the sensitivity and specificity of the scale for classifying participants according to their level of empathy (see Table 6).  Table 7 presents the percentiles of interest for each level of empathy and its dimensions. This information makes it possible to classify each participant as belonging to an empathy level and improves our interpretation of their score based on a normative sample.

Discussion
Regarding the students' mean scores, data are not available for comparison with other physical therapy samples, but when compared with medical students from various other national and international studies, they present a mean score slightly lower than 112 points, with a standard deviation of around 12 (Hojat et al., 2018). There is a higher mean in women than in men, with a small effect size, d = 0.25 (Cohen, 1988(Cohen, , 1992, indicative that said difference from a practical or clinical perspective would not be highly relevant. The reliability of the measurement (Cronbach's α = .835) slightly exceeds other estimates made from a variety of university student samples in Chile and in other countries, where Cronbach's alpha values have ranged from .70 to .80 with an average of .78 and an intraclass correlation of .835, indicative of good reliability (Hojat, 2018;Koo & Li, 2016). The JSE is a reliable measure for use with physical therapy students.
By studying the factor structure of the JSE, a three-factor model was obtained that fit the data well enough. The model confirmed all 10 items of the Perspective Taking factor, with   factor loadings equal to or greater than 0.24, and low factor loadings (<0.40) on items 15 and 17, with a coefficient α = .73. The Compassionate Care factor included eight items with factor loadings equal to or greater than 0.33, with low factor loading on item 18 and an α = .84. The third factor, Walking in Patient's Shoes, included two items with factor loadings of 0.67 and 0.77 and a coefficient α = .69. This three-factor model agrees with those reported for medical students in the US (Hojat et al., 2018), Spain (Ferreira-Valente et al., 2016), and Turkey (Bilgel & Ozcakir, 2017). The low internal consistency of the fourth and final factor is, we believe, explained by the small number of items that comprise it; ideally, a minimum of three items is required to stably determine a factor (Velicer & Fava, 1998). Three, or even better four, items per factor would significantly increase internal consistency (Ferrando & Anguiano-Carrasco, 2010). Despite the low factor loadings of three items (items 15 and 17 in PT, and 18 in CC), their presence does not damage the reliability of the measurement, although they have high error variance (with squared multiple correlations between .056 and .127).
Currently, the measurement of differences in levels of empathy is made using statistical estimates (Ye et al., 2020;Yuguero et al., 2019) that do not provide information about the qualitative changes observed when empathy is gained. The observed results of the factor model formed the basis for the determination of cut-off points. As a consequence, the establishment of these cut-off values provides a possible reference by which to establish comparisons between the empathy values observed not only in different schools of the same discipline within a country, but between countries and beyond this case of Chilean physical therapy students. Additionally, they serve as a reference point from which to measure the qualitative effect of a given empathic intervention; to discern whether interventions can effect change from one category to another or determine if such interventions have produced changes only within the same category.
Sensitivity exceeded 90% in two of the three cut-off points, satisfactory for an instrument, which measures a cognitive-affective psychological attribute, falling slightly only for the highest empathy categories (to 0.728) where it generated a false negative rate of 27.2%. We correctly identified those who possessed high empathy with a specificity that reached 98.7%. Without a doubt, further investigations into a criterion to permit the identification of those with high and low empathy are required.
Finally, the results of this study of Chilean physical therapy students provide evidence that allows inferring adequate psychometric properties of the JSE in this particular population, which is consistent with the evidence observed in samples of medical and dental students from Chile and Latin America. The three-factor, 20-item measurement model was reasonably fitted to the data, with satisfactory goodness-offit indicators, confirming the factor structure.
Thus, evidence was also obtained regarding factor invariance by gender, which indicated that the measure of empathy is equivalent in male and female students of physical therapy, favoring the comparison of the measurements by gender. The clusters generated provide cut-off points that assess empathy and its components categorically, and examine its changes, permitting potential comparisons between student populations, facilitating the interpretation of this variable, and ultimately simplifying decision-making processes.
A limitation of this study, due to a lack of representativeness in the sample used, is that its results cannot be generalized to all physical therapy students in Chile. To mitigate this, we selected samples from three different geographical areas of Chile.
This study includes elements, which contribute to the use of the JSE in professional contexts beyond the wide use it has been given in research. The psychometric properties of the instrument were reviewed, and norm and cut-off points were established which may be of wide and straightforward use for those health professionals who must measure empathy. It also provides sensitivity and specificity values, which permit Note. Δχ 2 = difference between the χ 2 values, Δdf = difference between degrees of freedom; CFI = comparative fit index; ΔCFI = difference between the comparative fit index.
its use as a diagnostic test. The wide use of the JSE in the context of research constitutes a foundation for its future development and, in this context, it appears to us that the continued establishment of test norms should proceed, and national norms for each country in which the test is used should be found which consider different health professions and health science training areas. From the psychometric perspective, the performance of some items require further investigation: in particular those in which several studies have shown low factorial loads (items 15, 17, and 18), as well as an evaluation of the real contribution the specification of the respective factors makes. Equally, it appears reasonable to pay attention to the WIPS factor's significance in the construction of empathy; a factor, which in various studies has shown low relative reliability with respect to the other two of the scale's factors or, indeed, has damaged the general goodness-of-fit of the original three-base-factor model, potentially by testing new items, which may improve the measurement of the factor, looking to widen it to three or four elements.

Conclusion
Despite the wide use of the JSE at a global level, the establishment of norms has not advanced sufficiently to allow the interpretation of a person's scores in relation to a representative sample of their population. This article proposes a norm for Chilean physical therapy students, and was able to position them relative to others by translating their test grade into a percentage value. This use of the raw empathy score to place them in the categories of high, medium-high, mediumlow, and low with respect to the total empathy score or its dimensions, and contribution of cut-off points of adequate sensitivity and specificity widens the possibilities of use of the JSE. Confirming a factorial structure that contributes to the validity of the construct, along with adequate internal Note. Se = sensitivity; Sp = specificity; FN = false negative; FP = false positive; OR = odds ratio (positive likelihood ratio); CI = confidence interval; PP = posterior probability.
consistency, indicates the reliability of the measurement for use with physical therapy students.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.