Assessing Somatic Symptoms With the Patient Health Questionnaire (PHQ-15) in Syrian Refugees

Somatic symptoms are common among Syrian refugees. To quantify somatic symptom load, sum score models derived from the Patient Health Questionnaire (PHQ-15) have been frequently applied without psychometric justification. Across two studies (total N = 776), we (a) tested different PHQ-15 factor solutions in Syrian refugees, (b) investigated measurement invariance (MI) of the factor solutions compared with German residents, and (c) scrutinized whether sum score models adequately represent the data and differ in associations with external validators compared with factor scores. One-factor, three-factor, four-factor, and a reduced one-factor solution all displayed acceptable to good model fit. The four-factor solution showed the best fit, enabling differential symptom analyses. Sum score models often had poor model fit, necessitating independent investigations before applying them. For all factor solutions, (partial) strict MI between residents and refugees could be established. All scoring methods displayed high and comparable associations with functional impairment, depressive, and anxiety symptoms.

More than 13 million Syrians have been forcibly displaced as a result of the Syrian civil war (UNHCR, 2020). Although most Syrian refugees fled to immediate neighboring countries, more than one million Syrian refugees have arrived in Germany since 2015 (UNHCR, 2020). Most of these individuals had been exposed to potentially traumatic events before and during their flight and remain vulnerable to psychological distress (Echterhoff et al., 2020). Meta-analytical findings suggest high levels of anxiety, depression, and post-traumatic stress disorder in refugees resettling in highincome countries (Henkelmann et al., 2020). However, clinical presentations and epidemiological research also highlight the role of somatic symptoms that remain medically unexplained among refugees (Hassan et al., 2015;Nesterko et al., 2020). In fact, somatic symptoms are common among Syrian refugees (Borho et al., 2021) and play a central role in their mental health symptom constellations .
Clinical studies with refugee populations often use sum score models derived from the Patient Health Questionnaire-15 (PHQ-15; Kroenke et al., 1998) to estimate somatic symptom distress (Nesterko et al., 2020). Although a limited range of psychometric aspects of the Arabic PHQ-15 have been explored with samples from university students (AlHadi et al., 2017) and primary care patients in Saudi-Arabia (Becker et al., 2002), the psychometric functioning of this version in Syrian refugee populations currently residing in Western receiving countries such as Germany remains largely unknown. Specifically, there is lack of knowledge regarding (a) the underlying factor structure in refugee populations, (b) measurement invariance (MI) compared with other populations, and (c) the validity of sum score models with external associations. The present investigation addressed these gaps to provide a sound psychometric basis for this important and practically relevant application of the PHQ-15.

Psychometric Properties of the PHQ-15
Several research findings support the reliability, validity, and efficiency of this screening tool for somatization in populations in Western countries (Kroenke et al., 1998;van Ravesteijn et al., 2009;Zijlema et al., 2013). Furthermore, taxometric analyses support the dimensional structure of somatic symptoms assessed with the PHQ-15 (Jasper et al., 2012). Because of these desirable psychometric properties, the Diagnostic and Statistical Manual of Mental Disorders 5th edition  workgroup suggested the PHQ-15 as a measurement tool of somatic symptom severity in somatic symptom disorders (DSM-5;American Psychiatric Association, 2013). Accordingly, researchers investigating refugee mental health in receiving countries commonly use sum score models to estimate somatic distress, implicitly assuming an underlying one-factor solution (Nesterko et al., 2020). Meanwhile, different factor solutions have been proposed for the PHQ-15 but not been systematically tested in refugee populations. A one-factor solution is the most pertinent solution to test because simple scale composite scores are convenient and often used (Nesterko et al., 2020). Such a factor solution was also suggested in a study that demonstrated good reliability and validity of the Arabic PHQ-15 in university students from Saudi-Arabia (AlHadi et al., 2017). Using the Chinese version of the PHQ-15 in other non-refugee populations, a three-factor solution was reported with the three correlated latent factors cardiopulmonary, gastrointestinal, and pain-fatigue symptoms (Liao et al., 2016;Zhang et al., 2016). A further differentiation was made with a four-factor solution by additionally separating the pain and fatigue factor in the general population and a sample of primary care patients in Germany (Witthöft et al., 2013;Zijlema et al., 2013). Recent advances reported an age and gender invariant bifactor model that led to an increment in model fit compared with the other factor solutions in primary care patients in Spain (Cano-García et al., 2020). This model specifies one general somatization factor and four orthogonal specific factors from the four-factor solution mentioned above (Cano-García et al., 2020;Witthöft et al., 2013). Furthermore, in a representative sample of the general population in Germany, Gierk et al. (2014) applied a shortened Version of PHQ-15, the Somatic Symptom Scale-8 (SSS-8), consisting of eight items only. Although the proposed bi-factor structure reflects gastrointestinal, pain, fatigue, and cardiopulmonary factors alongside an orthogonal general somatic symptom factor, model fit for a one-factor solution was also acceptable . Moreover, most studies that compared the SSS-8 against the PHQ-15 tested similarities in internal consistencies and construct validity of total scores of entire scales (Toussaint et al., 2017). In psychosomatic outpatients in Germany, internal consistencies, item total correlations, and associations with other constructs of the SSS-8 were comparable to those of the PHQ-15 (Gierk et al., 2015). As the SSS-8 is meant to be an abbreviated version of the PHQ-15 providing a parsimonious single score, a one-factor solution is the most pertinent to test in Syrian refugees.

Measurement Invariance
To allow for unbiased mean level comparisons with other populations, multi-group invariance of the established factor solutions needs to be tested (Milfont & Fischer, 2010). Importantly, studies comparing the measurement properties of the PHQ-15 between certain populations revealed that the bi-factor model with a general somatization factor and four symptom-specific factors was comparable among German and Dutch samples, but not a Chinese sample (Leonhart et al., 2018). This demonstrates the relevance of independent testing because measurement models may be biased. Accordingly, MI of the PHQ-15 may be violated when the overall construct of somatization is differently represented in refugee populations, indicating a lack of conceptual equivalence between populations. Different factor solutions may represent a more appropriate reflection of the data in refugee samples compared with other populations. In fact, Syrian refugees have specific idioms to express distress, and these may be related differently to the construct of somatization (Hassan et al., 2015). Symptoms may thus be non-invariant if the used language fails to express them in a culturally appropriate way. Certain items may display floor effects when refugees hesitate to agree with statements that do not resonate with their cultural expression of somatic symptoms. This would decrease the ability of these items to differentiate between individuals on the latent factor and could be reflected in higher error variances of the items. There is a large variation in the way in which individuals interpret the meaning of language that can be influenced by slight nuances (Sulaiman et al., 2001). Accordingly, response categories may have different meanings for refugees than for members of the receiving countries leading to different item thresholds. For instance, cultural norms may impose more denial upon reporting problems among refugees because they fear being perceived as weak (Hassan et al., 2015). Refugees might thus be hesitant to admit that they are bothered a lot by certain symptoms and thresholds for this response option may differ compared with residential populations.

Use of Sum Scores and External Validators
Congeneric factor models examining MI do not simply translate into sum score models in which symptoms equally contribute to the underlying latent construct (McNeish & Wolf, 2020). Items contain different information regarding the underlying construct by having different factor loadings. Equal treatment of items in composite scores is oversimplified and may lead to a loss of information. A simulation study suggests that 3% unexplained variability between factor scores and sum scores can result in different conclusions based on these scores (McNeish & Wolf, 2020). As counting somatization symptoms with the PHQ-15 is common practice, it is important to discern whether the factor models fit the data when all factor loadings of a given latent factor are constrained to be equal, which is the underlying assumption of sum scores. Furthermore, from a practical research perspective, it is relevant to test whether associations with external validators differ when factor scores are used compared with sum score models. This way, it is possible to examine the extent of potential bias in external associations when convenient sum score models are applied among refugee populations.

The Present Research
We aimed to systematically examine the psychometric properties of the PHQ-15 in Syrian refugees residing in Germany (Witthöft et al., 2013;Zijlema et al., 2013). Our research goals were threefold. First, we examined model fit of different factor solutions in refugees: A one-factor solution with one latent factor, a three-factor solution with the correlated latent factors pain-fatigue, gastrointestinal and cardiopulmonary symptoms, a four-factor solution with the correlated latent factors pain, gastrointestinal, cardiopulmonary, and fatigue, a bi-factor model including the four orthogonal specific factors plus a general somatization factor, and the SSS-8 factor structure. Second, we investigated MI of these factor solutions in refugees compared with German residents to ensure that the same underlying construct is measured across groups. Third, we tested whether sum score models adequately represent the data and whether associations with external validators differ between factor scores and sum scores. External validators were based on commonly reported associations with somatic symptoms: Depressive symptoms, anxiety symptoms, and functional impairment (Jongedijk et al., 2020;Kohlmann et al., 2016;Nesterko et al., 2020).

Participants and Design
Data were drawn from two studies investigating mental health in refugees in Germany compared with German residents . Respondents were at least 18 years old and provided informed consent. Participants were recruited via social media platforms and email lists, and took part online (monetary compensation: € 5). Ethical approval for data usage in both studies was granted by the psychology department's ethics committee of the University of Münster. Participants provided informed consent to anonymize data usage for research purposes, including the examination of the current psychometric analyses. Sample sizes were guided by practical considerations based on the feasibility to collect data from refugees in Germany with a target of 200 participants per group (refugees and residents, respectively) for both studies. In Study 1, N = 402 individuals participated, of which 202 were German residents and 200 were refugees residing in Germany. Initially, N = 410 individuals participated in Study 2, of which 205 were German residents and 205 were refugees. In Study 2, participants were instructed to read a short text about mental health services and were subsequently asked to answer three content-related multiple choice questions to ensure attentive reading. Data from participants who provided false responses were excluded. We used established versions of the German questionnaires, and all scales were then translated into Modern Standard Arabic by a professional translation office. The translated versions were carefully double-checked and compared with the German versions by a native Arabic speaker who was fluent in both Arabic and German. . This instrument assesses the following somatization symptoms: stomach pain, back pain, pain in arms and legs, pain during intercourse, headaches, chest pain, dizziness, heart race, short breath, constipation, nausea, gas, trouble sleeping, and feeling tired (for a review of established versions of this questionnaire including the German version, see Kroenke et al., 2010). The latter two symptoms are taken from the PHQ-9 module assessing depressive symptoms. Symptoms are assessed for the last four weeks on a three-point Likert-type scale ranging from 0 (not bothered at all), over 1 (bothered a little), to 2 (bothered a lot). The derived total scores thus range from 0 to 30. Higher scores indicate greater severity of somatic symptoms. Cutoffs for somatic distress are Minimal (≤4), mild (≥5), moderate (≥10), and severe (≥15). The PHQ-15 was applied in both studies.

Materials
PHQ-9. In Study 1, the PHQ-9 was used as an external validator to assess depressive symptoms during the last 2 weeks (Spitzer et al., 1999). Response options range from not at all (0), several days (1), more than half the days (2), and nearly every day (3). The PHQ-9 (α residents = .70; α refugees = .81) showed acceptable to good internal consistencies in this study. The psychometric functioning of the PHQ-9 has been demonstrated across many populations (Gräfe et al., 2004;Kroenke et al., 2010, for a review) and in Middle Eastern refugees in Germany alongside evidence for scalar MI between refugees and residents . Because two items are included in the PHQ-15, we used PHQ-9 scale composite scores without these two items.
Functional Impairment. As an additional external validator, we used a question tapping into functional impairment with one PHQ item (Kroenke et al., 2010;Spitzer et al., 1999) (If you checked off any problems, how difficult have these problems made it for you to do your work, take care of things at home, or get along with other people?). Participants responded to this item on a scale from (1) not difficult at all to (4) extremely difficult.
PHQ-4. In Study 2, we used the PHQ-4 as external validator (Kroenke et al., 2009). It comprises four items on a fourpoint scale from 0 (not at all) to 3 (nearly every day). Items pertain to endorsement in the last two weeks referring to two core symptoms of depression and anxiety, respectively (Gräfe et al., 2004). MI of the PHQ-4 was established between German residents, migrants, and refugees living in Germany (Tibubos & Kröger, 2020).

Analytic Strategy
Factor Solutions. Analysis code and data can be found in the open science framework (https://osf.io/c3t6h/?view_only= 66c59c75fdd745f180e6ff15464b5795). Analyses were conducted in R (R Core Team, 2019) and we used the lavaan package for our factor analytical models (Rosseel, 2012). Because of the three response options of the PHQ-15, we treated data as categorical and used the weighted least squares mean and variance adjusted (WLSMV) estimator for all analyses (Asparouhov & Muthén, 2010). In both studies, we tested four different factor solutions as proposed by the literature with confirmatory factor analyses (CFA): A one-factor solution with one latent factor (AlHadi et al., 2017), a three-factor solution with the correlated latent factors pain-fatigue, gastrointestinal and cardiopulmonary (Zhang et al., 2016), a four-factor solution with the correlated latent factors pain, gastrointestinal, cardiopulmonary, and fatigue (Witthöft et al., 2013), and a one-factor solution of the SSS-8 (Gierk et al., 2015). 1 Table 2 shows the items that belong to each factor. The bi-factor model that we initially aimed to test did not converge indicating that the model is too complex for the data. Likewise, a bifactor model with three orthogonal factors and one general somatization factor did not converge. 2 For each of the factor solutions, we tested whether sum score models adequately represent the data by constraining all items of a given latent factor to be equal. Following criteria were used to evaluate model fit: The comparative fit index (CFI) and the Tucker-Lewis Index (TLI) should both be larger than 0.95 or 0.90, for a good or acceptable model fit, respectively (Hu & Bentler, 1999). For the standardized root mean square residuals (SRMR), values lower than 0.08 indicate acceptable fit (Hu & Bentler, 1999). For the root mean square error of approximation (RMSEA), values below 0.05 indicate good fit and values of 0.08 acceptable fit, and they should not exceed 0.10 (Browne & Cudeck, 1992). To have an estimation of reliability for both factor and sum score models, we estimated omega total (ω total ) (McDonald, 1999) and Cronbach's α (Cronbach, 1951). Omega total considers the different factor loadings while Cronbach's α assumes equivalent factor loadings; values > .80 indicate good internal consistency; values > .70 acceptable internal consistency (McNeish, 2018).

Measurement Invariance.
To evaluate whether somatization can be compared between refugees and residents, we systematically tested MI between residents and refugees in a multigroup CFA framework (Meredith, 1993;Milfont & Fischer, 2010). To establish MI, increasingly constrained and nested models were sequentially tested against each other. In these hierarchically nested models, constraints are added at each step in addition to the former constraints (Meredith, 1993; see also Millsap, 2012). First, the factor structure was constrained to be equivalent across groups (configural invariance). This allows researchers to establish whether the construct is conceptually represented with the same factor structure across groups. Next, the factor loadings were constrained to be equal across groups to gauge whether the items relate to the somatization factor in the same way across groups in addition to having the same factor structure (weak/metric invariance). This level of MI allows to compare variances and covariances between the tested groups (Millsap, 2012). Then, item thresholds were additionally constrained to be equivalent to discern whether the observed thresholds conditional on the latent factor do not differ across groups in addition to having the same factor structure and equal factor loadings (strong/scalar invariance). In weighted least squares approaches (Muthén, 1984), item thresholds are introduced to account for the ordered nature of the observed data, assuming that participants' responses reflect a discrete categorization of the underlying latent variable and that both are related by a threshold relationship. When an observed variable has r response categories, the variable has r−1 thresholds τ j resulting in two thresholds for the three response options of the PHQ-15. Establishing this level of MI allows to compare the means, variances, and covariances of the latent factors between the groups (Millsap, 2012). Last, the residual variances of the items were also constrained to be equal to examine whether the amount of variance in the items not explained by the latent factor does not differ across groups in addition to having the same factor structure, equal factor loadings and thresholds (strict/residual invariance; Meredith, 1993). This level of MI indicates that group differences are truly attributable to differences in the underlying construct (Millsap, 2012). To test strict MI, theta instead of delta parameterization was used. To detect violations of MI, we evaluated changes (Δ) in the CFI and RMSEA. That is, we calculated the differences between the fit indices of two nested models and evaluated the level of their discrepancies with ΔCFI ≥ .010 and ΔRMSEA ≥ .007, indicating substantial deterioration in model fit (Chen, 2007;Meredith, 1993;Milfont & Fischer, 2010). 3 We also report the χ 2test difference test despite potentially leading to inflated Type 1 error rates, thus falsely indicating poor model fit (Sass et al., 2014). Non-invariance of certain items may not preclude the possibility of unbiased group comparisons (for a simulation study see Guenole & Brown, 2014). Few parameters (e.g., several factor loadings) can be relaxed in relation to the number of invariant parameters leading to relatively unbiased estimates (Byrne et al., 1989). Partial MI models were tested when full MI was not supported by iteratively freeing parameters according to their unconstraint between-group discrepancies (Byrne et al., 1989;Guenole & Brown, 2014). These models were then tested against the model established in the step before.
External Validation. We aimed to investigate the extent to which different derived PHQ-15 scores vary in their associations with external variables in the refugee sample. Given that MI between German residents and refugees from Syria was already established for the PHQ-9  and the PHQ-4 (Tibubos & Kröger, 2020), we used sum scores of our external validators. For the PHQ-15, we extracted factor scores for each model with the empirical Bayes method (Muthén, 1998(Muthén, -2004. We compared the external associations of these factor scores with the sum scores of the respective models by using Pearson's correlations. In Study 1, we compared the association of the different scoring methods with depressive symptoms (measured by the PHQ-9 excluding the two overlapping items) and functional impairment. In Study 2, we compared them with depressive and anxiety symptoms (PHQ-4). To have a comparison of maximally unconstrained associations, we used factor scores of the configural models and contrasted them to maximally constraint sum scores.

Results
Demographic characteristics are depicted in Table 1. Of the 200 refugees in Study 1, 182 (91%) were Syrians. Because the present study focuses on Syrian refugees, we excluded data from 18 non-Syrian refugees from countries such as Afghanistan or Iraq, resulting in a final sample of N = 384. In Study 2, n = 11 refugees were not from Syria (e.g., again from Afghanistan or Iraq), and their data were excluded from further analyses. Data from seven further refugees were excluded because they failed the attention check.

Descriptive Statistics
Descriptive statistics including the median and interquartile range of all PHQ-15 items from both studies can be found in Table 2. In the refugee samples, the item with the lowest endorsement was fainting, followed by pain during intercourse and menstruation problems. Although most items showed acceptable skewness and kurtosis, the symptom fainting was extremely non-normally distributed. This is in line with former studies, in which this item was excluded due to such distributions (Cano-García et al.,   2020). Given that the PHQ-15 was never tested in Syrian refugees and the WLSMV estimator can handle non-normally distributed items (Asparouhov & Muthén, 2010), we tested the factor solutions in the refugee samples with and without this item. Also, in line with previous research (Cano-García et al., 2020), we excluded the symptom menstruation problems from our analyses because it is gender specific and both refugee samples were predominately male.

Internal Consistencies
One-Factor Solution. Table 3 shows the internal consistencies of the factor solutions. For the one-factor solution, internal consistencies were good for refugees and residents in both studies (all values ≥ .82).
Three-Factor Solution. In the three-factor solution, internal consistency was acceptable for all factors in the refugee sample according to ω total in both studies. However, Cronbach's α indicated non-acceptable fit for the gastrointestinal factor in both studies, and for the cardiopulmonary in Study 1 (all alphas ≥ .66). For residents, internal consistencies ranged from .63 (pain-fatigue, Study 1), and .73 (cardiopulmonary, Study 2).
Four-Factor Solution. In both studies, for the gastrointestinal and cardiopulmonary factor, we found the same pattern as for the three-factor solution. In Study 1, the latent pain factor was acceptable for refugees (≥ .71), but showed below acceptable internal consistency according to both indices for refugees in Study 2, and for residents in both studies (range .60 -.66). The fatigue factor had acceptable internal consistencies in Study 2 for refugees (.77 according to both indices) and for residents in both studies (all values ≥ .70), but had below acceptable internal consistency according to both indices for refugees in Study 1 (.68).

SSS-8 Factor.
In both studies, internal consistencies were acceptable for both refugees and residents (range .71-.79).

Factor Solutions
All factor solutions have been tested with the fainting item and without the fainting item (see Table 4). In all factor models, we observed a decrement in model fit according to the χ 2 test-statistic and fit indices when the fainting item was included. Although this item had acceptable factor loadings (all λ ≥ .40) and model fit was still acceptable, it had the lowest factor loading in all solutions, which was also apparent in worse model fit in the sum score models. We, therefore, focus on the models without this item in the following analysis.
One-Factor Solution. In both samples, the one-factor solution had good model fit according to CFI and TLI, but only close to acceptable model fit or non-acceptable model fit according to RMSEA and SRMR. The respective sum score models showed acceptable model fit according to CFI and TLI but not acceptable fit according to the other indices.
Three-Factor Solution. The three-factor solution had good model fit according to CFI and TLI, but only acceptable model fit according to RMSEA and close to acceptable fit according to the SRMR. Model fit was descriptively better than for the one-factor solution. The respective sum score  Note. CFA = confirmatory factor analyses; 1F = one factor; 3F = three factors; 4F = four factors; SSS-8 = Somatic Symptom Scale-8; G = gastrointestinal; p = pain-fatigue symptoms (3 factor solution); C = cardiopulmonary; F = fatigue symptoms (4 factor solution).
model showed good fit according to CFI, TLI, acceptable fit according to RMSEA, but non-acceptable fit according to SRMR.  Table 5 shows freely estimated factor loadings for both groups and both studies for the different factor solutions.

Measurement Invariance
Factor loadings for refugees were all good and exceeded .40. In Study 1, pain during intercourse displayed the lowest factor loadings in the refugee group, in Study 2, headaches. Factor loadings tended to be somewhat higher for their respective factors in the three-and four-factor solutions compared with the one-factor solution and the SSS-8. Some factor loadings were, however, problematic in the residents group. In Study 1, pain in the arms and legs had factor loadings that were below .30 for the one-factor solution. The same applied to back pain in Study 2. The largest factor loadings discrepancies between groups were also found for these symptoms.
One-Factor Solution. Table 6 shows the MI tests of the different factor solutions among refugees and German residents. Configural model fit was good to acceptable for the onefactor solution in Studies 1 and 2 according to CFI, RMSEA, and TLI, and non-acceptable according to SRMR. Model fit deteriorated substantially for the metric invariance model in both studies according to ∆CFI and ∆RMSEA. We could establish partial metric MI when setting factor loadings of pain in arms and legs free in Study 1, and factor loadings of back pain in Study 2. In Study 1, we could only establish partial scalar and strict MI when we additionally set the item thresholds of headaches (τ1 refugees = −0.01, τ2 refugees = 1.18; τ1 residents = −0.59, τ2 residents = 1.24) and chest pain (τ1 refugees = 0.33, τ2 refugees = 1.51; τ1 residents = 1.10, τ2 residents = 2.89) free. In Study 2, partial strict MI was established.
Three-Factor Solution. In Study 1, we found the same pattern for the three-factor model with acceptable to good model fit for the configural model according to all indices. Partial strict MI could be established when setting free the same parameter as for the one-factor solution: Factor loadings of pain in arms and legs and item thresholds of headaches Four-Factor Solution. For the four-factor solution, we found good model fit for the configural invariance model in both studies according to the CFI, RMSEA, TLI, and acceptable for according to the SRMR. In Study 1, again factor loadings of pain in arms and legs were set free to establish partial metric MI. Partial scalar MI was given. To establish strict partial MI with overall good model fit, we set the residuals of the symptom headaches free. In Study 2, we found strict MI with acceptable to good model fit.
SSS-8 Factor. Configural model fit was acceptable to good in both studies. Metric MI assumptions were violated in both studies. We thus set factor loadings of pain in arms and legs free in Study 1, and factor loadings of back pain in Study 2. To establish partial scalar MI in Study 1, we freed the thresholds of stomach pain (τ1 refugees = 0.22, τ2 refugees = 1.57; τ1 residents = −0.01, τ2 residents = 1.47) and headaches (τ1 refugees = −0.02, τ2 refugees = 1.19; τ1 residents = −0.62, τ2 residents = 1.08). To establish partial strict MI, we relaxed constraints on the residuals of stomach pain and chest pain in Study 1 and chest pain in Study 2.

Mean Differences
Establishing these levels of MI allows a relatively unbiased interpretation of the unadjusted manifest mean differences reported in Table 7 (Guenole & Brown, 2014). To ensure that this assumption holds, we also report the latent mean differences of the models with the highest constraints, that is partial strict MI or strict MI (see Table 7). In Study 1, refugees reported higher unadjusted manifest mean levels in the SSS-8 solution, higher levels of pain-fatigue symptoms on the three-factor solution, and higher values on the latent factors cardiopulmonary and fatigue symptoms in the four-factor solution. In Study 2, refugees reported higher values on all latent factors. The latent mean differences yielded the same interpretations with the exception that residents reported slightly higher means on the pain factor in the four-factor solution in Study 1.

External Validation
Correlations Among Scoring Methods. In Study 1, the correlations between sum scores and their respective factor scores ranged from r = .86 (for the cardiopulmonary factor for both factor solutions) to r = .99 (for the one-factor solution and the SSS-8 solution). Between 2% and 26% of the variance between the scores remained thus unexplained. In Study 2, the correlation between sum scores and their respective factor scores ranged from r = .90 (for the pain factor for the four-factor solutions) to r = .99 (for the onefactor solution and the SSS-8 solution). Here, between 2% and 19% of the variance remained unexplained.
External Associations. All derived scores correlated highly with functional impairment, and anxiety and depressive symptoms (see Table 8 for correlations and 95% CIs). For the one-factor solution and the SSS-8 factor solution, sum scores and factor scores did not differ in their associations with external variables. In the three-factor solution, the Note. M = mean; SD = standard deviation; t = t-value; g = hedges' g; Latent diff = latent difference for the factor model with the highest constrains (including partial measurement invariance); SSS-8 = Somatic Symptom Scale-8. *p ˂ .001. **p ˂ .01. ***p ˂ .001.
pain-fatigue factor scores and sum scores showed the highest associations with outcome measures and the different scoring methods did not differ from each other. For the gastrointestinal and the cardiopulmonary symptoms, lower associations were found descriptively compared with the pain-fatigue factor. Moreover, sum scores and factor scores deviated from each other, although most confidence intervals still overlapped. Factor scores descriptively displayed higher correlation than sum scores. For the four-factor solution, the fatigue scores had the highest association with external associations. Sum scores and factor scores did not deviate from each other. The other three factors displayed lower associations and the sum scores deviated from the respective factor scores with deviations of up to .17. For the cardiopulmonary symptom factor and its association with the PHQ-9 in Study 1, sum score and factor score intervals did not overlap.

Discussion
Using data from two studies, we examined the psychometric properties of the PHQ-15 in Syrian refugees currently residing in a Western receiving country (i.e., Germany).
The symptom fainting had the lowest endorsement and led to a decrement in model fit when it was included. Although including this item may not lead to drastically worse model fit, this item may lead to lack of accuracy especially in sum score models, given that it contributes to more heterogeneous factor loadings. In line with previous research, we thus excluded the item from further analyses (Cano-García et al., 2020). However, we need to acknowledge that this item might be theoretically relevant and future studies including this item are advised to conduct independent testing in their samples. The one-factor solution had adequate model fit. Sum score models, too, had acceptable fit according to some fit indices, while the RMSEA and SRMR mainly indicated poor model fit for sum score models. Omega total and Cronbach's α indicated good internal consistencies for the one-factor solutions. The three-factor and four-factor solutions fitted the data descriptively better with the latter solution displaying the best model fit, in line with psychometric studies in the general population as well as primary care patients in Germany and China (Liao et al., 2016;Witthöft et al., 2013;Zhang et al., 2016;Zijlema et al., 2013). The respective sum score model seemed acceptable according to some fit indices. The SSS-8 model showed also good to acceptable model fit. Consistent with findings in psychosomatic outpatients in Germany, the psychometric properties were comparable to the PHQ-15 (Gierk et al., 2015). In principle, all models seem applicable in line with the oftenreported good psychometric properties of the PHQ-15 (Zijlema et al., 2013). The present study thus extends previous research on the psychometric functioning of the Arabic PHQ-15 version to Syrian refugees currently residing in Germany, while it was formerly tested with university students in Saudi-Arabia (AlHadi et al., 2017). The different models have their own merits. The onefactor solution allows for convenient applications, for instance, in large epidemiological studies or in a multistep diagnostic process, where single total scores with cut-off values are practical. In case resources are limited further, the SSS-8 model can be used, in line with its initial purpose . The three-factor and four-factor solutions may allow for fine-grained analyses such as elucidating distinct symptom profiles. Some evidence for this use is supported by our finding that the subscales showed differential associations with external validation measures. In the three-factor solution, the pain factor showed the strongest associations with external validators. However, this was likely attributable to the fatigue items because the fatigue factor showed the strongest associations with external validators when tested separately in the four-factor solution while associations of the pain factor became weaker. The cardiopulmonary and gastrointestinal symptom factors may display higher correlations with other constructs not assessed in the present studies (e.g., panic symptoms).
External sum score associations differed more strongly compared with factor scores for the three-factor and fourfactor solutions. For instance, for the cardiopulmonary symptom factor the associations with depressive symptoms differed for the two scoring methods. Correlations between the factor scores and sum scores were also lower for the subfactor compared with the entire scale. This evidence aligns with simulation studies that suggest unexplained variability between sum and factor scores can lead to different conclusions (McNeish & Wolf, 2020). In the present studies, sum scores appeared to be slightly downwardly biased, which should be considered by researchers applying fine-grained symptom analyses with these scales. Although some information is lost according to worse model fit of this solution, we did not detect these practical problems concerning external associations for the one-factor solution.
Recent studies tested more granulated bi-factor models that did not converge in our own analyses (Cano-García et al., 2020;Leonhart et al., 2018). The models may have been too complex for our data given sample size restrictions. However, this may be a problem that researchers face in refugee studies because of typically small sample sizes due to practical constraints (Borho et al., 2021). It, thus, remains open whether a general somatization factor in a model that additionally accounts for the variance by specifying specific orthogonal factors reduces biases in total scores meaningfully. It should be established in further studies whether a potential reduction of bias through such models outweighs their computational complexity.
We established either partial or full strict MI between Syrian refugees in Germany and German residents for the tested models, which provides evidence that mean level comparisons between these two groups can be made (Milfont & Fischer, 2010). Previous work has demonstrated comparable scale properties in German and Dutch samples, but not in a Chinese sample (Leonhart et al., 2018). In addition, MI was demonstrated for gender and age in previous studies (Cano-García et al., 2020;Gierk et al., 2014). Demonstrating MI between refugees and members of the residential population is an important premise for understanding cross-cultural differences in somatic symptom prevalence. When problems with metric MI were encountered, this was mainly attributable to lower factor loadings in the German subsample. Therefore, these items seem not to be problematic in refugees. Concerning scalar MI, we found different thresholds between refugees and residents for a few items. Although the thresholds for headaches were lower in residents, the thresholds for chest pain were considerably lower in refugees. Also, the thresholds for stomach pain were higher for refugees in one model. It needs to be noted that there is no clear consensus on the number of parameters that can be freed to achieve partial MI (for a review see Putnick & Bornstein, 2016). Some studies suggest that at least half of the parameters should be invariant (Vandenberg & Lance, 2000) while simulation studies demonstrate increasing mean level bias with increasing non-invariant items (Guenole & Brown, 2014). We set a maximum of two thresholds per model free, and a maximum of four parameters in total, constituting a relatively small proportion of the estimated parameters. Only in the four-factor solution in Study 1, one tested mean had a different result when manifest unadjusted scores were compared with partially invariant latent scores.

Limitations
It must be noted that the present findings reflect a limited perspective. This is because Syrian refugees have specific idioms to describe somatic symptoms and it remains unknown whether the PHQ-15 captures them adequately (Hassan et al., 2015). It may be that other symptoms or other linguistic subtleties that resonate better with these idioms may constitute a more accurate latent somatization factor. Future studies should include additional items that are specifically tailored to Syrian refugees' socio-cultural background. Qualitative studies and item development involving people with lived experience are needed. Also, Modern Standard Arabic was used. Although the translations were double-checked by a native speaker, several parallel forward translations and back-translations by independent bilingual and bicultural translators would be the most accurate way of translating scales in cross-cultural research (Cha et al., 2007). In addition, testing other regional Arabic dialects may unravel nuanced language differences. Understanding such subtleties may prove relevant for practitioners' readiness to work with refugees . The present studies used convenience samples, hence cross-validation in different samples is warranted. Gender distribution was unequal between refugees and residents, and we lacked the power to test invariance of gender in addition to the cultural background. Future studies should empirically test whether gender contributes to measurement variations in somatization. For instance, pain during intercourse had relatively low factor loadings compared with other symptoms, which could be attributable to gender-specific factors. Although the PHQ-15 was invariant across gender in a large sample of outpatients in Spain (Cano-García et al., 2020), this needs to be independently tested with Syrian refugees. This is because Syrian refugees may have different gender norms than other populations, which may influence the articulation and expression of somatic symptoms (Hassan et al., 2015). Although the sample distribution of predominantly young men mirrors the sociodemographic composition of Syrian refugees in Germany (Juran & Broer, 2017), it needs to be established whether somatic symptoms can be unbiasedly assessed across gender.
Both samples could also have been combined to have more statistical power. However, there was a period of a few months between the two data collections and researchers have highlighted the importance of time-varying differences depending on the migration stage (Wu et al., 2021), based on which we decided to test the samples independently. Moreover, we needed to keep the samples separate for the external validation analysis and wanted to ensure that the tested models reflect the data adequately in these specific samples. Also, the first refugee subsample was somewhat younger than the second subsample, and testing the factor solutions with smaller samples mirrors sample sizes of common convenience samples in refugee research.
For our model evaluation, we used cut-off values that have been commonly applied to determine the fit of CFAs (e.g., Browne & Cudeck, 1992;Hu & Bentler, 1999). However, these cut-offs are based on maximum likelihood estimation and may not discover misfit of the model when using the WLSMV estimator for categorical data (e.g., Xia & Yang, 2019). Our model fit conclusions and derived recommendations need thus to be carefully interpreted when using the PHQ-15 in Syrian refugees by considering different aspects of applicability and model fit. Establishing MI with categorical data has limitations. Using χ 2 -test statistics may lead to inflated Type 1 error rates, especially with large sample sizes, thus falsely indicating poor model fit (Sass et al., 2014). The validity of cut-off values for fit indices has however not been conclusively examined. This again points to the need of cross-validation.
Demonstrating (partial) strict invariance provides evidence for the internal validity of the PHQ-15. Yet, we did not have a broad range of external measures regarding mental health outcomes to prove external validity. Having a broader nomological network could unravel whether the PHQ-15 captures the entire breadth and complexity of the somatization construct. In this regard, it is also noteworthy that functional impairment was assessed with only one item. Area under the curve studies are warranted to demonstrate how well the PHQ-15 discriminates against clinical interviews to establish diagnoses. In addition, there was no attention check to ensure attentive participation in Study 1.

Conclusion
The current research sheds light on the applicability of the PHQ-15 in Syrian refugee populations who fled to Germany. It adds evidence regarding the scale's good psychometric functioning of different factor solutions and the potential use of sum scores. However, for the most accurate assessment of somatic symptoms in Syrian refugees, using factor scores is advisable because sum score models had the worst overall model fit. In sum, our data support the validity of the PHQ-15 in assessing somatic symptoms in Syria refugees with a history of potentially traumatic events and with different cultural backgrounds than most of the previously studied populations.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by SAFIR Münster