Semantic Overlaps Between Chinese Two-Character Words and Constituent Characters: A Normative Study

In written Chinese, the graphic units are Chinese characters (CCs). Most of the commonly used characters often join with others to form two-character words (2C-words) or words of more than two characters. Indeed, over 70% of the commonly used words are 2C-words. Since almost all characters are meaningful in their own right, there are semantic overlaps between 2C-words and their constituent characters. The present study investigated how normative semantic overlap of 2C-words and their constituent characters (SWC) might be influenced by whether the constituent characters are word or word-not characters (Wording) and by whether they are left or right characters (Positioning) and might be predicted by ordinary features of the constituent characters. The results confirmed earlier work that word-not characters are more strongly associated than word characters with 2C-words, and that right characters are more strongly than left characters associated with the 2C-words in semantics. The present study is also the first to provide evidence concerning the prediction of SWC by the norm features of the constituent characters. Skilled readers’ perception of the semantic features, frequency features and number-of-word features of the constituent characters may be mediated by Wording and Positioning in 2C-word semantic processing. However, they are not likely to perceive the visual features of the constituent characters. Rather, they seem to take the constituent characters of m-CCs as individual units, which should be highly familiar to them. In a semantic task on 2C-words, skilled readers may process the constituent characters in number of meanings, concreteness, imageability and emotion arousal, but not in sensory experience arousal. There appears to be a close association in valence between the 2C-words and the word characters, but not between the 2C-words and the word-not characters. These findings strongly support the theoretical argument that both words and characters should be taken as language units in Chinese. However, Wording and Positioning should be considered carefully when considering a CC as a language unit. These findings may be of more general significance for semantic understanding of compound words.


Introduction
In written Chinese, the graphic units are Chinese characters (CCs).Most of the commonly used characters often join with others to form two-character words (2C-words) or words of more than two characters (mC-words).Since almost all characters are meaningful in their own right, there are semantic overlaps between 2C-words and their constituent characters (Tse & Yap, 2018;X. Zhou & Marslen-Wilson, 2000), much like compound words in English.For example, ''basketball'' is composed of ''basket'' and ''ball,'' meaning a game in which a ball is thrown through a basket; similarly, "红旗" (red flag) is the semantic combination of the constituent characters "红" (red) and "旗" (flag).In an English two-morpheme compound, the morpheme on the right side is more likely to be the semantic head than the morpheme on the left side (Dronjic, 2011;Juhasz et al., 2015).This is also the case in many 2C-words, where the constituent characters on the right are more likely to be the meaningful heads than those on the left (T.Q. Xu, 2010).
Gagne¨et al. ( 2019) conducted a normative study on the semantic associations between over 8,000 English compounds and their constituent morphemes.They asked participants to rate each compound in terms of the degree to which its meaning can be predicted from its constituents.They also obtained linguistic characteristics that might influence compound processing and carried out a series of data analyses.The compounds whose meanings were highly predictable from their constituents were likely to be processed faster than those whose meanings were not easily predictable from their constituents.In other words, a normative semantic association between compounds and their constituents is strongly predictive of compound processing.Similarly, in Chinese, the normative semantic overlap of 2C-words and their constituent characters (SWC) should also be valuable in 2C-word processing.Several normative studies have been conducted into the ordinary features of CCs (e.g., Z. G. Cai et al., 2022;Y. Liu et al., 2007;Sze et al., 2014;R. Wang et al., 2020).It is possible that the strength of SWC might be a mathematical function of norm features of the constituent characters.The purpose of the present study is to confirm this speculation via a normative study.

Research Questions
Wording and Positioning.In English, compound words account for a small portion of the vocabulary (Gao et al., 2022;P. D. Liu & McBride-Chang, 2010), but in Chinese, 2C-and mC-words make up 72% and 22% of the commonly used words (State Language Affairs Commission, 2008), respectively, probably because of the particular language units of CCs.A CC corresponds to a syllable in spoken Chinese.In their first years of school, children learn that 2C-and mC-words are composed of single characters (Cheng et al., 2018;McBride, 2016).To meet the minimum literacy requirements set by education policy, they are required to master the 2,500 most commonly used CCs (State Language Affairs Commission, 1988), each of which is estimated to be used as a constituent character in about 20 commonly used 2C-words, on average (K.Zhang, 1997).
CCs may fall into three categories, according to whether they function only as one-character words, only as constituent characters or both as one-character words and as constituent characters.Those that can be used as constituent characters but not be used as one-character words are called ''word-not characters'' in the present study; those that can be used both as one-character words and as constituent characters are called ''word characters.''Word characters have more meanings and can join with other CCs to form more 2C-words than word-not characters (Ge, 2018;J. Zhou, 2019).Word characters might be different from word-not characters in how they semantically overlap with the corresponding 2C-words.
The constituent characters of 2C-and mC-words have fixed relative positions.Those on the left and right side of 2C-words are referred to as left and right characters, respectively.Left characters are potentially different from right characters in their semantic and syntactic contributions (Pan, 2002;T. Q. Xu, 2010).In the present study, Research Question One explores how SWC might be influenced by whether the constituent characters are word or word-not characters (Wording) and by whether they are left or right characters (Positioning).
Ordinary Features.CCs have many ordinary features, including number of strokes, number of components, frequency, familiarity, age of acquisition (AOA), number of meanings, semantic transparency, concreteness, imageability, valence, emotion arousal, sensory experience arousal and number of 2C-words in which they are used as left or right characters.The more strokes a CC has, the more complex it is in its visual complexity.CCs may also be divided into single characters and compound characters; over 95% of the commonly used CCs are compound characters that consist of more than one component.For example, the single character "木" and the compound character "村" are composed of 3 and 6 strokes, respectively; "村" and "树" are thought of consisting of two ("木" and "寸") and three components ("木," "又," and "寸"), respectively.Frequency indicates how often they are used in everyday language activities.Familiarity indicates the extent to which the reader is familiar with a CC (Juhasz et al., 2015).The AOA for a specific CC refers the year in which the reader first learns it (Y.Liu et al., 2007).Most of the limited number of components originate from single characters.According to the Dictionary of Chinese Characters Information (Science Publishers, 1988), about 53%, 21%, and 19% CCs have one, two and at least three meanings, respectively.Semantic transparency suggests the degree to which a compound character is semantically associated with its components (Tse & Yap, 2018).Imageability refers to the degree to which a CC arouses the readers' sensory-experience-based (Song & Li, 2021) or emotional-experience-based mental images (R. Wang et al., 2020).There is high degree of diversity in how concrete the meanings are among the meaningful CCs.Valence indicates the degree to which a CC is positive or negative in meaning (Yee, 2017).Sensory experience arousal and emotion arousal refer to the extent to which a CC arouses the reader's sensory experience (Yin & Ye, 2013) and emotional experience (Newcombe et al., 2012), respectively.Number of words (L) and number of words (R) suggest the number of 2C-words in which a CC is used as a left and right character, respectively.
Several studies have explored the norm features of CCs (e.g., Z. G. Cai et al., 2022;Y. Liu et al., 2007;Sze et al., 2014;R. Wang et al., 2020).For example, the processing efficiency of CCs in lexical decisions can be significantly predicted by features such as number of strokes, frequency, AOA and number of meanings (Sze et al., 2014).The naming time for a CC can be significantly predicted by AOA, frequency, familiarity, concreteness, number of strokes, number of words, imageability and number of components (Y.Liu et al., 2007), or by frequency, familiarity, number of strokes and number of words (Chang et al., 2016).However, few studies have investigated how SWC is predicted by norm features of the constituent characters.In the present study, Research Question Two explores the prediction of SWC by norm features of the constituent characters at each treatment level of Wording by Positioning.

Significance
The present study uses the 24,473 commonly used 2Cwords (State Language Affairs Commission, 2008) as sample words, the constituent characters of which are included in the 2,478 meaningful most commonly used CCs (1,936 word characters and 542 word-not characters) (m-CCs).Whether an m-CC is a word or word-not character is determined according to C. Wang (2017).Considering the significant status of CCs in Chinese literacy education, the findings of the present study should be extremely valuable.
It is generally accepted that the constituent characters are processed in the early stage of 2C-word recognition (Miwa et al., 2014;Taft, 2003;Tsang & Chen, 2010, 2013a, 2013b).The processing of a 2C-word is subject to the influences of frequency, number of strokes, number of meanings and number of word formations of its constituent characters (Huang et al., 2006(Huang et al., , 2011;;Miwa et al., 2014;Peng et al., 1999;Sun et al., 2018;Tsang et al., 2018;Tse et al., 2022;Tse & Yap, 2018;B. Zhang & Peng, 1992).However, few studies have investigated the relative importance of character features in predicting semantic activation of the corresponding 2C-words.The present study is likely to fill this gap from a normative perspective.
In addition to providing a rich reference resource for experimental studies on Chinese mental lexicons, the study is also significant for Chinese vocabulary teaching.More generally, given the similarity between English compounds and many 2C-words in syntactic structures, the findings may also enhance understanding of the associations between compound words and their constituent morphemes.

Methods
This section consists of two sub-sections: obtaining ordinary norm feature scores of the m-CCs and evaluating SWC scores for the sample words.The study was approved by the Ethics Committee of Qufu Normal University.

Norm Feature Scores
Objective Feature Scores.The scores in frequency, number of strokes and number of meanings of the m-CCs were obtained with reference to SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles (Q.Cai & Brysbaert, 2010), Xinhua Dictionary (Institute of Linguistics, Chinese Academy of Social Sciences, 2011) and the online version of Handian (https://www.zdic.net/),respectively.The scores of number of words (L) and number of words (R), in which each m-CC was used as the left or right character, respectively, were obtained with reference to Vocabulary Expert (S.Wu, 2000).
Subjective Feature Scores.Subjective feature scores were collected by a series of questionnaire surveys to a sample of skilled readers of Chinese.

SWC Scores
Participants.Participants were 9,024 college students (4,763 males, mean age = 19.54years, SD = 1.46 year) from Qufu Normal University.They were Chinese native speakers and had not participated in the questionnaires on the m-CCs.
Materials.Each sample word was paired with the corresponding left and right characters.A list of 48,946 pairs of sample words and the constituent characters were created, randomly divided into 411 groups.Each group contained 119 or 120 word-character pairs, which were printed on a single sheet of paper.To assess the strength of SWC for the sample words, 411 seven-point scaled questionnaires (0 = the constituent character has nothing to do with its meaning in isolation; 6 = the constituent character means exactly the same as it does when in isolation) were designed.
Procedure.The 411 questionnaires were copied 22 times and delivered to the participants.Each participant responded independently to a one-sheet questionnaire.

Results
A small proportion of the returned questionnaires were invalid and discarded for the subjective norm features (3.5%) and for the SWC scores (1.1%) (X.Xu & Li, 2020), leaving at least 18 valid respondents for each questionnaire item.Table 1 displays the descriptive results.The reliability coefficients were relatively low for familiarity and semantic transparency scores, probably because of participants' high familiarity with the m-CCs.The corresponding data were not included in the followup analyses.
To answer Research Questions One and Two, a linear mixed model analysis and four multi-linear regression analyses were conducted, respectively.

Linear Mixed Model Analysis
A linear mixed model analysis was conducted on the SWC scores using lme4 (Bates et al., 2011) in R (R Development Core Team, 2012) with the m-CCs as the random factor and Wording and Positioning as the fixed factors.The results showed that the main effects were significant for Wording and Positioning.The SWC scores were significantly greater for the word-not characters (M = 4.188, SD = 0.517) than for the word characters (M = 4.099, SD = 0.513) (b = .090,SE = 0.019, t = 4.803, p \ .0001),and were significantly larger for the right characters (M = 4.190, SD = 0.527) than for the left characters (M = 4.100, SD = 0.500) (b = .094,SE = 0.015, t = 6.157, p \ .0001).The interaction was not significant between Wording and Positioning (b = 2.005, SE = 0.037, t = 20.143,p = .887).

Multi-Linear Regression Analyses
Four multi-linear regression analyses were conducted to estimate prediction of the SWC scores by the norm features of the left-word, right-word, left-word-not and right-word-not characters.As displayed in Table 2, the scores of number of meanings, concreteness, imageability, valence, emotion arousal and frequency of the leftword characters were significant in predicting changes in SWC scores (F(12, 1,836), p \ .0001,adjusted R 2 = .23).The scores of number of meanings, concreteness, imageability, emotion arousal, valence, frequency and number of words (L) of the right-word characters were significant in predicting changes in SWC scores (F(12, 1,733), p \ .0001,adjusted R 2 = .25).The scores of number of meanings, concreteness, imageability, emotion arousal and AOA of the left-word-not characters were significant in predicting changes in SWC scores (F(12, 475) = 10.06,p \ .0001,adjusted R 2 = .19).The scores of number of meanings, concreteness, imageability, emotion arousal, frequency and number of words (R) of the right-wordnot characters were significant in predicting changes in SWC scores (F(12, 474) = 12.61, p \ .0001,adjusted R 2 = .23).

Discussion
As expected, the results suggested clear answers to the research questions.SWC scores were significantly affected by whether the constituent characters were word or word-not characters (Wording) and by whether they were left or right characters (Positioning) in parallel.The prediction of SWC by the character norm features was mediated by Wording and Positioning, which is particularly important for the understanding of the relationship between 2C-words and their constituent characters.

Influences of Wording and Positioning on SWC
The results of the mixed model analysis suggested that the SWC was stronger for the word-not than for the word characters and was stronger for the right than for the left characters.These findings are consistent with the evidence that word-not characters are more strongly associated than word characters with 2C-words (Gao et al., 2022;Shimomura, 1999;M. Wu, 2008), and that right characters are more strongly associated semantically than left characters with 2C-words (Fu, 2003;Li, 2019;Z. W. Lu et al., 1957;Yuan & Huang, 1998;J. Zhou, 2006).A word-not character is semantically constrained in the context of everyday language (M.Wu, 2008).For instance, the dictionary definition of the word-not character "身" is body, but it only appears in words such as "出身" (family background) and "终身" (lifelong).Since it is not a word in its own right, a reader may only be able to infer its meaning by how it is used within words (N.Wang, 1999).A greater priming effect was observed for the word-not character than for the word character primers in a lexical decision task (Gao et al., 2022;Shimomura, 1999), which can be interpreted with the interactive-activation model (Taft, 1994(Taft, , 2003)).Nodes were activated only at the character level for the word-not character primers.For the word character primers, however, nodes were activated both at the character level and at the word level.The word-level activation might have inhibited the priming effect of the word character primers.The word-level processing of the word character primers may have interfered with recognition of the 2C-word targets.All of this evidence appears to suggest a closer semantic association between word-not characters and 2C-words than between word characters and 2C-words.
Most 2C-words can be divided into five categories according to the syntactic relations between their constituent characters (Ge, 2018).For example, the constituent characters in "飞机" (plane), "窗户" (window), "地震" (earthquake), "开车" (to drive a car) and "提高" (to improve) form structure of modification, structure of coordination, structure of predication, predicate-object structure and verb-complement structure, respectively.The largest category is 2C-words with modification structure, in which the left and right characters are modifier and semantic head, respectively (Fu, 2003;Li, 2019;Z. W. Lu et al., 1957;Yuan & Huang, 1998;J. Zhou, 2006).The right characters are endocentric and are more closely associated than the left characters to the 2C-words in semantics (Yan, 2007).Therefore, the results appear to show that the right characters were more closely associated than the left characters with the corresponding 2C-words in semantics.

Prediction of SWC by Character Norm Features
Summary of Results.The 12 norm-feature predictors listed in Table 2 could be grouped into four categories (Song & Li, 2021): semantic features (number of meaning, concreteness, imageability, emotion arousal, valence and sensory experience arousal), frequency features (frequency and AOA), visual features (number of strokes and number of components) and number-of-word features (number of words (L) and number of words (R)).As summarized in Table 3, the prediction of SWC by the constituent character norm features was mediated by Wording and Positioning.First, four semantic features (number of meanings, concreteness, imageability and emotion arousal) of the constituent characters significantly predicted the strength of SWC.The higher the valence scores of the word characters, the smaller the SWC scores.However, the SWC scores were not significantly predicted by the valence scores of the word-not characters.The scores for sensory experience arousal of the constituent characters did not significantly predict changes in SWC scores.
Second, SWC scores became significantly smaller as the frequency scores of the left-and the right-word characters increased.The frequency score increase of the rightword-not characters resulted in a significant decrease in the SWC scores, but this change was not seen with respect to the left-word-not characters.An increase in AOA of the left-word-not characters led to a significant increase in SWC scores, but that of the right-word-not characters did not.The AOA scores of the left-or the right-word characters did not predict the SWC scores.
Third, SWC scores were significantly predicted by the scores of number of words (L) of the right-word characters and number of words (R) of the right-word-not characters.The number-of-word scores of the constituent characters did not have other predictions for SWC strength.
Fourth, the scores for number of strokes and number of components of the constituent characters did not significantly predict SWC scores.
The essentially semantic relationship between 2C-words and their constituent characters (Dronjic, 2011;Tse & Yap, 2018;X. Zhou & Marslen-Wilson, 2000) seems to be strongly confirmed by the finding that SWC was significantly predicted by number of meanings, concreteness, imageability and emotion arousal of the constituent characters.Valence and emotion arousal are closely associated as two critical dimensions of emotional semantics (Yao et al., 2017;Yee, 2017).In comparison with emotion arousal, however, only the valence score of the word constituent characters contributed significantly to the prediction of SWC.This may imply that valence may be perceivable for word but not for word-not characters.The scores of constituent-character sensory experience arousal did not predict changes in SWC, suggesting that this feature is less perceivable than other five semantic features of the constituent characters in 2C-word processing.
2C-word recognition is influenced by changes in frequency of constituent characters (Peng et al., 1999;Sun et al., 2018;Tse et al., 2022;Tse & Yap, 2018;B. Zhang & Peng, 1992).However, the findings in the present study regarding the frequency scores of the constituent characters may be of limited generalizability for the left-wordnot characters.That AOA of the constituent characters affects 2C-word recognition might also be limited, since SWC was not significantly predicted by AOA scores of the left-word, right-word and right-word-not characters.
The finding that visual features did not significantly predict changes in SWC has two implications.First, participants should have been highly familiar with the m-CCs and therefore there may have been a ceiling effect of the influence of visual features on their perception of the constituent characters.Second, participants might also have tended to ignore their perception of the visual features of the constituent characters in the task of evaluating the strength of SWC.In other words, skilled readers may ignore the visual complexity of the constituent characters when semantically processing 2C-words.Consistent with the mixed model analysis result that the SWC was stronger for the right than for the left characters, number of words (L) of the right-word characters and number of words (R) of the right-word-not characters significantly predict changes in SWC scores.

Conclusion
The present study has confirmed earlier work that wordnot characters are more strongly associated than word characters with 2C-words (Gao et al., 2022;Shimomura, 1999;M. Wu, 2008), and that right characters are more strongly than left characters associated with the 2Cwords in semantics (Fu, 2003;Li, 2019;Z. W. Lu et al., 1957;Yuan & Huang, 1998;J. Zhou, 2006).The present study is also the first to provide evidence concerning the prediction of SWC by the norm features of the constituent characters.First, skilled readers' perception of the semantic features, frequency features and number-ofword features of the constituent characters may be mediated by Wording and Positioning in 2C-word semantic processing.However, they are not likely to perceive the visual features of the constituent characters.Rather, they seem to take the constituent characters of m-CCs as individual units, which should be highly familiar to them.Second, in a semantic task on 2C-words, skilled readers may process the constituent characters in number of meanings, concreteness, imageability and emotion arousal, but not in sensory experience arousal.There appears to be close association in valence between the 2C-words and the word characters, but not between the 2C-words and the word-not characters.These findings strongly support the theoretical argument that both words and characters should be taken as language units in Chinese (Chen, 2014;Dong, 2004;J. Zhou, 2010J. Zhou, , 2019)).However, Wording and Positioning should be considered carefully when considering a CC as a language unit.These findings may be of more general significance for semantic understanding of compound words.
The study has some limitations.Familiarity and semantic transparency scores did not seem to be valid for the constituent characters of the sample words, which is likely because participants were very familiar with the m-CCs.Otherwise, the present study would likely have achieved a deeper understanding of the semantic overlap between 2C-words and the constituent characters.Participants were college students, and therefore not representative of how other populations might perceive relationships between 2C-words and their constituent characters.For example, children often confuse the positioning of the constituent characters and further work from a developmental perspective may be warranted.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by research Grants awarded to Li Degao by the National Social Science Fund of China under Grant 21AZD139.
Participants.Participants were 4,158 college students (1,643 males, mean age = 19.21years, SD = 1.53 years) from Qufu Normal University.They were Chinese native speakers and were blind to the purpose of the study.Materials.The 2,478 m-CCs were randomly divided into 21 groups, with each group listed on a single sheet of paper.A seven-point scale (with instructions printed at the top) was used to obtain scores for familiarity, AOA, semantic transparency, concreteness, imageability, sensory experience arousal, emotion arousal and valence; a five-point scale was used for number of components.Considering the familiarity questionnaire as an example, the main instructions were: ''There are seven numbers([1][2][3][4][5][6][7]) printed on the right side of each CC on the list.If you are very familiar with the CC, put a tick (O) next to the largest number[7]; if you are very unfamiliar with the CC, put a tick (O) next to the smallest number[1].The more familiar you are with the CC, the larger the number you tick (O).''Procedure.There were 189 questionnaires copied 22 times.The 189 3 22 questionnaire sheets were randomly delivered to participants.Each participant responded to a one-sheet questionnaire independently.

Table 1 .
Norm Feature Scores and SWC Scores.

Table 2 .
Coefficients of Constituent Character Norm Features in Predicting SWC Scores.

Table 3 .
Norm Features of the Constituent Characters Significantly Predicting SWC at Each Treatment Level of Wording by Positioning.