High Frequency Words Produced by Typically Developing Mandarin-Speaking Children Between 3 and 6 Years of Age

The purpose of this study was to provide high frequency word lists for Mandarin-speaking children between 3 and 6 years of age and to explore the differences between each part of speech (POS) category among different age groups. Participants were 209 typically developing native Mandarin speakers aged between 3 and 6 years, born in Taiwan, and recruited from Mandarin-language preschools in Taipei, New Taipei City, and Miaoli. Language samples were collected through conversations, free play, and story retelling. The researchers then transcribed the samples, segment utterances, and words, tagging the POS corresponding to each word. The frequencies of word occurrences were then analyzed and ranked to generate a high frequency word list. The mean frequency of each POS category was calculated to identify significant differences between age groups. The results showed high frequency word lists, including the corresponding POS tagging. Significant differences were found in 10 of the 11 POS categories among age groups. The results of this study presented preliminary information concerning high frequency words produced by Mandarin-speaking children aged between 3 and 6 years and the development of their use of each POS category.


Introduction
Background Vocabulary knowledge and competency play an important role in children's language development and their later academic and reading success (Loftus et al., 2010;Stahl & Nagy, 2006). High frequency words have been considered important to help students become efficient readers (Johns & Wilke, 2018). Also, word lists of high frequency words are important in academic settings. Gardner and Davies (2013) developed a corpus of high frequency words appearing in English academic texts which is important for academic success (Biemiller, 2010;Townsend et al., 2012). Word lists like this are useful in developing vocabulary goals, evaluating vocabulary knowledge, analyzing text difficulty, generating reading materials, and designing word learning tools (Nation & Webb, 2011). For children with language disorders, word lists can be applied when assessing their vocabulary and selecting target words for intervention (H. M. Liu & Lin, 2017).

Lexical Development Before School Age
Children acquire new words rapidly at preschool stage. They acquire around five novel words each day when they are 1.5 to 6 years old (Carey, 1978). As a result, a lexicon consisting of roughly 10,000 words was developed by the time a child turns 6 years old (Anglin, 1993). Nelson (1973) stated that children acquire nine novel words at most each day. With such a rapid rate of learning, 6-year-old children have expressive vocabulary of approximately 2,600 words and receptive vocabulary of 20,000 to 24,000 words (Owens, 1996). These studies support the assertion that children's vocabulary grows rapidly every year during the preschool stage.
To better understand children's lexical development, researchers have collected words produced by young children for word lists. Smith (1926) generated a word list for 88 children aged from 2 to 5 years based on 1-hr language samples of each child during their play with other children. Words which occurred more than 100 times across the collected conversation samples were included in the list. Beukelman et al. (1989) collected six typically developing preschoolers' communication samples from three different classrooms. The age range of children was from 3 years, 8 months to 4 years, 9 months. The results showed a list of 250 words with a frequency of at least five per 1,000 words. The researchers also identified the 25 most frequently occurring words, suggesting that the word list was helpful for Augmentative and Alternative Communication (AAC) programming for preschoolers. Dempsey (1956) compiled two-word lists to demonstrate words produced by kindergarteners, first graders, second graders, and third graders. These word lists provided detailed information about what words children produced and these words' frequencies. Such information was helpful for selecting vocabulary targets when teaching children words to increase their vocabulary size.

Development of High Frequency Word Lists
Practical applications of word lists are shown in both first and second language acquisition as important references for planning vocabulary instructions. Nation (2001) suggested that high frequency words cover a huge weight of the running words in spoken and written texts, and therefore, these words are particularly significant. Masrai (2019) supported that high frequency words were strongly related to reading comprehension ability of second language learners. It was suggested that high frequency words are so important that teachers and learners should spend more time on these words than low frequency words. Learners should master high frequency words before they move on to study words of lower frequencies (Dang et al., 2020;Nation, 2001).
To aid teachers in recognizing high frequency words, previous researchers have developed several word lists by ranking word frequency according to different corpora. West (1953) published A General Service List (GSL) including 2,000 high frequency words in English written texts. It was claimed that knowing these words gives access to about 80% of the words in written texts. Kilgarriff (1997) developed the British National Corpus (BNC) word list, and Davies and Gardner (2013) developed the Corpus of Contemporary American English (COCA) word list. These two-word lists were generated by ranking word frequency of words in spoken and written language corpora.
Words collected from assessment tools have helped develop high frequency word lists produced by children. The MacArthur Communicative Development Inventory (CDI), which was developed to assess early vocabulary, is such an example. The assessment tool consists of a checklist of words which a child has produced at home or any other setting. The word list for infants comprises 396 words in 19 semantic categories. The word list for toddlers comprises 680 words in 22 categories. Furthermore, H. M. Liu and Tsao (2013) have adapted the CDIs and developed Mandarin-Chinese version of the McArthur Communicative Development Inventories (MCDI-T), which contained word lists for assessing Mandarin-speaking children in Taiwan. With such word lists, Liu and Chen (2015) further analyzed semantic contents of high frequency words in 1,897 young children and generated comprehensive lists of words and their semantic categories. However, the age range for using MCDI-T is 8 to 36 months old. Information about high frequency words used by Mandarin-speaking children aged 3 through 6 years remains unknown.

Considerations of Part of Speech (POS)
After children start producing their first words, further examination of POS of these words gains more attention. Contents of children's vocabulary expand during preschool years (Clark & Sengul, 1978;Cox & Richardson, 1985). Thus, word categorization is utilized when exploring what kind of words are produced by preschoolers. Analyzing the distinct POS is a popular method when constructing linguistic theories and has been widely used to categorize words in linguistic-and language-related research (Bloom et al., 1993). Studies regarding corpus and word lists often tag words according to their POS, which helps researchers better understand distributions of these categories C. R. Huang et al., 2017;Lee & Wong, 1998).
Further examination into development of POS is particularly critical in Mandarin, which is a minimally inflected language. After their first words, children who learn languages such as English and Spanish spend much time deciphering rules of morphology, whereas Mandarin-speaking children rarely have to do the same. Rather, their further development is more likely to be observed in lexical categories, as known as POS, than in morphology, thus showing language-specific patterns. Tardif (1996) found such a distinction. Mandarinspeaking 22-month-old children produce more verbs than nouns, thereby contradicting a noun bias in early word learning based on languages such as English. C.-T. J. Huang et al. (2009) pointed out that POS is essential for forming sentences in Mandarin. Thus, it is necessary that we continue investigating Mandarin-speaking children's lexical development and POS beyond age 3 years.
However, to the authors' knowledge, no information concerning POS percentage produced by children aged 5 and 6 years has been reported in Mandarin literature. Furthermore, information regarding different types of words in terms of the POS contribution to the high frequency word list of preschoolers remains limited. As children's lexicon expands, word types reveal a clearer picture of developmental growth in each category as to knowing what words increase their importance with development.

Research Questions
In this study, the high frequency words produced by Mandarin-speaking children aged 3 to 6 years in their oral language were generated as a word list to show the preschoolers' core vocabulary. High frequency word lists for each age group were also generated. To understand developmental change between ages, the differences of the high frequency words among different age groups were examined. The following research questions were answered: Research Question 1: What are the high frequency words produced by Mandarin-speaking children aged from 3 to 6 years? What are the percentages of each POS among these words? Research Question 2: Do children aged 3, 4, 5, and 6 years produce each POS category differently? Research Question 3: What are the high frequency words produced by Mandarin-speaking children of 3, 4, 5, and 6 years old, respectively? What are the percentages of each POS among these words?

Participants
This study was reviewed and approved by the institutional review board of National Taiwan University, and the informed consents were signed by all participants' caregivers. Two hundred nine children aged from 3 to 6 years participated in this study. They were categorized into four groups: 3-year-olds (3y), 4-year-olds (4y), 5-year-olds (5y), and 6-year-olds (6y). Table 1 showed the characteristics of all participants.
Participants were recruited from Miaoli, Taipei, and New Taipei City, Taiwan, through sending brochures with information about the study to preschools. All recruited children speak Mandarin as their native language. All children did not have a diagnosis related to language delay/disorders, intelligence disability, neurological impairment, sensory impairment, psychological disturbance, or autism/pervasive developmental disorder (PDD). All children were given the test of Revised Evaluation Scale for Preschool Children with Language Disorders (Lin et al., 2008), and all their scores were greater than one standard deviation below the mean. In this study, children aged 3 and 4 years were recruited from Wanhua and Wenshan districts of Taipei, Taiwan. For children aged 4 and 5 years, a questionnaire on the child's background information was filled out by the caregivers. Regarding the main caregiver, 46% of the children were mainly taken care of by their mothers, 40% were by both parents, 5% were by mothers and grandparents, 4% were by fathers, and 2% were by parents and grandparents. Regarding their language use, 98% of the reports indicated that children speak Mandarin as their dominant language. Taiwanese was another language used by children. According to the questionnaire responses, one child always spoke Taiwanese, two children often spoke Taiwanese, and four children sometimes spoke Taiwanese. Sixty-nine percent of the children passed the hearing screening and 31% of the caregivers were reported as unavailable. Mothers' educational levels were also reported: 45% of the mothers had college-level education, 33% had graduated education, 19% were senior high school graduated, and 3% were junior high school graduated.

Procedure
The members of the research team received 20 hr of training for the collection of language samples, transcribing, word segmenting, and POS classification. The researchers first interacted with children to collect language samples. Language sample collection was conducted in four different contexts: conversation about the child's family, conversation about the child's school, story retelling, and free play. The book "Little Red Riding Hood" was selected for story retelling. This study used collection procedures and elicitation questions listed in Wu et al. (2019). Language samples were collected from one child at a time. The collected language samples were then transcribed. After transcribing, we followed the rules described in Wu et al. (2019) to segment and select utterances. A total of 100 utterances was selected for each child. For examining transcription reliability, 10 children for each age group were randomly selected. A second transcriber transcribed the language samples of the 10 children from each group. We used the following formula: agreements (characters)/agreements + disagreements (characters), to obtain the transcription reliability and the result was .96 in average of 40 children. Next, utterances were segmented into words. A Mandarin word might contain multiple characters and the basic semantic unit of Mandarin is the word (Wu et al., 2019). In this study, we used the CKIP Chinese Word Segmentation System (Academia Sinica Taiwan, 2014) to segment words. The CKIP Chinese Word Segmentation System automatically identifies and segments words based on the Academia Sinica dictionary of 80,000 written words (Ma & Chen, 2003).
According to Ma and Chen (2003), the CKIP system achieves 99.77% of the success rate when segmenting words without counting the mistakes occurred due to the existence of unknown words. This system provides a solution that can automatically extract new words to establish domain words or online instant word segmentation. It is a Chinese word segmentation system with the ability to recognize new words and add POS tags. This system includes a vocabulary of about 100,000 words and additional data such as POS, word frequency, POS frequency, and double conjunction frequency. The word segmentation is based on this vocabulary, quantitative words, overlapping words, and other word formation rules and new words identified online, and solve the problem of word segmentation ambiguity. It also automatically tags POS categories to words. The CKIP Chinese Word Segmentation System was used to decrease ambiguity and disagreement of word segmentation. The researchers manually modified the automatic segmentation results following the guidelines addressed in Wu et al. (2019). The reliability of manual word segmentation was 97.6%.
All words were then coded in terms of POS. In this study, we adapted the 12 categories from Y. H. Liu et al. (1996) and used 11 of them in this study. The category of onomatopoeia was excluded in this study. Onomatopoeia involves words of imitating sounds such as "bubu" (imitating car horns) and "huala huala" (imitating raindrops). When children imitate sounds, the phonological forms sometimes were not consistent and intelligible. Therefore, we have excluded these sounds in the language samples and the category of onomatopoeia was then excluded. The POS system used by CKIP included 47 categories and was too complicated for coding children's production. We simplified the 47 categories and narrowed to 11 categories. Each category of 47 has a corresponding category to 11 categories. To minimize coding errors, the researchers developed simple computer software to convert 47 categories to 11 categories. The corresponding tags that form the CKIP Chinese Word Segmentation System to 11 categories of POS were automatically converted using this computer software. The categories of POS used in this study and its corresponding tags were in the following: noun (N), verb (V), adjective (Adj), cardinal number (Num), classifier (CL), pronoun (Pron), adverb (Adv), preposition (P), conjunction (C), particle (Part), and interjection (I). The automatically converted tags were manually checked and modified according to Mandarin linguistic rules. A character may have two or more different meanings and POS. We considered same characters with different POS as different words.

Analysis
Word frequency list. A computer program was developed to calculate the occurring frequency of each word for all the language samples. Words were ranked by how frequently they occur in these samples. The most frequently occurred word was ranked first. According to Nation (2016), the most frequent 100 words of English cover around 50% of the running words in a text and the first 1000 different words between 70% and 90%, partly depending on the content of the text and whether the text used is spoken or written language. (p. 4) In this study, the high frequency word list contained the most frequent 302 words, which covered 80% of the word occurrences produced by the children. A high frequency word list was generated for all children aged from 3 to 6 years. We organized the top 302 words that were frequently used across all the groups by summing the raw frequency numbers of words produced by all the children. First, we categorized these top 302 words based on their POS. In addition, a high frequency word list was generated for each age group (3, 4, 5, and 6 years old). The names of characters in the story retelling were excluded from the high frequency word lists. Differences between POS used by different age groups. POS tagging was used to categorize all the listed words. For the high frequency words of each age group, the percentage of each POS category used by each age group was calculated. The frequency of each POS category was calculated and divided by the number of children to obtain mean frequency. The mean frequency of POS produced was used to examine the significant differences of each POS category between age groups. One-way analysis of variance (ANOVA) and Bonferroni post hoc test (p < .05) were conducted to examine the significance.

High Frequency Words Produced by Mandarin-Speaking Children Aged From 3 to 6 Years
First, we categorized the top 302 words based on their POS as shown in Table 2. Among all POS, verbs were used the most frequently, that is, 30.19% of the top 302 words, and with the highest word type, that is, 118 different verbs were frequently used by children aged 3 through 6 years. Pronouns with 17.35% of use rate were the second highest, followed by adverbs (14.01%) and nouns (12.38%; but with the second highest word type, 79 nouns). These first four POS already covered 73.93% of the top 302 words. Functional words, such as prepositions and interjections, were smaller portions of the top 302 words, each with less than 7% use rate.
Next, we looked closely at individual words in terms of frequency ranking, which provided more details about the exact words that these preschoolers often used (see Table 3). We first examined content words. Among the high frequency verbs, a copula verb shì, yǒu "have," and yào "want" were the top 3, and not surprisingly, many action verbs about daily activities, such as qù "go," zài "locating," shuō "say," wán "play," and chī "eat," were also frequently used by these preschoolers. The adverb jiù was used very frequency, possibly due to the fact that it is a polysemy that covers meanings of "at once," "right away," "just," "then," and so on. For nouns, kinship names such as māmā "mother" and bàbà "father," the general noun for objects dōngxī "thing" and body parts, such as dùzǐ "stomach," and names for objects seen in daily life were commonly used. Several adjectives that described objects' property or feature, such as shape and size, were frequently produced by preschoolers.
For function words, some pronouns (such as wǒ "I" and tā "he or she"), particles (such as DE), and the general classifier gè were used most frequently. Although there were only 12 conjunctions among the top 302 words by 3-to 6-year-olds, the ranking of these conjunctions showed how preschoolers combined clauses into longer sentences. For example, the coordinate conjunctions ránhòu "then" and háiyǒu "and" were used more frequently than the subordinate conjunctions yīnwéi "because" and deshíhòu "when." The conditional conjunction rúguǒ "if" was among the conjunctions with high frequency as well.
To sum, these top 302 words provided us with a general idea of the core vocabulary from age 3 to 6 years. However, this list was not very informative in terms of developmental patterns of these words over time. Thus, to better understand how the core vocabulary developed with age, next we looked at how the core vocabulary distributed within each age group. Table 4 illustrates the descriptive results of the frequency of each POS category produced by the four age groups. Oneway ANOVA was used to examine the significance of differences in each POS category (V, N, Part, P, Pron, CL, Num, Adv, C, I, and Adj) among groups of children aged 3, 4, 5, and 6 years. The results are presented in Table 5. There were significant effects of age on frequencies of POS at p < .05 for 10 POS categories (V, N, Part, P, Pron, CL, Num, Adv, C, and Adj). Bonferroni test was used to test significant differences between age groups of 3 and 4 years old, 3 and 5 years old, 3 and 6 years old, 4 and 5 years old, 4 and 6 years old, and 5 and 6 years old for these 10 POS categories. For all age groups which showed significant differences, older children produced higher frequencies than younger children. The results are presented in Table 6. The p values for all age group comparisons were extracted from Table 6 and presented in Table 7. Adverb was found significantly different in all six age groups. Verb, cardinal number, and conjunction were found significant in five age groups. Preposition, pronoun, classifier, adjective, and noun were found significantly different in four age groups. Particle was found significantly different in two age groups.

High Frequency Words Produced by Mandarin-Speaking Children of 3, 4, 5, and 6 Years Individually and the Percentage of POS
For the age 3 group, the cumulative frequency ratio of the most frequently used 302 words was 0.83, meaning that these 302 words covered 83% of the total types of words children produced. For the age 4 group, the cumulated frequency ratio of the most frequently used 302 words was 0.80, meaning that these 302 words covered 80% of the total types of words children produced. For the age 5 group, the cumulated frequency ratio of the most frequently used 302 words was 0.80, meaning that these 302 words covered 80% of the total types of words children produced. For the age 6 group, the cumulated frequency ratio of the most frequently used 302 words was 0.80, meaning that these 302 words covered 80% of the total types of words children produced. See Table 8 for the cumulated frequency for each age groups. The results of the 302 most frequently used words of typically developing Mandarin-speaking 3-, 4-, 5-, and 6-yearolds are presented in Tables 9 to 12. English translations of these words are available in Supplemental Material.

High Frequency Words Produced by Mandarin-Speaking Children Aged From 3 to 6 Years
To explore the core vocabulary during preschool period, we examined the high frequency words produced by children between ages 3 and 6 years. Content words, such as verbs (i.e., eat, have, want) and nouns (such as mother, home, father), were among the high frequency words with the most different types. We found that Mandarin-speaking children used verbs for 30% of the time, and such results were consistent with previous findings that Mandarin-speaking caregivers produced verbs over nouns (Tardif et al., 1997). Moreover, children at these ages steadily used a great number of content words, whereas functional words did not seem to be consistently used. However, if one looked more closely, it would be found that function words that are related to more sophisticated use of language, from conjunctions to complex syntax, were evident in the core vocabulary. By carefully examining certain lexical words, parents, clinicians, or educators can better understand children's development beyond words. Thus, the core vocabulary list we present here can be used as an assessment reference, as well as a reminder to parents and clinicians of which words within a given POS your child may need to learn to enhance lexical performance.

Differences Between POS Used by Different Age Groups
The means of frequency for each POS category among age groups were calculated and compared. The results showed that among 11 POS categories, there were significant differences for 10 categories. The results indicated that as children mature, they produced words from different POS categories. In general, children tend to use more words from each POS category more frequently as their age increases, except for one category, interjection. In the 10 categories that showed significance, 4-year-olds produced more frequencies than 3-year-olds for eight categories (V, P, Nh, Neu, D, C, Nf, and A). Five-year-olds produced more frequencies than 4-yearolds for five categories (V, N, Neu, D, and C). Only one category, adverb, showed a significant difference between children aged 5 and 6 years. These results might indicate that when looking at POS usage, children's development may be more evident between children aged 3 and 4 years when compared with 4 and 5 years and 5 and 6 years. Significant differences were shown in all paired age groups for one category, adverb. For Mandarin-speaking children, adverbs may be a category that children continue to develop and use more often as age increases from 3 to 6 years. Thus, adverbs could be a potential developmental measure for assessing the language development of children.
Le Normand et al. (2008) examined 316 language samples from French-speaking children aged 2 to 4 years and analyzed frequency of verbs: The study indicated that verb usage was similar to this study. In Le Normand et al. (2008), the mean of verb tokens for 3-year-olds was 123.90, and for 4-year-olds, 153.62. In this study, the mean of verb tokens for 3-year-olds was 122.28, and for 4-year-olds, 152.48. Smith (1926) studied POS percentage in the 1-hr conversation of 101 English-speaking children. The results showed that there were no significant differences from year to year of any POS category; however, there might be a tendency of a greater use of adjectives and pronouns as age increases. In this study, significant differences were shown for children aged 3 and 4 years for 10 POS categories. For adverbs, significant differences were presented year from year from 3 through 6 years old. The different results of these two studies may show that the development of POS usage for Mandarinspeaking children differs from English-speaking children.

High Frequency Words Produced by Mandarin-Speaking Children of 3, 4, 5, and 6 Years Individually and the Percentage of POS
The high frequency words and the percentage of POS were listed above. When looking into the percentage of POS in different age groups, it was noticed that the percentage of nouns seemed to decrease with age. Moreover, other POS, such as adverbs, adjectives, prepositions, classifiers, and conjunctions, seemed to increase after age 3 years. This was similar to the results of Yang (2015). Yang conducted a corpus-based study and analyzed the POS of Mandarin-speaking children aged between 19 and 48 months. The percentage of POS tokens for each age stage showed a decrease in noun and verb usage and an increase in other categories such as adverb, conjunction, and preposition. Because Mandarin possesses very little inflectional and derivational morphology, Mandarin-speaking children do not acquire inflectional morphemes such as past tense (-ed) and third-person singular (-s) to make sentences grammatical: They acquire different types of words and word order to make sentences grammatical. Some studies suggested that children with language delays and disorders often have difficulty learning certain types of words, such as classifiers and aspect markers (Law et al., 2009). It is important for Mandarin-speaking children to learn words from the different POS (H. M. Liu & Lin, 2017), especially function words to generate grammatical sentences.
When selecting a target word in vocabulary intervention for preschoolers, different POS may be the focus according to the age of the child. For example, for 3-year-olds, nouns and verbs may be dominant for their target words. However, for 6-year-olds, adverbs, classifiers, and conjunctions may be important to add to children's lexicon as their sentences become longer and more complex. Beukelman et al. (1989) listed 250 high frequency words of preschoolers, and it was noticed that no nouns appeared in the most frequently occurring 25 words. Similar to Beukelman, Jones, and Rowan's study, only one noun (māmā/mother) appeared in the list of 25 most frequently used words for the age 3 group in this study. Pronouns, verbs, and adverbs were dominant in the list of 25 most frequently used words. This might give educators, clinicians, and parents some information about what types of words frequently occurred in children's communication, and these words might be important to facilitate children's communication efficacy.

Conclusion and Future Directions
This is the first study exploring high frequency words and the differences of POS frequency produced by Mandarinspeaking children aged 3 to 6 years. Future studies on word lists should be conducted to provide parents, educators, and clinicians with a reference to utilize when assessing children's vocabulary performance, and furthermore, to select age-appropriate targets. In this study, we used POS to code words. For further studies, other categories may be used to group words. For example, nouns can be categorized as food, toy, transportation, and so on and verbs can be categorized as perception verb, psychological verb, and so on. These specific categories may help clinicians choose target words for their clients. Future studies can also be conducted applying the high frequency word lists: Studies concerning the development of effective vocabulary intervention programs targeting high frequency words should be explored. The effectiveness of facilitating communication when using the high frequency words as core vocabulary to program AAC for clients should also be studied. Children's use of adverbs should be examined in future studies to explore its applicability for assessing child developmental language disorders. Subcategories of adverbs produced by children should be coded and analyzed to study the language development of Mandarin-speaking children. How demographic and environmental differences affect the word production of children should also be explored in the future studies.

Author Contributions
Shang-Yu Wu helped in conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, supervision, validation, visualization, and writing-original draft preparation and review and editing. Shanju Lin helped in conceptualization, formal analysis, investigation, methodology, and writing-original draft preparation. Rei-Jane Huang and I-Fang Tsai helped in methodology, investigation, and resources.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:

Supplemental Material
Supplemental material for this article is available online.