Words in books versus on television
To examine how child-ren’s book vocabulary differs from that on television, we inspected (1) how many words from the CYP-LEX database are also present in the SUBTLEX-UK database, and (2) for words that are encountered both in CYP-LEX and SUBTLEX-UK, how CYP-LEX word frequencies compare to those in SUBTLEX-UK.
Although SUBTLEX-UK is largely a database of adult word frequencies, it also includes words derived from subtitles on two children’s channels, Cbeebies, targeting pre-school children, and CBBC, designed for children between 6 and 12 years. Overall, the SUBTLEX-UK database contains 159,235 types (derived from 201,335,638 tokens) and is thus substantially larger than CYP-LEX (105,694 types, derived from 70,287,217 tokens). However, its Cbeebies (27,236 types, derived from 5,848,083 tokens) subcorpus is significantly smaller than either of the CYP-LEX age bands, while the CBBC (58,691 types, derived from 13,612,278 tokens) subcorpus is comparable in size only to the 7–9 age band (52,851 types, derived from 11,162,653 tokens). The full textual content of the SUBTLEX-UK corpus (i.e., tokens) is not publicly available due to copyright issues, and, consequently, the differences in the size of the CYP-LEX and SUBTLEX-UK corpora pose challenges for direct comparisons between the resulting databases. To minimise the potential impact of these size differences, we combined the Cbeebies and the CBBC subcorpora, which resulted in a list of 63,081 types (derived from 19,460,361 tokens). We then examined how many words in the 7–9 and 10–12 age bands are not included in this combined word list. Regarding the first comparison, because the combined subtitle list is larger than CYP-LEX 7–9 (in terms of the number of both types and tokens) and includes words encountered in television programmes for older children (up to the age of 12), we reasoned that the presence or absence of CYP-LEX 7–9 words in this combined list would be informative regarding the differences in vocabulary in books versus on television. Following this logic, we then also examined how many words in each of the CYP-LEX age bands are not present in the entire SUBTLEX-UK database. Regarding the comparison between the 10–12 age band and the combined Cbeebies and CBBC list, we report this analysis for completeness but we acknowledge that these figures should be treated with caution given that the 10–12 age band was derived from a slightly larger corpus (a difference of 2.4 million tokens) than that from which the Cbeebies and the CBBC lists were derived.
This analysis revealed that children’s books contain many words that are never encountered on television. For instance, 28% (
N = 14,873) of words in books for 7- to 9-year-olds never appear in age-appropriate BBC television programmes and programmes for older children. Most of these words are nouns (57%), followed by verbs (18%) and adjectives (12%). While the majority of these words (90%) are encountered less than 10 times (Zipf frequencies below 3), 2% occur 50 times or more (Zipf frequencies between 3.5 and 5.5). Likewise, 40% (
N = 28,533) of words in the 10–12 age band are not encountered in the combined Cbeebies and CBBC list. Finally, while the SUBTLEX-UK database as a whole contains most of the words (91%) in the youngest age band, 14% (
N = 10,231) and 21% (
N = 19,472) of words in the two older age bands are missing from SUBTLEX-UK. SUBTLEX-UK includes content from all television programmes broadcast on the BBC over 3 years (2010–2012) and thus represents a comprehensive record of British television language; therefore, the fact that children’s books include so many words that are missing from SUBTLEX-UK is remarkable. About 90% of book words not included in SUBTLEX-UK are encountered less than 10 times (Zipf frequencies below 2.5); however, 2% occur 50 times or more (Zipf frequencies between 3 and 5). Inspection of these unshared words suggests that many are morphologically complex (e.g., “conquerable”, “unprocurable”, “sorrowfully”). This impression was further supported by an examination of the morphological structure of words missing from SUBTLEX-UK as documented in the MorphoLex database (
Sánchez-Gutiérrez et al., 2018). MorphoLex comprises morphological information for 68,624 words from the English Lexicon Project (
Balota et al., 2007); 19% (
N = 5,257) of CYP-LEX words missing from the SUBTLEX-UK database have entries in the MorphoLex database, and 77% (
N = 4,052) of these words are morphologically complex.
For those words that are present both in CYP-LEX and SUBTLEX-UK,
Table 2 reports the correlations between these words’ frequencies in each of the CYP-LEX age bands and their frequencies in each of the SUBTLEX-UK subcorpora. The Hotelling–Williams test for differences in correlations that are themselves intercorrelated (
Steiger, 1980) showed that frequencies from the 7–9 age band correlated more strongly with those from CBBC than with those from the other two SUBTLEX-UK subcorpora (CBBC vs. Cbeebies:
,
; CBBC vs. adult:
,
). For the other two CYP-LEX age bands, we observed higher correlations with the adult subcorpus of SUBTLEX-UK than with either Cbeebies (10–12:
,
; 13+:
,
) or CBBC (10–12:
,
; 13+:
,
). The fact that word frequencies in the 10–12 age band correlate more strongly with those in the adult SUBTLEX-UK subcorpus than with those in the CBBC subcorpus is particularly striking, and suggests that the way that words are used in children’s books may be more sophisticated than the way they are used on children’s television.
Words across the age bands
The aim of the analysis reported in this section was to understand the similarities and differences in how words are used across the age bands. To this end, we first examined whether the frequencies of the words that were encountered in more than one age band were alike across the age bands. 45,318 words (86%) and 47,426 (90%) words from the 7–9 age band were also present in the 10–12 and 13 + age bands, respectively, whereas 59,942 words (84%) from the 10–12 age band were also present in the 13+ age band. The frequencies of the shared words were highly correlated (all ): for the frequencies in the 7–9 and 10–12 bands, for the frequencies in the 7–9 and 13+ bands, and for those in the 10–12 and 13 + bands. The Hotelling–Williams test showed that the frequencies in the 7–9 age band were more strongly correlated with those in the 10–12 age band than with those in the 13+ age band (, ), and that the frequencies in the 10–12 age band were more strongly correlated with those in the 7–9 age band than with those in the 13+ age band (, ). This result suggests that, for words that are shared across the age bands, their frequency of use in books for 10- to 12-year-olds is more similar to that in books for younger children than it is to that in books for older children.
Next, we examined which words in the corpus were used most frequently and whether the age bands differed in terms of their most common words. The age bands appear very similar in terms of their 100 most frequent words (Zipf frequencies between 6 and 7.75; see
Figure 4). Similar to lexical databases derived from other corpora (e.g., CPWD, SUBTLEX-UK), these words amount to about half of all tokens (54%) in each age band. Most of these are function words (approx. 70%) such as prepositions (e.g., “in”, “with”; 18%), personal pronouns (e.g., “I”, “he”; 17%), auxiliary verbs (e.g., “be”, “are”; 14%), determiners (e.g., “a”, “the”; 7%), and coordinating conjunctions (e.g., “and”, “but”; 3%), but the list also includes a small proportion of adverbs (e.g., “again”, “away”, “back”; 14%), adjectives (e.g., “little”, “right”; 4%), nouns (e.g., “time”, “way”; 3%), and lexical verbs (e.g., “go”, “know”, “think”, “like”, “see”; 6%). Thus, a breakdown of word class in the 100 most common words in the CYP-LEX corpus very closely resembles that in the CPWD, in which 89% were classified as function words (note that, in
Masterson et al., 2010, adverbs and verbs such as “say”, “ask”, “look”, and “like”—termed “verbs with general meaning”—were treated as function words).
Among the top 100 words, the most pronounced differences across the age bands pertain to the use of personal pronouns, which, in each age band, amount to around 17% of the 100 most common words. The pronoun “I” is present in every band and its frequency increases as a function of the books’ target age—ranked 6th (7–9 age band), 5th (10–12 age band), and 3rd (13+ age band). It is thus more common than function words such as “a” or “to”, possibly indicating a growing self-focus in books targeting teenagers as opposed to younger children. The pronoun “he” is also very common in each age band, ranked 9th, 10th, and 7th in the three age bands, respectively. Intriguingly, “she” is used much less often—14th in the 7–9 and 10–12 age bands, and 17th in the 13+ age band—with the gap between “he” and “she” widening as the books’ target age increases. This pattern also holds at the lemma level, with the lemma “he” (ranked 4th in the 7–9 and 13+ age bands, and 6th in the 10–12 age band) being used more frequently than the lemma “she” (ranked 9th in the 10–12 and 13+ age band, and 8th in the 7–9 age band). Moreover, the distribution of tf-idf scores for these two lemmas across the individual books (
Figure 5) demonstrates that, on average, books in each age band tend to use the lemma “he” more than the lemma “she”, with this difference being particularly large in the 13+ age band. Taken together, these results suggest that children’s books tend to focus more on male than on female characters and that, as the books’ target age increases, this trend continues on the upward trajectory.
Regarding the plural forms of personal pronouns, the frequency of both first-person and third-person plural pronouns does not seem to vary across the age bands. However, in each age band, third-person plurals are used more frequently than first-person plurals, as indexed by higher frequency of the lemma “they” as compared with that of the lemma “we” in terms of both raw counts and tf-idf scores. Because first- and third-person plurals are often interpreted as markers of group identity (e.g.,
Pennebaker & Lay, 2002), this pattern could indicate that membership in social groups is an important concept in children’s literature and that, regardless of the readers’ target age, the characters are more often described as not belonging than belonging to a group. Given the high pre-valence of “I” in the books (and particularly so in books written for young people), one could further speculate that the book characters’ self-categorisation could be based on personal rather than on social identity. We note, however, that this preliminary interpretation needs further evaluation and that our data do not speak of the uniqueness of this pattern to children’s books, nor can they inform on the extent to which this pattern may apply to other language registers (e.g., spoken language and/or adult literature).
Beyond the top 100 words, our analysis shows that the similarity of vocabulary across the age bands decreases as a function of word frequency (
Figure 6). Indeed, while the first hundred of the top 600 words are almost identical across the age bands (93%–97% overlap), the amount of overlap is reduced to 73%–86% for the second hundred and to 53%–73% for the third hundred. By the time the sixth hundred most common words (words 501–600) are reached, the overlap between the 7–9 and the 13+ age bands is reduced to 15% and that between the 7–9 and the 10–12 age bands to 31%. Interestingly, for each set of 100 words among the top 600 words, the overlap between the 7–9 and the 10–12 age bands is greater than that between the 7–9 age band and the 13+ age band, suggesting that, with respect to the most common words, books for child-ren over 13 are less like those in the two younger age bands than books in these age bands are to each other. This pattern was also observed for frequency correlations for words shared across the age bands.
Our analysis further showed that the decrease in similarity across the top 600 words was accompanied by a decrease in the proportion of function words: for instance, among the second hundred of the top 600 words, only about 43% are functors, and, among the third hundred, no more than 38%. Notably, despite their preponderance in the corpus, function words account for only a very small percentage of types: in each age band, 57%–59% of all types are nouns, 20%–22% are verbs, followed by adjectives (14%), adverbs (4%), proper nouns (3%–4%), and foreign words (1%), with all other parts of speech accounting for less than 0.5% of all types. These findings indicate that the most significant vocabulary differences across the age bands should be attributable to the use of words with lower frequencies and that most of these words are nouns, followed by verbs and adjectives.
Indeed, while a high percentage of words are shared across the age bands, each age band contains vast numbers of words that do not appear in the younger age band, and the vast majority of these are very low in frequency. Thus, 36% (
N = 25,627) of words in the 10–12 age band and 48% (
N = 43,549) of words in the 13+ age band are not present in books for children aged between 7 and 9 years, while 34% (
N = 31,025) of words in the 13+ age band are not encountered in books for children aged 10–12 years. Strikingly, within the age bands, the vast majority of these “new” words (73%–74%) only appear 3 times or less, and only about 1% of these words are encountered more than 100 times (most of these words are names and, for the 13+ age band, swear words). It follows that about a third of these “new” words in the 10–12 and 13+ age bands (i.e., words that are never encountered in the 7–9 and 10–12 age bands, respectively) appear in a maximum of 3 books such that the readers are extremely unlikely to ever encounter them (see Supplementary material B for a list of words from bands 10–12 and 13+ that are missing in bands 7–9 and 10–12, respectively;
https://doi.org/10.17605/OSF.IO/SQU49).
It is important to recognise that, compared with books for adolescents, books for children who have only recently begun to read independently typically contain illustrations and have shorter sentences, fewer words per page, and fewer and shorter chapters. Indeed, in CYP-LEX, the mean book length (number of tokens) in the 10–12 age band is twice that in the 7–9 age band, while an average book in the 13+ age band is 1.7 times longer than that in the 10–12 age band (see
Table 1). In corpus linguistics, it is customary to control for differences in corpus size by equating the corpora in terms of the number of tokens (e.g., by taking random samples of the size of the smaller corpus). Within the context of our work, this approach would result in si-mulating an unrealistic scenario, whereas equating the number of books included in each age band is a more ecologically valid approach to quantifying children’s reading experience. We acknowledge that, under this approach, it is not possible to distinguish whether the differences in book vocabulary across the age bands are driven solely by differences in their lexical content or are at least to some extent a result of differences in book length, and, therefore, we do not make any inferences regarding this matter. Rather, we take our findings to indicate that, with about 40% of words in each age band having a raw frequency of 3 or less and about half of the words a raw frequency of 6 or less, a child wishing to enhance their vocabulary and move beyond the most common words would need to read widely. Furthermore, the results reported above also suggest that even for those who do read widely, understanding the vocabulary used in books still poses a challenge, with many new words to work through in age-appropriate books, and even more so as readers transition to books aimed at older children.
Thus far, we have shown that book vocabulary for younger primary school children is comparable to that for older children and young people regarding the first few hundred most common words, but that the age bands differ substantially in terms of words with lower frequencies. Our next analysis showed that this pattern of results holds also at the level of individual books. For each book in the corpus (
N = 1,200, with 400 books per age band),
Figure 7 shows the proportion of lemmas, out of the 75 most common lemmas (i.e., including function words), that each book shares with every other book in the corpus. It is immediately apparent that, within each age band, most of the books share the first 25 of the 75 most frequent lemmas. Yet, the amount of overlap reduces drastically for lemmas 26–50, and even more so for lemmas 51–75. This interpretation was confirmed through a statistical test which was conducted by first computing a mean vector for each age band and set of lemmas and then comparing these mean vectors by means of a
t-test (
Table 3). It is of note that, for a book of an average size, 25 lemmas amount to about 1% (7–9 age band), 0.7% (10–12 age band), and 0.5% (13+ age band) of all the lemmas it contains. Consequently, these results suggest that the number of lemmas that can be expected to occur consistently in books written for children of the same age is extremely low (no more than 1%), with the vast majority of these lemmas being function words.
To examine the similarity of vocabulary within the age bands beyond the most common lemmas, we built a document-term matrix, where each row corresponded to a book (
N = 1,200) and each column corresponded to each unique lemma observed in the entire CYP-LEX database (
N = 75,386). Each cell of the matrix recorded the raw frequency of each unique lemma in each book (a value of zero was recorded if a lemma did not occur in a book). This approach resulted in each book being represented in the form of a vector of values representing the frequency of each lemma in the database in this particular book. We then measured the similarity between all individual vectors (books) by computing their cosine similarity (i.e., their inner product space) and then repeated this process while excluding those lemmas that corresponded to function words.
6 An important advantage of this analysis is that it is not influenced by differences in book length. Figures 8A and 8B visualise the resulting similarity matrices in the form of two heatmaps, and the results of a statistical analysis comparing the similarity scores across the age bands are reported in
Table 4. Both of these analyses revealed that the similarity between books in the 7–9 age band is much lower than the similarity between books in the 10–12 and 13+ age bands.
Figure 8B clearly shows that, in contrast to the 7–9 age band, where most books appear to have low similarity to one another, only a small subset of books that children aged 13 + read differ from other books regarding their vocabulary. The majority of these “less similar” books are books listed in the GCSE syllabus for English. It is therefore likely that these books’ low similarity to other books in the 13+ age band is due to the fact that they were written long ago when language was used quite differently. Summing up, our findings indicate that, apart from some very frequent (function) words, books for younger primary school children differ more substantially from each other with respect to both which words they use and how often they use them. These differences are attenuated in the literature targeting older readers, suggesting a substantially higher degree of lexical homogeneity in books read by children in the final years of primary school and in secondary school.