Skip to main content
Intended for healthcare professionals
Restricted access
Research article
First published online April 1, 2015

On the Advantages of Word Frequency and Contextual Diversity Measures Extracted from Subtitles: The Case of Portuguese

Abstract

We examined the potential advantage of the lexical databases using subtitles and present SUBTLEX-PT, a new lexical database for 132,710 Portuguese words obtained from a 78 million corpus based on film and television series subtitles, offering word frequency and contextual diversity measures. Additionally we validated SUBTLEX-PT with a lexical decision study involving 1920 Portuguese words (and 1920 nonwords) with different lengths in letters (M = 6.89, SD = 2.10) and syllables (M = 2.99, SD = 0.94). Multiple regression analyses on latency and accuracy data were conducted to compare the proportion of variance explained by the Portuguese subtitle word frequency measures with that accounted by the recent written-word frequency database (Procura-PALavras; P-PAL; Soares, Iriarte, et al., 2014). As its international counterparts, SUBTLEX-PT explains approximately 15% more of the variance in the lexical decision performance of young adults than the P-PAL database. Moreover, in line with recent studies, contextual diversity accounted for approximately 2% more of the variance in participants' reading performance than the raw frequency counts obtained from subtitles. SUBTLEX-PT is freely available for research purposes (at http://p-pal.di.uminho.pt/about/databases).

Get full access to this article

View all access and purchase options for this article.

References

Adelman J. S., Brown G. D. A., & Quesada J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17, 814–823.
Alegria J., Marin J., Carrillo S., & Mousty P. (2003). Les premiers pas dans l'acquisition de l'orthographe en function du caractère profound ou superficial du système alphabétique: Comparaison entre le français et l'espagnol. In Romdhane M. N., Gombert J.-E., & Belajouza M. (Eds.), L'apprentissage de la lecture: Perspectives comparatives (pp. 51–67). Rennes: Presses Universitaires de Rennes.
Baayen R. H. (2011). Corpus linguistics and naive discriminative learning. Brazilian Journal of Applied Linguistics, 11, 295–328.
Baayen R. H., Feldman L. B., & Schreuder R. (2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 53, 496–512.
Baayen R. H., Piepenbrock R., & Gulikers L. (1993). The CELEX lexical database. Philadelphia: Linguistic Data Consortium, University of Pennsylvania.
Bacelar do Nascimento M. F., Pereira L. A. S., & Saramago J. (2000). Portuguese Corpora at CLUL. In Proceedings of the Second International Conference on Language Resources and Evaluation (pp. 1603–1607). Athens, Greece.
Balota D. A., Cortese M. J., Sergent-Marshall S. D., Spieler D. H., & Yap M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316.
Balota D. A., Yap M. J., Hutchison K. A., Cortese M. J., Kessler B., Loftis B., … Treiman R. (2007). The english lexicon project. Behavior Research Methods, 39, 445–459.
Bonin P., Chalard M., Méot A., & Fayol M. (2001). Age-of-acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition, 20, 401–443.
Breland H. M. (1996). Word frequency and word difficulty: A comparison of counts in four corpora. Psychological Science, 7, 96–99.
Brysbaert M., Buchmeier M., Conrad M., Jacobs A. M., Bölte J., & Böhl A. (2011). The word frequency effect. Experimental Psychology, 58, 412–424.
Brysbaert M., & Cortese M. J. (2011). Do the effects of subjective frequency and age of acquisition survive better word frequency norms?. The Quarterly Journal of Experimental Psychology, 64, 545–559.
Brysbaert M., & Diependaele K. (2013). Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods, 45, 422–430.
Brysbaert M., Keuleers E., & New B. (2011). Assessing the usefulness of Google Books’ word frequencies for psycholinguistic research on word processing. Frontiers in Psychology, 2, 27.
Brysbaert M., & New B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990.
Brysbaert M., New B., & Keuleers E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44, 991–997.
Burgess C., & Livesay K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, Instruments, & Computers, 30, 272–277.
Cai Q., & Brysbaert M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS ONE, 5, e10729.
Coltheart M., Rastle K., Perry C., Langdon R., & Ziegler J. (2001). DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108, 204–256.
Comesaña M., Fraga I., Moreia A. J., Frade C. S., & Soares A. P. (2014). Free associate norms for 139 European Portuguese words for children from different age groups. Behavior Research Methods, 46, 564–574.
Cortese, M. J., & Khanna, M. M. (2007). Age of acquisition predicts naming and lexical-decision performance above and beyond 22 other predictor variables: An analysis of 2,342 words. Quarterly Journal of Experimental Psychology, 60, 1072–1082.
Cuetos F., & Barbón A. (2006). Word naming in Spanish. European Journal of Cognitive Psychology, 18, 415–436.
Cuetos F., Ellis A. W., & Alvarez B. (1999). Naming times for the Snodgrass and Vanderwart pictures in Spanish. Behavior Research Methods, Instruments, & Computers, 31, 650–658.
Cuetos F., Glez-Nosti M., Barbon A., & Brysbaert M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicologica, 32, 133–143.
Davis C. J. (2010). The spatial coding model of visual word identification. Psychological Review, 117, 713–758.
Dimitropoulou M., Duñabeitia J. A., Avilés A., Corral J., & Carreiras M. (2010). Subtitle-based word frequencies as the best estimate of reading behavior: The case of Greek. Frontiers in Psychology, 1, 1–12.
Duchon A., Perea M., Sebastián-Gallés N., Martí A., Carreiras M. (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods, 45, 1246–1258.
Engbert R., Nuthmann A., Richter E., & Kliegl R. (2005). SWIFT: A dynamical model of saccade generation during reading. Psychological Review, 112, 777–813.
Equipe DELIC. (2004). Présentation du Corpus de référence du français parle [Presentation of the Reference Corpus of Spoken French]. Recherches sur le français parlé, 18, 11–42. Also available at http://www.up.univ-mrs.fr/veronis/pdf/2004-presentation-crfp.pdf
European Social Survey. (2008). ESS4-European Social Survey 2002/2008. Available at www.europeansocialsurvey.org/
Ferrand L., New B., Brysbaert M., Keuleers E., Bonin P., … Pallier C. (2010). The French lexicon project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42, 488–496.
Forster K. I., & Forster J. C. (2003). DMDX: A Windows display program with millisecond accuracy. Behavior Research Methods, Instruments, & Computers, 35, 116–124.
Freitas E., Casanova J. L., & Alves N. A. (1997). Hábitos de leitura: Um inquérito à população portuguesa [Reading habits: A Portuguese population survey]. Lisboa: Dom Quixote.
Frota S., Vigário M., & Martins F. (2002). Language discrimination and rhythm classes: Evidence from Portuguese. In Bel B. & Marlien I. (Eds.), Proceedings of the 1st International Conference on Speech Prosody (pp. 315–318). Université de Provence: Laboratoire de Parole et Language: Aix-en-Provence, France.
Goswami U., Gombert J. E., & de Barrera L. F. (1998). Children's orthographic representations and linguistic transparency: Nonsense word reading in English, French and Spanish. Applied Psycholinguistics, 19, 19–52.
Grainger J., & Jacobs A. M. (1996). Orthographic processing in visual word recognition: A multiple read-out model. Psychological Review, 103, 518–565.
Howes D. H., & Solomon R. L. (1951). Visual duration threshold as a function of word-probability. Journal of Experimental Psychology, 41, 401–410.
IMBD website. Retrieved May 3, 2014, from http://www.imdb.com/genre/
Johns B. T., Gruenenfelder T. M., Pisoni D. B., & Jones M. N. (2012). Effects of word frequency, contextual diversity, and semantic distinctiveness on spoken word recognition. The Journal of the Acoustical Society of America, 132, EL74–EL80.
Keuleers E., Brysbaert M., & New B. (2010). SUBTLEX-NL: A new measure for Dutch words frequency based on film subtitles. Behavior Research Methods, 42, 643–650.
Keuleers E., Diependaele K. & Brysbaert M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1, 174.
Keuleers E., Lacey P., Rastle K., & Brysbaert M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304.
Kučera M., & Francis W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.
Lee C. J. (2003). Evidence-based selection of word frequency lists. Journal of Speech-Language Pathology and Audiology, 27, 172–175.
Leech G., Rayson P., & Wilson A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London: Longman.
Lund K., & Burgess C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, & Computers, 28, 203–208.
McClelland J. L., & Rumelhart D. E. (1981). An interactive activation model of context effects in letter perception: I. An account of Basic Findings. Psychological Review, 88, 375–407.
Michel J. B., Shen Y. K., Aiden A. P., Veres A., Gray M. K., Pickett J. P., … Aiden E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331, 176–182.
Murray W. S., & Forster K. I. (2004). Serial mechanisms in lexical access: The rank hypothesis. Psychological Review, 111, 721–756.
National Endowment for the Arts (NEA). (2004). Reading at risk: A survey of literary reading in America. Washington, DC: National Endowment for the Arts. NW.
New B., Brysbaert M., Veronis J., & Pallier C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28, 661–677.
New B., Pallier C., Brysbaert M., & Ferrand L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers, 36, 516–524.
Open Subtitles (OS) website. Retrieved May 3, 2014, from http://opus.lingfil.uu.se/OpenSubtitles_v2.php
Perea M., Soares A. P., & Comesaña M. (2013). Contextual diversity is a main determinant of word identification times in young readers. Journal of Experimental Child Psychology, 116, 37–44.
Plaut D. C., McClelland J. L., Seidenberg M. S., & Patterson K. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103, 56–115.
Plummer P., Perea M., & Rayner K. (2014). The influence of contextual diversity on eye movements in reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 275–283.
Procura-PALavras database [P-PAL] website. Retrieved May 3, 2014, from http://p-pal.di.uminho.pt/tools
Reichle E. D., Pollatsek A., Fisher D. L., & Rayner K. (1998). Toward a model of eye movement control in reading. Psychological Review, 105, 125–157.
Santos M. L., Neves J., Lima M. J., & Carvalho M. (2007). A leitura em Portugal [Reading in Portugal]. Lisboa: Gabinete de Estatística e Planeamento da Educação (GEPE).
Seymour P. H. K., Aro M., & Erskine J. M. (2003). Foundation literacy acquisition in European orthographies. British Journal of Psychology, 94, 143–174.
Simões A. M., & Almeida J. J. (2001). Jspell. In Actas do Encontro Nacional da Associação Portuguesa de Linguística. Lisboa: Associação Portuguesa de Linguística.
Sinclair J. (2005). Corpus and text: Basic principles. In Wynne M. (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow Books.
Soares A. P., Comesaña M., Pinheiro A. P., Simões A., & Frade C. S. (2012). The adaptation of the affective norms for english words (ANEW) for European Portuguese. Behavior Research Methods, 44, 256–269.
Soares A. P., Costa A., Machado J., Silva A., Oliveira J., Gonçalves A. M., & Comesaña M. (2013). Subjective frequency, imageability and concreteness norms for 3,800 European Portuguese words. Poster presented at 18th Conference of the European Society for Cognitive Psychology (ESCOP), 29 August-01 September, Budapest, Hungary.
Soares A. P., Iriarte A., Almeida J. J., Simões A., Costa A., França P., … Comesaña M. (2014). Procura-PALavras (P-PAL): Uma nova medida de frequência lexical do Português Europeu contemporâneo [Procura-PALavras (P-PAL): A new measure of word frequency for contemporary European Portuguese]. Psicologia: Reflexão e Crítica, 27, 110–123.
Soares A. P., Medeiros J. C., Simões A., Machado J., Costa A., Iriarte A., … Comesaña M. (2014). ESCOLEX: A grade-level lexical database from European Portuguese Elementary to Middle School textbooks. Behavior Research Methods, 46, 240–253.
Soares A. P., Pinheiro A. P., Costa A., Frade C. S., Comesaña M., & Pureza R. (2013). Affective auditory stimuli: Adaptation of the international affective digitized sounds (IADS-2) for European Portuguese. Behavior Research Methods, 45, 1168–1181.
SUBTLEX-PT database website. Retrieved May 3, 2014, from http://p-pal.di.uminho.pt/about/databases
Thorndike E. L. (1921). The teacher's word book. New York: Teachers College, Columbia University.
Thorndike E. L., & Lorge I. (1944). The teacher's word book of 30,000 words. New York: Teachers College, Columbia University.
Tiedemann J. (2009). News from OPUS: A collection of multilingual parallel corpora with tools and interfaces. In Nicolov N., Bontcheva K., Angelova G. & Mitkov R. (Eds.), Recent advances in natural language processing (pp. 237–248). Amsterdam/Philadelphia: John Benjamins.
Universidade de Lisboa. (1987). Português fundamental: Métodos e documentos [Fundamental Portuguese: Methods and documents]. Lisboa: Instituto de Investigação Científica.
van Heuven W. J. B., Mandera P., Keuleers E., & Brysbaert M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67, 1176–1190.
Yap M. J., & Balota D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60, 502–529.
Zeno S. M., Ivens S. H., Millard R. T., & Duvvuri R. (1995). The educator's word frequency guide. Brewster, NY: Touchstone Applied Science.
Zevin J. D., & Seidenberg M. S. (2002). Age of acquisition effects in word reading and other tasks. Journal of Memory and Language, 47, 1–29.
Ziegler J. C., Petrova A., & Ferrand L. (2008). Feedback consistency effects in visual and auditory word recognition: Where do we stand after more than a decade?. Journal of Experimental Psychology: Learning Memory and Cognition, 34, 643–661.

Cite article

Cite article

Cite article

OR

Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

Share options

Share

Share this article

Share with email
EMAIL ARTICLE LINK
Share on social media

Share access to this article

Sharing links are not relevant where the article is open access and not available if you do not have a subscription.

For more information view the Sage Journals article sharing page.

Information, rights and permissions

Information

Published In

Article first published online: April 1, 2015
Issue published: April 2015

Keywords

  1. Word frequency
  2. Contextual diversity
  3. Subtitles
  4. Portuguese

Rights and permissions

© 2015 Experimental Pscyhology Society.
Request permissions for this article.
PubMed: 25263599

Authors

Affiliations

Ana Paula Soares
Human Cognition Lab, CIPsi, School of Psychology, University of Minho, Minho, Portugal
João Machado
Human Cognition Lab, CIPsi, School of Psychology, University of Minho, Minho, Portugal
Ana Costa
Human Cognition Lab, CIPsi, School of Psychology, University of Minho, Minho, Portugal
Álvaro Iriarte
Centre for Humanistic Studies, University of Minho, Minho, Portugal
Alberto Simões
Centre for Humanistic Studies, University of Minho, Minho, Portugal
José João de Almeida
Computer Science and Technology Center, University of Minho, Minho, Portugal
Montserrat Comesaña
Human Cognition Lab, CIPsi, School of Psychology, University of Minho, Minho, Portugal
Manuel Perea
ERI-Lectura and Departamento de Metodología, Universitat de València, Valencia, Spain

Notes

Human Cognition Lab, CIPsi, School of Psychology, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal. E-mail: [email protected]

Metrics and citations

Metrics

Journals metrics

This article was published in Quarterly Journal of Experimental Psychology.

VIEW ALL JOURNAL METRICS

Article usage*

Total views and downloads: 356

*Article usage tracking started in December 2016


Altmetric

See the impact this article is making through the number of times it’s been read, and the Altmetric Score.
Learn more about the Altmetric Scores



Articles citing this one

Receive email alerts when this article is cited

Web of Science: 43 view articles Opens in new tab

Crossref: 41

  1. Effect of contextual diversity on word recognition in different semant...
    Go to citation Crossref Google Scholar
  2. Cleaning up the Brickyard: How Theory and Methodology Shape Experiment...
    Go to citation Crossref Google Scholar
  3. PHOR-in-One: A multilingual lexical database with PHonological, ORthog...
    Go to citation Crossref Google Scholar
  4. Semantic cognition in healthy ageing: Neural signatures of representat...
    Go to citation Crossref Google Scholar
  5. Language balance rather than age of acquisition: A study on the cross-...
    Go to citation Crossref Google Scholar
  6. LextPT: A reliable and efficient vocabulary size test for L2 Portugues...
    Go to citation Crossref Google Scholar
  7. Explicit Instructions Do Not Enhance Auditory Statistical Learning in ...
    Go to citation Crossref Google Scholar
  8. On the syllable structure effect in European Portuguese: Evidence from...
    Go to citation Crossref Google Scholar
  9. Learning Words While Listening to Syllables: Electrophysiological Corr...
    Go to citation Crossref Google Scholar
  10. Does narrator variability facilitate incidental word learning in the c...
    Go to citation Crossref Google Scholar
  11. Comparing Lexical and Usage Frequencies of Palatal Segments in Portugu...
    Go to citation Crossref Google Scholar
  12. On the role of syllabic neighbourhood density in the syllable structur...
    Go to citation Crossref Google Scholar
  13. On the Shapes of the Polish Word: Phonotactic Complexity and Diversity
    Go to citation Crossref Google Scholar
  14. Using concept typicality to explore semantic representation and contro...
    Go to citation Crossref Google Scholar
  15. Of Beavers and Tables: The Role of Animacy in the Processing of Gramma...
    Go to citation Crossref Google Scholar
  16. Syllable effects in beginning and intermediate European-Portuguese rea...
    Go to citation Crossref Google Scholar
  17. The mirror reflects more for genial than for casual: right-asymmetry b...
    Go to citation Crossref Google Scholar
  18. Effects of Character and Word Contextual Diversity in Chinese Beginnin...
    Go to citation Crossref Google Scholar
  19. LexITA: A Quick and Reliable Assessment Tool for Italian L2 Receptive ...
    Go to citation Crossref Google Scholar
  20. References
    Go to citation Crossref Google Scholar
  21. The role of letter features on the consonant-bias effect: Evidence fro...
    Go to citation Crossref Google Scholar
  22. Convergent Evidence for the Validity of a Performance-Based ICT S...
    Go to citation Crossref Google Scholar
  23. SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Ca...
    Go to citation Crossref Google Scholar
  24. Psycholinguistic variables in visual word recognition and pronunciatio...
    Go to citation Crossref Google Scholar
  25. Learning to read facilitates the retrieval of phonological representat...
    Go to citation Crossref Google Scholar
  26. The mirror reflects more for d than for b: Right asymmetry bias on the...
    Go to citation Crossref Google Scholar
  27. Lexico-syntactic interactions during the processing of temporally ambi...
    Go to citation Crossref Google Scholar
  28. Procura-PALavras (P-PAL): A Web-based interface for a new European Por...
    Go to citation Crossref Google Scholar
  29. Lexico-syntactic interactions in the resolution of relative clause amb...
    Go to citation Crossref Google Scholar
  30. The role of syllables in intermediate-depth stress-timed languages: ma...
    Go to citation Crossref Google Scholar
  31. On the predictive validity of various corpus-based frequency norms in ...
    Go to citation Crossref Google Scholar
  32. Portuguese Norms of Name Agreement, Concept Familiarity, Subjective Fr...
    Go to citation Crossref Google Scholar
  33. Identification of fluency and word‐finding difficulty in samples of ch...
    Go to citation Crossref Google Scholar
  34. Contextual diversity facilitates learning new words in the classroom
    Go to citation Crossref Google Scholar
  35. The ERP signature of the contextual diversity effect in visual word re...
    Go to citation Crossref Google Scholar
  36. The Minho Word Pool: Norms for imageability, concreteness, and subject...
    Go to citation Crossref Google Scholar
  37. Disentangling stimulus plausibility and contextual congruency: Electro...
    Go to citation Crossref Google Scholar
  38. HelexKids: A word frequency database for Greek and Cypriot primary sch...
    Go to citation Crossref Google Scholar
  39. Disentangling the effects of word frequency and contextual diversity o...
    Go to citation Crossref Google ScholarPub Med
  40. Aphasia and age of acquisition: are early-learned words more resilient...
    Go to citation Crossref Google Scholar
  41. Megastudies, crowdsourcing, and large datasets in psycholinguistics: A...
    Go to citation Crossref Google Scholar

Figures and tables

Figures & Media

Tables

View Options

Get access

Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:

EPS members can access this journal content using society membership credentials.

EPS members can access this journal content using society membership credentials.


Alternatively, view purchase options below:

Purchase 24 hour online access to view and download content.

Access journal content via a DeepDyve subscription or find out more about this option.

View options

PDF/ePub

View PDF/ePub

Full Text

View Full Text