A Multilab Study of Bilingual Infants: Exploring the Preference for Infant-Directed Speech

From the earliest months of life, infants prefer listening to and learn better from infant-directed speech (IDS) compared with adult-directed speech (ADS). Yet IDS differs within communities, across languages, and across cultures, both in form and in prevalence. This large-scale, multisite study used the diversity of bilingual infant experiences to explore the impact of different types of linguistic experience on infants’ IDS preference. As part of the multilab ManyBabies 1 project, we compared preference for North American English (NAE) IDS in lab-matched samples of 333 bilingual and 384 monolingual infants tested in 17 labs in seven countries. The tested infants were in two age groups: 6 to 9 months and 12 to 15 months. We found that bilingual and monolingual infants both preferred IDS to ADS, and the two groups did not differ in terms of the overall magnitude of this preference. However, among bilingual infants who were acquiring NAE as a native language, greater exposure to NAE was associated with a stronger IDS preference. These findings extend the previous finding from ManyBabies 1 that monolinguals learning NAE as a native language showed a stronger IDS preference than infants unexposed to NAE. Together, our findings indicate that IDS preference likely makes similar contributions to monolingual and bilingual development, and that infants are exquisitely sensitive to the nature and frequency of different types of language input in their early environments.

and 12 to 15 months. We found that bilingual and monolingual infants both preferred IDS to ADS, and the two groups did not differ in terms of the overall magnitude of this preference. However, among bilingual infants who were acquiring NAE as a native language, greater exposure to NAE was associated with a stronger IDS preference. These findings extend the previous finding from ManyBabies 1 that monolinguals learning NAE as a native language showed a stronger IDS preference than infants unexposed to NAE. Together, our findings indicate that IDS preference likely makes similar contributions to monolingual and bilingual development, and that infants are exquisitely sensitive to the nature and frequency of different types of language input in their early environments.
Infants' preference for IDS may play a useful role in early language learning. For example, infants are better able to discriminate speech sounds in IDS than in ADS (Karzon, 1985;Trainor & Desjardins, 2002), more efficiently segment words from continuous speech in an IDS register (Thiessen et al., 2005), demonstrate better long-term memory for words spoken in IDS (Singh et al., 2009), and learn new words more effectively from IDS (Graf Estes & Hurley, 2013;Ma et al., 2011; but see Schreiner et al., 2016).
Although most studies have confirmed a general, early preference for IDS, to date there has been very little research aimed at understanding how different linguistic experiences affect infants' preferences. For instance, although the use of IDS has been demonstrated in a large number of cultures (see the sources cited above), the vast majority of the research on infants' IDS preferences has been conducted in North America, using English speech typically directed at infants learning North American English (NAE; Dunst et al., 2012). Most critically, past work has been limited to a particular kind of linguistic (and cultural) experience: that of the monolingual infant. Here, we present a large-scale, multisite, preregistered study on bilingual infants, a population that is particularly suited for exploring the relationship between language experience and IDS preference. Moreover, this research provides important insight into the early development of bilingual infants, a large but understudied population.

Does Experience Tune Infants' Preference for IDS?
What role might experience play in tuning infants' attention to IDS? We aggregated results from a published meta-analysis (Dunst et al., 2012) with additional community-contributed data available through the Meta-Lab database (http://metalab.stanford.edu) to examine their combined results. When all 62 studies were considered, we found a moderately sized average effect, Cohen's d = 0.64, showing a preference for IDS compared with ADS. Focusing on the 22 studies most similar to ours (testing IDS preference in a laboratory using looking times of typically developing infants ages 3 through 15 months, with stimuli consisting of naturally produced English-spoken IDS from an unfamiliar female speaker), we found a slightly smaller effect size, d = 0.60. Although this meta-analysis focused on infants in the first year of life, other studies of infants aged 18 to 21 months have also found a preference for IDS over ADS (Glenn & Cunningham, 1983;Robertson et al., 2013). There is some evidence that older infants show a greater preference for IDS than younger infants (Dunst et al., 2012), although an age effect was not found in the subsample of 22 studies mentioned above. More evidence is needed to explore the possibility that increased language experience as children grow enhances their preference for IDS.
Another experimental variable that would be important in understanding the role of experience in the preference for IDS is whether the speech stimuli are presented in a native or nonnative language. Numerous studies in early perception have found different developmental trajectories for perception of native versus nonnative stimuli (e.g., discriminating human faces vs. discriminating monkey faces in Lewkowicz & Ghazanfar, 2006; discriminating native vs. discriminating nonnative speech-sound categories in Maurer & Werker, 2014; segmenting word forms from fluent speech in Polka & Sundara, 2012). Generally, whereas infants show increasing proficiency in discriminating the types of faces and sounds that are present in their environment, they lose sensitivity to the differences between nonnative stimuli over time. This general pattern might lead one to predict that infants will initially be sensitive to differences between IDS and ADS in both the native and the nonnative languages, but that this initial crosslinguistic sensitivity will decline with age. In other words, at some ages, infants' preference for IDS over ADS could be enhanced when they hear their native language. However, to date, researchers have collected very little data relevant to this question. Importantly, this general trend, if it exists, may interact with differences across languages in the production of IDS. The exaggerated IDS of NAE might be either more interesting or less interesting to an infant whose native language is characterized by a less exaggerated form of IDS than to an infant who regularly hears NAE IDS.
Only a handful of IDS-preference studies have explicitly explored infants' preference for IDS in a native versus a nonnative language. Werker et al. (1994) compared 4.5-and 9-month-old English-and Cantonese-learning infants' preference for videos of Cantonese mothers using IDS versus ADS. Both groups showed a preference for IDS; however, the magnitude of the preference was not specifically compared between the two groups. Hayashi et al. (2001) studied Japanese-learning infants' (aged 4-14 months) preference for native ( Japanese) and nonnative (English) speech. These infants generally showed a preference for Japanese IDS over Japanese ADS, as well as an increasing preference for Japanese IDS over English IDS. The latter finding shows that infants tune into their native language with increased experience; however, as the study did not measure infants' interest in English ADS, it does not shed light on whether the infants were equally sensitive to the difference between ADS and IDS in the nonnative stimuli, or whether and how such sensitivity might change over time.
Infants growing up bilingual are typically exposed to IDS in two languages. They provide a particularly useful wedge in understanding experiential influences on infants' attention to IDS. Bilingual infants receive less exposure to each of their languages than monolingual infants do, and the exact proportion of exposure to each of their two languages varies from infant to infant. This divided exposure does not appear to slow the overall rate of language acquisition: Bilinguals pass their language milestones, such as the onset of babbling and the production of their first words, on approximately the same schedule as monolingual infants (Werker & Byers-Heinlein, 2008). Nonetheless, children from different language backgrounds receive different types of input and must ultimately acquire different language forms, which can alter some patterns of language acquisition (e.g., Choi & Bowerman, 1991;Slobin, 1985;Tardif, 1996;Tardif et al., 1997;Werker & Tees, 1984). As a consequence, bilingual infants allow researchers to investigate how a given "dose" of experience with a specific language relates to phenomena in language acquisition, while holding infants' age and total experience with language constant .
Aside from providing the opportunity to study dose effects, research on the preference for IDS in bilingual infants is important for the sake of understanding bilingual development itself. Several lines of research suggest that early exposure to two languages changes some aspects of early development , including bilinguals' perception of nonnative speech sounds (i.e., sounds that are in neither of their native languages). For example, a number of studies have found that bilinguals maintain sensitivity to nonnative consonant contrasts (García-Sierra et al., 2016;Petitto et al., 2012;, nonnative tone contrasts (Graf Estes & Hay, 2015;Liu & Kager, 2017a), and visual differences between languages (i.e., rhythmic and phonetic information available on talkers' faces; Sebastián-Gallés et al., 2012) until a later age than monolinguals. Other studies have suggested that bilinguals' early speech perception is linked to their language dominance (Liu & Kager, 2015;Molnar et al., 2016;Sebastián-Gallés & Bosch, 2002), such that bilinguals' perception more closely matches that of monolinguals in their dominant language than that of monolinguals in their nondominant language. Bilingual infants also demonstrate some cognitive differences from monolinguals that are not specific to language, including faster visual habituation (Singh et al., 2015), better memory generalization (Brito & Barr, 2014;Brito et al., 2015), and greater cognitive flexibility (Kovács & Mehler, 2009a, 2009b. This might reflect an early-emerging difference in information processing between the two groups. Together, these lines of work raise the possibility that preference for IDS over ADS could have a different developmental course for bilingual and monolingual infants, and that bilinguals' distinct course could interact with factors such as language dominance.

Bilinguals' Exposure to and Learning From IDS
Overall, there is very little research on whether bilinguals' experience with IDS is comparable to monolinguals' experience. Some research has compared English monolinguals and English-Spanish bilinguals in the United States (Ramírez-Esparza et al., 2014. The researchers reported that bilingual infants around 1 year of age received less exposure to IDS than monolingual infants on average. Moreover, in the bilingual families, input was more evenly distributed across infant-and adultdirected registers. It is difficult to know whether the results reported in these studies generalize to other populations of bilinguals, or whether they are specific to this language community. As acknowledged by the authors, the bilinguals in this study were of a lower socioeconomic status (SES) than the monolinguals, which could have driven differences in the amount of IDS that infants heard. On the other hand, it might be the case that bilingual infants more rapidly lose their preference for the IDS register than do monolinguals, and that caregivers of bilinguals respond to this by reducing the amount of IDS input they provide.
Bilingual infants might also hear IDS that differs prosodically and phonetically from that heard by monolingual infants. Bilingual infants often have bilingual caregivers, and even when they are highly proficient speakers, their speech may vary from that of monolinguals. One study compared vowels produced in the IDS of monolingual English, monolingual French, and balanced French-English bilingual mothers living in Montreal (Danielson et al., 2014). Bilingual mothers' vowels were distinct in the two languages, and the magnitude of the difference between French and English vowels was similar to that shown by monolingual mothers. However, another study showed that, in a word-learning task, 17-month-old French-English bilinguals learned new words better from a bilingual speaker than a monolingual speaker, even though acoustic measurements did not reveal what dimension the infants were attending to ; similar findings were reported in Mattock et al., 2010). Finally, a study of Spanish-Catalan bilingual mothers living in Barcelona found that some were more variable than monolingual mothers in their productions of a difficult Catalan vowel contrast (Bosch & Ramon-Casas, 2011). Thus, not only may bilingual infants differ from monolingual infants in the amount of IDS they hear in a particular language, but also different populations of bilingual infants may vary in how similar the IDS they hear is to monolingualproduced IDS in the same languages. This could, in turn, lead to greater variability across bilinguals in their preference for IDS over ADS when tested with any particular stimulus materials.
Regardless of bilingual infants' specific experience with IDS, evidence suggests that bilinguals might enjoy the same learning benefits from IDS as monolinguals. For example, Ramírez-Esparza et al. (2017) found that greater exposure to IDS predicted larger vocabulary size in both monolingual and bilingual infants. Indeed, an untested possibility is that exposure to IDS might be of particular benefit to bilingual infants. Bilinguals face a more complex learning situation than monolinguals, as they acquire two sets of sounds, words, and grammars simultaneously (Werker & Byers-Heinlein, 2008). This raises the possibility that bilingual infants might have enhanced interest in IDS relative to monolinguals, or that they might maintain a preference for IDS until a later age than monolinguals, much as bilingual infants' perception has an extended sensitivity to nonnative phonetic contrasts.

Replicability in Research With Bilingual Infants
Working with bilingual infant populations engenders unique replicability issues above and beyond those common in the wider field of infant research (e.g., betweenlab variability, methodological variation; see Frank et al., 2017). These issues begin with the nature of the population. Our discussion of bilingual infants thus far has used "bilingual" as a blanket term to describe infants growing up hearing two or more languages. However, this usage belies the large variability in groups of infants described as bilingual. First, some studies of bilinguals have included infants from a homogeneous language background (i.e., all infants are exposed to the same language pair; e.g., English-Spanish in Ramírez-Esparza et al., 2017), whereas others have included infants from heterogeneous language backgrounds (i.e., infants are exposed to different language pairs, e.g., English-other, where "other" might be Spanish, French, Mandarin, Punjabi, etc.; e.g., Fennell et al., 2007). Second, some bilinguals learn two typologically closely related languages (e.g., Spanish-Catalan), whereas others learn two distant languages (e.g., English-Mandarin). Third, there is wide variability among bilingual infants in the amount of exposure to each language, which introduces an extra dimension of individual differences relative to studies with monolingual infants. Fourth, studies define bilingualism in different ways, ranging from a liberal criterion of at least 10% exposure to the nondominant language to at least 40% exposure to the nondominant language (Byers-Heinlein, 2015). Finally, bilingual and monolingual populations can be difficult to compare because of cultural, sociological, and SES differences that exist between samples.
All of the above difficulties have resulted in very few findings being replicated across different samples of bilinguals. The limited research that has compared different types of bilingual learners has indicated that the particular language pair being learned by bilingual infants influences perception of both native (Bialystok et al., 2005;Sundara & Scutellaro, 2011) and nonnative (Patihis et al., 2015) speech sounds. In contrast, other studies have not found differences between bilinguals learning different language pairs in, for example, their ability to apply speech perception skills to a word-learning task (Fennell et al., 2007). Generally, it is not known how replicable most findings are across different groups of bilinguals, or how previously reported effects of bilingualism on learning and perception are impacted by the theoretically interesting moderators discussed above.
Research on bilingual infants also faces many of the same general concerns as other infancy research, such as challenges recruiting sufficient numbers of participants to conduct well-powered studies (Frank et al., 2017). Finding an appropriate bilingual sample further limits the availability of research participants, even in locations with significant bilingual populations. Such issues are particularly relevant given the recent emphasis on replicability and best practices in psychological science (Klein et al., 2014;Open Science Collaboration, 2015;Simmons et al., 2011). Of particular interest is whether bilingual infants as a group show greater variability in their responses than monolingual infants, and how to characterize the variability of responses between the different types of samples of bilinguals that can be recruited by particular labs (e.g., homogeneous vs. heterogeneous samples). Understanding whether variability differs systematically across groups is vital for planning appropriately powered studies.

Description of the Current Study
Here, we report a large-scale, multisite, preregistered study aimed at using data from bilingual infants to understand variability in infants' preference for IDS over ADS. This study, ManyBabies 1 Bilingual, is a companion project to the ManyBabies 1 project, published in a previous issue of this journal (ManyBabies Consortium, 2020). The two studies were conducted in parallel, using the same stimuli and experimental procedure. However, whereas the ManyBabies 1 analyses included all data collected from monolingual infants (including those data from monolinguals reported here), we report on a subset of these data together with additional data from bilingual infants not included in that project's analyses. Our multisite approach gave us precision in estimating the overall effect size of bilingual infants' preference for IDS, while also allowing us to investigate how different types of language experience moderate this effect.
Our primary approach was to compare bilinguals' performance with the performance of monolinguals tested in the same lab. This approach had two notable advantages. First, within each lab, bilinguals shared one of their two languages with monolinguals (the language of the wider community). Second, testing procedures were held constant within each lab. Thus, this approach allowed us to minimize procedural confounds with infants' bilingual status. However, a disadvantage of this approach is that it left out data from monolingual infants tested in other labs (because not all laboratories provided data from bilingual infants), which could potentially have added precision to the measured effects. Thus, we performed additional analyses comparing all bilinguals with all monolinguals within the same age bins, regardless of the lab each had been tested in.
We tested bilinguals in two age windows: 6 to 9 months and 12 to 15 months. The specific age bins selected were based on a preliminary survey of participating laboratories' access to participants of different age ranges. The choice of nonadjacent age bins also increased the chances of observing developmental differences.
All infants were tested using the same stimuli, which consisted of recordings of NAE-accented IDS and ADS. Because of the international nature of this multisite project, these stimuli were native for some infants but nonnative for other infants, both in terms of the language of the stimuli (English) and the variety of IDS (NAE IDS is particularly exaggerated in its IDS characteristics relative to other varieties of IDS; see Soderstrom, 2007, for a review). Moreover, the stimuli were produced by monolingual mothers. Thus, infants' exposure to the type of stimuli used varied from low (monolinguals and bilinguals not exposed to NAE), to moderate (bilinguals learning NAE as one of their two languages), to high (monolinguals learning NAE).
Infants were tested in one of three experimental setups regularly used to test infant auditory preference: central fixation, eye tracking, and head-turn-preference procedure (HPP). The use of a particular setup was the choice of each lab, depending on their equipment and expertise. Labs that tested both monolinguals and bilinguals used the same setup for both groups. In all setups, infants heard a series of trials presenting either IDS or ADS, and their looking time to an unrelated visual stimulus (e.g., a checkerboard) was used as an index of their attention. In the central-fixation setup, infants sat in front of a single screen that displayed a visual stimulus, and their looking time to this visual stimulus while an auditory stimulus was played was coded via button press using a centrally positioned camera. Looking time was recorded similarly in the eye-tracking setup, except that infants' looking was coded automatically using a cornealreflection eye tracker. In the HPP (see Kemler Nelson et al., 1995), infants sat in the middle of a room facing a central visual stimulus. Their attention was drawn to the left or right side of the room by a visual stimulus while the auditory stimulus played, and the duration of their looking to the visual stimulus was measured via button press using a centrally positioned camera.

Research Questions
We identified three basic research questions to be addressed by this study. Note that it was not always possible to make specific predictions given the very limited data on infants' cross-language preferences for IDS over ADS, and particularly the absence of data from bilingual infants. We also note that the ManyBabies 1 project, focusing on monolingual infants, addressed other more general questions, such as the average magnitude of the IDS preference, changes in preference over age, and the effects of methodological variation on IDS preference (ManyBabies Consortium, 2020). The main questions we planned to address using data from bilingual infants were • • How does bilingualism affect infants' interest in IDS relative to ADS? As described above, monolingual infants display an early preference for IDS that grows in strength at least through the first year of life. We anticipated that the bilingual experience might result in a different pattern of IDS preference; however, the direction and potential source of any difference were difficult to predict. For example, the more challenging nature of early bilingual environments might induce an even greater preference for IDS over ADS than is found in monolinguals. This enhanced preference could be shown across development, or might be observed only at certain ages. On the other hand, given some evidence that parents of bilingual infants produce relatively less IDS than parents of monolingual infants, it may be that bilinguals show less interest in IDS than monolinguals. We also explored the following questions concerning potential sources for an emerging difference between populations: If an overall difference between monolingual and bilingual infants' preference for IDS is observed, can this be accounted for by systematic differences in SES? Do bilinguals show greater variability in their preference for IDS than monolinguals? • • How does the amount of exposure to NAE IDS affect bilingual infants' listening preferences? Although we expected infants across different language backgrounds to show greater interest in IDS over ADS, we investigated whether this effect was moderated by the amount of exposure to NAE. For monolinguals, this exposure was either 100% (monolingual learners of NAE) or 0% (monolingual learners of other languages). For bilinguals, some infants had 0% exposure to NAE IDS (e.g., bilingual infants learning Spanish and Catalan), whereas others had a range of different exposures (e.g., bilingual infants learning NAE and French). This allowed us to at least partially disentangle dose effects of exposure to NAE IDS from infants' bilingualism. An additional possibility was that infants' exposure to NAE would predict overall attention to both infant-directed and adult-directed NAE, but have no differential effects on interest in IDS versus ADS. Alternately, it was possible that our stimuli would be equally engaging to infants regardless of their experience with NAE. • • Finally, we had planned to ask how bilingual infants' listening to NAE IDS and ADS is impacted by the particular language pair being learned. We intended to ask this question at both the group and the individual level. At the group level, we planned to investigate whether different patterns of overall preference for IDS and group-level variability would be seen in homogeneous versus heterogeneous samples of bilinguals. However, ultimately, we had insufficient homogeneous samples to address this question. At the individual level, we were interested in how the particular language pair being learned modulated infants' preference for IDS. As we did not know a priori what language pairs would have a sufficient sample size for analysis, this was considered a potential exploratory analysis. Ultimately, given the nature of our main results and the diverse language backgrounds of our final sample, we decided to leave this question open for future investigations.

Preregistration
The accepted Stage 1 version of this manuscript and the analysis plan were preregistered via OSF, at https://osf .io/wtfuq.

Data, materials, and online resources
Study instructions and other details are available at the ManyBabies 1 Bilingual OSF site, https://osf.io/zauhq/, and materials are available via the ManyBabies 1 OSF site, https://osf.io/re95x/. Labs submitted anonymized data for central analysis that identified participants by code only. Data and analytic code are available at https://github.com/manyba bies/mb1b-analysis-public. Video recordings of individual participants were coded and stored locally at each lab, and when possible were uploaded to a central controlledaccess data bank accessible to other researchers (https:// databrary.org).

Reporting
We report how we determined our sample size, all data exclusions, all manipulations, and all measures in our study.

Ethical approval
This research was carried out in accordance with the provisions of the World Medical Association Declaration of Helsinki. Each lab followed the ethical guidelines and ethics-review-board protocols of their own institution.

Participation details
Our monolingual sample originated from the ManyBabies 1 project (ManyBabies Consortium, 2020). Here we report some basic information about that sample (the reader is referred to the original study for further details) and focus primarily on the bilingual sample.

Time frame.
An open call for labs to participate was issued on February 2, 2017. Participant testing began on May 1, 2017. Testing for monolinguals ended on April 30, 2018. Because of the additional difficulty of recruiting bilingual samples, the end date for collection of these data was extended by 4 months, to August 31, 2018. Because of a miscommunication, one lab continued testing beyond this deadline but prior to data analysis, and these data were included in the final sample.
Age distribution. Labs contributing data from bilingual infants were asked to test participants in at least one of the two (but preferably both) age bins: 6-to 9-month-olds (i.e., ages 6 months 1 day through 9 months 0 days) and 12-to 15-month-olds (i.e., ages 12 months 1 day through 15 months 0 days). Labs were asked to aim for a mean age at the center of the bin, with distribution across the entire age window. Some labs chose to test additional infants outside the target age ranges for future exploratory analyses; these infants were excluded from the current study.
Lab participation criterion. Considering the challenges associated with recruiting bilingual infants and the importance of counterbalancing in our experimental design, we asked labs to contribute data from a minimum of 16 infants per age and language group (note that infants who met inclusion criteria for age and language exposure but were ultimately excluded for other reasons counted toward this minimum). We expected that requiring a relatively low minimum number of infants would encourage more labs to contribute a bilingual sample, and under our statistical approach, a larger number of groups was more important than a larger number of individuals (Maas & Hox, 2005). However, labs were encouraged to contribute additional data provided that decisions about when to stop data collection were made ahead of time (e.g., by declaring intended start and end dates before data collection). A sensitivity analysis showed that with a sample of 16 infants and assuming the average effect size of similar previous studies (Cohen's d = 0.70; Dunst et al., 2012; data available through the MetaLab database), individual labs would have 74% power to detect a preference for IDS in a pairedsamples t test (α = .05, one-tailed). Assuming a smaller effect size of 0.60, a conservative estimate based on our previously mentioned analysis of 22 studies most similar to ours, individual labs' power would be 61%. The moderate statistical power that individual labs would have to detect this effect highlights the importance of our approach to combine data across labs. We note that some labs were unable to recruit their planned minimum sample of 16 bilingual infants who met our inclusion criteria in the time frame available, a point we return to later.
Labs were asked to screen infants ahead of time for inclusion criteria, typically by briefly asking about language exposure over the phone. Despite this screening process, some infants who arrived in the lab for testing fell between the criteria for monolingual and bilingual status based on the comprehensive questionnaire. In such cases, the decision whether to test the infant was left up to individual laboratories' policy, but we asked that data from any babies who entered the testing room be submitted for data processing (even though some such data might be excluded from the main analyses).

Participants
Defining bilingualism. Infants are typically categorized as bilingual on the basis of their parent-reported relative exposure to their languages. However, studies vary considerably in terms of inclusion criteria for the minimum exposure to the nondominant language, which in previous studies has ranged from 10% to 40% of infants' total exposure (Byers-Heinlein, 2015). Some bilingual infants may also have some exposure to a third or fourth language. Finally, infants can vary in terms of when the exposure to their additional languages began, which can be as early as birth or any time thereafter. We aimed to take a middle-of-the-road approach to defining bilingualism, attempting to balance a need for experimental power with a need for interpretable data.
Thus, we asked each participating lab to recruit a group of simultaneous bilingual infants who were exposed to each of two languages between 25% and 75% of the time and who had regular exposure to both languages beginning within the first month of life. There was no restriction as to whether infants were exposed to additional languages; thus, some infants could be considered multilingual (although we continue to use the term bilingual throughout this article). These criteria would include, for example, an infant with 40% English, 40% French, and 20% Spanish exposure, but would exclude an infant with 20% English, 70% French, and 10% Spanish exposure. We also asked labs to recruit a sample of bilingual infants who shared at least one language-the community language being learned by monolinguals tested in the same lab. Labs in bilingual communities (e.g., Barcelona, Ottawa, Montréal, Singapore) were free to decide which community language to select as the shared language. Within this constraint, most labs opted to test heterogeneous groups of bilinguals; for example, among the English-other bilinguals for whom English was the community language, the other language might be French, Spanish, Mandarin, or some other language. Only one lab tested a homogeneous group of bilinguals (in this case, all infants were learning English and Mandarin), although we had expected that more labs would test homogeneous samples, given that both heterogeneous and homogeneous samples are used regularly in research with bilingual infants. Because only one homogeneous sample was tested, we were not able to conduct planned analyses examining the impact of this type of sample on our results. Infants who were tested but did not meet inclusion criteria for the bilingual group (e.g., because they did not hear enough of their nondominant language or did not hear enough of the community language) were excluded from the main analyses, but retained for exploratory analyses when appropriate.
Assessing bilingualism. Each lab was asked to use a detailed day-in-the-life parental interview questionnaire to quantify the percentage of time that infants were exposed to each language. This approach has been shown to predict bilingual children's language outcomes better than a one-off parental estimate (DeAnda et al., 2016). Moreover, recent findings based on day-long home language recordings show that caregivers can reliably estimate their bilingual child's relative exposure to each language (Orena et al., 2020). Labs were also asked to pay special attention to whether infants had exposure to NAE (based on caregivers' report of the variety of English spoken to their infants), and if so, which caregiver (or caregivers) this input came from. As most of the labs contributing bilingual data had extensive expertise in assessing bilingual language background, we encouraged each lab to use whatever measurement instrument was normally used in their lab (details of the assessment instruments, including source references for most measures, are outlined below). When possible, labs conducted the interview in the parents' language of choice and documented whether the parents' preferred language was able to be used.
Although standardization of measurement tools is often desirable, we reasoned that different questions and approaches might be best for eliciting information from parents in different communities and from different cultures. Indeed, many labs reported that their own instruments had undergone considerable refinement over the years as a function of their experience working with the families in their communities. However, in order to maximize the overall sample size and the diversity of bilingual groups tested, we encouraged participation from laboratories without extensive experience testing bilingual infants. Labs that did not have an established procedure were paired with more experienced labs working with similar communities to refine a language assessment procedure. Twelve of the labs administered a structured interview-style questionnaire based on the one developed by Sebastián-Gallés (1997, 2001; for examples of the measure, see the online supplementary materials of Byers-Heinlein et al., 2019, andDeAnda et al., 2016), and the remaining five labs administered other questionnaires. We describe each of these approaches in detail below.
The Sebastián-Gallés (1997, 2001) questionnaire is typically referred to in the literature as the Language Exposure Questionnaire (LEQ; e.g., Byers-Heinlein et al., 2013) or the Language Exposure Assessment Tool (LEAT; DeAnda et al., 2016). For simplicity, we use the former name here. Administration of the questionnaire takes the form of a parental interview, in which a trained experimenter systematically asks at least one of the infant's primary caregivers detailed questions about the infant's language environment. The interviewer obtains an exposure estimate for each person who is in regular contact with the infant, defined as a minimum contact of once a week. The caregiver gives an estimate of how many hours each of these people speaks to the infant in each language on each of the days of the week (e.g., weekdays and weekends may differ depending on work commitments). Further, the caregiver is asked if the language input from each regularcontact person has been similar across the infant's life history. If not, such as in the case of a caregiver returning to work after parental leave or leaving for an extended stay in another country, an estimate is derived for each different period of the infant's life span. The interviewer also asks the caregiver about the language background of each person with regular contact with the infant (as defined above), asking what languages that person speaks and whether he or she is a native speaker of those languages. The caregiver also gives an estimate of language exposure in the infant's day-care facility, if applicable.
Finally, the caregiver gives a global estimate of the infant's percentage of exposure to the two languages, which includes input from those people in regular contact with the infant and other people with whom the infant has less regular contact (e.g., playgroups, friends of caregivers). This global estimate does not include input from television or radio, as such sources have no known positive impact, and may even have a negative impact on monolingual and bilingual language development in infancy (see Hudon et al., 2013). The primary estimate of the infant's cumulative percentage of exposure in each language is derived from a time-weighted average of input from the primary individuals in the infant's life. Some labs use the global estimate simply to confirm these percentages. Other labs average the primary and global exposure to take into account all language exposure while still giving more weight to the primary individuals. Also, some labs asked additional questions, for example, about videoconferencing with relatives, caregivers' possible mixing of languages when speaking to the infant, or caregivers' cultural background. Finally, although the original form was pen and paper, there have been adaptations, which include using a formfillable Excel sheet (DeAnda et al., 2016).
For the other language-exposure measures used by five of the labs, we simply highlight the differences from the LEQ, as there is much overlap among all the instruments used to measure infants' exposure to their languages. Two labs used custom assessment measures they designed themselves. The major difference from the LEQ for the first of these custom measures is that parents estimate the percentage of exposure for each language from primary individuals in the infant's life, rather than the number of hours of exposure per day in each language. The other custom measure, unlike the LEQ, specifies estimates of language exposure in settings where more than one speaker is present by weighting each speaker's language contribution. Two other labs used other measures present in the literature: One used the Multilingual Infant Language Questionnaire (MILQ; Liu & Kager, 2017b), and the other used an assessment measure designed by Cattani et al. (2014). For the MILQ, one major difference from the LEQ is that parents complete the assessment directly using an Excel sheet with clear instructions. The other major difference is that the MILQ is much more detailed: It breaks down language exposure to very specific activities (e.g., car time, book reading, mealtime), asks for more detail about the people in regular contact with the infant (e.g., accented speech, level of talkativeness), and asks for estimates of media exposure (e.g., TV, music). The measure from Cattani et al. focuses on parental exposure and uses Likert scales to determine exposure from each parent. The ratings are converted to percentages, and maternal exposure is weighted more in the final calculation because of data showing that mothers are more verbal than fathers. Finally, one lab did not use a detailed measure, but rather simply asked parents to give an estimate of the percentage exposure to each of the languages their infant was hearing.
For monolinguals, labs either used the same assessment as with bilinguals or minimally checked participants' monolingual status by asking parents a single question: to estimate the percentage of time that their infant was exposed to their native language. Under either approach, if that estimate exceeded 90% exposure to a single language, the infant was considered monolingual.
Demographics. Each lab administered a questionnaire that gathered basic demographic data about the infants, including age, health history, and gestation. Infants' SES was measured via parental report of years of maternal education. To standardize the data across different education systems, where formal schooling may begin at different ages, we counted the number of years of education after kindergarten. For example, in the United States, mothers who had completed high school were considered to have 12 years of education.
Final sample. The final sample of bilinguals who met our infant-level inclusion criteria included 333 infants tested in 17 labs; 148 were 6 to 9 months old, and 185 were 12 to 15 months old (a full account of exclusions is detailed in the Results section). These 17 labs also collected data from 384 monolingual infants who met infantlevel inclusion criteria, of whom 181 were 6 to 9 months old and 203 were 12 to 15 months old. Although all analyses required that the data met the infant-level inclusion criteria, some analyses further required that the data met the lab-level inclusion criteria (lab-level inclusion criteria are discussed in the Results section in connection with the specific analyses for which they were implemented). Data from monolingual infants in these age ranges were available from 59 additional labs (6-to 9-month-olds: n = 574; 12-to 15-month-olds: n = 463) that did not contribute bilingual data. Table 1 provides summary information describing the bilingual infants and lab-matched monolingual samples. The appendix details the gender distributions across subsamples (Table A1) and the language pairs being learned by bilingual infants (Table A2).

Materials
Visual stimuli. Labs using a central-fixation or eyetracking method presented infants with a brightly colored checkerboard as the main visual stimulus. A video of a laughing baby was used as an attention getter between trials to reorient infants to the screen. Labs using the HPP used the typical visual stimuli employed in their labs, which were sometimes light bulbs (as in the original development of the procedure in the 1980s) or sometimes colorful stimuli presented on LCD screens. All visual stimuli are available via the ManyBabies 1 OSF site at osf.io/re95x/.
Auditory stimuli. Auditory stimuli consisted of seminaturalistic recordings of mothers interacting with their infants (ranging in age from 122 to 250 days) in a laboratory setting. Mothers were asked to talk about a set of objects with their infant and also separately with an experimenter. A set Note: The labs are listed in order of their samples' exposure to North American English (NAE). Because of lab-level inclusion criteria, cells with fewer than 10 participants were excluded from the meta-analyses, but were included in the mixed-effects regression analyses. Labs that tested only monolingual infants are not listed. HPP = head-turn-preference procedure.
of eight IDS and eight ADS auditory stimuli of 18 s each was created from these recordings. Details regarding the recording and selection process, acoustic details, and ratings from naive adult listeners can be found in the ManyBabies 1 report (ManyBabies Consortium, 2020) and the associated OSF project at osf.io/re95x.

Procedure
Each lab used one of three procedures common in infant studies, according to their own expertise and the experimental setups available in the lab: central fixation (three labs), eye tracking (seven labs), or HPP (seven labs). The testing procedure was identical to that used in the ManyBabies 1 project (ManyBabies Consortium, 2020; deviations from the protocol are also described there), and we describe key aspects here. Infants sat on a parent's lap or in a high chair, and parents listened to masking music over headphones throughout the study. Infants saw two training trials that presented an unrelated auditory stimulus (piano music), followed by 16 test trials that presented either IDS or ADS. Trials were presented in one of four pseudorandom orders that counterbalanced the order of presentation of the two stimulus types. Note that within each order, specific IDS and ADS clips were presented adjacently in yoked pairs to facilitate analyses. On each trial, the auditory stimulus played until the infant looked away for 2 consecutive seconds (in labs that implemented an infantcontrolled procedure) or until the entire stimulus played, up to 18 s (in labs that implemented a fixed-trial-length procedure). The implementation of the procedure depended on the software that was available in each lab. Trials with less than 2 s of looking were excluded from analyses. Attention-grabbing visual stimuli were presented centrally between trials to reorient infants to the task.
The main differences between the setups were the type and position of visual stimuli presented and their onset relative to the auditory stimuli. For central-fixation and eye-tracking procedures, infants saw a checkerboard on a central monitor, and the presentation of the checkerboard coincided with the onset of the auditory stimulus on each trial. For the HPP, the visual stimulus (either flashing light bulbs or a colorful stimulus on a monitor) was presented silently in the center of the room and then on one side until the infants turned their head toward the side stimulus, at which point the auditory stimulus began playing.
The dependent variable was looking time to the visual stimulus during each trial. For eye-tracking setups, looking time was measured automatically via corneal reflection. For central-fixation and HPP setups, looking time was measured by trained human coders who were blind to trial type, according to the lab's standard procedures.
Parents completed questionnaires about participants' demographic and language background either prior to or after the main experiment.

Analysis overview
Data exclusion. Labs were asked to submit all data collected as part of the bilingual study to the analysis team, and this section focuses on exclusions for the bilingual sample. The initial data set contained 501 infants, of which 333 met each of the inclusion criteria, which are detailed below. We note that exclusions were applied sequentially (i.e., percentages reflect exclusions among the remaining sample after previous criteria were applied).
• • Full term. We defined full term as gestation times greater than or equal to 37 weeks. There were 9 (1.80%) infants who were tested but excluded because they were preterm. • • No diagnosed developmental disorders. We excluded infants whose parents reported developmental disorders (e.g., chromosomal abnormalities) or who were diagnosed with hearing impairments. There were 2 (0.41%) infants who were tested but excluded for these reasons. Because of concerns about the accuracy of parent reports, we did not plan exclusions based on parent-reported ear infections unless parents reported medically confirmed hearing loss. • • Age. We included infants in two age groups: 6-to 9-month-olds and 12-to 15-month-olds. There were 58 (11.84%) infants who were tested in the paradigm but who fell outside our target ages. Some labs chose to test such infants for future exploratory analyses, knowing they would be excluded from the current project. • • Bilingualism. We excluded infants whose language background did not meet our predefined criteria for bilingualism (see above for details). There were 70 (16.20%) infants whose exposure did not meet this criterion. We also excluded seven (1.93%) additional infants who met this criterion, but who were not learning the community language as one of their languages. • • Session-level errors. There were 14 (3.94%) participants excluded because of session-level errors. Seven were excluded for equipment error, three for experimenter error, and four for outside interference. • • Adequate trials for analysis. A total of 855 (13.98%) trials were excluded because of errors such as fussiness, presentation of an incorrect stimulus, or a single instance of interference by a parent or sibling.
There was one infant who did not have any trials left for analysis once such trials were excluded. Next, we excluded seven (2.06%) infants who did not have at least one IDS-ADS trial pair available for analysis. For infants with at least one good trial pair, we additionally excluded any trial with less than 2 s of looking (n = 876 trials; 16.92% of trials), which was set as a trial-level minimum so that infants had heard enough of the stimulus to discriminate IDS from ADS. As infants did not have to complete the entire experiment to be included, this meant that different infants contributed different numbers of trials. On average, infants contributed 15.70 trials to the analysis.
Data-analysis framework. Our primary dependent variable of interest was looking time, which was defined as the time spent fixating on the visual stimulus during test trials. Given evidence that looking times are nonnormally distributed, we log-transformed all looking times prior to statistical analysis in the mixed-effects model (Csibra et al., 2016). We refer to this transformed variable as "log looking time." For the meta-analysis, we analyzed effect sizes computed from raw difference scores, which did not require log transformation. We preregistered a set of analyses to examine whether monolinguals, heterogeneous samples of bilinguals, and homogeneous samples of bilinguals showed different levels of variability. Unexpectedly, only one lab tested a homogeneous sample of bilinguals, so we deviated from our original plan and did not analyze data as a function of whether our bilingual groups were homogeneous versus heterogeneous. For the main analyses, we adopted two complementary data-analytic frameworks parallel to those used in the ManyBabies 1 project (ManyBabies Consortium, 2020): meta-analysis and mixedeffects regression. Under the meta-analytic framework, data from each sample of infants (e.g., 6-to 9-month-old bilinguals from Lab 1) were characterized by (a) its effect size (Cohen's d) and (b) its standard deviation. Effect-size analyses addressed questions about infants' overall preference for IDS, whereas group-based standard deviation analyses addressed questions about whether some groups of infants showed higher variability in their preference than others. Note that meta-analyses of intragroup variability are relatively rare (Nakagawa et al., 2015;Senior et al., 2016). Unfortunately, our preregistration did not account for the eventuality that several labs would contribute very small numbers of infants to certain groups, as each lab had committed to a minimum sample of 16 infants per group. In two cases, a lab contributed data with a single infant in a particular language group, so it was impossible to compute an effect size. Thus, we implemented a lab-level inclusion criterion for the meta-analysis such that each effect size was computed only if the lab had contributed data from at least 10 infants in that particular language group and age bin. For example, if lab A had contributed data from seven bilingual 6-to 9-month-old infants and 15 monolingual 6-to 9-monthold infants, we computed the effect size for the monolingual group, but not for the bilingual group. This criterion ensured that each effect size computed was based on a reasonable sample size (i.e., a minimum of 10 infants) and also was consistent with the lab-level inclusion criteria in the ManyBabies 1 study. Because this exclusion criterion was not part of the preregistration, we also ran a robustness analysis with a looser minimum of five infants, which yielded very similar findings (analysis code and results can be found in our GitHub repository).
An advantage of the meta-analytic approach is that it is easy to visualize lab-to-lab differences. Further, the meta-analytic framework most closely mirrors the typical single-lab approach for studying monolingual-bilingual differences, which usually compares groups of monolingual and bilingual infants tested within the same lab. We used this approach specifically to test the overall effect of bilingualism on the magnitude of infants' preference for IDS over ADS and the possible interaction of bilingualism with age. We also compared standard deviations for the bilingual group and monolingual group in a meta-analytic approach. This analysis closely followed Nakagawa et al. (2015).
Under the mixed-effects regression model, trial-bytrial data from each infant were submitted for analysis. Further, independent variables of interest could be specified on an infant-by-infant basis. This approach had the advantage of potentially increasing statistical power, as data were analyzed at a more fine-grained level of detail. As with the meta-analytic approach, this analysis tested the effect of bilingualism and its potential interactions with age. We also investigated whether links between bilingualism and IDS preference were mediated by SES. Additionally, this approach allowed us to assess how the amount of exposure to NAE IDS, measured as a continuous percentage, affected infants' listening preferences. Note that for this analysis, unlike for the meta-analysis, we did not need to apply a lab-level inclusion criterion, which maximized our sample size. Thus, data from all infants who met the infant-level criteria were included in this analysis, which resulted in slightly different sample sizes under the meta-analytic and mixed-effects approaches.
Under both frameworks, we used a dual analysis strategy to investigate how infants' IDS preference was related to bilingualism. First, we examined the labmatched subset of data (i.e., data from labs that contributed both a monolingual sample and a bilingual sample at a particular age). Second, we examined the complete set of data, including data from labs that contributed both monolingual and bilingual samples, as well as additional data from labs that tested only monolinguals at the ages of interest as part of the larger ManyBabies 1 project.

Confirmatory analyses
Meta-analytic approach. This approach focused on the analysis of group-level data sets. We defined a data set as data from a group of at least 10 infants who were tested in the same lab, were of the same age (either 6 to 9 or 12 to 15 months old), and had the same language background (monolingual or bilingual). For analyses of within-group variability, we compared bilingual infants with monolingual infants.
To estimate an effect size for each data set, we first computed individual infants' preference for IDS over ADS by (a) subtracting looking time to the ADS stimulus from looking time to the IDS stimulus within each yoked trial pair and (b) then computing a mean difference score for each infant. A total of 20.58% of trials in the lab-matched data set and 13.78% of trials in the full data set were missing. Trial pairs that had either one or both trials missing were excluded from the analysis; 34.53% of pairs in the lab-matched data set and 25.41% of pairs in the full data set were excluded for this reason. Note that we expected many infants to have missing data, particularly on later test trials, given the length of the study (16 test trials). Then, for each data set, we calculated the mean of these difference scores (M d ) and its associated standard deviation across participants. Finally, we used the derived means and standard deviation to compute a within-subjects Cohen's d using the formula d z = M d /SD.
In the following meta-analyses, random-effects metaanalysis models with a restricted maximum likelihood estimator were fit with the metafor package (Viechtbauer, 2010). To account for the dependence between monolingual and bilingual data sets stemming from the same lab, we added laboratory as a random factor. As part of our preregistered analyses, we planned to include method as a moderator in this analysis if it was found to be a statistically significant moderator in the larger ManyBabies 1 project-which it was (ManyBabies Consortium, 2020). However, only 17 labs contributed data from bilinguals, and we deviated from this plan because of the small number of labs per method (e.g., only three labs used a central-fixation method).
Effect-size-based meta-analysis. Our first set of metaanalyses focused on effect sizes (d z ): how our variables of interest contributed to differences in looking time on IDS versus ADS trials. Recall that we ran the analyses in two ways: The first analysis was restricted to the labs that contributed lab-matched data (lab-matched data set), and the second analysis included all available data (i.e., including data from labs that tested only monolinguals or only bilinguals at the ages of interest; full data set).
We initially fit the following model to examine contributions of age and bilingualism to infants' IDS preference, as well as potential interactions between these variables: d z~* 1 bilingual age bilingual age + + + Bilingualism was dummy coded (0 = monolingual, 1 = bilingual), and age (a continuous variable) was coded as the average age for each lab's contributed sample for each language group (centered for ease of interpretation).
In the lab-matched data set, we did not find statistically significant effects of age ( As bilingualism is the key moderator of research interest in the current article, here we report the effect sizes of monolingual and bilingual infants separately. In the lab-matched data set, the effect size for monolinguals was 0.42 (95% CI = [0.21, 0.63], z = 3.94, p < .001), and for bilinguals the effect size was 0.24 (95% CI = [0.06, 0.42], z = 2.64, p = .008). In the full data set, the effect size for monolinguals was 0.36 (95% CI = [0.28, 0.44], z = 9.15, p < .001), and for bilinguals the effect size was 0.26 (95% CI = [0.09, 0.43], z = 2.97, p = .003). In sum, numerically, monolinguals showed a stronger preference for IDS than bilinguals, but this tendency was not statistically significant in the effect-size-based meta-analyses. A forest plot for the lab-matched meta-analysis is shown in Figure 1.
Within-group-variability meta-analysis. Our second set of preregistered meta-analyses examined whether the variability in infants' preference for IDS within a sample (withinsample variability) was related to language background (monolingual vs. bilingual). Note that this question of withinsample heterogeneity is different from questions of betweensamples heterogeneity that can also be addressed in metaanalysis (see Higgins et al., 2003;Higgins & Thompson, 2002 for approaches to between-groups variability in metaanalysis). Specifically, the within-group-variability metaanalysis approach provides additional insights into how two groups differ in terms of their variances, not merely their mean effect sizes. This approach is useful to investigate whether infants' language backgrounds influence not only the magnitude of infants' IDS preference, but also the variability of their IDS preference. In the following, the standard deviations measure variability of infants' lookingtime preference for IDS over ADS in each language group. Again, we report d z , an effect size that measures the magnitude of infants' preference for IDS over ADS.
Our preregistered plan was to follow Nakagawa et al. (2015) and Senior et al. (2016), and we further elaborate on this plan here. According to Nakagawa et al., there are two approaches to run within-group-variability metaanalysis: One approach uses lnCVR, the natural logarithm of the ratio between the coefficients of variation, to compare the variability of two groups; a second approach enters lnSD (the natural logarithm of standard deviations) and lnX (the log mean) into a mixed-effects model. When data meet the assumption that the standard deviation is proportional to the mean (i.e., the two are correlated), the first approach should be used, and otherwise, the second approach should be used. Our data did not meet the necessary assumption; therefore, we used the second, mixed-effects approach. In the following metaregression model, the natural logarithm of the standard deviation (lnSD) of IDS preference in each language group is the dependent variable: where |d z | is the absolute value of d z because we needed to ensure that values entered into the logarithm were positive, and bilingual is the binary dummy variable that indicates whether the language group is monolingual or bilingual. Further, we entered a random intercept and a random slope for bilingualism, which were allowed to vary by lab. We note that this log transformation is entirely unrelated to the log transformation of raw looking times used in the linear mixed-effects models.
In the lab-matched data set, we did not find statistically significant evidence for bilingualism as a moderator of the difference between the two language groups' standard deviations (d z = −0.08, p = .235). We also did not find significant evidence for such an effect in the full data set (d z = 0.02, p = .698). In short, we did not find support for the possibility that bilingual infants would show larger within-group variability than monolingual infants.
Mixed-effects approach. Mixed-effects regression allows variables of interest to be specified on a trial-by-trial and infant-by-infant basis. We had anticipated that we would be able to include additional data from labs that aimed to test homogeneous samples (i.e., because we could include infants from these labs who were not learning the homogeneous sample's language pair), but in practice this did not happen, as only one lab contributed a homogeneous data set, and that lab did not test additional infants. We were able to include data from all valid trials, rather than excluding data from yoked pairs with a missing data point as was necessary for the meta-analysis. As under the metaanalytic approach, we ran the models twice, once including only data from labs that contributed lab-matched samples of monolinguals and bilinguals and once including all available data from 6-to 9-month-olds and 12-to 15-month-olds.
The mixed-effects model was specified as follows: The goal of this framework was to examine effects of the independent variables (IV) on the dependent variable (DV), while controlling for variation in both the DV (random intercepts) and the relationship of the IV to the DV (random slopes) based on relevant grouping units (subjects, items, and labs). Following recent recommendations (Barr et al., 2013), we planned to initially fit a maximal random-effects structure, such that all random effects appropriate for our design were included in the model. However, we also recognized that such a large random-effects structure might be overly complex given our data and would be unlikely to converge. After reviewer feedback during Stage 1 of the Registered Report review process, we preregistered a plan to use a parsimonious mixed-models approach for pruning the random effects (Bates et al., 2018;Matuschek et al., 2017). However, we found that it was computationally difficult to first fit complex models (i.e., our models had multiple interactions and cross-levels grouping) under the maximal random-effects structure and then prune the models using a parsimonious mixed-models approach. Further, we note that this was not the approach used in ManyBabies 1 (ManyBabies Consortium, 2020), which would make a direct comparison between ManyBabies 1 and the current study difficult. Therefore, following ManyBabies 1, we fitted and pruned the models we present in this section using the maximal random-effects structure only (Barr et al., 2013). We fitted all models using the lme4 package (Bates et al., 2015) and computed p values using the lmerTest package (Kuznetsova et al., 2016). All steps of the pruning process we followed are detailed in the analytic code on our GitHub repository. Following a reviewer's suggestion during Stage 2 review, we checked our models for potential issues with multicollinearity by examining variance inflation factors (VIFs) for each model. Variables that have VIF values exceeding 10 are regarded as violating the multicollinearity assumption (Curto & Pinto, 2011). None of our models violated this assumption. Below is a description of our variables for the mixed-effects models: • • log looking time: the dependent variable; logtransformed looking time in seconds • • trial type: a dummy-coded variable with two levels, with ADS trials as the baseline, such that positive effects of trial type indicate longer looking to IDS • • bilingual: a dummy-coded variable with two levels, with monolingual as the baseline, such that positive effects of bilingualism reflect longer looking by bilinguals • • language: a dummy-coded variable with two levels, with NAE learners as the baseline; NAE learners (i.e., infants learning NAE as a native language) were defined as monolinguals with at least 90% exposure to NAE and bilinguals with at least 25% exposure to NAE • • nae exposure: a continuous variable for the percentage of time infants heard NAE • • method: a dummy-coded variable to control for effects of different experimental setups, with central fixation as the reference level • • age: age in days, centered for interpretability of main effects • • trial number: the number of the trial pair, recoded such that the first trial pair was 0 • • ses: the number of years of maternal education, centered for ease of interpretation Note that in this analysis plan, we used a concise format for model specification, which is the form used in R. Thus, lower-order effects subsumed by interactions were modeled even though they were not explicitly written. For example, the interaction of trial type and trial number also assumes a global intercept, a main effect of trial type, and a main effect of trial number.
Homogeneity of variance. We preregistered a Levene's test to examine whether monolinguals and bilinguals showed different amounts of variance in their IDS preference. Our analysis focused on the residual variance for monolinguals and bilinguals in the main linear mixedeffects models, in order to partition out variance associated with other factors (e.g., age, method). The Levene's test revealed a statistically significant difference in variance between monolinguals and bilinguals for the full data set (p = .02) but not the lab-matched data set (p = .68). We note that the difference in residual variance between the monolingual (variance = 0.24) and bilingual (variance = 0.25) language groups was small, which suggests that the statistically significant Levene's test for the full data set was mainly driven by its larger sample size, rather than by a meaningful difference between monolinguals and bilinguals.
Effects of bilingualism on IDS preference. We planned a mixed-effects model that was based on the structure of the final model fitted for the ManyBabies project, including bilingualism as an additional moderator. Note that because data collection for the two projects was simultaneous, we did not know prior to registration what the final model structure for the monolingual-only sample would be (it was expected that pruning of this model would be necessary in the case of nonconvergence). The original model proposed for the monolingual-only sample was designed to include simple effects of trial type, method, language (infants exposed vs. not exposed to NAE IDS), age, and trial number, capturing the basic effect of each parameter on looking time (e.g., longer looking times for IDS, shorter looking times on later trials). Additionally, the model included two-way interactions of trial type with method and with trial number, a two-way interaction of age with trial number, and two-and three-way interactions of trial type, age, and language (see ManyBabies Consortium, 2020, for full justification). This model was specified to minimize higher-order interactions while preserving theoretically important interactions. Note that to reduce model complexity, we treated both developmental effects and trial effects linearly. The planned initial model was Our analysis plan specified that we would add bilingualism to the fixed effects of the final pruned model fitted to the monolingual sample. For higher-order interactions in the model, we ensured that we had at least 20 infants per group. For example, we did not include a three-way interaction of bilingualism, language, and age because we had fewer than twenty 6-to 9-month-old bilinguals who were not exposed to NAE.
In our preregistration, we were uncertain as to whether our sample size would support a model with a four-way interaction of trial type, age, bilingual status, and language. Given our final sample size, we elected to fit our main model without including the four-way interaction effect. In our main model, we included two fixed three-way interactions-(a) the interaction of bilingualism, age, and trial type and (b) the interaction of language, age, and trial type-as well as other subsumed lower-order interactions.
Regardless of our fixed-effects structure, the model included the random slope for the effect of bilingualism on lab and item, as well as appropriate interactions with other random factors. Our initial unpruned model was Overall, the mixed-level analyses in the lab-matched and the full data sets yielded similar results (Tables 2  and 3). More coefficients were statistically significant in the full data set, likely because of the larger sample size. Thus, in the following, we focus on the results of the mixed-level model for the full data set. We found that infants showed a preference for IDS, as indicated by a positive coefficient on the IDS predictor (reflecting greater looking times to IDS stimuli). We did not find an effect of bilingualism on IDS preference or any interaction of bilingualism and other moderators. This finding is consistent with the results of our meta-analysis above.
Surprisingly, the fitted model did not show an interaction between infants' IDS preference and the method used in the lab, a result that is different from the results Note: The proportion of variance accounted for by the fixed effects only (marginal R 2 ) was .087, and the proportion of variance accounted for by both the fixed and the random effects (conditional R 2 ) was .317. IDS = infant-directed speech; HPP = head-turn-preference procedure. Note: The proportion of variance accounted for by the fixed effects only (marginal R 2 ) was .110, and the proportion of variance accounted for by both the fixed and the random effects (conditional R 2 ) was .361. IDS = infant-directed speech; HPP = head-turn-preference procedure.
in the ManyBabies 1 project. However, this finding is likely due to smaller sample sizes in the current project, as we restricted the analysis to participants at particular ages. Apart from this, our findings were largely consistent with those of the ManyBabies 1 study. There was a significant and positive two-way interaction between IDS and language suggesting greater IDS preferences for children in NAE contexts than for children in other language contexts. The interaction between IDS and age was also significant and positive, suggesting that older children showed a stronger IDS preference. Finally, we found a marginally significant three-way interaction of IDS, age, and NAE, suggesting that older children in NAE contexts tended to show stronger IDS preference than those in the non-NAE contexts.

Dose effects of exposure to NAE IDS in bilingual infants.
In this analysis, we tested whether we could observe a dose-response relationship between infants' exposure to NAE IDS (measured continuously) and their preference for IDS over ADS.
We decided to conduct this analysis including only data from bilinguals. Our reasoning was that bilingualism status and exposure to NAE IDS were confounded, as monolinguals' exposure to NAE was near either 0% or 100%, whereas bilinguals' NAE experience could be either 0% (because not all bilinguals were learning NAE as one of their two languages) or in the range from 25% to 75%. Because the monolingual sample was larger and their NAE exposure was more extreme, their effects would dominate those of the bilinguals in a merged analysis. Therefore, we reasoned that if there was a dose effect, it should be observable in the bilingual sample alone. Finally, although excluding monolingual infants reduced power overall, we decided that given the relatively large sample of bilingual infants, this disadvantage would be offset by the ease of interpretation afforded by restricting the analysis to bilinguals. On average, bilingual infants in our sample were exposed to 20.17% NAE (range: 0%-75%).
Once again, we based this model on the final pruned monolingual model, replacing the binary measure of exposure to NAE IDS (language) with the continuous measure of exposure (nae exposure), and including a random slope for NAE exposure by item (which was ultimately pruned from the model). After pruning, our model was specified as follows:  Table 4 contains the details of the results in this model. The main effect of infants' exposure to NAE was not significant (β = −0.0007, SE = 0.001, p = .57). This indicates that bilingual infants who were exposed to more NAE did not pay more attention to the NAE speech stimuli than did those who were exposed to less NAE. However, the interaction between trial type and exposure to NAE was significant (β = 0.002, SE = 0.0008, p = .011). That is, bilingual infants who were exposed to more NAE showed stronger IDS preferences. This result confirmed a dose-response relationship between infants' exposure to NAE and their preference for IDS over ADS (Fig. 2) even among bilinguals who were learning NAE as one of their native languages.
Socioeconomic status as a moderator of monolingualbilingual differences. Because SES can vary systematically between monolinguals and bilinguals in the same community, we preregistered an analysis controlling for SES. Given that we did not find an effect of bilingualism on IDS preference, this analysis investigated whether such an effect would be apparent once SES was controlled. Alternately, if we had observed an effect of bilingualism on IDS preference, this analysis would have allowed us to investigate whether it would disappear once SES was controlled.
Thus, this analysis was relevant regardless of whether or not we observed a relationship between IDS preference and bilingualism in the previous model.
First, we computed descriptive statistics for the two groups. Mothers of the bilingual sample had an average of 16.71 years of education (SD = 2.47, range = 10-26), those of the lab-matched monolingual sample had an average of 16.33 years of education (SD = 2.83, range = 5-28), and those of the full monolingual sample had an average of 16.42 years of education (SD = 2.47, range = 8-25).
Our approach was to add SES as a moderator in our final model for bilinguals. We expected that any effects of SES could interact with age. Thus, this model included interactions of trial type, age, and SES as a fixed effect, as well as the corresponding random slope by item. Based on the potential model detailed above for the bilinguals, our expected SES-mediated model was In general, across the lab-matched and full data sets (Tables 5 and 6), SES did not have a significant effect on infants' looking time, nor did it affect infants' preference for IDS. However, for the lab-matched data set only, we found a statistically significant three-way interaction of IDS, age, and SES. Specifically, 6-to 9-month-olds showed stronger IDS preference when they were from higher-SES families, but older infants, 12-to 15-month-olds, showed similar IDS preference across families with different SES levels. However, this interaction was not observed in the full data set, which raises the possibility that it is spurious, and arose only in the lab-matched data set because that data set is substantially smaller than the full data set.

Exploratory analysis
Our confirmatory analysis showed that bilingual infants with more exposure to NAE showed stronger IDS preference (see Table 4). However, this initial analysis included a number of bilingual infants who were not exposed to NAE at all. This raises the question of whether the relation between NAE exposure and IDS preference was primarily driven by the infants who were not learning NAE. In the following analysis, we reran the preregistered model of the relationship between NAE exposure and IDS preference, this time restricting the model to infants (n = 135) who were exposed to NAE between 25% and 75% of the time. After pruning for nonconvergence, the final model was The interaction between IDS preference and NAE exposure was still statistically significant (β = 0.01, SE = 0.002, p = .004; see Fig. 2). This result suggested that a dose-response relationship between infants' exposure to NAE and their preference for IDS over ADS was not driven by infants living in non-NAE contexts alone (see Table 7 for details of the model). The solid lines show the results of linear regression models, one starting from zero NAE exposure and another focusing on NAE exposure from 25% to 75%. Note that the y-axis was truncated to highlight the trend, and some individual points are not plotted.

General Discussion
The current study was designed to shed light on the effects of experience on the tuning of infants' preference for IDS compared with ADS. Bilingual infants' language experience is split across input in two different languages, which are being acquired simultaneously. Bilinguals and monolinguals may thus differ in their preference for IDS. To explore this question, we used a collaborative, multilab (N = 17 labs) approach to gather a large data set from infants who were either 6 to 9 or 12 to 15 months old and growing up bilingual (N = 333 bilingual infants in the final sample) and a lab-matched sample of 384 monolingual infants from the same communities. Data were collected as a companion project to ManyBabies 1 (ManyBabies Consortium, 2020), which was limited to infants growing up monolingual. Overall, we found that bilingualism neither enhanced nor attenuated infants' preference for IDS; the magnitude and developmental trajectory of IDS preference was similar for bilinguals and monolinguals from age 6 to 15 months. Although bilingual experience did not appear to moderate infants' preference for IDS, we found striking evidence that experience hearing NAE, the language of our stimuli, contributed to the magnitude of bilingual infants' IDS preference. Bilinguals with greater exposure to NAE showed greater IDS preference (when tested in NAE) than those who had less exposure to NAE. This relationship between NAE exposure and IDS preference was also observed in a subsample of bilingual infants who were all acquiring NAE, but who varied in how much they were exposed to NAE relative to their other native language. These results converge with those from the larger ManyBabies 1 study, which found that monolinguals acquiring NAE had a stronger preference for IDS than monolinguals acquiring another language. Our approach provides a more nuanced view of the relationship between NAE exposure and IDS preference and suggests that there is a continuous dose effect of exposure on preference. Together, our findings have a number of implications for bilingual language acquisition during infancy. In the following, we discuss each of our research questions in turn, followed by limitations and implications of our study.
Our first research question asked whether bilingualism affects infants' relative attention to IDS versus ADS. Note: The proportion of variance accounted for by the fixed effects only (marginal R 2 ) was .088, and the proportion of variance accounted for by both the fixed and the random effects (conditional R 2 ) was .304. IDS = infant-directed speech; HPP = head-turn-preference procedure.
One possibility we raised was that the complexity of the bilingual infant's learning experience might lead to greater reliance on IDS, given that IDS may be viewed as an enhanced linguistic signal. However, our data were not consistent with this possibility. In the full data set, monolinguals showed a numerically larger meta-analytic effect size, d z = 0.36 [95% CI = 0.28, 0.44], than bilinguals did, d z = 0.26 [95% CI = 0.09, 0.43], but this difference was not statistically significant in either the meta-analyses or the mixed-effects linear models. Although small differences are still possible, our data generally support the conclusion that bilingual and monolingual infants show a similar preference for IDS over ADS. Specifically, both groups show a preference for IDS at 6 to 9 months of age, and this preference gets stronger by the age of 12 to 15 months. An additional part of our first research question asked whether bilinguals might show more variability than monolinguals in their IDS preference, beyond any differences in the magnitude of the preference. We reasoned that given their diversity of language experiences, bilingual groups may have a higher heterogeneity in terms of their IDS preference compared with monolingual groups (see also Orena & Polka, 2019, for a recent experiment that demonstrated this pattern). Both monolingual and bilingual groups showed high variability. The magnitude of the observed difference in variability was very small. We conducted three analyses to compare the variability between the monolinguals and bilinguals. Only one of the three variability analyses (i.e., the Levene's test with the full data set) was statistically significant. This effect was mainly driven by the large sample size in the full data set (N = 1,754) because the difference in variability between the monolinguals and bilinguals remained negligible. Thus, our results do not support the idea that bilingual infants show meaningfully more variability in their IDS preference than their monolingual peers.
Given that monolinguals and bilinguals can differ systematically in their SES, the third part of our first research question asked whether SES might moderate the effects of bilingualism. Using years of maternal education as a proxy for SES, we found mixed support for SES as a moderator in our data sets. In our smaller lab-matched data set, we found a statistically significant interaction of age, SES, and IDS preference: Six-to 9-month-olds from higher-SES families showed stronger IDS preference than those from lower-SES families, whereas 12-to 15-month-olds showed similar IDS preference regardless of SES. The direction of this effect aligns with other research indicating that children from higher-SES families generally receive more language input and/or higher-quality input (e.g., by engaging in conversations with more lexical diversity, complexity, and structural variations) than children from lower-SES families (Fernald et al., 2013;Hart & Risley, 1995;Hoff, 2006;Tal & Arnon, 2018). This could suggest that infants from higher-SES families may show stronger IDS preference early in life because they hear more or higher-quality IDS in their daily lives. Further, this positive impact of SES may be most beneficial to younger infants, whose IDS preference is still developing. However, given that in our larger (full) data set SES was unrelated to IDS preference in either 6-to 9-or 12-to 15-month-olds, this result might be spurious and should be interpreted with caution. Further, it is important to note that our samples (both monolingual and bilingual groups) were mainly from higher-SES families. Indeed, in the lab-matched data set, the mothers of 67.79% of children had earned at least a bachelor's degree. Our samples, therefore, had low variability in infants' SES. Thus, this question would be better tested with future studies that have participants from more diverse SES backgrounds.
Our second research question asked whether and how the amount of exposure to NAE affects bilingual infants' listening preferences. Given that our stimuli were produced in NAE, one possibility was that greater exposure to NAE would be linked to greater attention to NAE IDS relative to NAE ADS. Indeed, ManyBabies 1 (ManyBabies Consortium, 2020), which was conducted concurrently with the current study, found that monolinguals acquiring NAE showed a stronger IDS preference than monolinguals not acquiring NAE. However, in the ManyBabies 1 study, exposure to NAE IDS was a binary variable-the infants either heard only NAE or heard only a different language in their environments. In the current project, bilinguals provided a more nuanced way to address this question, as bilinguals' exposure to NAE varied continuously between 25% and 75% (for infants learning NAE as one of their native languages) or was near 0% (for infants learning two non-NAE native languages). We found clear evidence for a positive dose-response relationship between exposure to NAE and infants' preference for NAE IDS. This evidence-that bilinguals with more exposure to NAE showed a stronger NAE IDS preference-was also present when we focused only on bilinguals who were learning NAE as one of their native languages (i.e., those exposed to NAE 25%-75% of the time). Importantly, we did not find a similar effect of exposure to NAE on infants' overall looking time. This suggests that the effect of NAE exposure on preference for IDS is a meaningful relationship, rather than an Note: The proportion of variance accounted for by the fixed effects only (marginal R 2 ) was .119, and the proportion of variance accounted for by both the fixed and the random effects (conditional R 2 ) was .362. IDS = infant-directed speech; HPP = head-turn-preference procedure; NAE = North American English.
artifact due to the stimuli being presented in NAE. Further studies with stimuli in other languages would be necessary to solidify this conclusion. Our analyses included both meta-analyses and linear mixed-effects models, which allowed us to compare these two approaches. As our field moves toward more large-scale studies of this type, it will be important to determine appropriate standards for analysis. Our metaanalysis allows for better and more direct comparison with prior meta-analyses (e.g., Dunst et al., 2012). However, an important limitation of this approach is that infants' data are collapsed to a single averaged data point per group, which obscures potentially interesting variability. Moreover, because we could not model trial number directly, this average was based on valid adjacent trial pairs, which resulted in many trials being excluded from the analysis. In contrast, the mixed-effects models analyzed data at the individual-trial level, allowing us to examine how data variability could be explained by moderators at the trial and participant level, which increased statistical power. Our finding of a significant effect of age on IDS preference in the mixed models, but not in the meta-analysis, can be attributed to this difference in statistical power. We believe that each of these complementary approaches has its place, but that the mixed-effects model is preferable because it improves statistical power.
Given that this was the first study to recruit and test bilingual infants at such a large scale and at so many sites, we encountered several challenges (see also Byers-Heinlein et al., 2020, for a fuller discussion of challenges in planning and conducting ManyBabies 1). First, several laboratories were not able to recruit the number of bilingual infants originally planned. All labs committed to collecting data from a minimum of 16 bilingual infants per age group. This ended up being unfeasible for some labs within the time frame available (which was more than a year), in some cases because a large number of participants did not meet our strict criterion for inclusion as bilingual. Our experience undoubtedly highlights the challenges for labs in recruiting bilingual infant samples, and moreover raises questions about how bilingualism should be defined and whether it should be treated as a continuous or categorical variable (Anderson et al., 2018;Bialystok et al., 2018;Incera & McLennan, 2018). Second, we had planned to explore the effect of different language pairs on IDS preference. We had expected that some labs would be able to recruit relatively homogeneous samples of infants (i.e., all learning the same language pair), but in the end only one of 17 labs did so (another lab had planned to recruit a homogeneous sample but deviated from this plan when it appeared unfeasible). Thus, we leave the question of the effect of language pair on infants' IDS preference an open issue to be followed up in future studies. By and large, we believe that our large-scale approach to data collection may in the future allow for the creation of homogeneous samples of infants tested at different laboratories around the world. Large-scale and multisite bilingual research projects provide researchers with a powerful way to examine how the diversity and variability of bilinguals impact the development of their language skills and their cognitive development.
Overall, our finding that bilinguals and monolinguals show a similar preference for IDS reinforces theoretical views that emphasize the similarities in attentional and learning mechanisms across monolingual and bilingual infants (e.g., Curtin et al., 2011). IDS appears to be a signal that enhances attention in infants from a variety of language backgrounds. Yet bilingual infants appear to be exquisitely fine-tuned to the relative amount of input in each of their languages. It could have been the case that we would observe a threshold effect of language exposure such that any regular exposure to NAE enhanced infants' preference for NAE IDS, marking it is a highly relevant speech signal. Instead, we observed a graded effect such that the magnitude of bilingual infants' preference varied continuously with the amount of exposure to NAE. The current study shows that just as bilingual infants' relative vocabulary size and early grammar skills in each language are linked to the amount of input in that language (Hoff et al., 2012;Place & Hoff, 2011), the amount of language input may also play an important role in other language-acquisition processes. Indeed, an intriguing but untested possibility is that different input-related attentional biases (e.g., IDS preference) across bilinguals' two languages explain important variability in the early development of bilingual children's vocabulary and grammar. Future bilingual work can investigate this possibility to further delineate the interplay among infants' language input, IDS preference, vocabulary, and grammar skills.
To conclude, the findings of the current study provide a more nuanced view of the development of infants' preference for IDS than prior studies have allowed. IDS preference develops along a similar trajectory among infants from both monolingual and bilingual backgrounds. By testing bilingual infants, our study revealed that this IDS preference operates in a dose-response fashion; the amount of exposure to NAE positively moderated infants' (NAE) IDS preference in a continuous way. Our experience highlights the challenges in recruiting and testing bilingual infants, but also reveals the promise of large-scale collaborations for increasing sample sizes, and thus improving the replicability and generalizability of key findings in infant research.