L2 transfer of L1 island-insensitivity: The case of Norwegian

Norwegian allows filler-gap dependencies into embedded questions, which are islands for filler-gap dependency formation in English. We ask whether there is evidence that Norwegian learners of English transfer the functional structure that permits island violations from their first language (L1) to their second language (L2). In two acceptability judgment studies, we find that Norwegians are more likely to accept ‘island-violating’ filler-gap dependencies in L2 English if the corresponding filler-gap dependency is acceptable in Norwegian: Norwegian learners variably accept English sentences with dependencies into embedded questions, but not into subject phrases. These results are consistent with models that permit transfer of abstract functional structure. Norwegians are still less likely to accept filler-gap dependencies into English embedded questions than Norwegian embedded questions. We interpret the latter finding as evidence that, despite transfer, Norwegian speakers may partially restructure their L2 English analysis. We discuss how indirect positive evidence may play a role in helping learners restructure.


I Introduction
This article addresses first language (L1) transfer in the acquisition of filler-gap dependencies in adult second language (L2) acquisition. We ask whether Norwegian learners of English transfer acceptable filler-gap dependencies from their L1 Norwegian to their L2 English, including dependencies that are unacceptable (and therefore unattested) in English. We also consider whether and how Norwegians might learn that English is more restrictive than Norwegian.
Norwegian and English allow long-distance filler-gap dependencies into embedded declarative clauses. For example, the relative clause (RC) head the signals / signalene can be interpreted as either the direct object (1a, 2a) or subject (1b, 2b) of an embedded verb.
(1) a. Those were the signals i that the sailors said [(that) folks could understand ___ i ].
b. Those were the signals i that the sailors said [ ___ i meant danger].
(2) a. Det var signal-ene i som sjømenn-ene sa [(at) folk kunne forstå ___ i ]. That was signal-def.pl that seamen-def.pl said that folks could understand 'Those were the signals that the sailors said that folks could understand.' b. Det var signal-ene i som sjømenn-ene sa [(at) __ i betydde fare]. That was signal-def.pl that seamen-def.pl said that meant danger 'Those were the signals that the seamen said meant danger.' Norwegian and English differ, however, in subtle ways. Embedded questions are islands in English in that they block filler-gap dependency formation (Chomsky, 1977;Sprouse et al., 2012). Attempting to associate the filler the signals with the embedded verbs in (3) results in unacceptability. In Norwegian embedded questions are not islands (Maling and Zaenen, 1982). It is acceptable to associate the filler signalene with the embedded gaps in (4).
(3) a. *Those were the signals i that the sailors knew [who could understand __ i ].
b. *Those were the signals i that the sailors knew [what __ i meant].
(4) a. Det var signal-ene i som sjømennene visste [hvem som kunne forstå ___ i ]. That was signal-def.pl that seamen-def.pl knew who expl could understand 'Those were the signals that the sailors knew who could understand.' b. Det var signal-ene i som sjømennene visste [hva __ i betydde]. That was signal-def.pl that seamen-def.pl knew what meant 'Those were the signals that the sailors knew what meant.' ~ 'Those were the signals that the sailors knew the meaning of.' In the present study, we investigate if the acceptability of sentences like (4b) leads native Norwegian speakers to accept sentences like (3b) in their L2 English.
We expect Norwegians to accept sentences like (3b) if they inappropriately transfer to L2 English those features of their L1 grammar that render embedded questions nonislands. What could such features be? Under many generative syntactic analyses 'longdistance' movement out of an embedded clause as in (1) and (2) requires successive-cyclic movement through the left-periphery of the embedded clause (e.g. Chomsky, 1977Chomsky, , 2000. In languages like English, this movement uses the specifier of the complementizer phrase (henceforth spec,CP) as an intermediate landing site. Ordinary declarative clauses are not islands, because spec,CP is empty, allowing the moved element to transit through. Embedded questions are islands because spec,CP is already occupied by the wh-phrase -who/what in examples (3a) and (3b) -so an intermediate stop-over is blocked.
Cross-linguistic differences in the islandhood of embedded questions are assumed to reflect parametric variation in the functional structure of the left-periphery of the clause (e.g. Reinhart, 1983;Rizzi, 1982). 1 For the sake of concreteness we make use of a specific proposal made for Mainland Scandinavian languages like Norwegian: Recent work argues that such languages have multiple specifiers in the complementizer domain that would permit successive-cyclic movement through the edge of an embedded question (e.g. Lindahl, 2017;Kush et al., 2018Kush et al., , 2019Vikner et al., 2017). The relevant specifiers are generated by an extra functional head (e.g. the head c under Vikner and colleagues' proposal).
Under this analysis, Norwegians would treat English embedded questions as nonislands if they transfer the extra functional structure of their L1 complementizer domain to English. As we discuss below, whether such transfer is possible is a point of disagreement between models of L2 acquisition. In our investigation, we address three interrelated theoretical questions at the intersection of second-language influence and learnability: 1. To what extent does L1 functional structure transfer to L2? 2. Are L1 features transferred to L2 in a conservative fashion? 3. How do L2 learners restructure after erroneous L1 transfer?
We discuss each question in turn.

What can transfer?
Cases of L1-L2 transfer are well documented. For example, learners often produce or accept L1 word order patterns that are ungrammatical in L2 (Ayoun, 1999;Rankin, 2012;Trahey and White, 1993;Westergaard, 2003;White, 1991). Such instances suggest that L2 learners use some aspects of their L1 (as a starting point) to analyse their L2, but exactly what transfers is a matter of considerable debate.
Models of L2/Ln acquisition disagree on the degree to which L1 functional structure transfers (for review, see Rothman et al., 2019). The Minimal Trees approach of Vainikka and Young-Scholten (1994, 1996, 2006 admits no transfer of functional projections from L1 to L2, positing that learners transfer only lexical projections (VPs) during early acquisition. Higher-level functional projections (e.g. CP) are assumed to emerge later in development via the interaction of L2 input and principles of Universal Grammar (UG) without using L1 functional heads as templates. Most other models of transfer assume that functional projections from L1 transfer to L2, though they differ as to what this entails. Eubank's (1993) Weak Transfer hypothesis holds that functional heads from L1 transfer along with their parameter settings (e.g. basic directionality), but L1-specific lexical feature-values associated with those heads do not transfer. Full Transfer models contend that L1 functional heads, their parameter settings, and their associated feature values serve as the initial interlanguage template for L2 development (e.g. Schwartz andSprouse, 1994, 1996). 2 As far as island-insensitivity in L1 Norwegian can be attributed to the presence of extra functional structure, observing comparable island insensitivity in L2 English would constitute evidence for transfer of that functional structure.

Conservativity and transfer
Language learners often encounter input that is compatible with two (or more) analyses differing in generative capacity: a restrictive analysis that closely fits the observed data and another more powerful analysis that generates both the observed data and additional unattested sentences. In such cases the strings generated by the first analysis represent a subset of the strings generated by the more powerful analysis (with respect to a given phenomenon). 3 When learners must choose between the two analyses, they face a version of the classic subset-superset problem (e.g. Berwick, 1985;Wexler and Manzini, 1987;White, 1989aWhite, , 1989b; for cases in L2, see Judy and Rothman, 2010;Yuan, 1997): should they choose the more, or less, restrictive analysis? What if they choose the less restrictive analysis and it turns out the be incorrect? If so, rejecting the superset analysis may be difficult since strings consistent with the subset analysis are equally consistent with the superset analysis.
Similar learnability considerations apply in L2 acquisition, where the problem is aggravated by the possibility of transfer: If the learner's L1 supports the superset analysis and the analysis is transferred to L2, the result is an overly permissive L2 grammar that generates both acceptable and unacceptable L2 forms. The case of Norwegian is arguably such an instance: if Norwegian learners transfer their L1 functional structure to their analysis of L2 English filler-gap dependencies, they would be able to generate acceptable long-distance dependencies in English, but also island-violating dependencies that should be unacceptable in English.
Prior work in L1 acquisition has argued that learners can avoid erroneous overgeneralization by adopting conservative learning strategies that prefer restrictive analyses (e.g. Snyder, 2007;Westergaard, 2014). 4 In principle, it is possible that transfer is also conservative: L2 learners could eschew transferring features that would potentially overgenerate or avoid transferring typologically marked structures (e.g. Mazurkewich, 1984) without direct evidence for those features. Previous research has shown that L2 learners are less conservative than L1 learners, but these studies have not directly considered the role that transfer might play in these situations (e.g. Anderssen et al., 2018;Clahsen and Muysken, 1986;White, 1989b). If Norwegians treat embedded questions as non-islands in L2 English, this would constitute evidence against conservative transfer.

Retraction and restructuring after transfer
L2 learners can undo transfer of an L1 feature (e.g. restructure) based on positive evidence of conflict between L2 input data and L1 analyses. Many models assume that direct positive evidence of conflict with the L1 analysis of phenomenon, P, is required for restructuring the L2 analysis of P (see, e.g. Schwartz and Sprouse, 1996): metalinguistictype negative evidence (e.g. correction) is believed not to be useful for prompting underlying grammatical restructuring (Schwartz, 1993;White, 2003). For some basic phenomena like determining head-directionality the relevant evidence is in abundance, so restructuring should happen quickly. As the similarity between or number of (surface) forms predicted by L1 and L2 increases, however, the possibility of direct conflict diminishes: the relevant positive data are scarce, if they exist at all. In the absence of (enough) conflicting data, inappropriately transferred features are expected to persist late into acquisition or become fossilized (see, amongst others, Franceschina, 2005;Hawkins et al., 1993;Judy and Rothman, 2010;Lardiere, 2007;Schwartz and Sprouse, 1996). As such, instances where L2 surface forms are a subset of acceptable L1 forms represent paradigm cases where 'persistent' or fossilized transfer should obtain. Given that (most) acceptable English filler-gap dependencies are compatible with a transferred Norwegian analysis, we predict that Norwegians are likely to have restructured their L2 English grammars if transfer has occurred.

Past work on learnability of islands/movement
Before proceeding to our experiments, we briefly consider past work that investigated islands in L2 acquisition to highlight the difference from our research questions. Most prior studies were framed as tests of access to principles of Universal Grammar (UG) during L2 acquisition rather than transfer.
Some earlier experiments explored if L1 speakers of languages without overt whmovement accept island-violating wh-movement in English (Johnson and Newport, 1991;Li, 1998;Martohardjono, 1993;White and Genesee, 1996;White and Juffs, 1998;Wolfe Quintero, 1992). For example: As part of a larger study, Martohardjono (1993) had L1 Chinese and L1 Indonesian participants judge sentences with long-distance whdependencies in their L2 English. Test sentences contained wh-dependencies that into five types of constituents that are islands in English: embedded questions (wh-islands), RCs, complex NPs, adjunct clauses, and sentential subjects. Martohardjono found that participants in both groups correctly rejected island violations on a non-trivial portion of trials, 5 which was taken as evidence for access to UG constraints on wh-movement during L2 acquisition (for similar conclusions, see Li, 1998;White and Juffs, 1998).
Other experiments have tested whether learners accept island-violating L2 filler-gap dependencies that correspond to unacceptable dependencies in their L1. Martohardjono (1993) again provides an example. Martohardjono asked L1 Italian participants to rate the same English sentences as the native Chinese and Indonesian participants in the experiment above. In Italian, wh-movement from all five of the constituents is unacceptable, just as in English (Rizzi, 1982;Sprouse et al., 2016). Martohardjono found that Italian participants rejected the test sentences at rates comparable to L1 English natives. 6 These results demonstrate that participants do not allow island-violating dependencies in their L2 if those dependencies are unacceptable in L1, an empirical conclusion that is also supported by the growing body of research on the real-time processing of islands in L2 (Felser et al., 2012;Kim et al., 2015;Omaki and Schulz, 2012).
The results above do not directly address the limits of transfer because they are in principle compatible with transfer either having or not having occurred. If speakers of non-wh-movement languages initially transferred their L1 analysis of wh-dependencies to L2 English, observing overt movement dependencies would prompt them to restructure and generate a new analysis for the observed forms in the L2 input. If transfer did not occur, they would similarly base an analysis of English wh-dependencies on input forms. Judgments of English dependencies, then, would be based on their input-driven analyses. In the case of Italian, if participants conservatively learn the distribution of acceptable English dependencies from the L2 input alone or transfer their L1 analysis, they should reject island-violating wh-dependencies all the same.
Unlike prior experiments, our work tests whether transfer occurs by testing cases where the dependencies allowed by L1 constitute a larger set than is allowed in L2. If transfer occurs, we expect 'unlearning' the L1 analysis should prove difficult because there is arguably little to no direct evidence that would contradict the transferred analysis. As such, we expect the transferred analysis to persist and to affect participant judgments even despite high otherwise proficiency in the L2 and significant time knowing the L2 well. In particular, we expect participants to accept unacceptable L2 forms generable under the L1 analysis. We tested whether L2 speakers of English make such errors with two acceptability judgment studies. To preview our main results, we find evidence for the predicted non-conservative transfer from L1 Norwegian to L2 English. However, we also find evidence that suggests some degree of restructuring: Norwegians do not uniformly treat embedded questions as non-islands in English as they do in Norwegian. We consider the implications of these facts in the General Discussion.

II Experiments
We ran two acceptability judgment studies that tested Norwegian speakers' intuitions about the acceptability of relative clause dependencies in configurations like (3b) and (4b) in both English and Norwegian. We henceforth refer to such examples as Wh-Trace Configurations to highlight two characteristics of the constructions: (1) the islands in question are embedded questions (wh-islands) and (2) the filler is associated with a subject gap/trace immediately adjacent to the embedded wh-word. Both aspects of the constructions are presumed to result in unacceptability in English: (1) because of an island violation, and (2) because it is unacceptable in (most dialects of) English to have a gap next to an overt element in the complementizer domain (so called Comp-trace effects; Chomsky and Lasnik, 1977;Perlmutter, 1971). As both experiments had the same design, we present information about the materials, procedure, and analysis before discussing the specifics of each experiment.

Materials and design
Both experiments employed the factorial definition of island effects developed by Sprouse (2007) and used in many recent studies of island-sensitivity cross-linguistically (e.g. Kush et al., 2018Kush et al., , 2019Sprouse et al., 2011Sprouse et al., , 2012Sprouse et al., , 2016. The standard 2×2 factorial design for island effects crosses the factors Structure and diStance. All test sentences contain a filler-gap dependency. diStance controls the length of the dependency, while Structure controls the presence or absence of an island configuration. Example (5) illustrates the design with an item for testing Wh-Trace configurations. In (5) the filler-gap dependency is a relative clause dependency. diStance determined whether the head of the RC (either someone or the signal) was linked to the highest subject position in the RC (Short) or to the embedded subject position (Long). Structure manipulated whether the most deeply embedded clause inside the relative clause was declarative (No-Island) or an embedded question (Island). The Long-Island condition corresponds to the only sentence with an 'island violation'. According to the logic of the factorial design, an island effect is defined as a Structure × diStance interaction: it reflects the residual unacceptability associated with a structure like (5d) once the independent effects that long-distance extraction and structural complexity have on acceptability have been accounted for. We crossed the standard 2×2 manipulation above with an additional factor: language. Norwegian counterparts for all English items were created, (6), resulting in a 2×2×2 design. In addition to Wh-Trace items, we tested sensitivity to another island type: Subject islands. The islandhood of subject phrases is determined by different syntactic constraints (e.g. the Condition on Extraction Domains of Huang, 1982) than the embedded questions. As a result, the extra functional structure that permits Wh-Trace island violations should have no effect on the islandhood of subjects. Thus, subjects should be islands in both Norwegian and English. This prediction has been verified by previous studies using the factorial design (Kush et al., 2018(Kush et al., , 2019Sprouse et al., 2011Sprouse et al., , 2016. We adapted materials from Kush et al. (2018Kush et al. ( , 2019 to test the acceptability of RC-dependencies into subject islands. The design crossed Structure × diStance × language, yielding 8 conditions as exemplified below. (7) Sample Subject Island Item (English Conditions): The judge . . . a. met the lawyer that __ hoped that the report would confirm the suspicions. Short | no-iSland b. read the report that the lawyer hoped would __ confirm the suspicions.
long | no-iSland c. met the lawyer that __ hoped that the information in the report would confirm the suspicions. Short | iSland d. read the report that the lawyer hoped that [the information in __] would confirm the suspicions. long | iSland Dommeren. . . a. møtte advokaten som __ håpet at rapporten ville bekrefte mistankene. Short | no-iSland b. leste rapporten som advokaten håpet at __ ville bekrefte mistankene.
long | iSland Subject island judgments provide an independent baseline of island-sensitivity that is not expected to be affected by the hypothesized transfer of functional structure.

Procedure
Test items were distributed across lists according to a Latin Square design and intermixed among filler sentences. The experiment was hosted on IbexFarm (Drummond, 2012). Participants participated on their own personal computers. Sentences were presented one at a time. Participants rated their acceptability on a 7-point scale. All participants rated English items first before judging a Norwegian block to minimize L1 interference. Instructions were presented in English and participants received a break between English and Norwegian blocks.

Analysis
Raw ratings were z-score transformed before analysis. We z-scored ratings by participant and language tested. Z-scoring by-participant helps to control for biases in how individual participants used the 7-point scale. Z-scoring by-language for each participant helps control for the fact that participants may use the scale differently in their L1 and L2 (Sorace, 1996;Spinner and Gass, 2019). Z-scores were analysed with linear mixed effects models implemented using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) packages in R (R Core Team, 2013). All models included fixed effects of Structure, diStance and their interaction and random intercepts for both subject and item. When appropriate we also included fixed effects of iSland type and language. We included by-subject random slopes for Structure, diStance and their interaction when such models converged. When the more complex model did not converge, we simplified the random effects structure. Details of individual models are presented in the tables of statistical results below. P-values were computed using likelihood ratio tests. Effect size was measured as a Difference-indifferences (DD) score (Maxwell and Delany, 2003). DD scores were calculated by-participant.

III Experiment 1 1 Materials
Sixteen items of 8 conditions apiece were created for each island type following the diStance × Structure × language design. The 32 test items (16 Wh-Trace, 16 Subject Island) were interspersed among 76 filler sentences (38 English, 38 Norwegian). Each set of language-specific filler sentences contained 22 unacceptable and 16 acceptable fillers varying in length and complexity.

Experiment 1a: Native English controls
Thirty-one native English volunteers recruited as control participants via social media (mean age = 38.0, SD = 11.9, 17 female; 27 from the United States) judged sentences in the English block of the experiment on Ibex Farm. One participant was excluded from analysis for having multiple response times < 500 ms. Because participants rated only the English sentences, participants rated 4 tokens per condition per island.

Results
Average acceptability judgments by condition are found in Figure 1. A summary of statistical analysis is found in Table 1. Native English speakers rated RC dependencies into both subject phrases and embedded Wh-Trace constructions much lower than RC dependencies into non-islands. There were clear island effects (diStance × Structure p < .001). The sizes of Subject and Wh-Trace island effects (DDs = 1.55, 1.69, respectively) did not differ significantly, as evidenced by the absence of a three-way diStance × Structure × iSland type interaction.
We also inspected by-participant ratings of the Subject and Wh-Trace Long-Island sentences to check for inter-trial consistency. Native English participants rejected Wh-Trace Long-Island sentences nearly uniformly. Twenty-eight of 30 participants rejected 4 out of 4 Wh-Trace Long-Island tokens, judging all below z = 0. The two  remaining participants rated a single Wh-Trace Long-Island token above z = 0, but rejected the remaining 3 tokens. Judgments of Subject Long-Island sentences showed slightly more variability. Fourteen participants rejected all four tokens that they rated. Twelve participants rejected three of four tokens. Three participants exhibited more variability: two of the three rejected only two of four Subject Long-Island sentences and one participant rejected only one of four. Overall, however, participants rejected RC-dependencies into subject phrases on the clear majority of trials.

Experiment 1b: Norwegian L1, English L2
a Participants. Twenty-seven native speakers of Norwegian took part (16 female). Two participants' data were excluded because the participants reported exposure to English during infancy. Participants were students enrolled in the English Studies program at the Norwegian University of Science and Technology (NTNU) either at the bachelor's or master's level. Norwegian university students are assumed to have a proficiency in English at least commensurate to the Common European Framework of Reference for Languages (CEFR) at B2 level, as this is the minimum standard for enrollment for foreign students (see, for example, Samordna opptak, 2020). Participants filled out a short survey on their language background and their English exposure. An overview of responses to this survey is in Table 2.
Unlike the English control participants, Norwegian participants rated 8 items per island in each language (2 tokens per condition per language). Average judgments by language and island type are plotted in Figure 2.
We first report the results of the omnibus diStance × Structure × iSland type × language analysis before planned analyses of each island type in isolation; see Table 3. Sentences with long-distance RC dependencies were rated lower on average than sentences with short RC dependencies (p < .000), as were sentences with island structures (p < .000). However, these main effects were qualified by several higher-order interactions. We focus on the two three-way interactions. The significant interaction of diStance × Structure × language (p = .0385) reflects the fact that Norwegians exhibited larger average diStance × Structure island effects in English than in Norwegian. The significant diStance × Structure × iSland type interaction (p < .000) indicates that Norwegians exhibited larger diStance × Structure island effects for Subject island sentences than for Wh-Trace sentences. Although we did not observe a significant four-way interaction, we conducted planned comparisons of the three-way interaction of diStance × Structure × language for each island type separately.
b Subject islands. A statistical summary is given in Table 4. The size of Subject island effect was significantly larger in English (DD = 1.69) than in Norwegian (DD = 1.01), as indicated by a diStance × Structure × language interaction (p = .016). Resolving this interaction confirmed that significant, sizable subject island effects were present both in English and Norwegian (ts = -10.05, -6.16, respectively; ps < .001).
c Wh-Trace Islands. A statistical summary can be found in Table 5. Again, the threeway diStance × Structure × language interaction was significant (p < .01). Resolving the three-way interaction revealed that although there was a significant Wh-Trace island effect in English (t = -4.10, p < .001; DD = 0.38), there was not a significant Wh-Trace island effect in Norwegian (DD = -.01). Visual inspection of Figure1 confirms the absence of even trend towards an interaction in Norwegian.    The Wh-Trace island effect observed in English is smaller in magnitude than the subject island effects in either language, while the average rating of the English Wh-Trace Long-Island sentence is considerably higher (roughly -0.25) than the English Subject Long-Island sentence (roughly -0.80). Following Kush et al. (2018Kush et al. ( , 2019, we investigated whether the smaller effect reflected inconsistent judgments across trials. Figure 3 plots the distribution of z-scores in both Long conditions for each island-language combination. Response consistency is reflected in the degree to which judgments in a condition follow a unimodal distribution. Inconsistent judgments manifest as bimodal or uniform distributions. In each of the island-language pairs, the Long-NoIsland condition provides a baseline level of consistency against which to judge the responses in the Long-Island conditions. The extent of the overlap between the Long-NoIsland and Long-Island judgments provides a rough way of approximating the extent to which the RC-dependencies into islands were perceived as run-of-the-mill long-distance dependencies. Beginning with subject island sentences, we observe relatively little overlap between that the ratings for Long-Island and Long-NoIsland sentences. Judgments of Subject Long-NoIsland sentences cluster unimodally around the higher end of the scale (z = +1), with a thicker left tail. Judgments of the Subject Long-NoIsland condition, by contrast, cluster at the opposite end of the scale (z = -1) and exhibit less of a right skew.
Judgments of Long conditions in the Subject island sub-experiment provide a template for consistent judgment. Judgments in the Wh-Trace sub-experiments clearly do not conform to that template. Judgments in the Norwegian Wh-Trace Long-Island condition are consistent with general acceptability: z-scores are unimodally distributed about the high end of the scale with a fat left tail. The pattern of responses indicates that participants perceived the test sentences as unobjectionable on most trials. Bimodality in the corresponding Long-NoIsland condition suggests that participants were less consistent in judging those sentences. Turning to the English Wh-Trace sub-experiment we see bimodality in both Long conditions, though the larger mode falls on opposite ends of the range between conditions. Norwegian participants tended to accept Long-NoIsland sentences more often than reject them, but there were still a number of trials where they judged the sentences to be unacceptable. Most relevant to our purposes, the Norwegian participants often rejected Long-Island sentences, but there was a non-negligible number of trials on which they accepted structures that native English speakers reject.

Individual differences
Analysis of the rating distributions shows that there was inter-trial inconsistency in the ratings of English Wh-Trace Long-Island sentences, but it does not establish whether the cause was inter-or intra-participant inconsistency. To ascertain whether individual participants were inconsistent, we plotted each participant's maximum judgment against their minimum judgment for each island-language combination (see Kush et al., 2019). In Figure 4, each dot corresponds to an individual participant.
For the purposes of the analysis we adopt a crude definition of 'acceptance' and 'rejection': we treat all judgments that fall below z = 0 as rejections and all judgments that are above z = 0 as acceptances. Using this coarse categorization technique permits identification of three participant response types: Participants that rejected both tokens of an island type occupy quadrant 3 (bottom left). Those that accepted both tokens occupy quadrant 1 (top right). Those that occupy quadrant 4 (top left) rated island tokens inconsistently, accepting one and rejecting the other.
A few participants accepted one or both Norwegian subject island tokens, but subject island judgments otherwise exemplify consistent rejection: In Panels 1 and 2 of Figure 4 most participants fall into quadrant 4. Judgments of the Norwegian Wh-Trace sentences show a different pattern: all participants fell into quadrant 1 (consistent accepters) or quadrant 4 (inconsistent raters). Judgments of English Wh-Trace islands show more variability. Eleven participants consistently rejected Wh-Trace islands in English and 3 consistently accepted the constructions. The remaining 11 participants rated the sentences inconsistently. This level of inter-and intra-participant inconsistency stands in contrast to the relative uniformity of the same participants' judgments of English Subject island tokens.
Given the differences in participant response patterns for the English Wh-Trace, we conducted an exploratory analysis of whether individual variability correlated with self-reported proficiency, weekly hours of English spoken, or English media consumption. We used participants' English Wh-Trace island DD score as the dependent measure of island sensitivity. A positive correlation between DD score and individual measure would be expected on the assumption that increased exposure or proficiency made participants behave more like native English speakers. Hours of spoken English did not correlate with DD score (|t| < 1), nor did self-reported English proficiency (|t| < 1). There was a small, but significant negative correlation between DD score and English media consumption (t = -2.036, p < .05; adjusted R2 = .116). As Figure 5 shows, this correlation indicates -counter-intuitively -that participants who consumed more English media showed reduced sensitivity to English Wh-Trace island effects.

Discussion
As expected, English participants rejected RC-dependencies into subject phrases. Norwegian participants also rejected subject island violations in their L1 Norwegian and L2 English. These results are expected, if subjects are islands in both languages and the extra functional structure that allows filler-gap dependencies into embedded questions does not amnesty subject island violations.
Participants diverged in their judgments of Wh-Trace items. Native English speakers exhibited large Wh-Trace island effects, rejecting RC-dependencies in Wh-Trace configurations. We failed to find a Wh-Trace island effect in Norwegian. Norwegian participants generally accepted RC-dependencies into Wh-Trace configurations in their L1 as readily as RC-dependencies into declarative complement clauses. Interestingly, we found a significant island effect with English Wh-Trace constructions, indicating that Norwegians rated RC-dependencies in Wh-Trace constructions less acceptable on average than RC-dependencies into non-island declarative complement clauses. However, the Wh-Trace island effect was smaller than subject island effects, because Norwegian participants rated English Wh-Trace islands inconsistently: participant ratings were a mix of 'accept' and 'reject' trials. We defer further discussion and interpretation of this finding to the General Discussion.
The number of trials where Norwegian participants accepted English Wh-Trace violations provides suggestive support for transfer from L1 Norwegian to L2 English. However, the experiment was relatively low-powered, with only two observations of the relevant configuration per participant. We wished to test if our findings would replicate, and whether participants would provide more consistent judgments of Wh-Trace island violations in English if given more trials. Therefore, we ran Experiment 2, in which we doubled the number of observations per participant. We also increased our sample size and drew from a wider pool.

IV Experiment 2 1 Participants
Forty-nine native speakers of Norwegian took part in experiment 2 (29 female). Like the participants in Experiment 1, participants in Experiment 2 were enrolled as bachelor's and master's students in a Norwegian university. Unlike the previous participants, participants in Experiment 2 were enrolled in a wide range of degree programs, not only English. All these courses of study presuppose that students have studied English from upper secondary school and have achieved minimum proficiency at CEFR B2 level. Participants provided the same information as in Experiment 1. Table 6 provides an overview of descriptive statistics.

Materials
Participants rated the same items as in Experiment 1 plus 16 new Wh-Trace items. As a result, participants judged 4 tokens per condition per language in the Wh-Trace subexperiment instead of 2 as in Experiment 1.

Results
Participants' average judgments by island type and language are plotted in Figure 6. A summary of the omnibus statistical analysis can be found in Table 7. As in the analysis of Experiment 1, we focus only on the highest-order interaction effects.
There was a significant diStance × Structure × iSland type interaction (p < .000): Subject island effects were, on average, larger than Wh-Trace island effects, irrespective of language. The significant diStance × Structure × language interaction reflects that when collapsing across Subject and Wh-Trace island sentences participants exhibited numerically smaller average island effects in their judgment of Norwegian test items than with English test items. We proceed to the planned comparisons, focusing on each island type separately.
a Subject islands. A statistical summary is in Table 8. As in Experiment 1, Long sentences were rated lower on average than Short sentences (p < .000) and Island sentences were rated lower NoIsland sentences (p < .000) collapsing across languages. The threeway diStance × Structure × language interaction was again significant (p < .001). 4.9 (2.6) 4 1.5-12 English proficiency (self-reported) 5.7 (0.7) 6 4-7 Figure 6. Average z-scored acceptability judgments from Norwegian participants in Experiment 2. Rows correspond to the island judged and columns correspond to the language of presentation. The interaction was driven by the fact that the Norwegian Subject Island effect was numerically smaller than the English effect. However, subject island effects were large in both English and Norwegian (DDs = 1.47, 0.89, respectively) and significant (diStance × Structure: ts = -12.54, -8.01, respectively; ps < .000).
b Wh-Trace islands. Table 9 presents a statistical summary. Long conditions were rated significantly lower on average than Short conditions (p < .000) and Island conditions lower than NoIsland conditions (p < .05). The diStance × Structure × language interaction was significant (p < .001). Figure 6 makes evident that the interaction is because Norwegians exhibited an average island effect for English sentences (DD = 0.568), but not for Norwegian sentences (DD = -0.069). Follow-up analyses verified that there was a significant Wh-Trace island effect in English (t = -4.83, p < .001), but not in Norwegian (t < 1). We again examined the distribution of ratings in Long conditions for all island-language pairs. Distributions are plotted in Figure 7. Subject Long-Island and Long-NoIsland distributions are roughly bimodal with modes at opposite ends of the rating scale. However, in the Norwegian Wh-Trace sentences, the distribution of ratings for Long-Island sentences is essentially indistinguishable from Long-NoIsland sentences:  participants accepted the majority of test sentences in both conditions. English Wh-Trace judgments diverged from the ratings of their Norwegian counterpart sentences. Participants generally accepted Long-NoIsland sentences, judgments of Long-Island sentences are bimodally distributed. The group as a whole appears to accept and reject English Wh-Trace sentences with near equal frequency. Figure 8 plots individuals' minimum and maximum ratings for each island-language pair, to visualize rating consistency. Participants consistently rejected Subject Island violations in English and were generally consistent in their judgment of Norwegian Subject Island violations, as evidenced by the clustering in quadrant 3 in panels 1 and 2 of Figure 8.
Panel 3 shows that every participant accepted at least one Norwegian Wh-Trace island token -most participants fall into quadrant 4 -while many accepted all four tokens. Panel 4 indicates that almost all participants accepted at least one English Wh-Trace island token. Figure 8 only provides information about the range of individual participants' judgments. We were also interested in how many of the 4 Long-Island Wh-Trace tokens each participant accepted. Therefore, we binned participants by how many tokens they rated above 0. The result is in Table 10. Forty of 49 participants accepted 3 or 4 Norwegian Wh-Trace island tokens and none rejected all 4 tokens. Judgments of English Wh-Trace tokens showed less consistency: fewer participants accepted most of the items (18 of 49). Five participants consistently rejected Long-Island tokens. Nevertheless, Norwegian participants clearly displayed a different response pattern than Native English speakers in Experiment 1a, where all participants either uniformly rejected the Wh-Trace island tokens, or rejected 3 of 4.
One question that Table 10 leaves unaddressed is how strongly participants' judgments in the Norwegian Wh-Trace experiment correlate with their judgments in English. We addressed this question in two follow-up analyses. First, we plotted participants' Wh-Trace DD scores in Norwegian against their DD scores in English, to determine whether there was a correlation between island effect size. This plot is in Figure 9. Second, we looked for a correlation between individual participants' probability of accepting a Wh-Trace island violation in Norwegian and English. The correlation plot is provided in Figure 10.
There was no reliable correlation between Norwegian and English Wh-Trace DD scores. As Figure 9 makes apparent, there were many participants who exhibited no island effects in Norwegian (z ⩽ 0), but nevertheless had a positive Wh-Trace DD score in English. Figure 10 shows a numeric trend such that participants who accepted a high proportion of Wh-Trace island violations in Norwegian were slightly more likely to accept Wh-Trace island violations in English, though this correlation was not significant (Adjusted R 2 = .015; t = 1.32). The correlation was weakened by the group of 19 participants who readily accepted Wh-Trace island violations in Norwegian (> 50%), but were Table 10. Participants binned by the number of wh-trace island violation items they rated above the midpoint of the scale. less likely to do so in English. Importantly, all but five of these participants still accepted Wh-Trace violations in English. Finally, we checked whether any of the three individuallevel variables correlate with a participant's Wh-Trace DD score. None of the measures correlated with DD score (ts < 1).

V General discussion
Embedded questions are islands for filler-gap dependency creation in English, but not in Norwegian. The difference between the two languages has been linked to extra functional structure in the left-periphery of the Norwegian clause (Kush et al., 2018(Kush et al., , 2019Vikner et al., 2017). We were interested in determining whether Norwegians transfer this extra functional structure from their L1 to their L2, English. We reasoned that if Norwegians transfer the functional structure to English, they should erroneously treat embedded questions as non-islands in English. Insofar as the set of acceptable Norwegian filler-gap dependencies represents a superset of the acceptable English filler-gap dependencies, acquiring the appropriate generalization in English should prove difficult if transfer has occurred. The difficulty reflects the fact that there is arguably little if any direct evidence to counter-exemplify the less restrictive hypothesis (White, 1989a).
To test whether such transfer occurs, we tested whether adult Norwegian speakers accept filler-gap dependencies into embedded questions (wh-islands) in English. Our results provide evidence of transfer from L1 Norwegian L2 English: Participants accept filler-gap dependencies into wh-islands in English even though they have never encountered those structures in their English input. Importantly, participants do not accept all island-violating filler-gap dependencies in L2 English, as evidenced by participants' consistent rejection of subject-island-violating filler-gap dependencies. The fact that subject island violations were consistently rejected militates against an interpretation that attributes Norwegian participants' acceptance of English Wh-Trace island violations to general island-insensitivity in L2. As predicted by transfer, our participants only accepted island-violations in English if the corresponding dependency was acceptable in Norwegian. Insofar as the non-island status of embedded questions is due to extra CP-level functional structure, our results are consistent with models that permit such transfer, including Weak Transfer (Eubank, 1993) or traditional Full Transfer models (Schwartz andSprouse, 1994, 1996) over models that restrict transfer to minimal grammatical information Young-Scholten, 1994, 1996).
We predicted that transfer of L1 functional structure could lead native Norwegians into a 'superset trap': having assumed that an analysis that allows more filler-gap dependencies than are acceptable in English, learners would be unable to retract to a more restrictive analysis. All else equal, we would therefore predict that Norwegian participants should accept island-violating dependencies as often in L2 English as they do in L1 Norwegian. Participant judgments yielded a more complicated picture: Almost all participants accepted Wh-Trace island violations in English on some portion of trials, but roughly one-third of our participants accepted the English structures less readily than in Norwegian.

The source of inconsistent judgments
Participants' inconsistent judgments of English wh-island violations are consistent with two broad interpretations. First, participants may have rejected the sentences simply due to their increased complexity. This could occur if participants have greater difficulty processing wh-dependencies in their L2 than in their L1 (Juffs, 2005;Juffs and Harrington, 1995). We point out that this explanation presupposes that transfer must have occurred, otherwise the Norwegians would not accept the island-violations in English at all. The explanation holds, however, that Norwegians' tendency to probabilistically reject islandviolations does not constitute evidence of learning the appropriate English analysis: rejection occurs for orthogonal, extra-grammatical reasons.
The second option, which we favor, is that probabilistic rejection provides evidence of learning and partial restructuring. By 'restructuring' we simply mean that changes are made to some aspect of the holistic system or feature set transferred from L1. These changes could represent target-like restructuring, such that Norwegian speakers specifically reject the extra functional structure from Norwegian and adopt a simplified leftperiphery identical to native English speakers. Alternatively, Norwegians could engage in ad-hoc compensatory restructuring wherein other grammatical changes are made to ensure closer surface alignment with acceptable English forms, without directly retracting the L1 functional structure. Our current results do not allow us to distinguish these two possibilities. Which of these outcomes is more likely depends, in part, on what types of L2 input learners receive as evidence that the set of filler-gap dependencies is different in English and how directly that evidence contradicts the L1 analysis. We consider the issue of evidence in the input presently.
Participants' stochastic or inconsistent judgments are compatible with the notion that they have learned that the distribution of filler-gap dependencies differs between the two languages, but that learning or restructuring is not 'complete'. Within a parameter-setting model of L2 acquisition (Schwartz and Sprouse, 1996;White, 2003) or a grammar competition/multiple grammars model (Amaral and Roeper, 2014;Rankin, 2014), this uncertainty could be modeled as a probabilistic competition between different grammars. Transfer would entail that Norwegian learners begin acquisition by assigning a high probability to their L1 analysis. Over time, however, they would accumulate evidence against that analysis and would shift probability to a more restrictive analysis (provided they could avoid the preemption problem; Rothman and Iverson, 2013;Trahey and White, 1993).

Evidence of difference
The question remains what cues learners could use in their English input as evidence in favor of the restrictive analysis. Negative evidence could, in principle, play a role. We consider direct negative evidence in the form of corrections an implausible mechanism given: (1) the relative infrequency of relevant productions, (2) the unreliability and ambiguity of interlocutors' correction, (3) the low probability that wh-island violations are ever addressed explicitly in the English classroom (see, Carroll, 1995Carroll, , 2001Schwartz, 1993). Indirect negative evidence is another option: if Norwegian learners of English expect to encounter English wh-island violations at a rate comparable to Norwegian, then the absence of the structures could over time lead the learner towards the restrictive hypothesis. Prior research has argued that indirect evidence may play a role in L1 (Foraker et al., 2009;Perfors et al., 2011;Ramscar et al., 2013;Regier and Gahl, 2004;Rohde and Plaut, 1999) and L2 acquisition (Dahl, 2004;Plough, 1992). However, it is unclear whether the frequency of the island-violations is high enough in L1 to form the basis for strong predictions in L2.
Direct positive evidence of the unacceptability of wh-island violations does not occur, but some learning models allow indirect positive evidence to play a role (e.g. Pearl and Mis, 2016, or more traditional parameter-based models). Learners can rely on indirect positive evidence if there exist implicational relations between observed (non-island) structures and the possibility of island violations. Under the assumption that additional CP-level functional structure underlies the non-island status of embedded questions in Norwegian, Norwegians would require evidence that this structure is absent in English. Such evidence is only possible if some overt property of the English CP-domain conflicts with the Norwegian analysis. What could such cues be?
We assume that most sentences do not provide unambiguous evidence for deep differences in functional structure of the CP domain, given the similarity of surface word order patterns in the two languages. However, one piece of evidence might prove useful: It has been suggested that evidence for an articulated CP-domain in Mainland Scandinavian comes from embedded V2 phenomena (e.g. Vikner et al., 2017). Mainland Scandinavian languages exhibit V2 word order in main clauses (9a; Holmberg and Platzack, 1995): the finite verb (skal) is the second constituent in the linear string regardless of whether a subject (9a) or non-subject (9b) occupies sentence-initial position: The traditional analysis holds that V2 movement requires movement of the finite verb to C 0 . Canonical word order in embedded clauses is not V2, as evidenced by the position of the verb with respect to adverbs and negation in (10). This entails that the verb does not move to the embedded C position.
(10) Han er lei for at han antakeligvis ikke skal synge i morgen. He is sad for that he presumably neg shall sing tomorrow 'He is sad because he probably won't sing tomorrow.' It has been observed, however, that V2 word order is possible in some embedded clauses (see, amongst others, Bentzen, 2014;Julien, 2007). For example, in (11) the frame adverbial i morgen ('tomorrow') has been fronted internal to the embedded clause and the verb has moved past the embedded subject: Sentences like (11) provide evidence for extra functional structure in the left-periphery of the clause under the assumption that skal has moved to a head in the CP-domain distinct from the head hosting the complementizer head at ('that') and there exists a specifier position between at and the verb that i morgen can occupy.
In English, embedded fronting of a non-subject does not result in V2/subject-auxiliary inversion. Thus, observing the absence of V2 in embedded clauses like (12) might provide evidence that the language lacks the extra functional structure. 7 (12) He said that tomorrow {he will not | *will he not} sing.
Indirect positive evidence might also come from input sentences that do not involve observing different complementizer-level functional structure: English speech errors might also provide relevant evidence of a difference. It is well known that English speakers produce resumptive pronouns inside islands to 'rescue' ill-formed sentences (e.g. Morgan and Wagers, 2018;Ross, 1967). Importantly, English speakers produce resumptives in precisely the locations where Norwegian would allow gaps. For example, the sentences in (11) were observed in natural discourse: (13) a. There were a bunch of people at the party that I didn't know [who they were]. b. '. . . the sale of the uranium that nobody knows what it means' -Donald Trump (Gore et al., 2016, cited in Morgan andWagers, 2018: 861) c. 'Maybe it was a bad idea to get people together and try to record audio with some equipment that we didn't know how it worked.' (CBC Podcast Personal Best, Episode: 'Know more, Carry Less', ~9:00) Based on examples such as those in (13), a learner with the knowledge that resumptive pronouns and gaps are in complementary distribution would be able to infer that embedded questions are islands in English. Importantly, drawing inferences based on indirect positive evidence requires non-trivial prior knowledge of the implicational relations between overt forms and (families of) underlying structures. Indirect positive evidence of the type we describe above is likely to be relatively infrequent in the learner's input. The relative infrequency of such structures may help explain why our participants appear not to have mastered the appropriate generalization and why there is significant inter-individual variation in outcomes despite long-term exposure to, and instruction in, English. Such effects follow under probabilistic models of grammar competition where conclusively shifting to the subset grammar would require repeated exposure to disconfirmatory evidence (e.g. Yang, 2018).

VI Conclusions
We have argued that native Norwegian speakers erroneously transfer the grammatical source of wh-island insensitivity from their L1 to their L2 English. Such effects are compatible with models of transfer that allow transfer of CP-level functional structure, but not those that restrict transfer to lexical information. We also found evidence that suggested that (some) learners may partially restructure, which we suggested could be triggered by indirect positive evidence. However, our data do not tell us whether the restructuring observed involves transition to the target English analysis or adoption of a divergent compensatory hypothesis that simply ensures closer surface alignment with the English forms.
2. We remain agnostic as to whether 'Full Transfer' entails automatic wholesale transfer of a whole grammar at the beginning of acquisition, or whether individual units of functional structure can transfer on a property-by-property basis in response to input (e.g. Westergaard, 2019). 3. Here we deliberately frame the subset-superset distinction in terms of strings or 'weak generative capacity' rather than subset/superset grammars. This formulation is compatible with at least two possibilities: (1) that the two analyses under consideration map to grammars in a proper subset/superset relation, or (2) that they map to intersecting grammars that can be distinguished by their generative capacities in other areas. 4. There are cases where children appear to overgeneralize in non-conservative ways (see, for example, Ambridge et al., 2013;Mazurkewich and White, 1984;Pinker, 1989). 5. Participants' judgments of L2 island violations were often at, or close to, chance. Insofar as participants did not accept the island violations outright, the results were taken as evidence of UG influence. However, in the absence of information about how participants judged acceptable sentences, it is not clear how strongly to interpret the results. 6. Italian wh-movement exhibits sensitivity to wh-islands, but Italian relative clause movement does not (Rizzi, 1982). It would therefore be possible to test another superset-subset configuration using RC-movement in Italian. 7. Of course, the conclusion that English lacks the functional structure is not forced by data like (12), which is also compatible with the hypothesis that English has the extra structure, but lacks V2 movement generally. Thus, even these cases do not provide unambiguous indirect evidence. The possibility of vestigial or remnant V2 in English constructions with fronted negative and only-phrases further complicate this picture.