Simulating Speech Error Patterns Across Languages and Different Datasets

Children’s speech acquisition is influenced by universal and language-specific forces. Some speech error patterns (or phonological processes) in children’s speech are observed in many languages, but the same error pattern may have different effects in different languages. We aimed to explore phonological effects of the same speech error patterns across different languages, target audiences and discourse modes, using a novel method for large-scale corpus investigation. As an additional aim, we investigated the face validity of five different phonological effect measures by relating them to subjective ratings of assumed effects on intelligibility, as provided by practicing speech-language pathologists. Six frequently attested speech error patterns were simulated in authentic corpus data: backing, fronting, stopping, /r/-weakening, cluster reduction and weak syllable deletion—each simulation resulting in a “misarticulated” version of the original corpus. Phonological effects were quantified using five separate metrics of phonological complexity and distance from expected target forms. Using Swedish child-speech data as a reference, phonological effects were compared between this reference and a) child speech in Norwegian and English, and b) data representing different modes of discourse (spoken/written) and target audiences (adults/children) in Swedish. Of the speech error patterns, backing—the one atypical pattern of those included—was found to cause the most detrimental effects, across languages as well as across modes and speaker ages. However, none of the measures reflects intuitive rankings as provided by clinicians regarding effects on intelligibility, thus corroborating earlier reports that phonological competence is not translatable into levels of intelligibility.


Introduction
One important driver of children's speech and language acquisition is their advancing cognitive and motor capacities. And as children's physiological development is largely universal, some developmental trends are found regardless of what language the child is acquiring. For example, children typically acquire plosives before fricatives, and labial and coronal consonants before velar consonants (Vihman, 1993;Lee et al., 2010;Davis & MacNeilage, 1995;McLeod & Crowe, 2018). Universal trends are also observed in children's delayed phonological acquisition, such that the same speech error patterns (or phonological processes) 1 are observed across many languages. For example, velar fronting has been reported in languages like English, Swedish, and Cantonese (e.g., Cleland et al., 2017;Strömbergsson et al., 2015;Cheung & Abberton, 2000). On the other hand, the words and sounds that children hear drive their developmental trajectory in the direction of the surrounding language (e.g., Gierut & Dale, 2007;Yavaş, 2014). Such language-dependent pressure explains why the voiced fricative /v/ is acquired earlier in Swedish, Estonian and Bulgarian than in English (Ingram, 1988), and why English-learning children's first words are predominantly monosyllabic, whereas same-aged children acquiring Japanese, Swedish and French produce more disyllabic or longer words (Vihman, 1993). In descriptions of children's speech error patterns across languages, researchers point to a shared influence of universal and language-dependent forces (e.g., Clausen & Fox-Boyer, 2017;Hua & Dodd, 2006). The present study targets a related question that also resides in the intersection between language-universal and language-dependent pressure, namely: can typological differences between languages result in the same error pattern having more detrimental effects in one language than in another?
Much research underlying the identification of universal and language-specific trends in children's speech and language acquisition has been based on distributional characteristics of adult language, for example, in the computing of probabilistic phonotactics (Stokes, 2014;Storkel, 2004). Although reports have shown that certain aspects of word frequency that influence children's phonological acquisition are insensitive to whether the analysis is based on adult or child data (Gierut & Dale, 2007), others have identified distributional differences in child data compared to adult data (Dollaghan, 1994;Strömbergsson et al., 2017), raising concerns that more ecologically valid sources should be preferred. The lack of valid sources is a problem which is being remediated through the steadily increasing availability of both child-directed and child-produced speech data, for example, through initiatives like CHILDES (MacWhinney, 2000), PhonBank (Rose & MacWhinney, 2014) and LENA (Ganek & Eriks-Brophy, 2018). However, until such resources are universally available, many researchers have to rely on suboptimal resources. Thus, the need for estimating the degree of uncertainty in predictions of children's spoken language based on adult language use still remains.
We investigate phonological effects of six frequently observed speech error patterns through the simulation of these patterns in language data collected from Swedish, Norwegian, and English sources in order to glean how effects of the same error pattern vary across the three languages. Furthermore, by comparing these effects on different data subsets-written as well as spoken data, and data produced by adults as well as by children-we seek to explore the viability of basing investigations of children's phonological acquisition on more readily available sources than the ideal and most ecologically valid ones.
(i.e., two sequences of characters) is the minimum number of single-character insertions, deletions or substitutions needed to change one word into the other. The Levenshtein distance has been used to quantify the difference between pairs of words produced by speakers of different dialects (Gooskens & Heeringa, 2004). When used in this context, the Levenshtein distance has been found to correlate with objective mutual functional intelligibility scores of closely related languages like Danish, Norwegian, and Swedish (Gooskens, 2006). Hence, the measure can be expected to be revealing aspects of the functional consequences of speech not matching expected target forms.
Among measures quantifying speech production proficiency are also suggestions that combine independent and relational aspects. One such measure is the Phonological Mean Length of Utterance (pMLU) (Ingram, 2002). The pMLU is computed as the segmental length of the child's word productions plus the number of correct consonants in these productions, divided by the total number of word tokens. Therefore, in addition to being sensitive to the accuracy of the segments produced (just like the PCC), the pMLU also reflects variation in word length (i.e., structural characteristics). However, the pMLU measure yields similar scores for short words produced accurately as for longer words produced less accurately. Hence, the complexity of the intended word targets is unaccounted for by the pMLU. Addressing this limitation, Ingram (2002) suggested the Proportion of Whole-word Proximity (PWP) (Ingram, 2002) as a measure reflecting the complexity of the child's production in relation to that of the attempted target. The PWP is calculated by dividing the pMLU of the child's production by the pMLU of the target form. For example, for the production /nana/ for "banana," the PWP would be 6/9 = .67, whereas the PWP of /nana/ for "nanny" would be 6/6 = 1.0. Of Ingram's (2002) suggested whole-word measures, the pMLU and the PWP are probably the most widely used (see e.g., Arias and Lleó, 2014). Further, both measures have proved to be sensitive to children's phonological development over time across different languages (Ingram, 2002;Saaristo-Helin, 2009). However, caution has been raised that languagespecific adjustments may be needed when using the pMLU in cross-linguistic comparisons (Saaristo-Helin et al., 2006). This fact, together with the suggestion of the PWP as at least an indirect measure of intelligibility (Ingram, 2002) makes the PWP particularly relevant for inclusion in the present work.
As is clear from the above, many suggestions for quantifying the quality of children's speech production have been put forth. While some have posited at least an indirect link between speech accuracy and intelligibility (Ingram, 2002;McLeod et al., 2012), the functional validity of different quantifications, that is, to what extent a quantitative accuracy score relates to effects on functional communication, has rarely been tested (Mason et al., 2015). Alternative composite measures are rare, even though many have asserted that a comprehensive measure of phonological acquisition needs to be sensitive to more aspects than to the accuracy of phonological segments (e.g., Ingram, 2002;Mason et al., 2015;Masso et al., 2017). Examination of the relation between different existing measures is necessary when inventorying what measure best reflects the functional consequences of a speech disorder.

Ranking speech error patterns by severity
Intelligible speech is typically a long-term goal in the intervention of speech disorders (Lousada et al., 2014). Hence, although there are multiple considerations weighing into clinical decisions regarding what speech error patterns to prioritize in intervention (see e.g., Rvachew & Nowak, 2001), the ranking of error patterns by their impact on intelligibility would provide additional guidance in the prioritization of treatment goals. Efforts have been made to systematically rank error patterns by (un)intelligibility, but suggested rankings have often been based on clinical intuition, rather than on empirical evidence (Klein & Flint, 2006). For example, it has been suggested that deviant production of speech sounds occurring frequently in the child's language has more pervasive effects on intelligibility than misarticulation of sounds occurring less frequently (Brown, 1988), that misarticulation involving the neutralization of phonological contrasts are more damaging than those not causing lexical confusion (Ingram, 1989), and that inconsistent misarticulation is more detrimental to intelligibility than consistent misarticulation (Yavaş & Lamprecht, 1988). Of the empirically based research on the ranking of error patterns with regards to their impact on intelligibility, Paden's work (1981, 1983) is probably the most well-known. By observing differences in the distribution of error patterns across English-speaking children grouped by level of intelligibility, Hodson and Paden concluded, for example, that omissions of speech sounds are more detrimental to intelligibility than phonetic distortions, and that atypical error patterns are more disruptive than those often occurring early in typical phonological acquisition . As such, these observations have described correlation, that is, indirect links, between error patterns and (un)intelligibility. Approaching the issue from a different angle, Klein and Flint (2006) have reported a unique study of more direct links between speech error patterns and effects on intelligibility. Through the simulation of error patterns, these researchers were able to control both the types and the distribution of misarticulation, and, via calculating the proportions of words understood by listeners, to explore their consequences in terms of intelligibility. By ranking the three included speech error patterns-final consonant deletion, stopping and velar fronting-by their impact on intelligibility, final consonant deletion was found to have the most detrimental effect (Klein & Flint, 2006). However, the study has some methodological shortcomings, particularly pertaining to ecological validity. Considering that the results are based on "distorted" phonemic transcriptions being read aloud by an adult male, that is, recordings of an adult speaker producing error patterns that are highly unexpected in adult speech, it cannot be ruled out that listener responses reflect other factors than merely intelligibility. Nevertheless, with the desirable feature of allowing control over the distribution of error patterns, distorting phonemic transcriptions may still be a viable approach to studying the impact of different error patterns in context, particularly for large-scale transcription-based studies.

Phonological structure across different languages
Cross-linguistic explorations of phonological structure need to rely on the phoneme as the unit of analysis, rather than the phone or allophone (McLeod & Crowe, 2018). Further, consonantsrather than vowels, tones or prosodic features-are most often of particular interest in the study of phonological acquisition, within and across languages (McLeod & Crowe, 2018). Inevitably, then, reduction of phonetic detail in children's actual speech production is often accepted, and a segmental focus assumed, in order for more general patterns of similarities and differences between languages to be revealed.
Historically, the study on speech acquisition and on speech sound disorders in children has predominantly been based on English. However, more recent cross-linguistic research has shed light on phenomena described for English that do not generalize to other languages, due to differences in linguistic structure (Saaristo-Helin et al., 2006;Saaristo-Helin, 2009;Bernhardt et al., 2017;Hua & Dodd, 2006b). Although sharing the same Germanic origin, Swedish and Norwegian differ phonologically from English in a number of ways, see Table 1. For example, the Scandinavian languages have more vowels than English, and tonal accents.
We are not aware of any previous investigations involving large-scale transcription-based quantifications of the phonological effects of simulating common error patterns in authentic linguistic data. Hence, the present investigation constitutes a first effort at this task. Basing the investigation on phonologically transcribed data allows relatively convenient upscaling and extension to other languages. Here, phonological effects of the same error patterns are explored across child-speech transcripts in English, Swedish, and Norwegian. Knowing that Swedish and Norwegian are more similar to one another than either of the two is to English, larger differences are expected between English and the two Scandinavian languages, than between the two Scandinavian languages.

Phonological structure across different linguistic sources
Although ecologically valid linguistic speech data is to be preferred, the reality is that at present both the quantity and the quality of available child-speech corpora differ greatly between languages. Therefore, researchers may have to rely on other types of more readily available data. While some phonological phenomena may be insensitive to the specific mode and target age group of the underlying linguistic data, other factors are less robust. For example, in their investigation of the potential influence of word frequency on phonological acquisition in children with a Table 1. Phonological descriptions of Swedish, Norwegian, and English. The description of Swedish is based on Riad (2014) and Engstrand (2004), the description of Norwegian on Kristoffersen (2000), and the description of English on Smit (2004).
/aɪ, aʊ, ɔɪ/ Tonal accents Accent 1 and Accent 2 3 Accent 1 and Accent 2 -Syllable shape phonological disorder, Gierut and Dale (2007) demonstrated results that were consistent across both spoken and written data, and across child and adult data, and they concluded, therefore, that for this specific purpose, it was not necessary to rely on a corpus of transcripts of utterances produced by children. However, more recent studies highlight distributional differences across modes of discourse and populations, thus calling for corpus-based language acquisition studies to be based on data collected from more ecologically valid sources, such as children's lexicons derived from spontaneous speech data (e.g., Daland, 2013;Tsuji et al., 2014). This caution is resonated by Stoel-Gammon (2011), who calls for the study of word frequency effects to include a variety of measures, such as adult word counts, word counts of child input and output, collected from many children, as well as of word input and output in individual children. A systematic investigation of variation in linguistic structure across different modes (spoken vs. written) and different ages (children vs. adults) requires data from four types of sources: child and adult speech data, and texts written for child and adult target audiences. To that end, we explore phonological effects of the same error patterns across such linguistic sources.

Aim
The aim of the current investigation was to explore phonological effects of the same speech error patterns across different languages, target audiences and discourse modes, using a novel method for large-scale corpus investigation. Using phonemic transcripts derived from a dataset of spoken data produced by Swedish children as a reference, and through comparison between this reference and a) corpus data representing other languages (Norwegian and English), and b) Swedish corpus data in different modes of discourse (spoken/written) and intended audiences (adults/children), we aim to explore the following research question: How are commonly observed speech error patterns quantitatively ranked by severity across Swedish, Norwegian, and English?
We quantify severity by measures of phonological complexity and accuracy. Speech error patterns reported as being more severe (e.g., atypical patterns > typical patterns; omissions > substitutions) are also expected to be quantitatively ranked as more severe-across all three languages. We explore this question on the most ecologically valid data, that is, child-produced speech. Due to the relative scarcity of optimal data sources, the question will also be investigated using less ideal (but more easily accessible) data, namely, text. Hence, as a secondary purpose, we aim to explore the potential distortion of basing investigations of children's phonological acquisition on material other than that of spoken utterances produced by children, with the following research question: Within the same language, how does severity vary across a) the age of the producer (adult vs. child), b) the age of the intended audience (adults vs. children), and c) the mode of linguistic data (spoken vs. written)?
Based on preliminary data presented in Strömbergsson et al. (2017), where differences in phonotactic distributions were identified between different types of linguistic data within one and the same language, differences are expected also in what effects simulated error patterns may have. With fewer opportunities where an error pattern is applicable, the smaller the expected effect. For example, if the consonant cluster /str/ occurs less frequently in child-produced speech than in other data sources, the effect of cluster reduction can be expected to be less severe in this dataset.
Because the included quantitative measures of severity concern different phonological aspects of the data, and because it is unknown to what extent these aspects relate to consequences in functional communication (e.g., in terms of intelligibility), the face validity of the rankings provided by the included measures is explored through the following research question: To what extent do the Swedish estimated rankings of severity correspond to intuitive rankings collected from practicing clinicians?
There is little empirical evidence of correlations between level of severity and effects on intelligibility. A finding that any of the included measure correlates with clinically intuitive rankings would strengthen the functional validity of this measure.

Materials
For each of the three languages in this study, a collection of orthographically transcribed spoken data produced by children was selected from open data resources (see Table 2). For child-produced speech, there are high quality corpora for both British and American English. We decided to use both American and British data, which is not unusual in studies of phonological acquisition, for example, . For Norwegian and Swedish, we used all child-speech corpora freely available for research. For Swedish, additional spoken and written data produced by adults was selected from open data resources to provide a balanced representation of the language. An overview of this collection is presented in Table 3. For descriptions of the corpora as well as references, see Appendix A.
Concerning the speech data, child-produced and (adult-produced) child-directed data included orthographic transcripts of spontaneous interaction between infants, toddlers, and young children (up to 8 years) and adults in a naturalistic free-play setting at home, or in a lab. These transcripts Table 2. Overview of included child-speech corpora, for each of the three languages. All corpora are distributed via CHILDES/PhonBank (MacWhinney, 2000;Rose & MacWhinney, 2014) or listed by the OLAC (Bird & Simons, 2003). For a description of the corpora, see Appendix A.

Language/Dataset
Tokens Types  (Bird & Simons, 2003). The (adult-produced) adultdirected Swedish speech data consisted of orthographic transcripts of spontaneous dialog obtained in a lab or an interview setting. For all speech data, orthographic transcripts were used in their original forms, without special considerations paid to potential variation regarding how nonstandard speech forms were represented in the transcripts. As for written data, the child-directed data encompassed fiction with a child target audience (6-12 years). The adult-directed written data was selected to represent a broad set of genres and domains, consisting of about 25% fiction and 75% informative prose (e.g., news text, official documents, academic texts, periodicals). A detailed description of the sources for the Swedish data set can be found in Appendix A.
As the CMU dictionary includes examples of all major categories of American and British spelling variants (e.g., "-or"/ "-our" or "-ize"/" "-ise"), this procedure also includes normalization between British and American English spelling conventions. That is, common words spelled according to British conventions are transformed into American English pronunciation during lexicon look-up, and any word not found in the lexicon is handled by the g2p model, which has been trained to handle both American and British spelling variants as found in the CMU dictionary.
Speech error patterns were simulated in the phonemic transcripts by automatically replacing specific phonemes or phoneme sequences with other phonemes, across all word tokens in an entire corpus subset. The error patterns were applied one at a time, thus generating one "misarticulated" corpus subset version per error pattern. Table 4 lists the six specific error patterns implemented, selected on the basis of having been frequently attested in at least two of the three included languages. Another selection criterion was practicability-only context-independent error patterns were selected, based on their straightforward implementation. (This excluded context-dependent patterns like metatheses and assimilation.) Furthermore, the simulated patterns were restricted to substitutions with segments that hold phonemic status in all three languages. (This excluded, for example, lateralization of /s/, or an /r/-weakening pattern where /r/ is realized as [w].) One can note that one of the six error patterns-backing-is categorized as atypical in all three included languages (Hua & Dodd, 2006b;Lohmander et al., 2013;, whereas the remaining five are often observed early in typical speech acquisition (Hua & Dodd, 2006b;Lohmander et al., 2013;. 3 Including an atypical error pattern among those investigated allowed the chance of exploring the hypothesis that atypical error patterns have more detrimental effects than typical error patterns.  Reported observations of the realization of the six different error patterns have varied. For example, descriptions of cluster reduction illustrate that its realization varies extensively, making implementation particularly challenging (see discussion in McLeod et al., 1997). Further, descriptions of cluster reduction exist primarily for word-initial clusters . In the present study, cluster reduction was not restricted to word-initial position, but was applied in syllable-initial position, thus applying to words like (the Norwegian) "trekker" (English: "pulls"), but also to compounds like (the Norwegian) "juletreet" (English: "the Christmas tree"). This reflects observations reported for both Swedish and Norwegian, and is therefore the desired behavior. For a more detailed description of the implementation of error patterns, together with listed references attesting their realization, see Appendix B. All scripts were computed in Python and are available at https://sprakbanken.speech.kth.se/data/simulerror/.

Clinical survey
A clinical survey was conducted among the audience of a research seminar, held by the first author at an SLP clinic in the Stockholm region, directed to the practicing SLP clinicians employed at this clinic. In all, around 60 individuals participated in this seminar. Forms were distributed to the participants, where the six error patterns (see Table 4) were listed, together with examples illustrating each of them. The participants were instructed to rank the listed error patterns from the least to the most severe, in terms of effects on intelligibility. The rankings were indicated with numbers, ranging from 1 (least severe) to 6 (most severe). On the same form, they were asked to fill in for how long they had practiced as an SLP (in years), and for how long they had worked clinically with children (in years). Forty-seven forms were returned. Fourteen forms were excluded from the analysis; three because they contained ambiguous information, and 11 because the participants had less than one year's experience of clinical pediatric SLP practice. The remaining 33 forms were included in the analysis, representing participants with an average number of years as a practicing SLP of 7.6 years (max = 38; min = 1; SD = 7.1 years), and an average number of years working with children of 4.8 years (max = 20; min = 1; SD = 3.7 years). Table 5 lists the five measures used in quantifying the phonological effects of the simulated error patterns. Structural measures were used to characterize the words in the datasets in terms of phonological complexity, in their original version as well as in "misarticulated" versions, thus enabling estimation of the structural impact of "misarticulation." For example, the difference between the IPC score of the original phonemic transcription of a word, and the IPC score of same word having undergone velar fronting is used as a quantification of the effect of velar fronting for that word type. As an illustration, the Swedish original transcription of the word "gubbar" (Eng. "old men") / 2 ɡɵ.bar/ yields an IPC score of 1, whereas the velar fronted version of the word, / 2 dɵ.bar/, receives an IPC score of 0; hence, in this case, the velar fronting results in a reduction of complexity by one IPC point. "Misarticulation" resulting in a reduction of complexity would, then, result in negative values, whereas positive values would reflect increased complexity. Relational measures were used to quantify the difference (or "distance") between original versions and "misarticulated" versions of datasets.

Data analyses
All structural measures were calculated on open-class words, that is, excluding entries appearing in pre-specified lists of closed-class words. For the PWP, this follows Ingram's (2002) original description of the pMLU calculation. For the IPC and the WCM, the original descriptions are unspecified regarding what types of words are included; however, restricting application to open-class words follows previous work of, for example, Howell and Au-Yeung (2007). Lists of closed-class words were created by manual identification of closed-class words, including various forms of auxiliary and copula verbs among the 100 most frequent word types across all corpus subsets. The purely relational measures-PCC and LVN-were calculated on all word tokens, that is, including both closed-class and open-class words. This is in accordance with the original description of the PCC (Shriberg & Kwiatkowski, 1982). Effects of a specific error pattern within Table 5. Structural, relational, and combined measures used for quantifying linguistic structure and/ or effects of the simulated speech error patterns, together with a description of the calculation of each measure. As an illustration, calculations and outcome scores are provided of an original phonemic transcription Trsc_orig / 2 spøː.ket/ having undergone stopping, resulting in the "misarticulated" phonemic transcription Trsc_error / 2 pøː.ket/ (Swedish: "spöket," Eng. "the ghost").
pMLU Trsc_orig = 10 (6 for n segments + 4 for n consonants) pMLU Trsc_error = 8 (5 for n segments + 3 for n correct consonants) 0.8 1 As no adaptation of the WCM is available for Norwegian, the Swedish adaptation was used also for Norwegian. 2 Here, based on the number of consonants in an original transcription of a word, and the number of consonants left unchanged in the "misarticulated" version of that transcription. 3 Note that the SAMPA version of / 2 spøː.ket/ is /""sp2:$ket/, that is, the marker of accent 2 is indicated by two characters, in contrast to the IPA transcription. Hence, the total number of characters in /""sp2:$ket/ is 10.
a particular corpus subset were expressed as means and standard deviations across word tokens. All scripts analyzing phonological effects were computed in Python and are available at https://sprakbanken.speech.kth.se/data/simulerror/. The face validity of the rankings provided by each of the five different measures of phonological effects was examined through a Kendall's τ rank correlation analysis in relation to average intuitive rankings as obtained via the clinical survey. For this analysis, effect scores as obtained by the structural measures of complexity-the IPC and the WCM-raw values were used in the correlation analysis, thus representing difference in complexity regardless of whether the difference resulted in increasing or decreasing complexity. Figure 1 illustrates the ranking of the six included error patterns, with regard to phonological severity as measured by relational (top panel), structural (middle panel), and combined measures (bottom panel), based on child-produced Swedish, Norwegian, and English speech. The graphs illustrate a general trend with backing consistently being ranked as having the most severe phonological effects, across all measures, and across all three languages. Another general pattern is that stopping is ranked as the error pattern having the second most severe effects, often larger for English than for the two Scandinavian languages. It can also be noted that, with respect to both structural measures, backing involves an increase in phonological complexity. This is also found for /r/-weakening, but only for Swedish and Norwegian, and only with respect to the WCM; for the IPC, /r/-weakening involves no or a negligible reduction of complexity.

Phonological severity ranking: Cross-linguistic variation
The results concerning the other end of the scale, regarding which of the six error patterns results in the smallest phonological effects, are more divergent. weak syllable deletion is, however, consistently ranked as one of the error patterns resulting in the smallest phonological effects, across the three languages. /r/-weakening is also most often ranked as one of the patterns resulting in the smallest phonological effects.
In terms of differences and similarities across the three languages, the expected difference between English and the two Scandinavian languages is modest. stopping can be identified as the error pattern which exhibits the most salient difference; here, the phonological effects are consistently more detrimental in English than in the two Scandinavian languages. A similar tendency can be observed for backing, where effects are more detrimental in English compared to Swedish and Norwegian; however, this only holds true with regard to the structural and combined measures. More unexpected is the observation of a difference between Swedish on the one hand and Norwegian and English on the other hand, concerning cluster reduction, where effects are more detrimental in Swedish, at least with respect to structural and combined measures.
3.2 Phonological severity ranking: Alternative data sources 3.2.1 Age of the speaker. Figure 2 illustrates that phonological effects of the six included error patterns are often larger when estimated on Swedish adult-produced speech (directed to children), compared to when estimated on child-produced speech extracted from the same conversational setting. However, the difference is generally one of degree rather than of quality-the rank order generally remains the same across the two datasets. (An exception is the ordering of cluster reduction and fronting with regards to the WCM.) Hence, backing results in the largest effects, and weak syllable deletion in the smallest effects, regardless of whether speakers are children or adults. Further, the difference between datasets is smaller with respect to relative measures, compared to when measured by structural and/or combined measures. For the relative measures (PCC and LVN), differences between child-produced and adult-produced speech are subtle, except for /r/-weakening and stopping, where phonological effects are larger in adult-produced speech than in child-produced speech. (For these error patterns, the difference between child-produced and adult-produced speech is also observable in the structural and combined measures.)

3.2.2
Age of intended audience. The variation in phonological effects across the age of the intended audience, that is, whether adult speakers address children or adults, is also illustrated in Figure 2. Here, differences between (adult-produced) child-directed and adult-directed speech are often smaller than differences observed between child-produced and adult-produced speech. Where differences between child-directed and adult-directed speech do occur, it is in the direction of larger effects in adult-directed compared to child-directed speech. The most salient difference is observed for weak syllable deletion; this is the one error pattern where the difference between childdirected and adult-directed speech is consistent across measures (although quite subtly with regards to the PCC). One can also note that for /r/-weakening and stopping, phonological effects do not vary across the age of the intended audience. Instead, a larger difference is observed between the adult-produced speech and the child-produced speech reference, such that effects are more severe in adult-produced speech. Again, the ordering of the error patterns with respect to phonological effects remains largely the same across conditions; hence, backing consistently results in the largest effects in both adult-directed and child-directed speech. Figure 3 illustrates variation in phonological effects of the six error patterns across different discourse modes: spoken vs. written data. (Note that in order to isolate the modes of the data sources, text vs. speech, only adult-produced data was included in this analysis, as-for natural reasons-very little data was available from children producing text. Child-produced speech data is, however, included for reference.) As the graphs illustrate, basing the analysis on general text data results in larger effects than basing the analysis on speech data produced by adults, and even larger when compared to speech produced by children. (Note, though, that for fronting, the difference between adult-produced and child-produced speech is not as pronounced as for the other error patterns.)

Mode of discourse-spoken vs. written.
Notably, although the differences between text and speech data are salient with respect to structural and combined measures, they are more subtle with regard to the relational measures (PCC and LVN). Concerning the relational measures, one can observe larger effects for /r/-weakening and stopping in adult-produced language (quite) regardless of mode, compared to child-produced speech. Hence, the results illustrated in Figure 2 for /r/-weakening are reflected here, too: in terms of relational outcome measures, /r/-weakening and stopping result in larger differences from expected targets in adult-produced language compared to in child-produced language. Apart from these observations, Figure 3 illustrates a now familiar overall trend concerning the ordering of error patterns: backing is consistently, and by far, ranked as the pattern resulting in the largest effects, followed by stopping and, although less consistently, fronting. On the other side of the scale, weak syllable deletion and (although less consistently) /r/-weakening and cluster reduction result in the smallest effects.
Visual inspection of the different outcome measures across the described comparisons allows some general observations concerning the nature of the different measures. For example, the two complexity measures-the WCM and the IPC-yield different outcomes regarding /r/-weakening, such that according to the WCM, /r/-weakening results in increasing complexity, whereas with respect to the IPC, it does not affect phonological complexity at all.

Face validity: Rank ordering with regard to intelligibility
The results of the ranking of the six error patterns with respect to their impact on intelligibility as provided by Swedish practicing clinicians are presented in Figure 4.
As indicated in these average rankings, /r/-weakening was ranked as having the least detrimental effect on intelligibility. weak syllable deletion was ranked, on average, as having the most detrimental effects on intelligibility, although the distance to backing on the same end of the scale is small. stopping, fronting and cluster reduction were, on average, ranked as having less severe impact, but still clearly more severe than /r/-weakening. Table 6 lists the average clinical ranking values together with the average outcome scores of each of the five computed severity measures. Table 7 presents the results of the correlation analyses between the clinically intuitive rankings and the average severity scores as obtained by each of the five included severity measures. (Note that for the two structural measures, the IPC and the WCM, absolute average scores were used in the correlation analyses, rather than the raw values presented in Table 6.) As shown in Table 7, none of the included measures correlate with the intuitive rankings of effects on intelligibility. Furthermore, the table illustrates patterns of interrelations between the five measures, for example, that the PCC and the LVN, that is, the two relational measures, are in close agreement. Also, the WCM correlates strongly with all other measures. The combined measure PWP correlates strongly with the two structural measures, that is, the IPC and the WCM.

Discussion
This investigation set out to explore phonological effects of six frequently reported speech error patterns when simulated in transcriptions of authentic samples of children's speech production, and how the effect ranking of these patterns varies across languages, and across different types of linguistic data. We aimed at finding answers to whether findings in one language can be generalized to other languages, and to whether these kinds of investigations of children's speech acquisition need to rely on the most ecologically valid kind of data (i.e., child-produced speech), or whether they can be based on more readily available data (like, for example, text produced by adult writers). Although replication and extension is necessary to allow definite conclusions, we discuss the knowledge gained from our findings, and what theoretical and practical implications these have.
Concerning the first research question-how the six error patterns are ranked by phonological effects across three different languages-the findings are mostly in line with our expectations. Indeed, the one pattern regarded as atypical in the included languages, backing, was consistently ranked as causing the largest phonological effects, across measures, and across the three languages. Almost as consistently, stopping was ranked as the pattern causing second largest effects. Less expected was the finding that the two error patterns involving omission rather than substitution of segments-weak syllable deletion and cluster reduction-were ranked toward the other end of the scale, that is, as causing the smallest phonological effects, across measures, and across the three languages. A possible explanation to this somewhat surprising finding is that the data underlying the analysis-that is, child-produced speech-offer fewer opportunities where these patterns are applicable, compared to language produced by adults. This would align to suggestions of lexical selectivity (e.g., Vihman et al., 2014;Willadsen, 2013), that is, that children's speech production is skewed by their preference to produce words that are composed of phonological elements and structures that they master. The finding that these patterns have larger phonological effects in adultproduced data corroborates this hypothesis. On the surface, the surprisingly small effects caused by weak syllable deletion and cluster reduction in child-produced speech appear to oppose  suggestion that omissions are more disruptive than substitutions. However, Table 7. Kendall τ correlation coefficients (with p-values, two-tailed) for associations between clinically intuitive rankings of severity with regards to effects on intelligibility and outcomes as measured by each of the included measures of phonological effects (based on child-produced speech data).  whereas Hodson and Paden make claims about effects on intelligibility, the present study cannot make claims beyond effects as measured by phonological structure and accuracy; hence, this seeming contradiction may adhere to the fact that phonological effects do not necessarily correspond to level of intelligibility.
In terms of the rank order of severity, there are no dramatic cross-linguistic differences-the six patterns are ranked similarly across the three languages. This can be expected, as all three languages share many typological traits. There were, however, some differences in the size of the phonological effects. For one, stopping was found to have larger effects in English than in the Scandinavian languages, thus confirming the expected pattern of cross-linguistic differences. Examination of the most frequent words in the three respective languages reveals that many of the top-most frequent words in English contain fricatives (e.g., "the," "this," "is"), whereas this is not the case for the corresponding words in the Scandinavian languages (e.g., "det," "den," "är"). This raises the suspicion that the difference is driven by the high frequency of fricatives in the most frequent English words. However, as closed-class words were excluded from the analysis based on structural and combined measures, this fact alone cannot explain the difference. Rather, the fact that English has more fricatives overall-reflected in both types and tokens-than the Scandinavian languages could be a more plausible explanation. Consequently, stopping can be expected to cause more problems in English than in the Scandinavian languages.
Less expected was the observation concerning cluster reduction, where smaller effects were observed for Swedish than for English and Norwegian. Indeed, the proportion of word tokens in each language where cluster reduction was applied was smaller for Swedish (5%) compared to Norwegian (8%) and English (7%). A possible explanation behind this unexpected variation could be that either the Swedish or the Norwegian child-produced speech dataset is skewed. Considering that these corpora consist of data longitudinally collected from only a few children, this is quite possible. By extending the analysis to include other types of linguistic data in English and Norwegian, such tendencies in the child-speech data could be revealed.
Regarding the second research question, concerning the variation in phonological effects across different types of linguistic sources, the results show a general trend of phonological effects being the largest in general text data, followed by adult-produced speech directed to adults and adultproduced speech directed to children, where effects in turn were larger than in child-produced speech. This illustrates that there are indeed differences between the datasets in terms of their phonological characteristics, in line with previous observations (Strömbergsson et al., 2017). However, as the difference is primarily one of degree rather than of quality-the rank order remains largely the same across speaker and listener age-there are aspects in adult-produced speech that can be generalized to child-produced speech. Although this finding needs replication in other languages, it is an indication that large-scale investigations of ranking of severity of different error patterns may be based on adult speech data in cases where authentic child-speech data is not available. In cases where the specific magnitude of the effect is important, this is, however, not recommended.
Concerning adult-produced speech directed to other adults and to children, respectively, and the difference observed with regard to weak syllable deletion, this could be seen as a confirmation of lexical selectivity as an explanation to why this error pattern results in surprisingly small effects in the child-produced speech data. One can note that the difference is primarily one between childproduced speech and adult-produced speech directed to adults, rather than one driven by the age of the speaker. This may be a consequence of the adults adapting their speech to the child they are talking to. A similar trend can be seen in the comparison across different modes of discourse (speech vs. text). At least with reference to complexity measures and combined measures, this trend even affects the rank ordering of the error patterns, so that weak syllable deletion has similar or even more degrading effects than fronting.
The patterns /r/-weakening and stopping were found to result in larger phonological effects in adult-produced language (text and speech) than in child-produced speech. Notably, the difference in effect size between adult-produced speech and child-produced speech was larger than that observed between child-directed and adult-directed speech produced by adult speakers. Hence, even though the speech produced by children and most of the child-directed speech produced by adults were retrieved from the same conversational data, fricatives and /r/s occur more frequently in the adults' than in the children's speech production. In that respect, there were no signs of adult adaptation to the children's phonological or lexical preferences. It should be cautioned, however, that a possible explanation could be sought in the original orthographic transcripts. Word-final /r/s are often reduced by Swedish speakers of all ages. However, we have found that in the Swedish child-speech corpus used in this study, word-final "r"s are orthographically transcribed in adults' speech but less so in the orthographic transcripts of children's speech. An exploratory study of ten frequent Swedish verbs in present tense showed that 57% of all occurrences in the children's speech are transcribed as the reduced form compared with less than 3% in the adults' speech. Taken at face value, this suggests that the children's speech is more reduced than the adult's speech. However, we wonder if the difference is due to the purpose of the transcripts: the child-speech transcripts describe the performance of the child during speech development, whereas the competence of the adult speaker is taken for granted in the adult transcripts.
The ordering of error patterns as provided by the panel of clinicians closely matches expected phonological development, with the error patterns ranked as having the most detrimental effects on intelligibility being the ones that are typically only seen in the earliest stage of phonological acquisition (or, in the case of backing, not expected at all in typical phonological acquisition). So, for example, weak syllable deletion, ranked by the clinicians as the pattern having the most detrimental effects on intelligibility, is also an error pattern which is expected to be overcome early in speech acquisition , whereas /r/-weakening, ranked as the least detrimental pattern, is observed in typically developing children as late as around seven years of age . It cannot be ruled out that the clinicians' knowledge and experience of these norms may have influenced their ranking decisions. Regarding the relation between the included severity measures and the clinical intuitive rankings, no correlation was found. Hence, none of the severity measures reaches face validity. As alluded to before, this may be a consequence of the measures being sensitive to broad phonological characteristics, whereas the clinical intuitive rankings were given with reference to effects on intelligibility; maybe, there is simply not a straightforward correspondence between these two aspects of speech production. It is possible that more sensitive phonological measures would better reflect perceptual similarity/distance between target and "misarticulated" transcriptions, such as by weighting phonological similarity between symbols. This would take into account, for example, that vowels are more similar to other vowels than to consonants, and that fricatives are more similar to other fricatives than to other consonants (see e.g., Preston et al., 2011). For measures purporting to reflect functional effects of deviation from expected target forms this is a desirable feature to be implemented in future work. Doubtlessly, there are also non-phonological factors that contribute to intelligibility, such as the linguistic context, to which the analysis in the present study is insensitive. Moreover, factors like functional load, that is, the quantification of the effect of a loss of a phonological contrast in a language (Stokes & Surendran, 2005), can also be assumed to influence intelligibility. Including contextual information as well as lexical factors like functional load are natural and important venues for future extensions to this work.
A couple of reflections regarding the nature of the included severity measures deserve mentioning. First, it can be noted that according to the structural measures, backing and-although to a lesser extent-/r/-weakening involve increasing articulatory complexity. This is a natural consequence of velar speech sounds being categorized by these measures as more complex than alveolar/dental speech sounds (Jakielski, 2016;Marklund et al., 2018;Stoel-Gammon, 2010), which, in turn, is motivated by an expected later age of acquisition of these speech sound types. In other words, if early acquired (i.e., less complex) sounds are replaced by late acquired (i.e., more complex) sounds, this will be reflected as increased complexity. For backing, this adds to the already established view of this pattern as atypical (Dodd, 2005;. For /r/-weakening, the counterintuitive finding that it should lead to unchanged or even increasing complexity (as observed with regards to the WCM) is a consequence of the phonological classification of the /r/ and /j/ sounds in the included languages. In all three languages, /r/ can take many allophonic forms-in Swedish, for example, [ʐ ʂ ɹ ɾ] (Engstrand, 2004). The phonological classification that both the IPC and the WCM are based on requires reducing this variation and selecting one of these allophonic variants as representing /r/. For all three languages, /r/ was classified as a voiced alveolar rhotic sound. For Swedish and Norwegian, /j/ was classified as a voiced palatal fricative. As the Swedish version of the WCM awards /r/ with one point, but voiced fricatives with two points, the substitution of /j/ for /r/ is reflected in increased complexity. In English, /j/ was categorized as a voiced palatal approximant, representing a sound type which is not awarded any complexity points at all; hence the substitution of /j/ for /r/ will result in reduced complexity in English. This illustrates an inherent restriction in allowing one symbol to represent the many allophonic shapes and forms a specific speech sound may take, and consequently, an issue that needs to be supplemented by investigations based on actual speech data.
Another reflection concerning the nature of the severity measures can be made regarding the close correspondence between the Levenshtein distance and the PCC. Although both these measures are based on counts of symbol omission, insertion, and substitution, they are different in respect of whether they measure distance or similarity (hence, they are inverted), and whether all symbols or only consonants are considered. However, for all implemented error patterns except weak syllable deletion, only consonants were affected, and hence, the measures were expected to be closely (inversely) correlated. This observation may motivate researchers to rely on the Levenshtein distance instead of the PCC in large-scale studies of phonological accuracy, as there exist many freely available scripts to use for its implementation, whereas implementation of the PCC may be more cumbersome. In the present study, the PCC was based on the number of consonant changes made to the original transcriptions in the simulation of misarticulation. In studies where the only information that exists is a target transcription and an observed transcription, the calculation of the PCC will have to rely on other procedures.
Concerning the results for English, it should be borne in mind that the results are based on American English pronunciation variants, stemming from the American lexicon from which phonemic transcriptions are retrieved/derived. This is perhaps most relevant to the results concerning /r/-weakening, where British English phonemic transcriptions would have offered fewer opportunities where this pattern would be applicable. For example, in words like "car," the word-final "r" would be dropped in many British English variants, disallowing /r/-weakening to be applied. Hence, the effects of /r/-weakening presented here can be expected to be more pronounced than they would have been if they were instead based on British English. In this context, it should also be acknowledged that the selected /r/-weakening pattern where /r/ is substituted for [j] is considerably less common in English than /r/ substituted for [w] (Smit, 1993). This serves to underline that the primary value of cross-linguistic comparisons like the present is not the absolute values presented for each of the included languages, but rather, the relation between values across the included languages.

Limitations and future research
In all corpus-based studies, there is a risk that the corpora are not wholly representative of the language discourse the researchers intend to examine. This risk is present also here and isolating only one factor when comparing different languages and data sources proved to be challenging. For example, in the cross-linguistic analysis, there are other factors that vary between the Swedish, Norwegian, and English datasets than merely the language, such as the age of the children included. This should be kept in mind in the interpretation of the findings.
As mentioned above, the included orthographic transcripts do not always follow standard orthographic spelling. For example, in the Swedish CHILDES data, spellings may sometimes reflect both common reductions and rarer misarticulation, like "fö" (for "för," English: "for"), "fyplan" (for "flygplan," English: "airplane"). In cases where non-adult-like learner forms like the English "eated" occur in the orthographic transcripts, this is not a problem. Such cases are handled by the grapheme-to-phoneme conversion, which provides a phonologically reasonable guess of how the previously unseen word is produced. However, in cases where the speaker's reductions/misarticulations are represented in the orthographic transcripts (such as in "fyplan" mentioned above, where a cluster reduction is represented), the estimation of the effect of the simulated misarticulation will be obscured. As the misarticulation is already in the original production, it will-erroneously-be assumed to be the target production. On close examination, this phenomenon was indeed found in the Swedish CHILDES data, as described above. For the English and Norwegian datasets, this was rarer. This should serve to encourage Swedish child-speech researchers to collect and share their data, in order to increase the availability of higher-quality child-speech resources in Swedish. From the authors' point of view, orthographic documentation of speech data should follow standard spelling conventions, ideally with each orthographic word aligned to a phonemic transcription.
The methodological approach used in this study in the simulation of misarticulation is based on a number of assumptions, each of which may be questioned. For one, the error patterns are implemented across the board in an entire dataset. Obviously, this is not an exact representation of children's speech production in real-life. For example, although consistent velar fronting has been attested in many cases (e.g., Cleland et al., 2017), a tendency for velar fronting to be driven by word/syllable context has also been reported (McAllister Buyn, 2012;Mason et al., 2015). Variation in speech production driven by phonotactic frequency is also not reflected (see James et al., 2008). Restraining the distribution of error patterns by syllabic/word context and taking phonotactic frequency into account is feasible within the suggested framework. This is an interesting venue for future work.
Regarding the set of quantitative measures implemented in the present study, it should be noted that all were originally designed to be applied to actual samples of children's speech production. Hence, their values may be conspicuously inflated when applied to thousands, or even millions, of words as in the present study. Typically, in the measurement descriptions, minimum sample sizes are specified (e.g., 100 words for the PCC in Shriberg & Kwiatkowski, 1982), whereas upper limits are rarely considered.
The dataset consistently used as a reference in the present study was Swedish child-produced speech data. No investigation was made into whether results for the other two languages differed dependent on the age of the producer or the audience, or between the modes of the data. This remains a topic for future studies. Also, although there are expected differences between the two Scandinavian languages and English, all three languages included here are from the Germanic language family. Based on previous research observing cross-linguistic differences in phonological structure between children's speech production in English compared to other languages, such as Finnish and Cantonese (e.g., Hua & Dodd, 2006b;Saaristo-Helin et al., 2006), another interesting focus of future studies would be to extend the investigation to include languages that differ more from English. This would complement important large-scale cross-linguistic investigations like those proposed by Bernhardt and colleagues (2017) and pursued by McLeod and Crowe (2018), in addressing a calling need for increased knowledge into phonological acquisition in a wide range of languages other than English.

Conclusions
This investigation is explorative in its nature, and-to our knowledge-the first large-scale investigation of phonological consequences of different speech error patterns across languages, different modes, and age of speaker and intended audience. The rank order of speech error patterns remained rather stable across the three included languages, with backing being ranked as the most severe across all three languages. Hence, although backing was already known to be an atypical error pattern, our findings add the information that it also causes more detrimental phonological effects than the other five, thus serving as support to prioritize it in clinical intervention. Further, stopping was found to cause more detrimental effects in English compared to the two Scandinavian languages. This effect may be linked to typological differences between the included languages. The observation that rank order remained rather stable across modes of discourse and speakers' ages indicates that for the purpose of ordering error patterns by phonological effects, readily available sources like text corpora may be used as a proxy for the ideal and most ecologically valid data. Finally, the finding that none of the included metrics of phonological effects reflected clinicians' intuitive ratings of different speech error patterns in terms of effects on intelligibility corroborates earlier suggestions that phonological competence does not necessarily translate into level of intelligibility.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Swedish Research Council (VR 2015-01525).    Grunwell (1987); Yavaş (1998); Smit (1993b) SW: [ 2 kata] skratta Magnusson (1983) NO: N/A 6 1 To our knowledge, this is the only published observation of an example of a possible case of coronal backing in Norwegian. Note, however, that the example may well be contextually driven, i.e. an assimilation.
2 In cases of /sC/-clusters, the /s/ is removed. 3 No reports found of the form stop+nasal. Now treated as in Swedish, i.e. reduction to stop, except when C 2 is syllabic. (børsten -> / 2 bøʈɳ̍ /) 4 Except for /sl/ in English, which is treated separately.
5 But see Hewlett (1992) and Ingram (1976) for reports of omission of C 1 , and Yavaş (1998: 136) and Klopfenstein and Ball (2010) for reports on omission of either C 1 or C 2 . 6 We have not found any reports of observed reductions of /sCC/ clusters in Norwegian.