Speaker-Sex Discrimination for Voiced and Whispered Vowels at Short Durations

Whispered vowels, produced with no vocal fold vibration, lack the periodic temporal fine structure which in voiced vowels underlies the perceptual attribute of pitch (a salient auditory cue to speaker sex). Voiced vowels possess no temporal fine structure at very short durations (below two glottal cycles). The prediction was that speaker-sex discrimination performance for whispered and voiced vowels would be similar for very short durations but, as stimulus duration increases, voiced vowel performance would improve relative to whispered vowel performance as pitch information becomes available. This pattern of results was shown for women’s but not for men’s voices. A whispered vowel needs to have a duration three times longer than a voiced vowel before listeners can reliably tell whether it’s spoken by a man or woman (∼30 ms vs. ∼10 ms). Listeners were half as sensitive to information about speaker-sex when it is carried by whispered compared with voiced vowels.


Introduction
The world is full of complex dynamically changing sources of sound. One source of sound is other humans speaking. The information voices convey is both linguistic (what has been said) and indexical (sociocultural status, emotional state, physical attributes, etc.; Giles & Powsland, 1975;Krause, Freyberg, & Morsella, 2002;Ladefoged & Broadbent, 1957;Murray & Arnott, 1993;Sachs, Lieberman & Erikson, 1972). This article concerns one of the most salient and important pieces of indexical information-whether someone speaking is a man or a woman. Of particular interest is how speaker-sex discrimination performance builds up with stimulus duration where the speech sounds are either voiced or whispered.
The communication sounds of mammals (including the speech sounds of humans) are produced by the same underlying physiological mechanism. The diaphragm pushes air from the lungs past the vocal folds. The vocal folds are muscular bands of tissue located in the larynx at the base of the throat. In normal voiced speech, the vocal folds open-and-close very rapidly in a vibratory motion which has the effect of breaking up the steady stream of air late-available (but more reliable) information. Such an approach (which can be characterized as Bayesian) maximizes performance in a rapidly changing dynamic environment. This reflects a general philosophy of increasing the weighting of the more reliable cue when combining across multiple information sources (e.g., Hillis, Watt, Landy, & Banks, 2004;Jacobs, 2002) where the reliability of those cues change over time (for review of Bayesian learning see Knill & Pouget, 2004).
One prediction of this view of how perceptual information is recruited across different time scales is that there should be different speaker-sex discrimination performance as a function of stimulus duration for whispered compared with voiced speech. When humans whisper, the normal vibratory motion of the vocal folds is suspended, and consequently there is no periodic f0 component in whispered speech. This contrasts with voiced speech, where the glottal pulses generated as the vocal folds vibrate, form a periodic f0 component in the speech sound which is clearly heard as the pitch of the voice. Thus, voiced speech has an extra speaker-sex cue of voice pitch compared with whispered speech. Interestingly, pitch needs at least two glottal cycles to be present in the sound, so for durations less than two glottal cycles both voiced and whispered speech possess no pitch information. However, both whispered and voiced speech have formant peaks imposed on their frequency spectrum by the filtering action of the vocal tract, so they both have VTL-related cues to speaker sex. Speaker-sex discrimination performance as a function of stimulus duration for whispered speech should thus take a different form than for voiced speech. At the very shortest of durations, where speaker-sex discrimination performance is driven by early-available VTL-related cues (Smith, 2014), voiced and whispered speaker-sex discrimination performance should be similar. But as stimulus duration increases and GPR-related information becomes available, voiced speech performance should improve relative to whispered speech performance (as shown in Figure 1). Thus, the underlying psychometric functions, which relate stimulus duration to listeners' correct speaker-sex discrimination responses, are predicted to be markedly different for voiced and whispered speech.

Method Participants
Twenty English-speaking listeners participated in the main experiment (14 female, age range 18-39 years, mean ¼ 20.3 years). A different group of seven English-speaking listeners participated in the supplementary experiment (five female, age range 19-21 years, mean ¼ 20.1 years). All listeners had normal hearing as indicated by their absolute thresholds at both ears at 0.5, 1, 2, and 4 kHz on an audiogram. Listeners were naive to the purpose of the experiments and participated to earn course credit. Written informed consent was given by the participants after the experiments were introduced to them. The experimental procedure was approved by the Hull Psychology Research Ethics Committee (Ref: 1415122506).

Stimuli and Apparatus
Full details of the stimuli and procedures used in this study are given in Smith (2014) and will only be summarized here. One example of each of the five English vowels /a/, /e/, /i/, /o/, /u/, corresponding to the vowel sounds in ''fa '', ''bay'', ''bee'', ''toe,'' and ''zoo,'' of four adult men and four adult women speakers were presented to listeners. Speakers provided both voiced and whispered versions of the vowels. The speakers were native-English speaking students at the University of Hull. Sounds were recorded with a sampling rate of 48 kHz and an amplitude resolution of 16-bits.
The duration of all vowels was adjusted to have six different durations (8,12,18,27,40, and 60 ms) by taking different duration length segments from the central portion of each vowel. Each segment was cosine-square gated to ensure that the sounds came on and went off smoothly over the first and last 1 ms, respectively. All the vowel sounds of all durations were normalized to the same root-mean-squared (rms) level of 0.0250 (relative to maximum of AE 1). The sound level of the vowels at the headphones was 77 dB SPL.
A noise mask was presented immediately following the offset of the short duration vowel. The Gaussian noise mask was 500 ms in duration, with an onset and offset that was smoothed by a cosine-gating function of 10 ms. The sound level of the Gaussian noise at the headphones was 69 dB SPL.
The stimuli were played by a 24-bit sound card (X-fi Xtreme Audio, Sound Blaster, Creative) and presented to the listener diotically over Sennheiser HD600 headphones. Listeners were seated in a single-walled IAC sound-attenuating booth.

Procedure
The experiments were performed using a single-interval, one-response paradigm. The listener heard a vowel of a given duration and had to indicate whether a man or women had spoken where P(t) is the probability of correct discrimination of speaker sex at stimulus duration t, with guess rate g (which in an mAFC task is 1/m, or ½ in our 2AFC task) which sets the lower asymptote representing chance performance, and with lapse rate which sets the upper asymptote representing ceiling performance. The function F is for convenience taken to be the logistic function [1 þ exp(Àx)] À1 , which takes values between 0 and 1 for values of t, À1 < t < 1 (see Treutwein & Strasburger, 1999). The bracketed region ''formants'' indicates durations where VTL-related information (the formants of speech) are the main cue to speaker sex discrimination, the region ''f0'' indicates durations where GPR-related information (voice pitch, as determined by f0) is the main cue for discriminating speaker sex, and the region ''formants and f0'' indicates durations where both formants and f 0 could contribute to speaker-sex discrimination. Proportion correct values on the y axis are for illustrative purposes only and xaxis durations are purposively left blank. the vowel. There was a 50% chance that either a man or woman had spoken the original vowel. There was a 20% chance that the vowel was a particular vowel from the set of five (/a-u/). The judgement of the sex of the speaker of the vowel uttered was made by selecting the appropriate button on a visual display. The order of the ''man'' and the ''woman'' buttons was quasi-randomly switched at the beginning of each run.
Listeners were first given a practice run of 30 trials with a single vowel duration of 100 ms of either voiced or whispered vowels. The purpose of the practice was to familiarize listeners with the experimental procedure. The five vowels were each presented six times, with half spoken by men and half spoken by women. Which vowel and whether the vowel was spoken by a man or a woman was quasi-randomly determined. The ability of listeners to correctly judge the sex of the original speaker was measured. Listeners invariably found it an easy task to judge the sex of the speaker of the voiced vowels at this duration (M ¼ 99.17%, SD ¼ 2.39% correct) but harder to judge the sex of the speaker of the whispered vowels (M ¼ 83.50%, SD ¼ 10.62% correct). Each listener was provided with feedback as to their performance level only for the first practice run (whether it was voiced or whispered being counterbalanced). The practice run took approximately 2 to 3 min to complete.
Listeners then proceeded on to the main experiment. The listener was given a run of 180 trials, consisting of six durations (8,12,18,27,40, and 60 ms), each repeated 30 times. Half the trials were vowels spoken by men, and half the trials were vowels spoken by women (balanced across durations and vowels). Each run consisted of either all voiced or all whispered vowels. The duration, sex, and vowel were presented in a quasi-random order generated by the computer. Which of the four men's or four women's vowels was used in any one trial was quasi-randomly determined by the computer. Whether listeners undertook the voiced-vowel run or the whispered-vowel run first was counterbalanced to control for the effects of experience or fatigue. There was no feedback. After the first experimental run had been completed, the listeners were given a practice run and then the last experimental run (all without feedback). Thus, one participant might do practice-voiced, experimental-voiced, practice-whispered then experimental-whispered. Another participant might do the whispered practice and experiment first, followed by the voiced practice and experimental conditions. Each experimental run of 180 trials took approximately 10 to 15 min to complete. Each listener did the experiment in one session lasting approximately 45 min.

Main Experiment
The first finding is that proportion correct scores for the speaker-sex discrimination task are higher for voiced than for whispered vowels for all durations (  The solid (fitted to main experiment data) and dashed (fitted to supplementary experiment data) curves are best-fitting psychometric functions using nonparametric local linear regression fitting ( _ Zychaluk & Foster, 2009). Data collapsed across correct judgments A ''model-free'' 1 approach to estimating the psychometric function was adopted because the form of the underlying function is not known and because of the wish to avoid assumptions about lower and upper asymptote limits. The lower asymptote limit is conventionally set by the ''guess rate'' g (which is 0.5 in a 2 Alternative Forced Choice (2AFC) task) and the upper asymptote limit is set by the maximum possible proportion correct minus the 'lapse rate' . Lapse rate represents incorrect responses that are unconnected to the level of the independent variable (due to momentary loss of attention and incorrect key presses) which tend to be minimal (affecting between 0% and 5% of trials, see Wichmann & Hill, 1999). However, ''lapse'' rate can be non-trivial if it incorporates a hard barrier to further improvements in performance, perhaps induced by lack of information, perceptual bias, change in the weighting, or cue used to make a decision. In parametric fitting of psychometric functions, both g and substantially affect the shape of the fitting function (Treutwein & Strasburger, 1999;Wichmann & Hill, 2001). In our situation, there is no reason to assume that is trivially small or g strictly equal to chance (0.5) because there may be perceptual biases or cue weighting changes affecting them. Local linear fitting derives the asymptotic values g and automatically provided the psychometric function is sampled in the required region ( Z _ ychaluk & Foster, 2009). The non-parametric fits are as good as parametric fits, with the only assumption being that the function must be smooth ( Z _ ychaluk & Foster, 2009). The point at which listeners can reliably tell whether a man or woman spoke-the duration threshold (min sex ) for reliable discrimination, defined as the duration corresponding to the 0.75 point on the fitted curve (d 0 ¼ 1 for 2AFC task, see Macmillan & Creelman, 1991)-was extracted from the fitted psychometric functions to the voiced and whispered vowel conditions. The slope at a point equal to probability P ¼ ((1 À g À )/2 þ g) on the fitted curve was measured to provide a value for sensitivity-how quickly speaker-sex discrimination performance increases as a function of vowel duration. The reasoning was slope should not be unduly affected by differences in g and which would arise if the slope was measured at a fixed probability such as 0.75.
The data were first analyzed pooled across both men's and women's voices (Figure 2(a)). The best-fitting psychometric function for voiced vowels (solid curve fitted to filled large circles) is clearly different from that of the whispered vowels (solid curve fitted to open large circles). The duration threshold (min sex ) for reliable discrimination of whether a man or woman spoke was 11.28 (AE0.45) ms for voiced vowels versus 33.77 (AE 5.74) ms for whispered vowels. The uncertainty (SD) in the threshold and slope Ms was estimated from 200 iterations in a bootstrap procedure (Foster & Bischof, 1991). Comparison of threshold (and slope) estimates across vowel types was made using 99% confidence intervals which maintain at least p < .01 for non-overlapping error bars when the standard error of the estimates differs by a factor of approximately 13 (see Payton, Greenstone, & Schenker, 2003). The threshold estimates for voiced and whispered vowels (Table 1) clearly do not  (Figure 2(a), top). Data plotted separately for men speakers (Figure 2(b), middle) and women speakers (Figure 2(c), bottom). For the main experiment (Figure 2  overlap, and we can thus be confident that there is a significant difference between the duration thresholds for voiced and whispered vowels. A measure of the slope was also extracted from the fitted psychometric functions for the voiced and whispered vowels. These were measured at the P ¼ .75 point for the voiced and at the P ¼ .665 point for the whispered. The slopes were 0.0263 (AE 0.0018) for the voiced and 0.0120 (AE 0.0026) for the whispered, which are significantly different from each other at least at p < .01 (see Table 1).
The differences between voiced and whispered psychometric functions were also evident for the speaker-sex discrimination data for the men's voices analyzed separately (Figure 2(b)). The threshold and slope estimates of the voiced and whispered vowels, derived from the fitted functions, clearly do not overlap-duration threshold (min sex ) for reliably discriminating whether a man or woman spoke was 11.02 (AE 0.47) ms for voiced vowels versus 28.87 (AE 1.72) ms for whispered vowels, and slope estimates were 0.0316 (AE 0.0029) for the voiced and 0.0146 (AE 0.0013) for the whispered-all significantly different from each other at least at p < .01 (see Table 1).
Finally, differences between voiced and whispered psychometric functions were apparent when the speaker-sex discrimination data for the women's voices were analyzed separately (Figure 2(c)). It is problematic to compare thresholds because the whispered condition for women's voices never reaches 0.75 probability correct-however, clearly, there is a difference with duration threshold (min sex ) for reliable discrimination whether a man or woman spoke being 11.49 (AE0.76) ms for voiced vowels versus undefined (but at least >60 ms) for the whispered vowels. Comparing slope estimates, we have 0.021 (AE 0.002) for the voiced and 0.0088 (AE 0.0024) for the whispered, significantly different from each other at least at p < .01 (see Table 1).

Supplementary Experiment
Although the voiced and whispered vowels were equated to the same level of 77 dB SPL, it could be argued that the whispered vowels are less salient than the voiced vowels. Thus, the reduced discriminability of speaker-sex in the whispered relative to the voiced vowels could be due to the whispered vowels having less perceptual loudness rather than their being impoverished in speaker-sex cues per se. To look at this idea further, the experiments were repeated but with the sounds all increased by 6 dB. All other details were the same. Figure 2 (dotted line, small circles) shows probability correct judgment of original speaker sex, as a function of duration of the vowel, for voiced and whispered vowels in the supplementary experiment. As in the main experiment, the relationship between vowel duration and proportion correct for the voiced and whispered vowels, was characterized by using non-parametric local linear regression fitting (Z _ ychaluk & Foster, 2009), to derive a best-fitting psychometric function. Threshold and slope estimates derived from the psychometric functions were compared between identical conditions across the supplementary and main experiment, for example, voiced (men and women speakers) in the supplementary versus voiced (men and women speakers) in the main experiment, whispered (men and women speakers) in the supplementary versus whispered (men and women speakers) in the main experiment, and so forth. In no case for the voiced vowels, was there a significant difference between comparable conditions in the main and supplementary experiments (compare following values against Table 1 equivalent values: voiced (m þ w) min sex 13.10 (AE 0.53) ms, slope 0.0308 (AE 0.0032); voiced (m) min sex 12.16 (AE 0.75) ms, slope 0.0343 (AE 0.0055); voiced (w) min sex 14.19 (AE 0.98) ms, slope 0.0256 (AE 0.004)). For the whispered speaker-sex discrimination (men speakers), there was a significant difference between comparable conditions in the main and supplementary experiments (whispered (m) min sex 41.73 (AE 3.36) ms, slope 0.0074 (AE 0.0011)), while for the other whispered conditions there was no significant differences (whispered (m þ w) min sex 36.68 (AE 6.27) ms, slope 0.0065 (AE 0.0029); whispered (w) min sex 56.36 (AE 11.96) ms, slope 0.0019 (AE 0.0042)).

Discussion
This article investigated how speaker-sex discrimination performance improves as a function of stimulus duration for voiced and whispered vowels. The prediction was that speaker-sex discrimination performance for voiced and whispered vowels would be similar for very short durations but, as stimulus duration increased, voiced vowel performance would improve relative to whispered vowel performance. This would be reflected by markedly different psychometric functions (see hypothetical curves in Figure 1) and poorer speaker-sex discrimination performance (in terms of discrimination thresholds and sensitivity slope values) for whispered compared with voiced vowels. This is the case: a whispered vowel needs to have a duration three times longer than a voiced vowel before listeners can reliably tell whether it's spoken by a man or woman ($30 ms vs. $10 ms). Listeners are approximately half as sensitive to information about speaker-sex when it is carried by whispered as opposed to voiced vowels (as shown by the slopes of the psychometric functions).
It was suggested that the relative impairment between voiced and whispered speaker-sex discrimination performance should be least at shorter durations where the two different types of stimuli approach parity as both do not possess pitch information. This was partially confirmed (Figure 2(a), solid lines, filled vs. open large circles). Interestingly, when plotting judgments separately for men and women speakers the pattern of performance is more mixed. Men's voices, though showing the characteristic poor speaker-sex discrimination performance of whispered vowels relative to voiced vowels, do not show less impairment at very short durations relative to longer durations (Figure 2(b)). Women's voices are more similar to the prediction, showing little difference between whispered and vowel speaker-sex discrimination performance at short durations but a large difference at longer vowel durations (Figure 2(c)). Whispered speech tends to have higher formants (primarily F1) than voiced speech for a given vowel (Kallail & Emanuel, 1984). Higher-frequency formants cue for shorter VTL (Fant, 1970) which would indicate a women speaker, as women on average have shorter VTLs than men (Fitch & Giedd, 1989). This could lead to some misclassification of male vowels as being spoken by a woman. This seems to be occurring at least at very short durations (8 ms) where performance drops below chance (0.50) for men's vowels (Figure 2(b)).
Another difference between speaker-sex discrimination performance for men's and women's whispered vowels is at longer durations (!40 ms) where women's whispered vowel speaker-sex discrimination performance asymptotes at approximately 0.70 proportion correct while men's whispered vowel speaker-sex discrimination performance increases up to 0.90 (at 60 ms). The male glottis has a medial surface bulge in the vocal folds while the female glottis converges more linearly (Titze, 1989). This could lead to perceptual differences at longer durations in male and female whispered vowels which might aid speaker-sex discrimination. The difficulties associated with identifying women compared with men speakers is consistent with other studies that have shown a perceptual advantage for male sounds in speaker-sex discrimination tasks (Owren, Berkowitz, & Bachorowski, 2007).
A supplementary experiment exploring whether differences in loudness between voiced and whispered vowels might explain the observed pattern of speaker-sex discrimination performance involved increasing the sound level of the stimuli. This had no effect on performance for voiced vowels (Figure 2(a)-(c), small vs. large filled circles). The effect upon whispered vowels was only significant for men's voices (Figure 2(b), small vs. large open circles), where increasing the sound level by 6 dB led to poorer performance at medium durations. There was no difference in speaker-sex discrimination performance for whispered women's voices (Figure 2(c)) or when men's and women's voices were plotted together (Figure 2(a)). This implies that changes in perceptual loudness do not underlie differences in speaker-sex discrimination performance between voiced and whispered vowels. The suggestion is that the differences are due to a lack of temporal pitch information in whispered speech.
In summary, the impoverished representation of speaker-sex cues (no temporal pitch) in whispered speech leads to poorer speaker-sex discrimination performance for whispered compared with voiced vowels-a whispered vowel has to be three times as long (34 ms) as a voiced vowel (11 ms) to reach the threshold of discrimination (min sex ). The difference between voiced and whispered vowel speaker-sex discrimination performance is least at very short durations because both voiced and whispered vowels contain VTL-related information and have no GPR-related information. However, at longer durations, GPRrelated information becomes available in the voiced vowels while still being absent from the whispered vowels. Consequently, whispered vowel speaker-sex discrimination performance does not improve as much as voiced vowel speaker-sex discrimination performance. This is consistent with Smith (2014) in that it provides further support for the idea that speaker-sex discrimination is mediated by VTL-related information at the very shortest durations and then switches to being dominated by GPR-related information when it is available at longer durations. This makes best use of what information is available-using early-available but less reliable information in the beginning of a decision process and then switching to late-available but reliable information as it comes on stream. Such an approach maximizes performance in a rapidly changing dynamic environment.