A basic problem of linguistic phonetics is to explain how the infinite variety of speech sounds in actual utterances can be described with finite means, such that they can be dealt with in the grammar, i.e. phonology, of a language. The crucial concept that was developed to cope with this reduction problem is the sound category, or – when applied to the description of segmental phenomena – the phoneme. This is best conceived of as an abstract category that contains all possible sounds that are mutually interchangeable in the context of a minimal word pair. That is, substitution of one token (allophone) of a phoneme for an other does not yield a different word (i.e., a string of sounds with a different lexical meaning).2
The phonemes in a language differ from one another along a finite number of phonetic dimensions, such as degree of voicing, degree of noisiness, degree of nasality, degree of openness, degree of backness, degree of rounding, etc. Each phonetic dimension, in turn, is subdivided into a small number (two to four) of phonologically functional categories, such as voiced/voiceless, (half)closed/(half)open, front/central/back, etc. Phonetic dimensions generally have multiple acoustical correlates. For instance, degree of voicing correlates with a multitude of acoustic cues such as voice onset time, duration of preceding vowel, steepness of intensity decay and of formant bends in preceding vowel, duration of intervocalic (near) silence, duration and intensity of noise burst, steepness of intensity attack and formant bends of following vowel. These acoustic properties typically co-vary in preferred patterns, but may be manipulated independently through speech synthesis. When non-typical (‘conflicting’) combinations of parameter values are generated in the laboratory, some cues prove to be more influential than others; so-called ‘cue trading relationships’ have been established for many phonemic contrasts. In Dutch, for instance, vowel quality (acoustically defined by F1 and F2, i.e., the centre frequencies of the lowest two resonances in the vocal tract) and vowel duration were found to be equally influential in cuing the tense/lax-contrast between / and //: a duller vowel quality (lower F1 and F2-values), normally cuing // could be compensated for by increasing the duration of the vowel so that native listeners still perceive /a/ (and vice versa, van Heuven, 1986).
Categorization of sounds may proceed along several possible lines. First, many differences between sounds are simply too small to be heard at all: these are subliminal. The scientific discipline of psycho-acoustics provides a huge literature on precisely what differences between sounds can and cannot be heard with the naked ear. Moreover, research has shown that the human hearing mechanism (and that of mammals in general) has developed specific sensitivities to certain differences between sounds and is relatively deaf to others. These predilections have been shown to be present at birth (probably even in utero), and need not be acquired through learning. However, human categorization of sound is further shaped by exposure to language. As age progresses from infancy to adulthood, sound differences that were still above threshold shortly after birth quickly lose their distinctivity. An important concept in this context is the notion of categorical perception. This notion is best explained procedurally in terms of a laboratory experiment.
Imagine a minimal word pair such as English back ~ pack. One important difference between these two tokens is that the onset of voicing in back is more or less coincident with the plosive release, whilst the voice onset in pack does not start until some 50 ms after the release. It is not too difficult in the laboratory to create a series of exemplars by interpolating the voice onset time of a prototypical back (0-ms delay) and that of a prototypical pack (70-ms delay) in steps of, say, 10 ms, so that we now have an 8-step continuum ranging over 0, 10, 20, 30, 40, 50, 60, and 70 ms. These eight exemplars are shuffled in random order and played to an audience of native English listeners for identification as either back or pack (forced choice). The 0-ms voice delay token will naturally come out with exclusively back-responses (0% pack); the 70-ms token will have 100% pack-responses. But what results will be obtained for the intermediate exemplars? If the 10-ms changes in voice delay are perceived continuously, one would predict a constant, gradual increase in %-pack responses for each 10-ms increment in the delay. I.e., when the stimulus increment (from left to right) is plotted against the response increment (from bottom to top), the psychometric function (the line that captures the stimulus-response relationship) is essentially a straight line (open symbols in Figure 1B). The typical outcome of experiments with voiced/voiceless continua, however, is non-continuous. For the first part of the continuum all exemplars are perceived as back-tokens, the rightmost two or three exemplars are near-unanimously perceived as pack. Only for one or two exemplars in the middle of the continuum do we observe uncertainty on the part of the listener: here the distribution of responses is more or less ambiguous between back and pack. The psychometric function for this so-called categorical perception is sigmoid, i.e., has the shape of an S (big solid symbols in Figure 1B). In the idealized case of perfect categorical perception we would, in fact, expect to see a step-function jumping abruptly from (almost) 0 to (almost) 100% pack-responses somewhere along the continuum (thin black line with small solid symbols in Figure 1B).
The category boundary (at 35-ms VOT in Figure 1B) is defined as the (interpolated) point along the stimulus axis where the distribution of responses is completely ambiguous, i.e., 50-50%. For a well-defined cross-over from one category to the other there should be a point along the stimulus axis where 75% of the responses agree on one category, and a second point where there is 75%-agreement on the other category. The uncertainty margin is defined in absolute terms as the distance along the stimulus axis between the two 75%-points; equivalent relative measures can be derived from the steepness of the psychometric function (e.g. the slope coefficient or the standard deviation of the cumulative normal distribution fitted to the data points).
Figure 1. Panel A. Hypothetical discrimination function for physically same and different pairs of stimuli (one-step difference) reflecting categorical perception. Panel B. Illustration of continuous (open squares) versus categorical (big solid squares) perception in the identification and discrimination paradigm. The thin line with small squares represents the ideal step function that should be obtained when categorical perception is absolute. Category boundary and uncertainty margin are indicated (further, see text).
Although a pronounced sigmoid function (such as the one drawn in Figure 1B) is a clear sign of categorical perception, researchers have always been reluctant to consider it definitive proof. Listeners, when forced to, tend to split any continuum down the middle. For a continuum to be perceived categorically, therefore, two conditions should be met:
results of an identification experiment should show a clear sigmoid function, and
the discrimination function should show a local peak for stimuli straddling the category boundary.
The discrimination function is determined in a separate experiment in which either (i) identical or (ii) adjacent tokens along the stimulus continuum are presented pair-wise. Listeners then decide for each pair whether the two tokens are ‘same’ or ‘different’. Two kinds of error may occur in a discrimination task:
a physically different pair may be heard as ‘same’, and
a pair of identical tokens may be called ‘different’.
The results of a discrimination task are best expressed as the percentage of correct decisions obtained for a ‘different’ stimulus pair minus the percentage of errors for ‘same’ pairs constructed from these stimuli (the latter percentage is often called the response bias). In the case of true categorical perception the discrimination scores show a pronounced peak for the stimulus pair straddling the category boundary, whilst all other pairs are discriminated at or only little above chance level (see panel A in Figure 1). Physically different sounds that fall in the same perceptual category are hard to discriminate. In the case of continuous perception, there is no local peak in the discrimination function.