In 1995 Kessler introduced the use of the Levenshtein distance as tool for measuring linguistic distances between language varieties. The Levenshtein distance is a string edit distance measure and Kessler applied this algorithm to the comparison of Irish dialects. Later on, this approach was applied by Nerbonne, Heeringa, Van den Hout, Van der Kooi, Otten, and Van de Vis (1996) to Dutch dialects. They assumed that distances between all possible pairs of segments are the same. E.g. the distance between an  and an [e] is the same as the distance between the  and . Both Kessler (1995) and Nerbonne and Heeringa (1997) also experimented with more refined versions of the Levenshtein algorithm in which gradual segment distances were used which were found on the basis of the feature systems of Hoppenbrouwers (1988) and Vieregge et. al. (1984).
In this paper we use an implementation of the Levenshtein distance in which sound distances are used which are found by comparing spectrograms. In Section 4.1 we account for the use of spectral distances and explain how we calculate them. Comparisons are made on the basis of the audiotape The Sounds of the International Phonetic Alphabet (Wells and House, 1995). In Section 4.2 we describe the Levenshtein distance and explain how spectral distances can be used in this algorithm.
4.1.Gradual segment distances
When acquiring language, children learn to pronounce sounds by listening to the pronunciation of their parents or other people. The acoustic signal seems to be sufficient to find the articulation which is needed to realize the sound. Acoustically, speech is just a series of changes in air pressure, quickly following each other. A spectrogram is a “graph with frequency on the vertical axis and time on the horizontal axis, with the darkness of the graph at any point representing the intensity of the sound” (Trask, 1996, p. 328).
In this section we present the use of spectrograms for finding segment distances. Segment distances can also be found on the basis of phonological or phonetic feature systems. However, we prefer the use of acoustic representations since they are based on physical measurements. In Potter, Kopp and Green’s (1947) Visible Speech, spectrograms are shown for all common English sounds (see pp. 54-56). Looking at the spectrograms we already see which sounds are similar and which are not. We assume that visible (dis)similarity between spectrograms reflects perceptual (dis)similarity between segments to some extent. In Figure 3 the spectrograms of some sounds are shown as pronounced by John Wells on the audiotape The Sounds of the International Phonetic Alphabet (Wells and House, 1995). The spectrograms are made with the computer program P
Figure 3. Spectrograms of some sounds pronounced by John Wells. Upper the [i] (left) and the [e] (right) are shown, and lower the [p] (left) and the [s] (right) are visualized.
For finding spectrogram distances between all IPA segments we need samples of one or more speakers for each of them. We found the samples on the tape The Sounds of the International Phonetic Alphabet on which all IPA sounds are pronounced by John Wells and Jill House. On the tape the vowels are pronounced in isolation. The consonants are sometimes preceded, and always followed by an [a]. We cut out the part preceding the [a], or the part between the [a]’s. We realize that the pronunciation of sounds depends on their context. Since we use samples of vowels pronounced in isolation and samples of consonants selected from a limited context, our approach is a simplification of reality. However, Stevens (1998, p. 557) observes that
“by limiting the context, it was possible to specify rather precisely the articulatory aspects of the utterances and to develop models for estimating the acoustic patterns from the articulation”.
The burst in a plosive of the IPA inventory is always preceded by a period of silence (voiceless plosives) or a period of murmur (voiced plosives). When a voiceless plosive is not preceded by an [a], it is not clear how long the period of silence which really belongs to the sounds lasts. Therefore we always cut out each plosive in such a way that the time span from the beginning to the middle of the burst is equal to 90 ms. Among the plosives which were preceded by an [a] or which are voiced (so that the real time of the start-up phase can be found) we found no sounds with a period of silence or murmur which was clearly shorter than 90 ms.
In voiceless plosives, the burst is followed by an [h]-like sound before the following vowel starts. A consequence of including this part in the samples is that bursts often do not match when comparing two voiceless plosives. However, since aspiration is a characteristic property of voiceless sounds, we retained aspiration in the samples. In general, when comparing two voiced plosives, the bursts match. When comparing a voiceless plosive and a voiced plosive, the bursts do not match.
To keep trills comparable to each other, we always cut three periods, even when the original samples contained more periods. When there were more periods, the most regular looking sequence of three periods was cut.
The Levenshtein algorithm also requires a definition of ‘silence’. To get a sample of ‘silence’ we cut a small silent part on the IPA tape. This assures that silence has approximately the same background noise as the other sounds.
To make the samples as comparable as possible, all vowel and extracted consonant samples are monotonized on the mean pitch of the 28 concatenated vowels. The mean pitch of John Wells was 128 Hertz; the mean pitch of Jill House was 192 Hertz. In order to monotonize the samples the pitch contours were changed to flat lines. The volume was not normalized because volume contains too much segment specific information. For example it is specific for the [v] that its volume is greater than that of the [f].
In the most common type of spectrogram the linear Hertz frequency scale is used. The difference between 100 Hz and 200 Hz is the same as the difference between 1000 Hz and 1100 Hz. However, our perception of frequency is non-linear. We hear the difference between 100 Hz and 200 Hz as an octave interval, but also the difference between 1000 Hz and 2000 Hz is perceived as an octave. Our ear evaluates frequency differences not absolutely, but relatively, namely in a logarithmic manner. Therefore, in the Barkfilter, the Bark-scale is used which is roughly linear below 1000 Hz and roughly logarithmic above 1000 Hz (Zwicker and Feldtkeller, 1967).
In the commonly used type of spectrogram the power spectral density is represented per frequency per time. The power spectral density is the power per unit of frequency as a function of the frequency. In the Barkfilter the power spectral density is expressed in decibels (dB’s). “The decibel scale is a way of expressing sound amplitude that is better correlated with perceived loudness” (Johnson, 1997, p. 53). The decibel scale is a logarithmic scale. Multiplying the sound pressure ten times corresponds to an increase of 20 dB. On a decibel scale intensities are expressed relative to the auditory threshold. The auditory threshold of 0.00002 Pa corresponds with 0 dB (Rietveld and Van Heuven, 1997, p. 199).
A Barkfilter is created from a sound by band filtering in the frequency domain with a bank of filters. In PRAAT the lowest band has a central frequency of 1 Bark per default, and each band has a width of 1 Bark. There are 24 bands, corresponding to the first 24 critical bands of hearing as found along the basilar membrane (Zwicker and Fastl, 1990). A critical band is an area within which two tones influence each other’s perceptibility (Rietveld and Van Heuven, 1997). Due to the Bark-scale the higher bands summarize a wider frequency range than the lower bands.
In PRAAT we used the default settings when using the Barkfilter. The sound signal is probed each 0.005 seconds with an analysis window of 0.015 seconds. Other settings may give different results, but since it was not a priori obvious which results are optimal, we restricted ourselves to the default settings. In Figure 4 Barkfilters for some segments are shown.
igure 4. Barkfilter spectrograms of some sounds pronounced by John Wells. Upper the [i] (left) and the [e] (right) are shown, and lower the [p] (left) and the [s] (right) are visualized.
In this section, we explain the comparison of segments in order to get distances between segments that will be used in the Levenshtein distance measure. In a Barkfilter, the intensities of frequencies are given for a range of times. A spectrum contains the intensities of frequencies at one time. The smaller the time step, the more spectra there are in the acoustic representation. We consistently used the same time step for all samples.
It appears that the duration of the segment samples varies. This may be explained by variation in speech rate. Duration is also a sound-specific property. E.g., a plosive is shorter than a vowel. The result is that the number of spectra per segment may vary, although for each segment the same time step was used. Since we want to normalize the speech rate and regard segments as linguistic units, we made sure that two segments get the same number of spectra when they are compared to each other.
When comparing one segment of m spectra with another segment of n spectra, each of the m elements is duplicated n times, and each of the n elements is duplicated m times. So both segments get a length of m n.
In order to find the distance between two sounds, the Euclidean distance is calculated between each pair of corresponding spectra, one from each of the sounds. Assume a spectrum e1 and e2 with n frequencies, then the Euclidean distance is:
quation 1. Euclidean distance
The distance between two segments is equal to the sum of the spectrum distances divided by the number of spectra. In this way we found that the greatest distance occurs between the [a] and ‘silence’. We regard this maximum distance as 100%. Other segment distances are divided by this maximum and multiplied by 100. This yields segment distances expressed in percentages. Word distances and distances between varieties which are based on them may also be given in terms of percentages.
In perception, small differences in pronunciation may play a relatively strong role in comparison with larger differences. Therefore we used logarithmic segment distances. The effect of using logarithmic distances is that small distances are weighed relatively more heavily than large distances. Since the logarithm of 0 is not defined, and the logarithm of 1 is 0, distances are increased by 1 before the logarithm is calculated. To obtain percentages, we calculate ln(distance + 1) / ln(maximum distance + 1).
4.1.4.Suprasegmentals and diacritics
The sounds on the tape The Sounds of the International Phonetic Alphabet are pronounced without suprasegmentals and diacritics. However, a restricted set of suprasegmentals and diacritics can be processed in our system.
Length marks and syllabification are processed by changing the transcription beforehand. In the X-SAMPA transcription, extra-short segments are kept unchanged, sounds with no length indication are doubled, half long sounds are trebled, and long sounds are quadrupled. Syllabic sounds are treated as long sounds, so they are quadrupled.
When processing the diacritics voiceless and/or voiced, we assume that a voiced voiceless segment (e.g. ) and a voiceless voiced segment (e.g. [d]) are intermediate pronunciations of a voiceless segment ([t]) and a voiced segment ([d]). Therefore we calculate the distance between a segment x and a voiced segment y as the average of the distance between x and y and the distance between x and the voiced counterpart of y. Similarly, the distance between a segment x and a voiceless segment y is calculated as the mean of the distance between x and y and the distance between x and the voiceless counterpart of y. For voiced sounds which have no voiceless counterpart (the sonorants), or for voiceless sounds which have no voiced counterpart (the glottal stop) the sound itself is used.
The diacritic apical is only processed for the [s] and the [z]. We calculate the distance between [s] and e.g. [f] as the average of the distance between [s] and [f] and  and [f]. Similarly, the distance between [z] and e.g. [v] is calculated as the mean of [z] and [v] and  and [v].
The thought behind the way in which the diacritic nasal is processed is that a nasal sound is more or less intermediate between its non-nasal version and the [n]. We calculate the distance between a segment x and a nasal segment y as the average of the distance between x and y and the distance between x and [n].