Using the Levenshtein distance, two dialects are compared by comparing the pronunciation of a word in the first dialect with the pronunciation of the same word in the second. It is determined how one pronunciation is changed into the other by inserting, deleting or substituting sounds. Weights are assigned to these three operations. In the simplest form of the algorithm, all operations have the same cost, e.g. 1. Assume afternoon is pronounced as [tnn] in the dialect of Savannah, Georgia, and as [] in the dialect of Lancaster, Pennsylvania_{15}. Changing one pronunciation into the other can be done as in table 1 (ignoring suprasegmentals and diacritics for this moment)_{16}:
Table 1. Changing one pronunciation into another using a minimal set of operations.
tnn delete 1
tnn insert r 1
trnn subst. / 1
3
In fact many sequence operations map [tnn] to []. The power of the Levenshtein algorithm is that it always finds the cost of the cheapest mapping.
Comparing pronunciations in this way, the distance between longer pronunciations will generally be greater than the distance between shorter pronunciations. The longer the pronunciation, the greater the chance for differences with respect to the corresponding pronunciation in another variety. Because this does not accord with the idea that words are linguistic units, the sum of the operations is divided by the length of the longest alignment which gives the minimum cost. The longest alignment has the greatest number of matches. In our example we have the following alignment:
Table 2. Alignment which gives the minimal cost. The alignment corresponds with table 1.
t n n
1 1 1
The total cost of 3 (1+1+1) is now divided by the length of 9. This gives a word distance of 0.33 or 33%.
In Section 3.1.3 we explained how distances between segments can be found using spectrograms. This makes it possible to refine our Levenshtein algorithm by using the spectrogram distances as operation weights. Now the cost of insertions, deletions and substitutions is not always equal to 1, but varies, i.e., it is equal to the spectrogram distance between the segment and ‘silence’ (insertions and deletions) or between two segments (substitution).
To reckon with syllabification in words, the Levenshtein algorithm is adapted so that only a vowel may match with a vowel, a consonant with a consonant, the [j] or [w] with a vowel (or opposite), the [i] or [u] with a consonant (or opposite), and a central vowel (in our research only the schwa) with a sonorant (or opposite). In this way unlikely matches (e.g. a [p] with a [a]) are prevented.
In our research we used 58 different words. When a word occurred in the text more than once, the mean over the different pronunciations was used. So when comparing two dialects we get 58 Levenshtein distances. Now the dialect distance is equal to the sum of 58 Levenshtein distances divided by 58. When the word distances are presented in terms of percentages, the dialect distance will also be presented in terms of percentages. All distances between the 15 language varieties are arranged in a 15 15 matrix.
5.Results
The results of the Levenshtein distance measurements are analyzed in two ways. First, on the basis of the distance matrix we applied hierarchical cluster analysis (see Section 5.1). The goal of clustering is to identify the main groups. The groups are called clusters. Clusters may consist of subclusters, and subclusters may in turn consist of subsubclusters, etc. The result is a hierarchically structured tree in which the dialects are the leaves (Jain and Dubes, 1988). Several alternatives exist. We used the Unweighted Pair Group Method using Arithmetic averages (UPGMA), since dendrograms generated by this method reflected distances which correlated most strongly with the original Levenshtein distances (r=0.9832), see Sokal and Rohlf (1962).
Second, we ranked all varieties in order of relationship with the standard languages, Frisian and Town Frisian (see Section 5.2). When ranking with relation to Frisian, we looked at the average over all Frisian dialects. Since the ratings with respect to each of the Frisian varieties individually were very similar averaging was justified.
5.1.The classification of the Germanic languages
Looking at the clusters of language varieties in Figure 5 we note that our results reflect the traditional classification of the Germanic languages to a large extent (see Figure 1). On the highest level there is a division between English and the other Germanic languages. When we examine the group of other Germanic languages, we find a clear division between the North Germanic languages and the West Germanic languages. Within the North Germanic group, we see a clear division between the Scandinavian languages (Danish, Norwegian and Swedish) on the one hand and the Faroese and Icelandic on the other hand. In the genetic tree (see Figure 1), Norwegian is clustered with Icelandic and Faroese. However, due to the isolated position of Iceland and the Faroes and intensive language contact between Norway and the rest of Scandinavia, modern Norwegian has become very similar to the modern languages of Denmark and Sweden. All varieties spoken in the Netherlands, including the Frisian varieties, cluster together, and German clusters more closely to these varieties than English.
F igure 5. Dendrogram showing the clustering of the 14 language varieties in our study. The scale distance shows average Levenshtein distances in percentages.
All Frisian dialects form a cluster. This clustering corresponds well with the traditional classification as sketched in Section 3.1. The dialects of Hijum and Oosterbierum belong to Klaaifrysk and these dialects form a cluster. The Wâldfrysk dialects of Westergeest and Wetsens also cluster together. The Levenshtein distance between the four dialects is small, ranging from 19.6% between Hijum and Oosterbierum and 23.8% between Oosterbierum and Westergeest. Also the Súdwesthoeksk dialects, represented by the Tjerkgaast dialect, are rather close to the Klaaifrysk and Wâldfrysk dialects (distances between 21.6% and 26.4%). The highly conservative dialect of Hindeloopen is more deviant from the other dialects (distances between 29.8% and 32.5%) and this is also the case for the Town Frisian dialect of Leeuwarden which is more similar to Dutch (20.3%) than to Frisian (between 32.3% and 35.8%) which confirms the characterization of Town Frisian by Kloeke (1927) as ‘Dutch in Frisian mouth’.
