On the Boundaries of Phonology and Phonetics

səhifə	19/41
tarix	25.06.2016
ölçüsü	3.17 Mb.

1 ... 15 16 17 18 19 20 21 22 ... 41

3.3.Difficulty

One way to characterize the complexity of the training set is to compute the entropy of the distribution of successors, for every available left context. The entropy of a language L viewed as a stochastic process measures the average surprise value associated with each element (Mitchell, 1997). In our case, the language is a set of words and the elements are phonemes, hence the appropriate entropy measures the average surprise value for phonemes c preceded by a context s. Entropy is measured for a given distribution, which in our case is the set of all possible successors. We compute entropy Entr(s) for a given context s with (1):
Equation 1. Entropy

where α is the alphabet of segment symbols, and p(c) the probability of a given context. Then the average entropy for all available contexts sL, weighted with their frequencies, will be the measure of the complexity of the words. The smaller this measure, the less difficult are the words. The maximal possible value for one context would be log₂(45), that is, 5.49, and this would only obtain for the unlikely case that each phoneme was equally likely in that context. The actual average value of the entropy measured for the Dutch monosyllables, is 2.24, σ = 1.32. The minimal value was 0.0, and the maximal value was 3.96. These values may be interpreted as follows: The minimal value of 0.0 means that there are left contexts with only one possible successor (log₂(1) = 0). A maximal value of 3.96 means that there is one context which is as unpredictable as one in which 2^3.96 = 16 successors were equally likely. The mean entropy is 2.24, which is to say that in average 4.7 phonemes follow a given left context.

3.4.Negative Data

We noted above that negative data is also necessary for evaluation. Since we are interested in models that discriminate more precisely the strings from L (the Dutch syllables), the negative data for the following experiments will be biased toward L.

Three negative testing sets were generated and used: First, a set R_Mcontaining strings with syllabic form [C]^0...3V[C]^0...4, based on the empirical observation that the Dutch mono-syllables have up to three onset (word initial) consonants and up to four coda (word final) consonants. The second group consists of three sub-sets of R_M: {R¹_M , R²_M , R_M^{3 +} }, with fixed distances of the random strings to any existing Dutch word at 1, 2, and 3+ phonemes, respectively (measured by edit distance (Nerbonne, Heeringa & Kleiweg, 1999)). Controlling for the distance to any training word allows us to assess more precisely the performance of the model. And finally, a third group: random strings built of concatenations of n-grams picked randomly from Dutch monosyllables. In particular, two sets - R²_N and R³_N - were randomly developed, based on bigrams and trigrams, correspondingly.

The latter groups are the most "difficult" ones, and especially R³_N , because it consists of strings that are closest to Dutch. They are also useful for the comparison of SRN methods to n-gram modeling. The corresponding n-gram models will always wrongly recognize these random strings as words from the language. Where the connectionist predictor recognizes them as non-words, it outperforms the corresponding n-gram models, which are considered as benchmark models for prediction tasks such as phonotactics learning.

3.5.Training

This section reports on network training. We will add a few more details about the training procedure, then we will present pilot experiments aimed at determining the hidden layer size. The later parts will analyze the network performance.

3.5.1.Procedure

The networks were trained in a pool on the same problem, and independently of each other, with the BPTT learning algorithm. The training of each individual network was organized in epochs, in the course of which the whole training data set is presented in accordance with the word frequencies. The total of the logarithm of the frequencies in the training data base L¹_M is about 11,000, which is also the number of presentations of sequences per epoch, drawn in a random order. Next, for each word, the corresponding sequence of phonemes is presented to the input, one at a time, followed by the end-of-sequence marker `#'. Each time step is completed by copying the hidden layer activations to the context layer, which is used in the following step.

The parameters of the learning algorithm were as follows: the learning coefficient η started at 0.3 and dropped by 30% each epoch, finishing at 0.001; the momentum (smoothing) term α = 0.7. The networks required 30 epochs to complete training. After this point, very little improvement is noted.

3.5.2.Pilot experiments

Pilot experiments aiming at searching for the most appropriate hidden layer size were done with 20, 40 and 80 hidden neurons. In order to avoid other nondeterminism which comes from the random selection of negative data, during the pilot experiments the network was tested solely on its ability to distinguish admissible from inadmissible successors. Those experiments were done with a small pool of three networks, each of them trained for 30 epochs, which resulted in approximately 330,000 word presentations or 1,300,000 segments. The total number of individual word presentations ranged from 30 to 300, according to the individual word frequencies. The results of the training are given in Table 1, under the group of columns "Optimal phonotactics". In the course of the training, the networks typically started with a sharp error drop to about 13%, which soon turned into a very slow decrease (see Table 2, left 3 columns).

The training of the three pools with hidden layer size 20, 40 and 80, resulted in networks with similar performance, with the largest network performing best. Additional experiments with SRNs with 100 hidden neurons resulted in larger errors than a network with 80 hidden neurons, so that we settled experimentally on 80 hidden neurons as the likely optimal size. It is clear that this procedure is rough, and that one needs to be on guard against premature concentration on one size model.

Table 1. Results of a pilot study on phonotactics learning by SRNs with 20, 40, and 80 (rows) hidden neurons. Each network is independently trained on language L_M three times (columns). The performance is measured (left 3 columns) using the error in predicting the next phoneme, and (right 3 columns) using L₂ (semi-Euclidean) distance between the empirical context-dependent predictions and the network predictions for each context in the tree. Those two methods do not depend on randomly chosen negative data.

	Optimal Phonotactics			\|\|SRN^L, T^L\|\|_L2
Hidd Layer Size	SRN₁	SRN₂	SRN₃	SRN₁	SRN₂	SRN₃
20	10.57%	10.65%	10.57%	0.0643	0.0642	0.0642
40	10.44%	10.51%	10.44%	0.0637	0.0637	0.0637
80	10.00%	9.97%	10.02%	0.0634	0.0634	0.0632

Table 2. A typical shape of the SRN error during training. The error drops sharply in the beginning and then slowly decreases to convergence.

Epoch	1	2-4	5-10	11-15	16-30
Error (%)	15.0	12.0	10.8	10.7	10.5

1 ... 15 16 17 18 19 20 21 22 ... 41