Ana səhifə

On the Boundaries of Phonology and Phonetics


Yüklə 3.17 Mb.
səhifə19/41
tarix25.06.2016
ölçüsü3.17 Mb.
1   ...   15   16   17   18   19   20   21   22   ...   41

3.3.Difficulty


One way to characterize the complexity of the training set is to compute the entropy of the distribution of successors, for every available left context. The entropy of a language L viewed as a stochastic process measures the average surprise value associated with each element (Mitchell, 1997). In our case, the language is a set of words and the elements are phonemes, hence the appropriate entropy measures the average surprise value for phonemes c preceded by a context s. Entropy is measured for a given distribution, which in our case is the set of all possible successors. We compute entropy Entr(s) for a given context s with (1):
Equation 1. Entropy



where α is the alphabet of segment symbols, and p(c) the probability of a given context. Then the average entropy for all available contexts sL, weighted with their frequencies, will be the measure of the complexity of the words. The smaller this measure, the less difficult are the words. The maximal possible value for one context would be log2(45), that is, 5.49, and this would only obtain for the unlikely case that each phoneme was equally likely in that context. The actual average value of the entropy measured for the Dutch monosyllables, is 2.24, σ = 1.32. The minimal value was 0.0, and the maximal value was 3.96. These values may be interpreted as follows: The minimal value of 0.0 means that there are left contexts with only one possible successor (log2(1) = 0). A maximal value of 3.96 means that there is one context which is as unpredictable as one in which 23.96 = 16 successors were equally likely. The mean entropy is 2.24, which is to say that in average 4.7 phonemes follow a given left context.


3.4.Negative Data


We noted above that negative data is also necessary for evaluation. Since we are interested in models that discriminate more precisely the strings from L (the Dutch syllables), the negative data for the following experiments will be biased toward L.

Three negative testing sets were generated and used: First, a set RM containing strings with syllabic form [C]0...3V[C]0...4, based on the empirical observation that the Dutch mono-syllables have up to three onset (word initial) consonants and up to four coda (word final) consonants. The second group consists of three sub-sets of RM: {R1M , R2M , R M 3 + }, with fixed distances of the random strings to any existing Dutch word at 1, 2, and 3+ phonemes, respectively (measured by edit distance (Nerbonne, Heeringa & Kleiweg, 1999)). Controlling for the distance to any training word allows us to assess more precisely the performance of the model. And finally, a third group: random strings built of concatenations of n-grams picked randomly from Dutch monosyllables. In particular, two sets - R2N and R3N - were randomly developed, based on bigrams and trigrams, correspondingly.

The latter groups are the most "difficult" ones, and especially R3N , because it consists of strings that are closest to Dutch. They are also useful for the comparison of SRN methods to n-gram modeling. The corresponding n-gram models will always wrongly recognize these random strings as words from the language. Where the connectionist predictor recognizes them as non-words, it outperforms the corresponding n-gram models, which are considered as benchmark models for prediction tasks such as phonotactics learning.

3.5.Training


This section reports on network training. We will add a few more details about the training procedure, then we will present pilot experiments aimed at determining the hidden layer size. The later parts will analyze the network performance.

3.5.1.Procedure


The networks were trained in a pool on the same problem, and independently of each other, with the BPTT learning algorithm. The training of each individual network was organized in epochs, in the course of which the whole training data set is presented in accordance with the word frequencies. The total of the logarithm of the frequencies in the training data base L1M is about 11,000, which is also the number of presentations of sequences per epoch, drawn in a random order. Next, for each word, the corresponding sequence of phonemes is presented to the input, one at a time, followed by the end-of-sequence marker `#'. Each time step is completed by copying the hidden layer activations to the context layer, which is used in the following step.

The parameters of the learning algorithm were as follows: the learning coefficient η started at 0.3 and dropped by 30% each epoch, finishing at 0.001; the momentum (smoothing) term α = 0.7. The networks required 30 epochs to complete training. After this point, very little improvement is noted.


3.5.2.Pilot experiments


Pilot experiments aiming at searching for the most appropriate hidden layer size were done with 20, 40 and 80 hidden neurons. In order to avoid other nondeterminism which comes from the random selection of negative data, during the pilot experiments the network was tested solely on its ability to distinguish admissible from inadmissible successors. Those experiments were done with a small pool of three networks, each of them trained for 30 epochs, which resulted in approximately 330,000 word presentations or 1,300,000 segments. The total number of individual word presentations ranged from 30 to 300, according to the individual word frequencies. The results of the training are given in Table 1, under the group of columns "Optimal phonotactics". In the course of the training, the networks typically started with a sharp error drop to about 13%, which soon turned into a very slow decrease (see Table 2, left 3 columns).

The training of the three pools with hidden layer size 20, 40 and 80, resulted in networks with similar performance, with the largest network performing best. Additional experiments with SRNs with 100 hidden neurons resulted in larger errors than a network with 80 hidden neurons, so that we settled experimentally on 80 hidden neurons as the likely optimal size. It is clear that this procedure is rough, and that one needs to be on guard against premature concentration on one size model.


Table 1. Results of a pilot study on phonotactics learning by SRNs with 20, 40, and 80 (rows) hidden neurons. Each network is independently trained on language LM three times (columns). The performance is measured (left 3 columns) using the error in predicting the next phoneme, and (right 3 columns) using L2 (semi-Euclidean) distance between the empirical context-dependent predictions and the network predictions for each context in the tree. Those two methods do not depend on randomly chosen negative data.




Optimal Phonotactics

||SRNL, TL||L2

Hidd Layer Size

SRN1

SRN2

SRN3

SRN1

SRN2

SRN3

20

10.57%

10.65%

10.57%

0.0643

0.0642

0.0642

40

10.44%

10.51%

10.44%

0.0637

0.0637

0.0637

80

10.00%

9.97%

10.02%

0.0634

0.0634

0.0632


Table 2. A typical shape of the SRN error during training. The error drops sharply in the beginning and then slowly decreases to convergence.

Epoch

1

2-4

5-10

11-15

16-30

Error (%)

15.0

12.0

10.8

10.7

10.5
1   ...   15   16   17   18   19   20   21   22   ...   41


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət