On the Boundaries of Phonology and Phonetics

səhifə	20/41
tarix	25.06.2016
ölçüsü	3.17 Mb.

1 ... 16 17 18 19 20 21 22 23 ... 41

3.6.Evaluation

The performance of a neural predictor trained on phonotactics may be evaluated with different methods, depending on the particular task the network is applied to. In this section we evaluate the neural networks performing best during the pilot studies.

3.6.1.Likelihoods

The direct outcome of training the sequential prediction task is learning the successors' distribution. This will therefore be used as a basic evaluation method: the empirical context-dependent successor distribution P_s^L (C) will be matched against the network context dependent predictions NP_s^L (C). For this purpose, the output of the network will be normalized and matched against the distribution in the language data. This procedure resulted in a mean L₂ (semi-Euclidean) distance of 0.063 - 0.064, where the optimal value would be zero (see Table 1, right 3 columns)._¹⁹ These values are close to optimal but baseline models (completely random networks) also result in approximately 0.085 L₂ distance.

3.6.2.Phonotactic Constraints

To evaluate the network's success in becoming sensitive to phonotactic constraints, we first need to judge how well it predicts individual phonemes. For this purpose we seek a threshold above which phonemes are predicted to be admissible and below which they are predicted to be inadmissible. This is done empirically - we perform a binary search for an optimal threshold, i.e. the threshold θ* that minimizes the network error E(θ). The classification obtained in this fashion constitutes the network's predictions about phonotactics.

We now turn to evaluating the network's predictions: the method to evaluate the network from this point of view compares the context-dependent network predictions with the corresponding empirical distributions. For this purpose, the method described by Stoianov (2001) will be used. The algorithm traverses a trie (Aho, Hopcroft & Ullman, 1983: 163-169), which is a tree representing the vocabulary where initial segments are the first branches. Words are paths through this data structure. The algorithm computes the performance at the optimal threshold determined using the procedure described in the last paragraph, i.e., at the threshold which determines which phonemes are admissible and which inadmissible (see also 2.1). This approach compares the actual distribution with the learned distribution, and we normally use the complete database L_M for training and testing.

Figure 2 shows the error of SRN₁^{8 0} at different values of the threshold. The optimal threshold searching procedure resulted in 6.0% erroneous phoneme prediction at a threshold of 0.0175. This means that if we want to predict phonemes with this SRN, they would be accepted as allowed successors if the activation of the correspondent neurons are higher than 0.0175.

3.6.3.Word Recognition

sing an SRN trained on phoneme prediction as a word recognizing device shifts the focus from phoneme prediction to sequence classification. We wish to see whether it can classify sequences of phonemes into well-formed words on the one hand and ill-formed non-words on the other. To do this we need to translate the phoneme (prediction) values into sequence values. We do this by taking the sum of the phoneme error values for the sequence of phonemes in the string, normalized to correct for length effects. But to translate this sum into a classification, we again need to determine an acceptability threshold, and we use a variant of the same empirical optimization described above. The threshold arrived at for this purpose is slightly lower than the optimal threshold from the previous algorithm. This means that the network accepts more phonemes, which, however, is compensated for by the fact that a string is accepted only if all its phonemes are predicted. In string recognition it is better to increase the phoneme acceptance rate, because the chance to detect a non-word is larger when more tokens are tested.

Figure 2. SRN error (in %) as a function of the threshold θ. The False Negative Error increases as the threshold increases because more and more admissible phonemes are incorrectly rejected. At the same time, the False Positive Error decreases because fewer unwanted successors are falsely accepted. The mean of those two errors is the network error, which finds its minimum 6.0% at threshold θ* = 0.0175. Notice that the optimal threshold is limited to a small range. This illustrates how critical the exact setting of threshold is for good performance.

Since the performance measure here is the mean percentage of correctly recognized monosyllables and correctly rejected random strings, we incorporate both in seeking the optimal threshold. The negative data is as described above in 3.4. Concerning the positive data, this approach allows us to test the generalization capacity of the model, so that the training L¹_M and testing L²_M subsets may be used here - the first for training the model and evaluating it during training, and the second to test the generalization capacity of the trained network.

Once we determine the optimal sequence-acceptance threshold (0.016), we obtain 5% error on the positive training dataset L¹_M and the negative strings from R_M , where the error varied 0.5% depending on the random data set generated.

The model was tested further on the second group of negative data sets. As expected, strings which are more unlike Dutch resulted in smaller error. Performance on random strings from R_N^{3 +} is almost perfect. In the opposite case, the strings close to real words (from R¹_N ) resulted in larger error.

The generalization capabilities of the network were tested on the L²_M positive data, unseen during training. The error on this test set was about 6%. An explanation of the increase of the error will be presented later, when the error will be studied by varying its properties.

Another interesting issue is how SRN performance compares to other known models, e.g. n-grams. The trained SRN definitely outperformed bigrams and trigrams, which was shown by testing the trained SRNs on the non-words from R²_N and R³_N sets, yielding 19% and 35% error, respectively. This means that the SRN correctly rejected four out of five non-word strings composed of correct bigrams and two out of three non-word strings made of trigrams. To clarify, note that bigram models would have 100% error on R²_N , and trigram models 100% error on R³_N .

1 ... 16 17 18 19 20 21 22 23 ... 41