Ana səhifə

On the Boundaries of Phonology and Phonetics


Yüklə 3.17 Mb.
səhifə21/41
tarix25.06.2016
ölçüsü3.17 Mb.
1   ...   17   18   19   20   21   22   23   24   ...   41

4.Network Analysis


The distributed representations in Neural Networks prevent the analysis of generalizations in trained models by simple observation, which symbolic learning methods allow. Smaller NNs may be analyzed to some extent by examination, but for larger networks this is practically impossible.

It is possible, however, to analyze trained networks to extract abstract knowledge about their behavior. Elman (1988), for example, trained an SRN to learn sentences and then analyzed the hidden layer activations of that SRN in various contexts, from which he showed that the network had internally developed syntactical categories. Similarly, we trained SRNs on phonotactics (Stoianov et al., 1998), and then analyzed the network statically, by viewing the weight vectors of each neuron as pattern classifiers. We showed that the SRN had induced generalizations about phonetic categories. We follow that earlier work in order to study network behavior, and we present the results of this study in the first subsection.

Another approach to the analysis of connectionist models assumes that they are black boxes and examines the variation of network performance while varying some properties of the data (Plaut et al., 1996; Stoianov, Stowe & Nerbonne, 1999). For example, one can vary word frequency, length, etc., and study the network error. When modeling human cognitive functions with this approach one can compare the behavior of the cognitive system and its artificial models. For example, in phonotactic modeling, one can compare results from psycholinguistic studies of a lexical decision task with the network reaction. This will be subject of study in the rest of the section.

4.1.Weight Analysis


The neurons of a neural network act as pattern classifiers. The inputs selectively activate one or another neuron, depending on the weight vectors. This means that information about network structure may be extracted from the weight vectors.

In this section we will present a cluster analysis of the neurons in the output layer. For that purpose, the mean weight vectors of the output layer of one of the networks - SRN24 0 (from Table 1) - were clustered using a minimum variance (Ward's) method, and each vector in the resulting dendrogram was labeled with the phoneme it corresponds to.20 The resulting diagram is shown in Figure 3.



F
igure 3.
Cluster analysis of the vector of the output neurons, labeled with the phonemes they correspond to. The weight vectors are split into clusters which roughly correspond to existing phonetic categories.

We can see that the weight vectors (and correspondingly, the phonemes) cluster into some well-known major natural classes - vowels (in the bottom) and consonants (the upper part). The vowels are split into two major categories: low vowels and semi-low, front vowels (/, , a, e/), and high, back ones. The latter, in turn, are clustered into round+ and round- classes. Consonants appear to be categorized in a way less congruent with phonetics. But here, too, some established groups are distinguished. The first subgroup contains non-coronal consonants (/f, k, m, p, x/) with the exceptions of /l/ and /n/. Another subgroup contains voiced obstruents (/, d, , /). The delimiter '#' is also clustered as a consonant, in a group with /t/, which is also natural. The upper part of the figure seems to contain phonemes from different groups, but we can recognize that most of these phonemes are quite rare in Dutch monosyllables, e.g., //, perhaps because they have been 'loaned' from other languages, e.g. /g/.


4.2.Functional analysis


We may also study NNs by examining their performance as a function of factors such as word frequency, similarity neighborhood, and word length. Such an analysis relates computational language modeling to psycholinguistics, and we submit that it is useful to compare the models' performance with humans'. In this section we introduce several factors which have played a role in psycholinguistic theorizing. We then examine the performance of our model as a function of these factors.

4.2.1.Psycholinguistic Factors


Frequency is one of the most thoroughly investigated characteristics of words that affect performance. Numerous previous studies have demonstrated that the ease and the time with which spoken words are recognized are monotonically related to the experienced frequency of words in the language environment (Luce, Pisoni & Goldinger, 1990; Plaut et al., 1996). The general tendencies found are that the more frequent words are, the faster and the more precise they are recognized.

Our perception of a word is likewise known to depend on its similarity to other words. The similarity neighborhood of a word is defined as the collection of words that are phonetically similar to it. Some neighborhoods are dense with many phonetically similar words while others are sparse with few.

The so-called Colthearth-N measure of a word w counts the number of words that might be produced by replacing a single letter of w with some other. We modify this concept slightly to make it sensitive to similarity of sub-syllabic elements, so that we regard words as similar when they share two of the subsyllabic elements - onset, nucleus and coda. Empty onsets or codas are counted as the same. The word neighborhood is computed by counting the number of the similar words. If implemented precisely, the complexity of the measuring process just explained is high, so we reduce it by probing for sub-syllables rather than for units of variable size, starting from a single phoneme. This simplifies and speeds up processing. The neighborhood size of the corpus we used ranged from 0 to 77 and had mean value of μ= 30; σ = 13.

For example, the phonological neighborhood of the Dutch word broeds /bruts/ is given below. Note that the neighborhood contains only Dutch words.


/brts/, /brots/, /bruj/, /brujt/, /bruk/, /brur/, /brus/, /brut/, /buts/, /kuts/, /puts/, /tuts/
These represent the pronunciations of Brits `British', broods `bread' (gen.sg.), broei `brew', broeit `brew' (3rd. sg.), broek `pants', broer `brother', broes `spray nozzle', broed `brood', boots `boots' (Eng. loan), koets `coach', poets `clean' and toets `test'. Among the words with very poor neighborhood are // schwung, /brts/ boards, /jnt/ joint, and /skrs/ squares, all of which are of foreign origin. Words such as /hk/ hek, /bs/ bas, /lxt/ lacht, and /bkt/ bakt have large neighborhoods.

It is still controversial how similarity neighborhood influences cognitive processes (Balota, Paul & Spieler, 1999). Intuitively, it seems likely that words with larger neighborhoods are easier to access due to many similar items, but from another perspective these words might be more difficult to access due to the nearby competitors and longer selection process. However, in the more specific lexical decision task, the overall activity of many candidates has been shown to facilitate lexical decisions, so we will look for the same effect here.

The property word length might affect performance in the lexical decision task in two different ways. On one hand, longer words provide more evidence since more phonemes are available to decide whether the input sequence is a word so that we expect higher precision for longer words, and lower precision for particularly short words. On the other hand, network error accumulating in iteration increases the error in phoneme predictions at later positions, which in turn will increase the overall error for longer words. For these reasons we expect U-shaped patterns of error as word length increases. Such a pattern was observed in a study on modeling grapheme-to-phoneme conversion with SRNs (Stoianov et al., 1999). Static NNs are less likely (than dynamic models such as SRNs) to produce such patterns.

So far we have presented three main characteristics of the individual words, which we expect to affect the performance of the model. However, a statistical correlation analysis (bivariate Spearman test) showed that they are not independent, which means that an analysis of the influence of any single factor should control for the rest. In particular, there is high negative correlation between word neighborhood and word length (r = -0.476), smaller positive correlation between neighborhood and frequency (r = 0.223), and very small negative correlation between frequency and word length (r = -0.107). Because of the large amount of data all these coefficients are significant at the 0.001 level.



Finally, it will be useful to seek a correlate in the simulation for reaction time, which psycholinguists are particularly fond of using as a probe to understanding linguistic structure. Perhaps we can find an SRN correlate to Reaction Time (RT) for the lexical decision task in network confidence, i.e., the amount of evidence that the test string is a word from the training language. The less confident the network, the slower the reaction, which can be implemented with a lateral inhibition (Haykin, 1994; Plaut et al., 1996). The network confidence for a given word might be expressed as the product of the activations of the neurons corresponding to the phonemes of that word. A similar measure, which we call uncertainty U, is the negative sum of (output) neuron activation logarithms, normalized with respect to word length |w| (2). Note that U varies inversely with confidence. Less certain sequences get higher (positive) scores.
E
quation 2.

To analyze the influence of these parameters, the network scores and U-values were recorded for each monosyllabic word at the optimal threshold θ*= 0.016. The data was then submitted to the statistical package SPSS for analysis of variance using SPSS's General Linear Model (GLM). When analyzing network score, the analysis revealed main effects of all three parameters discussed above: word neighborhood size (F = 18.4; p < 0.0001), word frequency (F = 19.2; p < 0.0001), word length (F = 11.5; p < 0.0001). There was also interaction between neighborhood size and the other parameters: the interaction with word frequency had an F -score 6.6 and the interaction of the neighborhood with word length had an F-score of 4.9, both significant at 0.0001 level. Table 3 summarizes the findings. Error decreases both as neighborhood size and as frequency increases, and error dependent on length shows the predicted U-shaped form (Table 3c).
Table 3. Effect of (a) frequency, (b) neighborhood density and (c) length on word uncertainty U and word error.



Frequency

Low

Mid

High

U

2.30

2.20

2.18

Error (%)

8.6

4.1

1.5






Neighb. size

Low

Mid

High

U

2.62

2.30

2.21

Error (%)

12.7

3.9

0.8






Length

Low

Mid

High

U

2.63

2.20

2.13

Error (%)

5.2

4.4

13.1

Analysis of variance on the U-values revealed similar dependencies. There were main effects of word neighborhood size (F = 58.2; p < 0.0001), word frequency (F = 45.9; p < 0.0001), word length (F = 137.5; p < 0.0001), as well as the earlier observed interactions between neighborhood density and the other two variables: word length (F = 10.4; p < 0.001) and frequency (F = 5.235; p < 0.005).



The frequency pattern of error and uncertainty variance was expected, given the increased evidence to the network for more frequent words. The displayed length effect showed that the influence of error gained in recursion is weaker than the effect of stronger evidence for longer words. Also, the pattern of performance when varying neighborhood density confirmed the hypothesis of the lexical decision literature that larger neighborhoods makes it easier for words to be recognized as such.
1   ...   17   18   19   20   21   22   23   24   ...   41


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət