Ana səhifə

On the Boundaries of Phonology and Phonetics


Yüklə 3.17 Mb.
səhifə18/41
tarix25.06.2016
ölçüsü3.17 Mb.
1   ...   14   15   16   17   18   19   20   21   ...   41

3.Experiments


The challenge in connectionist modeling is not only developing theoretical frameworks, but also obtaining the most from the network models during experimentation. This section focuses on experiments on learning the phonotactics of Dutch syllables with Simple Recurrent Networks and discusses a number of related problems. It will be followed by a study on the network behavior from a linguistic point of view.

3.1.Some implementation decisions


SRNs were presented in section 2. A first implementation decision concerns how sounds are to be represented. A simple orthogonal strategy is to choose a vector of n neurons to represent n phonemes, to assign each phoneme (e.g. //) to a neuron (e.g., neuron 5 in a sequence of 45), and then to activate that one neuron and deactivate all the others whenever the phoneme is to be represented (so a // is represented by four deactivated neurons, a single activated one, and then forty more deactivated neurons). This orthogonal strategy makes no assumptions about phonemes being naturally grouped into classes on the basis of linguistic features such as consonant/vowel status, voicing, place of articulation, etc. An alternative strategy exploits such features by assigning each feature to a neuron and then representing a phoneme via a translation of its feature description into a sequence of corresponding neural activations.

In phonotactics learning, the input encoding method might be feature-based or orthogonal, but the output decoding should be orthogonal in order to obtain a simple prediction of successors, and to avoid a bias induced from the peculiarities of the feature encoding scheme used. The input encoding chosen was also orthogonal, which also requires the network discover natural classes of phonemes by itself.

The orthogonal encoding implies that we need as many neurons as we have phonemes, plus one for the end-of-word '#' symbol. That is, the input and output layers will have 45 neurons. However, it is usually difficult to choose the right size of the hidden layer for a particular learning problem. That size is rather indirectly related to the learning task and encoding chosen (as a subcomponent of the learning task). A linguistic bias in the encoding scheme, e.g., feature-based encoding, would simplify the learning task and decrease the number of the hidden neurons required learning it (Stoianov, 2001). Intuition tells us that hidden layers that are too small lead to an overly crude representation of the problem and larger error. Larger hidden layers, on the other hand, increase the chance that the network wanders aimlessly because the space of possibilities it needs to traverse is too large. Therefore, we sought an effective size in a pragmatic fashion. Starting with a plausible size, we compared its performance to nets with double and half the number of neurons in the hidden layer. We repeated in the direction of the better behavior, keeping track of earlier bounds in order to home in on an appropriate size. In this way we settled on a range of 20-80 neurons in the hidden layer, and we continued experimentation on phonotactic learning using only nets of this size.

However, even given the right size of the hidden layer, the training will not always result in an optimal weight set W* since the network learning is nondeterministic - each network training process depends on a number of stochastic variables, e.g., initial network weights and an order of presentation of examples. Therefore, in order to produce more successful learning, several SRNs with different initial weights were trained in a pool (group).

The back-propagation learning algorithm is controlled by two main parameters - a learning coefficient η and a smoothing parameter α. The first one controls the speed of the learning and is usually set within the range (0.1…0.3). It is advisable to choose a smaller value when the hidden layer is larger. Also, this parameter may vary in time, starting with a larger initial value that decreases progressively in time (as suggested in Kuan, Hornik & White (1994) for the learning algorithm to improve its chances at attaining a global minimum in error). Intuitively, such a schedule helps the network approximately to locate the region with the global minima and later to make more precise steps in searching for that minimum (Haykin, 1994; Reed & Marks, II 1999). The smoothing parameter α will be set to 0.7, which also allows the network to escape from local minima during the search walk over the error surface.

The training process also depends on the initial values of the weights. They are set to random values drawn from a region (-r...+r). It is also important to find a proper value for r, since large initial weight values will produce chaotic network behavior, impeding the training. We used r = 0.1.

The SRNs used for this problem are schematically represented in Fig. 1, where the SRN reaction to an input sequence /n/ after training on an exemplary set containing the sequences /nt#/, /nts#/, /ntrk#/ is given. For this particular database, the network has experienced the tokens '#', /s/ and // as possible successors to /n/ during training and therefore it will activate them in response to this input sequence.

3.2.Linguistic Data - Dutch syllables


A data base LM of all Dutch monosyllables - 5,580 words - was extracted from the CELEX (1993) lexical database. CELEX is a difficult data source because it contains many rare and foreign words among its approximately 350,000 Dutch lexical entries, which additionally complicate the learning task. Filtering out non-typical words is a formidable task and one which might introduce experimenter prejudice, and therefore all monosyllables were used. The monosyllables have a mean length of 4.1(σ = 0.94; min = 2; max = 8) tokens and are built from a set of 44 phonemes plus one extra symbol representing space (#) used as a filler specifying end-of-word.

The main dataset is split into a training (L1) and a testing (L2) database in proportion approximately 85% to 15%. The training database will be used to train a Simple Recurrent Network and the testing one will be used for evaluating the success of word recognition. Negative data also will be created for test purposes. The complete database LM will be used for some parts of evaluation.

In language modeling it is important to explore the frequencies of word occurrences which naturally bias humans' linguistic performance. If a model is trained on data in proportion to its empirical frequency, this focuses the learning on the more frequent words and thus improves the performance of the model. This also makes feasible a comparison of the model's performance with that of humans performing various linguistic tasks, such as a lexical decision task. For these reasons, we used the word frequencies given in the CELEX database. Because the frequencies vary greatly ([0...100,000]), we presented training data items in proportion with the natural logarithms of their frequencies, in accordance with standard practice (Plaut, McClelland, Seidenberg & Patterson, 1996). This approach resulted in frequencies in a range of [1...12].

1   ...   14   15   16   17   18   19   20   21   ...   41


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət