his section will briefly present Simple Recurrent Networks (Elman, 1988; Robinson & Fallside, 1988) and will review earlier studies of sequential, especially phonotactic learning. Detailed descriptions of the SRN processing mechanisms and the Back-Propagation Through Time learning algorithm that is used to train the model are available elsewhere (Stoianov, 2001; Haykin, 1994), and will be reviewed only superficially.
Figure 1. Learning phonotactics with the SRNs. If the training data set contains the words /nt#/, /nts#/ and /ntrk#/ then after the network has processed a left context /n/, the reaction to an input token /t/ will be active neurons corresponding to the symbol '#' and the phonemes /s/, and //.
Simple Recurrent Networks (SRNs) were invented to encode simple artificial grammars, as an extension of the Multilayer Perceptron (Rumelhart, Hinton & Williams, 1986) with an extra input - a context layer that holds the hidden layer activations at the previous processing cycle. After training, Elman (1988) conducted investigations on how context evolves in time. The analysis showed graded encoding of the input sequence: similar items presented to the input were clustered at close, but different, shifting positions. That is, the network discovered and implicitly represented in a distributed way the rules of the grammar generating the training sequences. This is noteworthy, because the rules for context were not encoded, but rather acquired through experience. The capacity of SRNs to learn simple artificial languages was further explored in a number of studies (Cleeremans, Servan-Schreiber & McClelland, 1989; Gasser, 1992).
SRNs have the structure shown in Figure 1. They operate as follows: Input sequences SI are presented to the input layer, one element SI(t) at a time. The purpose of the input layer is just to transfer activation to the hidden layer through a weight matrix. The hidden layer in turn copies its activations after every step to the context layer, which provides an additional input to the hidden layer - i.e., information about the past, after a brief delay. Finally, the hidden layer neurons output their signal through a second weight matrix to the output layer neurons. The activation of the latter is interpreted as the product of the network. Since the activation of the hidden layer depends both on its previous state (the context) and on the current input, SRNs have the theoretical capacity to be sensitive to the entire history of the input sequence. However, practical limitations restrict the time span of the context information to maximally 10-15 steps (Christiansen & Chater, 1999). The size of the layers does not restrict the range of temporal sensitivity.
The network operates in two working regimens - supervised training and network use. In the latter, the network is presented the sequential input data SI(t) and computes the output N(t) using contextual information. The training regimen involves the same sort of processing as network use and also includes a second, training step, which compares the network reaction N(t) to the desired one ST(t), and which uses the difference to adjust the network behavior in a way that improves future network performance on the same data.
The two most popular supervised learning algorithms used to train SRNs are the standard Back-Propagation algorithm (Rumelhart et al., 1986) and the Back-Propagation Through Time algorithm (Haykin, 1994). While the earlier is simpler because it uses information from one previous time step only (the context activation, the current network activations, and error), the latter trains the network faster, because it collects errors from all time steps during which the network processes the current sequence and therefore it adjusts the weights more precisely. However, the BPTT learning algorithm is also cognitively less plausible, since the collection of the time-spanning information requires mechanisms specific for the symbolic methods. Nevertheless, this compromise allows more extensive research, and without it the problems discussed below would require much longer training time when using standard computers for simulations. Therefore, in the experiments reported here the BPTT learning algorithm will be used. In brief, it works in the following way: the network reaction to a given input sequence is compared to the desired target sequence at every time step and an error is computed. The network activation and error at each step are kept in a stack. When the whole sequence is processed, the error is propagated back through space (the layers) and time, and weight-updating values are computed. Then, the network weights are adjusted with the values computed in this way.
2.1.Learning Phonotactics with SRNs
Dell, Juliano & Govindjee (1993) showed that words could be described not only with symbolic approaches, using word structure and content, but also by a connectionist approach. In this early study of learning word structure with neural nets (NNs), the authors trained SRNs to predict the phoneme that follows the current input phoneme, given context information. The data sets contained 100 - 500 English words. An important issue in their paper is the analysis and modeling of a number of speech-error phenomena, which were taken as strong support for parallel distributed processing (PDP) models, in particular SRNs. Some of these phenomena were: phonological movement errors (reading list - leading list), manner errors (department - jepartment), phonotactic regularity violations (dorm - dlorm), consonant-vowel category confusions and initial consonant omissions (cluster-initial consonants dropping as when `stop' is mispronounced [tp]).
Aiming at segmentation of continuous phonetic input, Shillcock et al. (1997) and Cairns et al. (1997) trained SRNs with a version of the BPTT learning algorithm on English phonotactics. They used 2 million phonological segments derived from a transcribed speech corpus and encoded with a vector containing nine phonological features. The neural network was presented a single phoneme at a time and was trained to produce the previous, the current and the next phonemes. The output corresponding to the predicted phoneme was matched against the following phoneme, measuring cross-entropy; this produced a varying error signal with occasional peaks corresponding to word boundaries. The SRN reportedly learned to reproduce the current phoneme and the previous one, but was poor at predicting the following phoneme. Correspondingly, the segmentation performance was quite modest, predicting only about one-fifth of the word boundaries correctly, but it was more successful in predicting syllable boundaries. It was significantly improved by adding other cues such as prosodic information. This means that phonotactics might be used alone for syllable detection, but polysyllabic word detection needs extra cues.
In another connectionist study on phonological regularities, Rodd (1997) trained SRNs on 602 Turkish words; the networks were trained to predict the following phonemes. Analyzing the hidden layer representations developed during the training, the author found that hidden units came to correspond to graded detectors for natural phonological classes such as vowels, consonants, voiced stops and front and back vowels. This is further evidence that NN models can capture important properties of the data they have been trained on without any prior knowledge, based only on statistical co-occurrences.
Learning the graphotactics and phonotactics of Dutch monosyllables with connectionist models was first explored by Tjong Kim Sang (1995) and Tjong Kim Sang & Nerbonne (1999), who trained SRNs to predict graphemes/phonemes based on preceding segments. The data was orthogonally encoded, that is, for each phoneme or grapheme there was exactly one neuron activated at the input and output layers (see below 3.1). To test the knowledge learned by the network, Tjong Kim Sang and Nerbonne tested whether the activation of the neurons corresponding to the expected symbols are greater than a threshold determined as the lowest activation for some correct sequence encountered during the training data. This resulted in almost perfect acceptance of unseen Dutch words (generalization), but also in negligible discrimination with respect to (ill-formed) random strings. The authors concluded that “SRNs are unfit for processing our data set” (Tjong Kim Sang & Nerbonne, 1999).
These early works on learning phonotactics with SRNs prompted the work reported here. First, Stoianov et al. (1998) demonstrated that the SRNs in Tjong Kim Sang and Nerbonne's work were learning phonotactics rather better than those authors had realized. By analyzing the error as a function of the acceptance threshold, Stoianov et al. (1998) were able to demonstrate the existence of thresholds successful at both the acceptance of well-formed data and the rejection of ill-formed data (see below 3.6.2 for a description of how we determine such thresholds). The interval of high-performing thresholds is narrow, which is why earlier work had not identified it (see Figure 2 on how narrow the window is). More recently, Stoianov & Nerbonne (2000) have studied the performance of SRNs from a cognitive perspective, attending to the errors produced by the network and to what extent it correlates with the performance of humans on related lexical decision tasks. The current article ties these two strands of work and presents it systematically.