This paper explores the learning of phonotactics in neural networks. Experiments are conducted on the complete set of over 5,000 Dutch monosyllables extracted from CELEX, and the results are shown to be accurate within 5% error. Extensive comparisons to human phonotactic learning conclude the paper. We focus on whether phonotactics can be effectively learned and how the learning which is induced compares to human behavior.
Phonotactics concerns the organization of the phonemes in words and syllables. The phonotactic rules constrain how phonemes combine in order to form larger linguistic units (syllables and words) in that language (Laver, 1994). For example, Cohen, Ebeling & van Holk (1972) describe the phoneme combinations possible in Dutch, which will be the language in focus in this study.
Phonotactic rules are implicit in natural languages so that humans require no explicit instruction about which combinations are allowed and which are not. An explicit phonotactic grammar can of course be abstracted from the words in a language, but this is an activity linguists engage in, not language learners in general. Children normally learn a language's phonotactics in their early language development and probably update it only slightly once they have mastered the language.
Most work on language acquisition has arisen in linguistics and psychology, and that work employs mechanisms that have been developed for language, typically, discrete, symbol-manipulation systems. Phonotactics in particular has been modeled with n-gram models, Finite State Machines, Inductive Logic Programming, etc. (Tjong Kim Sang, 1998; Konstantopoulos, 2003). Such approaches are effective, but a cognitive scientist may ask whether the same success could be possible using less custom-made tools. The brain, viewed as a computational machine, exploits other principles, which have been modeled in the approach known as Parallel Distributed Processing (PDP), which was thoroughly described in the seminal work of Rumelhart & McClelland (1986). Computational models inspired by the brain structure and neural processing principles are Neural Networks (NNs), also known as connectionist models.
Learning phonotactic grammars is not an easy problem, especially when one restricts one's attention to cognitively plausible models. Since languages are experienced and produced dynamically, we need to focus on the processing of sequences, which complicates the learning task. The history of research in connectionist language learning shows both successes and failures even when one concentrates on simpler structures, such as phonotactics (Stoianov, Nerbonne & Bouma, 1998; Stoianov & Nerbonne, 2000; Tjong Kim Sang, 1995; Tjong Kim Sang & Nerbonne, 1999; Pacton, Perruchet, Fayol & Cleeremans, 2001).
This paper will attack phonotactics learning with models that have no specifically linguistic knowledge encoded a priori. The models naturally do have a bias, viz., toward extracting local conditioning factors for phonotactics, but we maintain that this is a natural bias for many sorts of sequential behavior, not only linguistic processing. A first-order Discrete Time Recurrent Neural Network (DTRNN) (Carrasco, Forcada & Neco, 1999; Tsoi & Back, 1997) will be used: the Simple Recurrent Network (SRNs) (Elman, 1988). SRNs have been applied to different language problems (Elman, 1991; Christiansen & Chater, 1999; Lawrence, Giles & Fong, 1995), including learning phonotactics (Shillcock, Levy, Lindsey, Cairns & Chater, 1993; Shillcock, Cairns, Chater & Levy, 1997). With respect to phonotactics, we have also contributed reports (Stoianov et al., 1998; Stoianov & Nerbonne, 2000; Stoianov, 1998).
SRNs have been shown capable of representing regular languages (Omlin & Giles, 1996; Carrasco et al., 1999). Kaplan & Kay (1994) demonstrated that the apparently context-sensitive rules that are standardly found in phonological descriptions can in fact be expressed within the more restrictive formalism of regular relations. We begin thus with a device which is in principle capable of representing the needed patterns.
We then simulate the language learning task by training networks to produce context-dependent predictions. We also show how the continuous predictions of trained SRNs - likelihoods that a particular token can follow the current context - can be transformed into more useful discrete predictions, or, alternatively, string recognitions.
In spite of the above claims about representability, the Back-Propagation (BP) and Back-Propagation Through Time (BPTT) learning algorithms used to train SRNs do not always find optimal solutions - SRNs that produce only correct context-dependent successors or recognize only strings from the training language. Hence, section 3 focuses on the practical demonstration that a realistic language learning task may be simulated by an SRN. We evaluate the network learning from different perspectives - grammar learning, phonotactics learning, and language recognition. The last two methods need one language-specific parameter - a threshold - that distinguishes successors/words allowed in the training language. This threshold is found with a post-training procedure, but it could also be sought interactively during training.
Finally, section 4 assesses the networks from linguistic and psycholinguistic perspectives: a static analysis extracts acquired linguistic knowledge from network weights, and the network performance is compared to humans' in a lexical decision task. The network performance, in particular the distribution of errors as a function of string position, will be compared to alternative construals of Dutch syllabic structure - following a suggestion from discussions of psycholinguistic experiments about English syllables (Kessler & Treiman, 1997).
This section will review standard arguments that demonstrate the cognitive and practical importance of phonotactics. English phonotactic rules such as:
‘/s/ may precede, but not follow /t/ syllable-initially’
(ignoring loanwords such as `tsar' and `tse-tse') may be adduced by judging the well-formedness of sequences of letters/phonemes, taken as words in the language, e.g. /stp/ vs. */tsp/. There may also be cases judged to be of intermediate acceptability. So, even if all of the following are English words:
/m/ `mother', /f/ `father', /sst/ `sister'
None of the following are, however:
*/m/, */f/, */tss/
None of these sound like English words. However, the following sequences:
/m/, /fu/, /snt/
"sound" much more like English, even if they mean nothing and therefore are not genuine English words. We suspect that, e.g., /snt/ 'santer', could be used to name a new object or a concept.
This simple example shows that we have a feeling for word structure, even if no explicit knowledge. Given the huge variety of words, it is more efficient to put this knowledge into a compact form - a set of phonotactic rules. These rules would state which phonemic sequences sound correct and which do not. In this same vein, second language learners experience a period when they recognize that certain phonemic combinations (words) belong to the language they learn without knowing the meaning of these words.
Convincing psycholinguistic evidence that we make use of phonotactics comes from studying the information sources used in word segmentation (McQueen, 1998). In a variety of experiments, this author shows that word boundary locations are likely to be signaled by phonotactics. The author rules out the possibility that other sources of information, such as prosodic cues, syllabic structure and lexemes, are sufficient for segmentation. Similarly, Treiman & Zukowski (1990) had shown earlier that phonotactics play an important role in the syllabification process. According to McQueen (1998), phonotactic and metrical cues play complementary roles in the segmentation process. In accordance with this, some researchers have elaborated on a model for word segmentation: the Possible Word Constraints Model (Norris, McQueen, Cutler & Butterfield, 1997), in which likely word-boundary locations are marked by phonotactics, metrical cues, etc., and in which they are further fixed by using lexicon-specific knowledge.
Exploiting the specific phonotactics of Japanese, Dupoux, Pallier, Kakehi & Mehler (2001) conducted an experiment with Japanese listeners who heard stimuli that contained illegal consonant clusters. The listeners tended to hear an acoustically absent vowel that brought their perception into line with Japanese phonotactics. The authors were able to rule out lexical influences as a putative source for the perception of the illusory vowel, which suggests that speech perception must use phonotactic information directly.
Further justification for the postulation of a neurobiological device that encodes phonotactics comes from neurolinguistic and neuroimaging studies. It is widely accepted that the neuronal structure of Broca’s area (in the brain's left frontal lobe) is used for language processing, and more specially that it represents a general sequential device (Stowe, Wijers, Willemsen, Reuland, Paans & Vaalburg, 1994; Reilly, 2002). A general sequential processor capable of working at the phonemic level would be a plausible realization of a neuronal phonotactic device.
Besides cognitive modeling, there are also a number of practical problems that would benefit from effective phonotactic processing. In speech recognition, for example, a number of hypotheses that explain the speech signal are created, from which the impossible sound combinations have to be filtered out before further processing. This exemplifies a lexical decision task, in which a model is trained on a language L and then tests whether a given string belongs to L. In such a task a phonotactic device would be of use. Another important problem in speech recognition is word segmentation. Speech is continuous, but we divide it into psychologically significant units such as words and syllables. As noted above, there are a number of cues that we can use to distinguish these elements - prosodic markers, context, but also phonotactics. Similarly to the former problem, an intuitive strategy here is to split the phonetic/phonemic stream at the points of violation of phonotactic constraints (see Shillcock et al. (1997) and Cairns, Shillcock, Chater & Levy (1997) for connectionist modeling). Similarly, the constraints of the letters forming words in written languages (graphotactics) are useful in word processing applications, for example, spell-checking.
There is another, more speculative aspect to investigating phonotactics. Searching for an explanation of the structure of the natural languages, Carstairs-McCarthy presented in his recent book (1999) an analogy between syllable structure and sentence structure. He argues that sentences and syllables have a similar type of structure. Therefore, if we find a proper mechanism for learning the syllabic structures, we might apply a similar mechanism to learning syntax as well. Of course, syntax is much more complex and more challenging, but if Carstairs-McCarthy is right, the basic principles of both devices might be the same.