As this thesis is based on a corpus analysis, I deem that it is unavoidable to be stated what a corpus is. Therefore, the following sections are devoted to the introduction of corpora and corpora related issues. To be more specific, one of them deals with the corpus typology. Only those types that are relevant to the corpus created for the purposes of this thesis were decided to be closely described. Attention is paid, in particular, to comparable and parallel corpora. Another section is aimed at the utilization of corpora in various disciplines with a special focus on their usage in translation.
Nevertheless, before proceeding further, a branch of linguistics clearly linked to the corpora seems necessary to be at least briefly presented. In the words of Lynne Bowker and Jennifer Pearson taken from the introductory chapter of Working with Specialized Language, “Corpus linguistics is an approach or a methodology for studying language use”, which extensively employs the computer technology. “As you may have guessed,” they continue “corpus linguistics requires the use of a corpus” (9).
1.1 Corpora Definition
There are many available definitions stating what a corpus is. If, Macmillan English Dictionary for Advanced Learners is, for example, consulted, it immediately offers two different options.
The first one states that the corpus is “a collection of writing, for example all the writings of one person” (Rundell 330). Nevertheless, this kind of definition is rather vague.
The second one, on the other hand, can be considered more valuable, saying that the corpus is” a collection of written and spoken language stored on computer and used for language research and writing dictionaries” (Rundell 330). However, it is still to general.
Therefore, another more elaborate version of corpus definition, which was introduced by John Sinclair, is provided here:
A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to present, as far as possible, a language or language variety as a source of date for linguistic research (“Corpus and Text” 16).
These sets of criteria differ from case to case, depending on the type of project a corpus is used for; nevertheless, the determination of these criteria is the first major step in the process of corpus building. Furthermore, Sinclair also advises such criteria for the structure of corpus to be “small in number, clearly separated from each other, and efficient as a group delineating a corpus that is representative of the language or a variety under examination” (“Corpus and Text” 5). These specific criteria belong to four characteristics listed by Bowker and Pearson that “make corpora different from other types of text collections.” The remaining ones imply that this compilation must be also authentic, electronic, and large (9).
Criteria of the corpus which is enclosed at the end of this bachelor thesis are elaborated on in the section 2.1.1.
1.2 Corpora Typology
There is an immense number of types of corpora as the language is investigated in many types of ways (Bowker 11). Yet, this is a try to list at least some of those which are the most well known and which are, in fact, various broad categories of corpora defined according to distinct criteria which are mentioned later in the text.
The first way of determining the type of corpora concerns the mode of texts they include.
Not surprisingly, spoken corpora consist of texts that are “the transcriptions of spoken materials (e.g. conversations, broadcasts, lectures, etc)” (Bowker and Pearson 12). However, creating this kind of corpora is very time-consuming and extremely labor-intensive. In addition to that, there are many obstacles accompanying compilation of this kind of collection associated with the recording of speech. Therefore, this type of corpus is very scarce (50). An example of this type is Corpus of Spoken American English (CSAE) (Biber, Conrad and Reppen 282).
What stand for the opposite are written corpora which “contain only texts that were originally written”. These corpora are very common, since materials for their compilations are easily obtainable (Bowker and Pearson 12).
In some cases, corpora can even include both these types of texts, such as the British Nation Corpus (BNC) and CHILDES Project, which is a corpus of children’s spoken and written English (Biber, Conrad and Reppen 282).
1.2.2 Sample (Samples) vs. Fulltext Corpora
Another way of specifying the type of corpora is based on the fact whether the corpus consists of full texts or only of their samples. This distinction was necessary to be made as the size of corpora increased thanks to advancement in the field of computing which enabled whole texts to be included in the corpora (Sinclair, Corpus, Concordance, Collocation 23).
If corpora are composed of many short extracts of texts holding altogether around a million words, then they can be called sample corpora (Sinclair, Corpus, Concordance, Collocation 23).
On the other hand, if some unabridged texts create the corpora, then they stand for fulltext corpora (Sinclair, Corpus, Concordance, Collocation 23).
The third way of defining the type of corpora concerns the fact whether they aim “at the standard language as a whole or . . . at a special phenomenon” (Teubert and Čermáková 68).
If corpora focus on a language for general purpose and, simultaneously, consist of the texts representative of the given language as a whole, then they can be labeled as general reference corpora. They usually embrace a wide range of texts; both those whose purpose is to be spoken, such as transcriptions of radio and television broadcasts, and those that are target to be read, such as broadsheets, tabloids, periodicals, fiction and nonfiction, brochures, instruction leaflets, “and sundry printed materials we come across, from tax forms and telephone bills to tourist brochures and theater programs” (Teubert and Čermáková 66).
With respect to the aforementioned, it is no wonder that Federico Zanettin, Sylvia Bernardini, and Dominic Stewart further distinguish between reference corpora which are described as very large, and general corpora which are considered to be only large (5). Nevertheless, both of them have the same function; they are “used to make general observations about that particular language” (Bowker and Pearson 12).
On the contrary, if some corpora concentrate on a particular aspect of a language, they are special-purpose corpora, as Bowker and Pearson maintain. They also add that this type of corpora “could be restricted to the LSP [language for special purpose] of a particular subject field, to a specific text type, to a particular language variety or to the language used by members of a certain demographic group (e.g. teenagers)” (Bowker and Pearson 11). As for the size, corpora of this type are qualified by Zanettin, Bernardini, and Stewart as small (5). They are also sometimes called special corpora and LSP corpora.
What is treasured, among other things, on these two types of corpora is the fact that if they are compared and contrasted, the unique features which contradistinguish a particular special-purpose language from the general language come to light (Bowker and Pearson 11).
1.2.4 Synchronic vs. Diachronic Corpora
The fourth way of defining the type of corpora is connected to the length of the time period they cover.
Bowker and Pearson state that synchronic corpora “present a snapshot of language use during a limited time frame” (11). Thus, they only offer a picture of language in a particular short period of time. Brown corpus is an instance of this type of corpus (Biber, Conrad and Reppen 283).
In contrast with the type dealt theretofore, diachronic corpora are able to provide researchers with the information about the development of the language, as they comprise texts coming from various periods of time (Bowker and Pearson 11); for example, Helsinky Corpus of English Language and Zurich Corpus of English Newspapers (Biber, Conrad and Reppen 283).
1.2.5 Monolingual Corpora vs. Bi- and Multilingual Corpora
The fifth way of specifying the type of corpora is based on counting languages they imply.
Bowker and Pearson put it simply, “a monolingual corpus is one that contains texts in a single language” (11). Its typical example is the Czech National Corpus and the Corpus of Contemporary American English (COCA).
Multilingual corpora, on the contrary, comprehend texts in two or more languages (Bowker and Pearson 11). According to Tony McEnery and Paul Baker, corpora of this type have developed greatly since 1994 (Zannetin, Bernardini and Stewart 89). Furthermore, the group of multilingual corpora can be divided into two subtypes, namely parallel and comparable corpora, and they are briefly addressed in the following paragraphs and more elaborated on in the section concerning corpora and the field of translation.
188.8.131.52 Parallel Corpora
Teubert and Čermáková describe parallel corpora as collections “of original texts in one language and their translations into another (or several other languages)” (72). This is, however, rather imprecise because of the fact that the parallel pairs are not always translations of each other; they can be the translations of a third text which does not have to be included in the corpus. Bowker and Pearson provide another and possibly slightly better definition: “Parallel corpora contain texts and their translations in one or more languages” (93).
In addition to that, sometime the user of this type of corpora does not know what is the original language and if it is included. This is often the case in multilingual environments in which more official languages are used (Bowker and Pearson 94). Instances of this situation are documents issued by the European Commission and texts published by Vatican (Teubert and Čermáková 72).
Parallel corpora are also known as translation corpora because of their connection to the work of translators (Teubert and Čermáková 72).
It is also essential to mention that the texts in different languages are aligned in the well-build parallel corpora. Teubert and Čermáková explain that alignment is a process in which links are created between the segments in a source language and their parallel counterparts in a target language (or a target language and another target language, depending on the fact whether the original version of the text is incorporated in the corpus).1 The process can be done either manually, which is very labor-intensive for the unit of alignment is usually a sentence, or by a computer program. However, even if the texts are aligned by the program, they need to be edited in the final stage; and this makes it a time-consuming work, as well. This problem can be the reason why “there are still only a few parallel corpora of considerable size” (Teubert and Čermáková 75).
184.108.40.206 Comparable Corpora
Bowker and Pearson state that multilingual comparable corpora can be described as “sets of texts in different languages that are not translations of each other” (93). To put is simply, this type of corpus consist of two or more monolingual corpora that share some common features or characteristics according to which their texts were selected. As this is very short explanation, further information regarding this topic is provided below.
The subject matter ranks among the frequent mutual features; however, others may also be text type, period in which the texts were written, degree of technicality and many others (Bowker and Pearson 93). On the other hand, “there is only one feature that distinguish these sets of texts from each other, and that is the language in which they are written (93).
What can be seen as a shortcoming of the multilingual comparable corpora is the fact that the texts in different languages are not linked (or rather aligned) to each other as those of the parallel corpora (Bowker and Pearson 93). This is obvious from the fact that they are completely diverse texts. However, these corpora provide the researcher with different benefits, which are described in more detail in the section dealing with the utilization of corpora in translation.
It is essential to add that comparable corpora do not have to be only bilingual but can also be monolingual. Nevertheless, the latter type is less frequent. Marco Baroni and Silvia Bernardini monolingual corpora as “collections of original and translated texts in the same language assembled according to comparability criteria” (1).These criteria are the same as the common features listed above. In this type, not two distinct languages but two variants of one language are compared.
The corpus compiled for the purposes of this thesis contains both Czech original and translated recipes; therefore, it can be seen as a monolingual comparable corpus. On the other hand, it also embraces a set of recipes written in English and, thus, that part is a bilingual comparable corpus.
To conclude, I deem it necessary to name at least a few more examples of corpora: learner, closed, and open corpora are the types described in Working With Specialized Language (Bowker and Pearson 11-12); monitor corpora is one of those mentioned in Corpus Linguistics (Teubert and Čermáková 71); and last but not least, reciprocal, monodirectional, bidirectional, and disposable corpora which are also known as virtual, ad hoc or do-it-your-self web corpora rank among those referenced in Corpora in Translator Education (Zanettin, Bernardini, and Stewart 5).
1.3 Usage of Corpora
Bowker and Pearson also mention that it is rational that corpora have been applied in a broad range of disciplines and have helped to investigate many linguistic issues since they contain authentic examples of language. One of those disciplines is lexicography, in which corpora enable easier identification of neologisms and new meanings of already known words. Corpora are also exploited in the field of sociolinguistics in order to reveal the differences between men’s and women’s speeches. Besides, language learners work with corpora to improve their knowledge. In addition, diachronic corpora are utilized by historical linguists to explore the development of various languages. Corpora are associated with the field of computation linguistics, as well, because corpus based resources are used there. Furthermore, corpora, namely LSP corpora, are applied in various disciplines such as terminology, technical writing and, last but not least, translation (11). As their usage in translation is interesting topic which is also connected to the theme of this thesis, it is described in more detail in the following section.
1.3.1 Corpora in Translation
Corpora are closely connected to translation, in particular, to the translation of specialized languages.
To obtain the needed information, translators often search corpora with miscellaneous corpus analysis tools. Most of these include a concordancer and a word lister. The former one, according to Bowker and Pearson, “allows the user to see all the occurrences of a particular word in its immediate context” and usually shows the result in a format known as KWIC, which means key word in context (13). The latter one “basically allows . . . to perform some simple statistical analyses of your corpus”, such as the number of tokens (words), the number of types (word forms) and their frequency (13).
Both bilingual and monolingual corpora are translators’ useful helpers. Nevertheless, bilingual parallel corpora can be probably seen as the most favored. Therefore, examples of their exploitations are described below.
Translators use parallel corpora as bilingual dictionaries. Translation equivalents of terms they are interested in can be easily derived from a parallel corpus with the help of a bilingual concordancer (Bowker and Pearson 194). This corpus based approach is advantageous, as Teubert and Čermáková mention, because translation equivalents extracted from the corpus are “embedded in their context, which make them unambiguous”, in comparison with those found in bilingual dictionaries (73).
Furthermore, when displaying particular translational equivalents, translators simultaneously observe what words these equivalents collocate with and how they behave in sentences. This helps them to create translated texts that sound natural in a particular LSP language (Bowker and Pearson 194).
Considerably a lot has been stated about the exploitation of bilingual parallel corpora in translation. Therefore, it is necessary to at least generally comment on the usage of bilingual comparable corpora in translation, as well. According to Zanettin, Bernardini and Stewart; “[comparable corpora] provide translators with a better understanding not only of target but also of source texts, allowing them to compare terminology, phraseology and textual conventions across languages and cultures” (6).
Following two paragraphs relate to both mono- and multilingual corpora.
In addition to that, with the help of LSP corpora, translators investigate styles of specialized languages which are necessary to be known, because “it is not enough”, as Bowker and Pearson notice, “to use accurate terminology and correct grammatical structure” if one aims to create a translation that would be acknowledged by LSP speakers (197).
Besides, translators also use “term identification tools” (also know as term recognition tools) for exploiting both types of corpora in order to extract a terminology and possibly also build their own glossary of terms. These tools either “operate as stand alone products” or are integrated into various packages together with other tools such as translation memories and terminology management systems (Bowker and Pearson 165-66).
Finally, concerning the above-mentioned translation memories, it is necessary to introduce that many companies utilize the translation memory software. This software enables to build one’s own aligned parallel corpora. Those who are using it translate texts segment by segment. Bowker and Pearson describe the way it functions with these words: Each time a segment has been translated, a translation unit (i.e. source segment and target segment) is created and stored in the translation memory” (101). The benefit of this activity is the fact that already translated parts of texts can be reused in future; and therefore, this software simplifies and fastens translators’ work, and, simultaneously, guarantees the terminological consistency.