As stated in the introductory section, to conduct my research, it was necessary to compile a corpus. Considering the fact that this thesis deals with a specialist language, I decided for a comparable corpus. To specify what type it is when considered from different points of view, this corpus also summarizes the qualities of the special-purpose, diachronic, written, and sample corpora. In my opinion, it is suitable for the analysis of the development of the translation of English recipes and of its impact on the traditional form of Czech recipes.
Originally, a parallel corpus, which is very often used for an analysis of translations, should also have been constructed. However, as my research concentrates on the investigation of as many types of sources as possible and as it is rather complicated to obtain both the original and the translated versions of cookbooks and periodicals, the comparable corpus was preferred. In order to compensate the absence of the parallel corpus and to enrich the research, some recipes have been aligned on a sentence-to-sentence level and used for a minor one-to-one analysis. These recipes are taken from two books gained both in source language and target language versions; namely, Jamie’s Kitchen and The Desperate Housewives Cookbook.2
When compiling the corpus, my attention was paid to the criteria listed in Working with Specialized Language which “make corpora different from other types of text collections” (Bowker and Pearson 9). This book was my guide for the creation of the corpus.
2.1 Steps to a Functional Corpus
Three steps are necessary to produce a good corpus. The first one focuses on the selection of materials, the second one is concerned with the transformation of materials into an electronic form, and the third one touches upon the Corpus Encoding Standard. The way I proceeded through these steps is described in the following paragraphs.
2.1.1. Selection of Materials
The first step requires a selection of suitable and representative materials. This should be ensured by the right determination of external criteria (Bowker and Pearson 10). According to John Sinclair, five distinct things, among others, have to be considered: language, type of source materials, location of materials, publishing date, and medium3 (“How to build a corpus” 5). Other two Bowker and Pearson’s points are also subjoined; namely, the validity of sources and the number of materials a researcher is going to collect (45-54). Therefore, it was necessary to define the criteria with respect to these seven aforementioned topics. They are briefly addressed below.
As for the first criterion – the language of the materials, English and Czech recipes were chosen to create the corpus. However, the corpus consists of three types of recipes: English recipes, original Czech recipes, and Czech recipes where Czech stands for a target language.
As to the second criterion – the type of materials, recipes from three distinct types of authentic materials (books, periodicals such as magazines and newspapers, and the Internet) were selected. This choice is reasoned in the following five paragraphs.
Books are used as a cornerstone because they contain the most standardized form of recipes, which may be so due to the fact that many publishing companies often “give a set of norms to their translators, where they determine their requirements” (Havlásková 41).
Periodicals – unless specialized in cooking – are not concerned about the established form of recipes to such an extent as they also include various other topics apart from recipes. Moreover, it is not exceptional that both original Czech and translated recipes are published in one issue of a magazine; therefore, the former ones are likely to be influenced by the new trends of the latter ones.
Last but not least, the Internet is one of the most variable media, covering all the novelties because it is interactive and people can easily change its content. Therefore, new trends in recipe writing are easily observable there. To sum up: “Indeed, these kinds of materials [periodicals and online sources] best reflect the development of the usage because they are less ‘regulated’ than published books, which pass through necessary corrections and they adhere to norms established by a company” (Havlásková 24).
The reason why several types of resource materials were selected to be incorporated in my corpus is that all are necessary for conducting research.
In addition, I am aware of the fact that one cannot find recipes only in the previously presented materials but all around; for instance, in advertisements and commercials, and on food products. However, being taken into account the scope of the bachelor thesis, such materials were excluded from my corpus. In spite of this fact, this corpus is representative because the chosen types of materials are representative in this field as they cover all the variations.
One more thing needs to be mentioned about the types of the used texts. It was decided this corpus to comprise both texts written by experts for non-experts (major part) and those written by experts for experts (minor part), such as Řeznická Kuchařka written by Josef Dušátko.
Regarding the third criterion – the location of texts, the analyzed English and translated materials originate in the United States of America and Great Britain. There is a slight prevalence of the former ones. The Czech language materials come from the Czech Republic, with one exception which was published in Greece.
With respect to the fourth criterion - the publishing date, the title of this thesis suggests that only those Czech materials which were published during the last twenty years were chosen. To specify this, the included materials written between 1990 and 1994 cover the situation directly after the Velvet revolution, and those written between 2006 and 2009 show the latest developments. The translated materials can only come from the same time periods. Last but not least, the chosen English materials come from the period of the last forty years, as the form of the analyzed features of English recipes has not profoundly changed.
Concerning the fifth criterion – the medium, it is obvious that the written corpus was build because it was decided to exclude texts which the transcriptions of speeches. The only source material which one can label as controversial is Jamie Oliver’s book, Jamie’s Kitchen, because of its connection to his TV show. However, the recipes are only based on it. Besides, it is a very popular and influential book as far as the changing form of Czech recipes is concerned; and it was, therefore, included in the corpus.
As for the sixth criterion – the validity of the texts, materials written by renowned authors or published by credited publishers were selected to be chosen. Regarding Internet, only websites used by Havlásková in her thesis, for example Vareni.cz; or honored by the Food Blog Award, such as Simply Recipes; or recommended by my supervisor, PhDr. Jarmila Fictumová, were employed.
As to the seventh and, simultaneously, the last criterion - the size of the corpus, Bowker and Pearson maintain that “there are no hard and fast rules to determine the ideal size of a corpus,” and that one should take into consideration “factors such as the needs of a project, the availability of data and the amount of time” (47). Therefore, my aim was to gather as many materials – books and periodicals – as possible, with a minimum number of ten4 and a maximum number of twenty-five resources for each type of a material and for each type of a recipe (types of recipes are explained in more detail in the paragraph dealing with languages above). A wide range was established because some materials, for example English cookbooks, are less easily obtainable than others, for example Czech periodicals. As to the Internet sources, four web sites were chosen to be explored. All in all, one hundred and twelve sources were finally analyzed.
Furthermore, following another Bowker and Pearson’s recommendation to choose “. . . a greater number of texts written by a range of different authors, rather than just a few texts written merely by one or two different authors,” books written or translated by various persons were gathered (49). The only exception is Jana Frolíková who is included in the author list of two magazines; notwithstanding, she is not the only author in these periodicals.
At this point, it is necessary to state that I realize that there are lots of materials containing recipes available nowadays – in particular those in bookshops and on the Internet – and, in comparison with this fact, my corpus could be seen as rather small. Nevertheless, since it would be impossible to analyze all the accessible materials, I decided that this amount would constitute a representative sample text, even though I am aware of the fact that the outcome of my research is limited by the size of the corpus.
2.1.2 Converting Materials into an Electronic Form
As for the second step, it is, primarily, necessary to introduce where the corpus materials were obtained from. The majority of them were located in libraries. In addition to that, my own resources were also exploited. Thus, none out of these aforementioned materials were in electronic form. As it is one of the principal requirements of a well-built corpus, it was necessary to convert them. The process is described in the paragraphs below
First, samples of the above-mentioned materials were scanned. As the form of recipes was unified in almost every particular book or magazine, one or two pages from each source were scanned. If it was an exceptional case in which the forms differed, enough pages to cover the varieties were provided.
Second, an optical character recognition (OCR) was used to acquire machine-editable texts. From two variants for saving the text that the tool gives – a plain text file (.txt) and a rich text format (.rtf) – the second one was chosen, although Sinclair recommends to “save the text in a plain text format” (“How to build a corpus” 81). The reason why it was decided for the rich text format follows from the character of the research: the structural formatting is lost with the plain text and then it would not be possible to search for information in specific areas of the texts. I consider this a substantive disadvantage since one part of this work analyzes whether foodstuffs are described in the list of ingredients or in the directional part of a recipe.
Approximately one fifth of the corpus materials were copied from Havlásková’s corpus. Even though the recipes from this source were already scanned, they were not machine-editable; thus, the OCR processing repeated.
Finally, some recipes were also derived from the Internet. These were already in electronic versions but not in the required format. In such a situation, Bowker and Pearson offer two solutions, saying that one “option is to simply select all the text [in my case a recipe] on screen and then copy and paste it into the word processor (e.g. Wordpad, Word, WordPerfect)” and “the other option is to use the ‘save as’ feature of the browser to safe the text as a plain text file” (Bowker and Pearson 65). The first alternative was prioritized as with the second one the formatting would have been lost (the reasons are mentioned above). Nevertheless, there is one drawback connected with the online sources; that is the fact that “texts on the Web are encoded using a special language called HyperText Markup Language (HTML),” and, thus, they “contain a number of tags (shown in angle brackets)” which need not to be compatible with corpus analysis tools (63-65). Therefore, some time was also devoted to editing the texts.
In the third step, one should follow the Corpus Encoding Standard, which is “the widely accepted set of encoding standards for corpus based work” (Corpus Encoding Standard 1). It says that three categories are worth encoding in a corpus: documentation, primary data, and linguistic annotation (Bowker and Pearson 80).5 As for this topic, my corpus can be considered imperfect for it does not fulfill the requirements. Only titles and sometimes also names of the authors or translators are included in the names of documents. The corpus was not tagged for it would be a time-consuming activity and, in addition to that, these characteristics were not necessary for the analysis.
Unquestionably, I am aware of the shortcomings, such as lacking annotation and a rather small size of the corpus, and I also realize that this is not a corpus to all intents and purposes. Nonetheless, for the purposes of my research, I still consider this “disposable corpus”6 sufficient.