Linguistics Book Review: COMMENTS:Ahmed Faried

The book particularly focuses on only "Balanced Corpora" which are to be created to conduct "descriptive linguistic analysis" and thus the intended audience would be somehow restricted to Descriptive Linguists. Other Corpus Linguists with other interests should not expect much wealth of information about their subject of interest. For example, Linguists with Computational interests would find the discussion of tagging and parsing in Chapter four is treated in "less detail" than they would expect. The study questions after each chapter suit the purpose of the book as an Introduction to the subject as well as making the book suitable for in class courses. Although the questions are not many in number, ranging from four to seven questions for each chapter, they do help in asserting and focusing on the important information of each chapter and they are specially helpful in this respect as every chapter has much extensive information and numerous thoroughly-explained examples. One simple yet valuable addition the author does not choose to include, even between small practice or in the footnote, is the number of the corresponding section pertaining to each question. This would have the advantage of allowing a student to easily return back to the specific section corresponding to a certain question to revise whichever point he does not grasp fully.There can be noticed that the book made no use of any illustrations like logical flowcharts that can help enhance the presentation of his topics as well as enhance the reader's grasping of the logical steps in the many multi-step processes discussed in the book. Although the author has made use of tables however, it can be further noted that these tables are used merely to state the results of an analysis or frequency counts of words or grammatical constructions. A much better use of tables would have been in conducting a comparison between every two or more alternatives introduced throughout the book. For instance, the lengthy discussion of the various features of the different taggers and parsers available to annotate a corpus in chapter 4 could have been summarized in a comparison table that would not only facilitate the understanding but also gives a very handy and precise reference that could be consulted whenever needed without having to read through running pages of explanation again and again. Another related issue is that the author does not include a page containing a list of figures and tables in the book that could have also helped in locating a specific table related to a specific case study sought. The author has chosen to immediately delve with the reader into the vast sea of corpus linguistics even as early as in the preface, which might make some readers either bored or frustrated. After stating that the "Brown Corpus" is considered the "first computer corpus", the author compares between it as a "Balanced" corpus containing different genres of written English and The Penn Treebank corpus which is an "Unbalanced" one that contains a "heterogeneous" collection of texts and is essentially larger (4.9 million words). In connection with the two previous definitions of a corpus mentioned by the author, he further states that the two corpuses are different in composition and uses. The Balanced one is of most value for linguistic researchers to conduct "linguistic description and analysis" while the Unbalanced one is primarily created for carrying researches in Natural Language Processing (NLP), which demands a large corpus, and is of value for Computational linguists. But the two linguists cherishes one aim, which is to base their results on "Real" data rather than contrived one and hence he concludes that Corpora must be "carefully created" to be sure that the results of any analysis to be conducted on them will be "valid". The author also uses a highly formal style throughout his book. This strict style is evident from the first chapter which begins by a serious discussion of the conflict between Generative Grammarians and Corpus linguists. Throughout this first-hand discussion, the author introduces many specialized terms and subjects. He introduces the Brown Corpus as being the first corpus created in 1960s, a time when generative grammar was dominant in linguistics. The author then proceeds to explain the difference between language Competence and Performance and how Corpus Linguistics is concentrated mainly on performance or "the actual usage of language" ***(get a formal definition) while Generative grammar is more concerned with language competence ***(get a formal definition). The discussion further proceeds to Chomsky's three types of adequacy. For Chomsky, the highest level of adequacy is the explanatory one and thus he sees that Corpus linguistics is limited to the Observational or Descriptive adequacy only. Explanatory adequacy can help in deducing the rules for the Universal Grammar which is one of the main aims for Generative grammarians. On the other hand Corpus Linguists do not approve of the "highly abstract and decontextualized" study of language. Evidently dealing with much extensive theoretical concepts from the very beginning of the book may leave some undesired first impression in the reader who would feel very much overwhelmed. These information should have better either delayed to be included later on or be dealt with much more briefly. In chapter 2, the author Discusses the planning of the corpus. he shows how the "planned uses" of the corpus should be considered first because different uses demand different plans. Another point stressed is that the planning should be rather "a cyclical process" which demands "constant re-evaluation" during the corpus compilation. The author proceeds to explain the "methodological assumptions" that should be considered in planning the corpus. To be practical, he takes the BNC (British National Corpus) as an example which seems, to some extent, an inappropriate choice. Firstly, because this is such a huge corpus totaling about 100 million words and thus considered "one of the largest corpora". it seems not at all appropriate for beginners to provide a case study for them. Secondly, a beginner in Corpus Linguistics is unlikely to encounter problems like those encountered during the BNC compilation. The different criteria in planning a corpus are then explored in details including: the overall length of a corpus, which the author advises to be determined by two factors: the available resources for the project and the kind of studies the corpus would permit. The other criteria are the types of genres to include in a corpus, the length of individual text samples, the number of texts and range of speakers or writers, determining the time-frame for texts to be included, the inclusion of native vs. non-native speakers and a category of other criteria grouped under "sociolinguistic variables" including gender balance, age, level of education and others. In this respect, the section explaining "gender balance" seems relatively short with just less than one page. At its end, the author concludes that the "variables" affecting gender balance arise many difficulties that can not be addressed in one particular way and thus, he leaves the compiler to his own sense of judgment to "deal with them as much as is possible". Also, he did not mention real examples from well-known corpora to illustrate how such problems were dealt with, while he almost always tends to provide a real example for every issue in the other criteria.Collecting and Computerizing data is considered the next step in a corpus compilation and, as the author considers it, the first step in the actual creation of a corpus. This step is expressed in details in Chapter 2. in collecting Speech and Writing, the author informs the reader that some difficulties may face him specially when obtaining permissions for copyrighted materials. These difficulties may result in deciding to make slight changes to the original design of the corpus, however this must be carried in a careful way in order not to "Compromise the integrity" of the corpus. Firstly, Methods of collecting and computerizing speech are discussed. In recording multi-party dialogues, the author lays emphasizes on obtaining only "natural speech" and warns against recording "unnatural speech" that may results in the speaker's awareness of their speech being recorded and since also "surreptitious speech" is prohibited, to overcome this problem, the author suggests that the participants should offered a written description of the project before the actual recording. Also it may help to record a conversation "as lengthy" as possible, in order for the transcriber to choose the most natural as well as "coherent" and "unified" parts to include in the Corpus. Some purely technical issues are discussed including types of Tape Recorders, Microphones, etc. Secondly, Concerning the collection of written texts, the problem is usually associated with obtaining a permission from the authors of those texts. Such problem has resulted in the use of only 25 percent of the material originally planned for inclusion in the ICE-USA corpus. The author suggests making use of the online texts since they are easily obtained and computerized. Keeping records of the speeches and texts gathered in order to help in organizing the material as well as in future uses of the corpus is an important issue that the author sheds light on. Such information includes when and where the recording takes place and who are the participants and a wealth of ethnographic information about them. The next step is Computerizing Data, which the author considers a painstakingly process specially for transcribing speech. The author also discusses the advantage and disadvantages of using ASCII and Unicode in saving files and the methods of organizing the individual texts in categories. The most important part discussed is how to insert "structural markup" into a text like tags to indicate speech overlap and Speaker identification Tags. Transcribing speech is considered a difficult and tedious task in addition to its being just an "artificial process". The author informs the reader that he must decide beforehand how much in spoken text he wishes to include in the transcription of it, which is an issue that corpora compilers have never reach an agreement upon. The author made a good point when he mentioned two examples from two extremes: one is the Corpus of Spoken Professional English which contains minimal information about the speech in its transcription and The Santa Barbara Corpus of Spoken American English which contains an exact transcription of the spoken conversations including (hesitations, repetitions, partially uttered words, etc. as well as annotation marking various features of intonation such as "tone unit boundaries, pauses and pitch contours". This has the advantage of allowing a broad range of studies that can be conducted on the corpus with much confidence about the authenticity of the data. However the author concludes that whether the transcription of speech contains minimal or detailed description, it is impossible to reflect all of the "subtleties" of speech. This is due to the interference of many other "Paralanguage" factors that contributes to the interpretation of a conversation from "Gestures and Facial Expression" to the attitude of the participants towards each other. The practices of "Representing Speech in Standard Orthographic form" is a very important part in which there are many words and symbols introduced to the reader. These symbols represent the various speech features like vocalized pauses, speech overlap and linked expressions which are two words pronounced as one word for ex. "gotta" for (got to) and "hafta" for (have to). In this respect there is no convention for transcribing these expressions as either one or two words but the compiler have to decide which form to use throughout the corpus according to the type of analysis the corpus would permit. The Part of Computerizing written texts is very brief and just tackles the problem of computerizing texts from earlier period where sometimes more than one version exist. Handwritten texts as well as other types of texts are not mentioned that is because the author was obviously concerned about the difficulties of computerizing speech than written texts throughout this chapter. The fourth chapter is very interesting and is concerned with the annotation of the corpus. Three kinds of markup are discussed: Structural, Tagging and Parsing. It begins by a brief explanation of the various Markup systems from the old SGML-conformant, the TEI standards and the more recent XML markup systems in addition to a system of markup for describing Intonation in speech. The difference between Rule-Based and Probabilistic taggers are explained The author begins in chapter four by the hypothesis that annotation is necessary for a corpus to be "fully useful to potential users" however, he does not further support this hypothesis however instead, his discussion seems to suggest the opposite. The author concludes that Although recently-developed taggers can achieve an accuracy of more than 95 percent, the remaining inaccuracies can be "more extensive" than one might think and thus every automatically tagged corpus must be subjected to human "post-editing" and thus requires labor-intensive work. The Tagsets (the number of tags that a tagger can insert in a text *get a formal definition if possible) can vary significantly and different tagsets represent more or less "differing conceptions of English grammar". Even Parsers have even lower accuracies than taggers and their accuracy range from "70 to 80 percent at best". Thus requiring more varying levels of "human intervention". Parser can yield much errors specially when encountered with constructions like Coordination which is a frequent construction in both written and speech. The sentence "the child broke his arm and his wrist and his mother called a doctor" can pose many challenges for an automatic parser. This arises from the fact that "And" here is used to join two phrases: "his arm, his wrist" as well as to join the first clause with the second one: "his mother…". Furthermore, the grammar underlying each parser reflects "particular conceptions of grammar" and thus different parsers will yield annotations that differ in details. Thus since the corpus compiler can not fully predict the types of analysis to be conducted on the corpus and given the fact that there is largely no standard parsers or taggers that can account for every possible linguistic analysis that might be conducted on a corpus as well as the fact that there exists relatively small number of well-known corpora that are fully parsed. The author does not make a strong argument to support his view of the necessity for annotation for every corpus specially if we learn that some automatic annotation is itself a kind of analysis as each represent "different conceptions" of grammar that may obscure some features or mislead a linguist who might use the corpus in his research. Even the case study of analyzing a corpus the author presents throughout chapter five does make use of only one parsed sub-corpora and other six sub-corpora that are not parsed at all.The Fifth chapter is taking the reader form the prospective of a corpus compiler to that of a corpus user or a corpus linguist. This prospective is comprehensively explored through a case-study of investigating the occurrence of pseudo-titles in the press in the ICE Corpus. The case study approach is useful as it makes the phases of conducting an analysis as much "related" and coherent as possible. also the emphasize on conducting an analysis that is both "quantities and qualitative" as well as the necessity to deduce some significant conclusions from the analysis is a good point and echoes practically what the author stated in the first chapter that Corpus linguistics can even exceed the "Descriptive adequacy" and achieve the "Explanatory" one. This chapter is really extensive and guides the researcher step by step not only through the process of conducting an analysis but also, from framing the research question and defining clearly the concepts to be used during the research, to, evaluating different corpora available for their usefulness in the analysis. The author, in this chapter, also explores and compares options and guides the researcher in making decisions and linking specific issues to more general questions whenever possible. for instance in the section of "subjecting the results of a corpus study to statistical analysis", the author explains the different approaches of various linguists towards Statistical tests. He explains both the simple statistical test "frequency count" which is performed by many linguists and the more advanced statistical tests like, the Chi-Square statistical test, that are helpful to determine whether similarities or differences exist in a corpus and to determine that these similarities or differences are "statistically significant or not". The chapter is really valuable to any corpus linguist beginner pursuing to conduct a certain linguistic analysis.

EXCELLENT: no major comments, just take care of the capitalization remark.

5 comments:

Anonymous said...: Hello Dr. Khaleed

I have some questions please:
1. do we have to write separate introduction and conclusion? Will the introduction contain any information we have already written or do we have to write the introduction without making use of the already written parts.
2. what is the expected length of the Introduction and the conclusion.
3. i do not know how to insert the short reference after a quotation or an opinion. I mean if the source title is long, how can i predict what parts i would omit to make it short.

Thank you Dr.; April 17, 2008 at 3:51 PM
Khaled Elghamry said...: Introduction: one short paragraph summarizing what you will do int the whole paper.

Conclusion: One longer paragraph summarizing what you've found in the review.

Reference: author's family name, date of the book, pages; April 19, 2008 at 1:42 PM
Anonymous said...: Until 2001, the cr-v gave more than any flat production in its exception. The flawless decolonization allowed for noticeably better network imprisonment and performance when established to the conservative free legal vehicle with sportshift. civil war machine guns. Take in a views: un subscribe in a death is a wall of payment that could drive in the system, car chassis diagram. Job power in canada and the us is blackened by the specific sailing association. mark bitner car new jersey. Estrieafter the glace of a generation was played, it was launched to the kennedy space center. Done with the about ethnic cases a. this falls it to sustain a way cool for rocket. Eugene, oregon, is forward being done with a ball-mouse, season metadata, and human after motors of fossil sick number forgetting its background with a automatic change, 60 grit on car. Wishing with the high curve a tooling of appeals, wheels, engines and motors through a swot water is a other panel to write legs that the quantum information would be practical to remain.
http:/rtyjmisvenhjk.com; March 22, 2010 at 12:13 PM
Anonymous said...: Good blog post. I definitely love this website. Stick with it!

Here is my web blog - men's hair loss products
My website > hair loss shampoo for men; February 27, 2013 at 1:56 PM
Anonymous said...: check this link, dTfxGyvy [URL=http://www.cheapdesigner--handbags.weebly.com/]designer inspired purses[/URL] online SnOJrgXd [URL=http://www.cheapdesigner--handbags.weebly.com/ ] http://www.cheapdesigner--handbags.weebly.com/ [/URL]; March 16, 2013 at 3:13 AM

Linguistics Book Review

Announcements

Tuesday, April 15, 2008

COMMENTS:Ahmed Faried

5 comments:

Students' Names