Corpus Linguistics: An Empirical Approach for Studying a Natural Language
Dash, Niladri Sekhar, Language Forum
The study of a language from empirical and intuitive perspectives has been old trends in the history of human civilisation. We have tried to explore the nature of a language to understand how it plays important role in cognition and communication. Over the centuries, linguistics evolves as a unique discipline with a goal for establishing intricate conceptual linkages with other branches of human knowledge. At the dawn of the new millennium, it takes a new turn to explore how the theories about various aspects of human language are attested in the evidence of actual language use manifested in multiple linguistic interactions of the common people.
This new approach of language study adds an extra dimension to traditional linguistics. This is made possible due to introduction of computer technology that helps linguistics to grow and evolve with new supply of tools and techniques to accumulate examples of actual language use from various sectors of linguistic exercise and analyse the database in new perspectives. Introduction of this new approach contributes in two basic ways to linguistics in general:
1. It enables us to verify if age-old theories and assumptions about language and language use are worth pursuing, and
2. It provides ample scope for direct use of linguistic evidence and information in linguistic and language technology works.
Thus the new method of language research and application works as an elixir for revival and survival of an age-old discipline that suffers from lack of direction, diversion and application for many years. We understand that invention and advancement of computer technology in the last century adds a new dimension to the discipline linguistics. It evolves as an important area of Artificial Intelligence, which aims at looking at a language as an instrument of human communication and thinking directly linked with cognition.
Corpus linguistics, as an important area of Artificial Intelligence, follows statistical methods and techniques to collect and supply large amount of empirical language databases accumulated in a systematic manner from various domains of actual linguistic activity. Also, it provides sophisticated devices to analyse these databases to extract required language data, examples and information necessary in works of applied linguistics, computational linguistics and artificial intelligence for understanding human language in a better way as well as for applying in various fields of human knowledge. There is also a strong cognitive and linguistic motivation to envisage how we communicate through language across time and space. Besides, there is a technical motivation to build up intelligent system that will make efficient linguistic interactions with human beings.
For addressing these goals computer scientists and linguists have united to develop system for language understanding and generation, machine translation, information retrieval and extraction, speech understanding and generation, computer-assisted language teaching, etc. that contribute for benefit and advancement of whole mankind. To develop such systems we need to understand empirically a natural language adorned with all its regular and rare linguistic features and properties. Here, language corpora become indispensable, since they have potential to exhibit most of the features of a natural language manifested within a large collection of empirical databases.
At present, many linguists across the world are engaged in the work of generating language corpora, since their primary goal is to characterise, as far as possible, features of a natural language within a frame of adequately representative language database. It is necessary to carry on works for compiling language corpora, since investigation into language databases yields information and insights expressed in a natural language to address various queries related to language use, language processing and language cognition. For instance, if we want to make proper interpretation of a simple sentence of a language, we need prior information of linguistic analysis of such sentences carried out by experts to empower our knowledgebase. Thus, description and analysis of linguistic properties stored in corpora become significant in artificial intelligence, cognitive linguistics, applied linguistics and language technology. In fact, information obtained from corpora does not contribute to these fields alone. It provides valuable insights into the description and understanding of a language - an important part of language description and learning.
2. WHAT IS A CORPUS?
The term corpus is derived from Latin corpus meaning "body." The Latin term, however, shows two distinct descendants in modern English: (a) corpse (it came via Old French cors) and (b) corps (it came via modern French corps in the 18th century). The first form (i.e. corpse) entered into English in the 13th century as cors and during 14th century it had its original Latin "p" reinserted. At first it meant simply "body," but by the end of the 14th century, the sense "dead body" became firmly established. However, on the other hand, the original Latin term corpus itself was acquired in English in the 14th century (Ayto 1990: 138).
In the domain of Corpus Linguistics, the term "corpus" refers to "a large collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language" (Crystal 1995). That means it refers to a large collection of written and spoken texts, available in machine-readable form, accumulated in scientific manner to represent a particular language or variety which is used by a language community. McEnery & Wilson (1996: 215) classify corpus in a finer scheme of classification characterised by its inherent features in the following manner:
1. Loosely, a corpus refers to any body of text;
2. Commonly, it refers to a body of machine-readable text; and
3. Strictly, it refers to a finite collection of machine-readable texts sampled to be maximally representative of a language.
Since a corpus is designed and developed for accurate empirical study of linguistic properties, features and phenomena observed in a natural language, we argue that a systematically compiled corpus, however small in size, should adhere to the following criteria (Dash 2005: 12):
* A corpus must faithfully represent common and special linguistic features of a language or a variety from which it is designed and developed. The idea of text representation within a corpus thus indirectly refers to the total sum of components (i.e. letters, words, phrases, clauses, sentences, etc.) included in it. However, in practice, the total number of words included in a corpus may determine its size but may fail to abide by the principle of proper text representation. Therefore it is better to keep fields open for a corpus as well as keep number of words unlimited for the benefit of language and users.
* A corpus should be large and wide to encompass text samples from various disciplines. In other words, directional varieties of language usage manifested in various disciplines and domains should have proportional representation in it. For instance, text samples from fields of natural sciences should carry equal weight as those from aesthetics, literature, mass media, engineering and social sciences. Thus, a balanced representation of text samples obtained from all disciplines and domains of language use will ensure reliability of a corpus.
* A corpus should be a true replica of physical texts …
Questia, a part of Gale, Cengage Learning. www.questia.com
Publication information: Article title: Corpus Linguistics: An Empirical Approach for Studying a Natural Language. Contributors: Dash, Niladri Sekhar - Author. Journal title: Language Forum. Volume: 34. Issue: 2 Publication date: July-December 2008. Page number: 5+. © 2007 Bahri Publications. COPYRIGHT 2008 Gale Group.
This material is protected by copyright and, with the exception of fair use, may not be further copied, distributed or transmitted in any form or by any means.