Essential Programming for Linguistics

Essential Programming for Linguistics

Essential Programming for Linguistics

Essential Programming for Linguistics


This book introduces programming for linguists. Linguistic programming is becoming an important topic for advanced students of linguistics, especially those that are trying to process corpus materials or literary texts for B.A. dissertations, Ph.D. theses, or advanced research.

This volume assumes no background knowledge of programming. It introduces linguists to the basic notions and techniques needed for linguistic programming and develops their understanding of electronic texts. Conversely, it also helps computer scientists who are unfamiliar with language processing and wonder which aspects of linguistics may be relevant to conducting intelligent, rather than purely "statistical" language processing. Many examples based on diverse topics demonstrate the immediate applicability of the concepts taught.


This book is mainly intended as an introduction to programming for linguists without any prior programming experience. However, it will hopefully also be useful to students or researchers who have a computer science background and will therefore probably already be familiar with programming concepts, but may have had little experience in understanding and analysing language.

The first question that may come to your mind when picking up a book about programming for linguistics is probably “Why should I need this, anyway?”, so I will try to answer this in a few paragraphs. In the days before computerised corpora became available for language analysis, researchers or scholars in linguistics either invented their examples – if they adhered to what has sometimes been referred to as ‘armchair linguistics’ – or they used reference materials collected and stored in the form of filing cards or other means of storage, which they then needed to search through again rather painstakingly each time they wanted to find an example of a particular linguistic phenomenon. With the advent of corpus linguistics and its use of computerised data, and especially today’s means of accessing these through the internet and other sources, finding suitable samples of language has – at least to some extent – become much easier and it is now possible to analyse and document languages, as well as validate theories about them, much more efficiently.

However, linguistic analysis on such data often goes far beyond what basic search programs for linguistics – so-called concordancers – have to offer and frequently involves multiple steps of data preparation and analysis that would be extremely time-consuming – and also potentially very error-prone – if conducted manually. For example, the very first step in analysing data is often to tokenise it, that is to identify appropriate units in the material under analysis and separate the data into them. This is then frequently followed by a stage of morpho-syntactic analysis or tagging, and then in turn by a syntactic analysis or maybe a frequency count of specific tagged data in order to conduct genre/variational analyses along the lines of Biber (1988). Now, there may be individual programs available for all these intermediate steps, but chances are that these programs are either not freely available, only run on a different operating system from the one that you are using, require a particular input format or do not produce the appropriate output format that you need to have your data in at particular stages of the processing. In the first two cases, perhaps your only option is to write such a program yourself, and, without any knowledge of programming, in the latter cases you would probably end up manually preparing and correcting your . . .

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.