Distributional Hypothesis: Words for 'Human Being' and Their Estonian Collocates

Distributional Hypothesis: Words for 'Human Being' and Their Estonian Collocates

1. Introduction

The famous sentence by the British linguist John Rupert Firth, "You shall know a word by the company it keeps!" draws attention to the fact that the combination of words in phrases and sentences is never random or based on purely syntactic rules; instead, there are special relations between the words, while their way of co-occurrence reveals important information about those relations. The idea was found enlightening by Zellig Harris, also called the last American structuralist, who based his semantic classification on word distribution, arguing that words occurring in similar contexts have a similar meaning:

"If we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C" (Harris 1954:156).

The distribution of an item is understood as a sum of its whole context. This general definition by Harris leaves open what is meant by context and what exactly is the meaning supposed to be reflected by that context. In Harris's own publications, context may mean either phonetic or lexical context, or even the whole text. In terms of the present study, context is understood as left collocations, to be more precise, the three words positioned to the left of the node word. For comparison, some right collocations from an interval of similar length (3 words) have also been presented.

Our analysis deals with the 10 most frequent Estonian words for 'human being' and their 30 most frequent collocates. The words are inimene 'human', mees 'man', naine 'woman', laps 'child', tudruk 'girl' , poiss 'boy' , tutar 'daughter', poeg 'son', ema 'mother', and isa 'father'. The aspect of frequency is especially relevant in the discovery of collocational relations. The least frequent of the above words was tutar with only 612 occurrences. Thus, the list of node words was closed at ten, as less frequent nodes would have meant too much random material among the 30 collocates planned to be presented for each node word. Intuitively, the above ten words should belong to the basic words for 'human being'. True, their psychological salience has not been investigated, but their linguistic salience is reflected in their position in the list of the 130 most frequent Estonian nouns (Kaalep and Muischnek 2002), which is a strong argument for their possible basic word status (for the definition of 'basic word' see Sutrop 2000, Sutrop 2011).

As for meaning, Harris has not given a precise definition of that either. The present study is based on the idea that meaning consists of some smaller aspects or components (for componential analysis see Fodor and Katz 1963, Nida 1975). A component (for example, 'human being') binds all the relevant words into one semantic field. Thus, according to the distributional hypothesis, all words for 'human being' must naturally share a lot of context. However, relations within the semantic field are also of interest. Notably, the words for 'human being' are distinguished by such features as MALE, ADULT, KINSHIP/PARTNERSHIP, GENERATION OLDER THAN SELF and GENERATION YOUNGER THAN SELF. The research question is: Are the values given to those components, i.e. the semantic aspects of words, also manifested in the collocations of those words?

Semantic relations bind word meanings into semantic networks. Ever since Saussure, word semantic relations have been divided into syntagmatic and paradigmatic ones. The former occur between words appearing simultaneously in the same syntagm. For example, in the phrase 'young man' the words 'young' and 'man' are related syntagmatically. Paradigmatic relations, however, occur between words that are mutually substitutable in the syntagm. In the above syntagm, for example, the word 'young' stands in a paradigmatic relationship with the words 'old', 'tall', etc. (Lyons 1977:240-241). Note that Saussure called the paradigmatic relation associative, arguing that this is a relationship between the words associated in memory (Saussure 2000:121). …

