Cited page

Citations are available only to our active members. Sign up now to cite pages or passages in MLA, APA and Chicago citation styles.

X X

Cited page

Display options
Reset

Orthographic Measures of Language Distances between the Official South African languages/Ortografiese Maatstawwe Van Taalafstande Tussen Die Amptelike Suld-Afrikaanse Tale

By: Zulu, P. N.; Botha, G. et al. | Literator: Journal of Literary Criticism, comparative linguistics and literary studies, April 2008 | Article details

Look up
Saved work (0)

matching results for page

Why can't I print more than one page at a time?
While we understand printed pages are helpful to our users, this limitation is necessary to help protect our publishers' copyrighted material and prevent its unlawful distribution. We are sorry for any inconvenience.

Orthographic Measures of Language Distances between the Official South African languages/Ortografiese Maatstawwe Van Taalafstande Tussen Die Amptelike Suld-Afrikaanse Tale


Zulu, P. N., Botha, G., Barnard, E., Literator: Journal of Literary Criticism, comparative linguistics and literary studies


Abstract

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Out classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically.

We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

Key concepts:

clustering language distances language identification Levenshtein distance n-gram

Opsomming

Twee metodes vit die bepaling van verwantskappe tussen die elf amptelike tale van Suid-Afrika word beskryf. Die eerste metode maak gebruik van n-gramme. Die verwarrings wat plaasvind in 'n taalherkenningstelsel verskaf inligting oor die verhouding tussen die tale. N-gram-statistieke word vanaf teksdokumente bepaal en word dan gebruik as kenmerke vit klassifikasie. Ons wys dat die uitsette van 'n bevestigingstoets gebruik kan word oto te bepaal hoe naby tale aan mekaar le. Vanuit hierdie metings het ons 'n sigbare voorstelling van crie verhouding tussen tale afgelei.

Verder het ons die Levenshtein-metode gebruik oto crie afstand tussen die ortografiese transkripsies van woorde te bepaal, toegespits op die elf amptelike tale van Suid-Afrika. 'n Grafiese groepering volgens die afstande tussen crie verskillende tale toon weer die verhoudings aan tussen die tale en ook familiegroepe. Met sowel die dendrogramme as die multidimensionele skalering word bepaalde familiegroepe aangedui, en selfs ook die fynere verwantskappe binne hierdie familiegroepe.

Kernbegrippe: groepering Levenshtein-afstand n-grain taalafstande taalherkenning

1. Introduction

The development of objective metrics to assess the distances between different languages is of great theoretical and practical importance. To date, subjective measures have generally been employed to assess the degree of similarity or dissimilarity between different languages (Gooskens & Heeringa, 2004; Van-Hout & Munstermann, 1981; Van-Bezooijen & Heeringa, 2006), and those subjective decisions are for example, the basis for classifying separate languages, and certain groups of language variants as dialects of one another. It is without doubt that languages are complex; they differ in vocabulary, grammar, writing format, syntax and many other characteristics. This presents levels of difficulty in the construction of objective comparative measures between languages. Even if one intuitively knows for example, that English is closer to French than it is to Chinese, by how much is it closer? Also, what are the objective factors that allow one to assess these levels of distance?

These questions bear substantial similarities to the analogous questions that have been asked about the relationships between different species in the science of cladistics. As in cladistics, the most satisfactory answer would be a direct measure of the amount of time that has elapsed since the languages' first split from their most recent common ancestor. Also, as in cladistics, it is hard to measure this from the available evidence, and various approximate measures have to be employed instead. In the biological case, recent decades have seen tremendous improvements in the accuracy of biological measurements as it has become possible to measure differences between DNA sequences. In linguistics, the analogue of DNA measurements is historical information on the evolution of languages, and the more easily measured, though indirect measurements (akin to the biological phenotype) are either the textual or acoustic representations of the languages in question.

In the current article, we focus on distance measures derived from text; we apply two different techniques, namely language confusability based on n-gram statistics and the Levenshtein distance between orthographic word transcriptions, in order to obtain measures of dissimilarity among a set of languages. These methods are used to obtain language groupings, which are represented graphically using two standard statistical techniques (dendrograms and multi-dimensional scaling). This allows us to assess the methods relative to known linguistic facts in order to assess their relative reliability.

Our evaluation is based on the eleven official languages of South Africa. These languages fall into two distinct groups, namely the Germanic group (represented by English and Afrikaans) and the South African Bantu languages, which belong to the South Eastern Bantu group. The South African Bantu languages can further be classified in terms of different sub-groupings: Nguni (consisting of Zulu, Xhosa, Ndebele and Swati), Sotho (consisting of Southern …

The rest of this article is only available to active members of Questia

Sign up now for a free, 1-day trial and receive full access to:

  • Questia's entire collection
  • Automatic bibliography creation
  • More helpful research tools like notes, citations, and highlights
  • Ad-free environment

Already a member? Log in now.

Select text to:

Select text to:

  • Highlight
  • Cite a passage
  • Look up a word
Learn more Close
Loading One moment ...
Highlight
Select color
Change color
Delete highlight
Cite this passage
Cite this highlight
View citation

Are you sure you want to delete this highlight?