This study examines problems caused by initial articles in library catalogs. The problematic records observed are those whose titles begin with a word erroneously considered to be an article at the retrieval stage. Many retrieval algorithms edit queries by removing initial words corresponding to articles found in an exclusion list even whether the initial word is an article or not. Consequently, a certain number of documents remain more difficult to find. The study also examines user behavior during known-item retrieval using the title index in library catalogs, concentrating on the problems caused by the presence of an initial article or of a word homograph to an article. Measures of success and effectiveness are taken to determine if retrieval is affected in such cases.
When filing entries alphabetically in an index, ignoring initial definite and indefinite articles is customary. (1) For instance, the book titled The Earth and Its Inhabitants is normally filed under the letter "e." This procedure is used almost universally because initial articles "tend to be used intermittently," and also because, due to the high occurrences of initial articles in titles, it would otherwise produce very large groupings of entries beginning with the same word, thus losing the desired alphabetical dispersion of entries within the index. (2) In the current version of the MARC 21 standard, this procedure can be achieved, for the first index subfield in some fields, by using a numerical indicator (the non-filing characters indicator) corresponding to the number of initial characters to be ignored at the beginning of the string being indexed. In the above example, the non-filing indicator of field 245 (title) would be set to 4, indicating that the first four characters (t-h-e and the space) are to be ignored for indexing. (3) Using this technique allows the initial article to be retained in the title field and used for display, without being taken into account in the browse index.
Because the non-filing indicator is not available for all the fields in which articles and other non-filing elements occur, and also because non-filing data elements do not always occur at the beginning of a field, a new technique, setting off the non-filing zone by means of control characters, was approved in 1999 as a result of American Library Association (ALA) Machine-Readable Bibliographic Information (MARBI) Committees Proposal 98-16R. (4) Guidelines for use of the new non-filing control characters were discussed in two discussion papers, DP118 (June 1999) and 2002-DP05 (January 2002), and finally published in 2004 by the Network Development and MARC Standards Office of the Library of Congress. (5) This procedure offers more flexibility, as it allows the cataloger to identify non-sorting zones virtually anywhere in the record and tag them with the use of special control characters whose function is to delimit the beginning and the end of the non-filing elements. As far as data representation is concerned, there are fairly standardized, documented, and efficient ways of dealing with initial definite and indefinite articles in data elements; however, the MARC coding controls only the way initial articles are to be indexed, not the way the retrieval is done. (6) Less standardization is found at the retrieval stage and this is what is investigated in this study.
All systems preprocess search strings to some extent (e.g., ignoring case distinction, omitting punctuation or replacing it with spaces, ignoring diacritics) before sending them to the index. When a user launches a browse-title search in a library catalog, the retrieval module may activate an algorithm to detect the presence of an inopportune initial article at the beginning of the query string. Because most initial articles are removed from the entries when indexing the title strings, even if a user includes an initial article in his or her query, the algorithm will automatically eliminate the word/article and bring the user to the correct entry point in the index. This procedure may prove very useful in some cases. For instance, if the user retains the initial article in a search query (for example, ti=the earth and its inhabitants), the algorithm detects the initial article and automatically suppresses it from the search query before it is sent to the index. In this example, the system therefore will bring the user the index of titles beginning with the letter "E" rather than the letter "T."
Nonetheless, most of these algorithms are not sophisticated enough to detect some linguistic subtleties, which can result in retrieval problems. This automatic detection of initial articles in search queries poses a number of problems, particularly in multilingual environments. (7) The cataloger's decision to declare an initial word as an article to be ignored must be based on several factors, among which the language comes first, since it can be reasonably assumed that an initial article in one language will have a corresponding legitimate non-article equivalent in another language. This is the case, for instance, in German with the article "die," which is homographic to (i.e., spelled with the same sequence of letters as) the English verb "to die." It would not be correct to file the title Die Another Day under the letter "A". In some cases, it is even necessary to grammatically analyze the titles in order to avoid incorrect assumptions within a language. In French, for instance, the definite article "la" is homographic (albeit the diacritic) to the adverb of place "la" ('there'); and the word "un" can either be an indefinite article, as in Un destin tragique, a pronoun, as in L'un d'entre eux, or a number, as in Un, deux, trois, partez! It can even be part of an adverbial locution, as in Un peu de fatigue. That is not counting the fact that it also is the homograph of the acronym form for United Nations (UN). Therefore, processing titles case by case is essential. Also, sentences (and titles) can begin with only one article, so it makes no sense grammatically to remove two or more words from the beginning of a title search query. Yet, the algorithms tested in this project will remove any number of words that appear at the beginning of a search query that match the words in their exclusion list. For instance, in Atrium (the Universite de Montreal catalog), the query "un the au Sahara" will be transposed to "au sahara" because the "un" matches a French article and the "the," when transposed to "the," matches an English article.
The detection algorithms included in most information retrieval systems are not sophisticated enough to detect these linguistic subtleties, which are the cause of some retrieval problems. Some homographic non-article words might be erroneously removed from the queries. This is the case for a title such as Las Vegas, The Success of Excess. This title will be correctly filed in the index under letter "L" since the word "Las" is part of a place name, but if the word "Las" is included in the exclusion list of the algorithm, it will be interpreted as the Spanish definite article and automatically stripped of the query string, and the user will be misguided to the letter "V" in the index where the entry is nowhere to be found.
Suppose a user needs to find the work by Michel Leiris entitled A cor et a cri. Browsing through the title index normally would be done with the standard query "a cor et a cri." Unfortunately, if the initial article detection algorithm is activated, the user will be misguided to letter "C" in the index since the initial "a" of the query will be, in this case, wrongly interpreted as the English indefinite article "a" and the query text will be truncated, often without the user being aware of it, becoming "cor et a cri." The title having been correctly indexed under letter "A," the user will be wrongly positioned in the index as illustrated (figure 1) and may wrongly assume that the title is not in the collection. This lack of system feedback most probably has a negative impact on end users learning to use the catalog.
[FIGURE 1 OMITTED]
In the catalog (the University of Toronto catalog) in figure 1, the user has to choose between two search modes: either the keywords mode (containing), or the browse mode (starting with). If the starting with option is chosen, the user will probably draw the conclusion that the document being sought is not in the …