Searching Titles with Initial Articles in Library Catalogs: A Case Study and Search Behavior Analysis
Arsenault, Clement, Menard, Elaine, Library Resources & Technical Services
This study examines problems caused by initial articles in library catalogs. The problematic records observed are those whose titles begin with a word erroneously considered to be an article at the retrieval stage. Many retrieval algorithms edit queries by removing initial words corresponding to articles found in an exclusion list even whether the initial word is an article or not. Consequently, a certain number of documents remain more difficult to find. The study also examines user behavior during known-item retrieval using the title index in library catalogs, concentrating on the problems caused by the presence of an initial article or of a word homograph to an article. Measures of success and effectiveness are taken to determine if retrieval is affected in such cases.
When filing entries alphabetically in an index, ignoring initial definite and indefinite articles is customary. (1) For instance, the book titled The Earth and Its Inhabitants is normally filed under the letter "e." This procedure is used almost universally because initial articles "tend to be used intermittently," and also because, due to the high occurrences of initial articles in titles, it would otherwise produce very large groupings of entries beginning with the same word, thus losing the desired alphabetical dispersion of entries within the index. (2) In the current version of the MARC 21 standard, this procedure can be achieved, for the first index subfield in some fields, by using a numerical indicator (the non-filing characters indicator) corresponding to the number of initial characters to be ignored at the beginning of the string being indexed. In the above example, the non-filing indicator of field 245 (title) would be set to 4, indicating that the first four characters (t-h-e and the space) are to be ignored for indexing. (3) Using this technique allows the initial article to be retained in the title field and used for display, without being taken into account in the browse index.
Because the non-filing indicator is not available for all the fields in which articles and other non-filing elements occur, and also because non-filing data elements do not always occur at the beginning of a field, a new technique, setting off the non-filing zone by means of control characters, was approved in 1999 as a result of American Library Association (ALA) Machine-Readable Bibliographic Information (MARBI) Committees Proposal 98-16R. (4) Guidelines for use of the new non-filing control characters were discussed in two discussion papers, DP118 (June 1999) and 2002-DP05 (January 2002), and finally published in 2004 by the Network Development and MARC Standards Office of the Library of Congress. (5) This procedure offers more flexibility, as it allows the cataloger to identify non-sorting zones virtually anywhere in the record and tag them with the use of special control characters whose function is to delimit the beginning and the end of the non-filing elements. As far as data representation is concerned, there are fairly standardized, documented, and efficient ways of dealing with initial definite and indefinite articles in data elements; however, the MARC coding controls only the way initial articles are to be indexed, not the way the retrieval is done. (6) Less standardization is found at the retrieval stage and this is what is investigated in this study.
All systems preprocess search strings to some extent (e.g., ignoring case distinction, omitting punctuation or replacing it with spaces, ignoring diacritics) before sending them to the index. When a user launches a browse-title search in a library catalog, the retrieval module may activate an algorithm to detect the presence of an inopportune initial article at the beginning of the query string. Because most initial articles are removed from the entries when indexing the title strings, even if a user includes an initial article in his or her query, the algorithm will automatically eliminate the word/article and bring the user to the correct entry point in the index. …