Academic journal article LIBRES: Library and Information Science Research Electronic Journal

Selecting the Right Search Term in Query-Based Systems for deduplication/Die Wahl Des Suchbegriffs in Anfragebasierten Systemen Zur Erkennung Bibliographischer Dubletten/Sélection De Mot De Recherche Dans Les Systèmes Basés Sur Des Reqûetes En Vue De Détecter Les Doublons Bibliographiques

Academic journal article LIBRES: Library and Information Science Research Electronic Journal

Selecting the Right Search Term in Query-Based Systems for deduplication/Die Wahl Des Suchbegriffs in Anfragebasierten Systemen Zur Erkennung Bibliographischer Dubletten/Sélection De Mot De Recherche Dans Les Systèmes Basés Sur Des Reqûetes En Vue De Détecter Les Doublons Bibliographiques

Article excerpt

Essentially three approaches could be identified when choosing a proper search term to detect bibliographic duplicates. Stop words are excluded in all of them, then (1) just the first term of an entry will be selected or (2) that term is selected, which produces the smallest number of hits or finally (3) that term will be used, which has a certain number of hits below a defined threshold.

These three procedures are compared with each other here. The results derive from series of measurements done with bibliographic data from the Austrian Central Catalog.

Bei der Wahl eines gunstigen Suchbegriffs zur Erkennung bibliographischer Dubletten sind im Wesentlichen drei Ansatze erkennbar. Stoppworter werden in allen dreien ausgeschlossen, anschließend wird (1) der erste Begriff eines Eintrags gewahlt, der aufgefunden wird oder es wird (2) jener gewahlt, der die kleinste Treffermenge hervorruft oder letztlich (3) findet der Begriff Verwendung, dessen Treffermenge unter einem definierten Schwellwert liegt.

Diese drei Vorgehensweisen werden hier miteinander verglichen. Die Ergebnisse stammen aus Messreihen, die mit Titeldaten aus dem osterreichischen Bibliothekenverbund durchgefuhrt wurden.

On a pu identifier essentiellement 3 approches dans la procedure de selection d'un terme susceptible de detecter les doublons bibliographiques. Ces trois approches excluent les mots vides (stopwords), ensuite (1) on selectionne le premier terme d'une entree ou bien (2) on choisit celui pour lequel on obtient le plus petit nombre de resultats ou bien finalement (3) on prend le terme pour lequel le nombre de resultats se situe sous un seuil defini. Ces trois procedures sont ici compares les unes aux autres.

Les resultats sont issus de series de mesures effectuees sur une base de donnees bibliographiques du catalogue de la Biblioth`eque nationale autrichienne.

1 Preface

If one tries to classify this article based on current li- terature, he or she may quickly be tempted to dismiss the discussed methods as not contemporary any more. With this brief preface, which provides some references on relevant literature, it will be shown, that such a fast judgment is certainly not valid.

When detecting bibliographic duplicates it is intented in particular not to compare too many entries becau- se the final calculation of large numbers of duplicates is rather time consuming. Much more effective, of cour- se, is any approach that keeps the amount of titles to be tested very small. On the other hand it has to be concerned, however, that a smaller amount of selected titles increases the probability of not recognizing some duplicates. Thus, in any case of restricting the somehow limited "search space", within which the verification of duplicates take place, efficiency and reliability with their necessary requirements are diametrically opposed.

The approach to proceed on the basis of keywords is one that meets the reality of working with bibliogra- phic databases. This reality is quite often characterized by the fact that bibliographic entries (from very diffe- rent databases), which are locally not available, have to be integrated into a local system. Thereby, the remo- te databases are addressed via interfaces (like Z39.50) and APIs (= Application Programming Interfaces like REST ( = Representational State Transfer) or SOAP ( = Simple Object Access Protocol)).

Therefore, the selected approach must be one, that can be used in such environments in a functional and effec- tive way (see Schneider 1999). Such is the method of automatically loading the titles to be checked by means of a proper selected keyword.

As an alternative to a keyword-based approach cur- rent literature presents mainly methods such as SNM ( = Sorted Neighborhood Method) (Hernandez & Stol- fo 1995 and Yan u. a. 2007) or the "Blocking Method" (Draisbach & Naumann 2009). Thereby small strings are extracted from the data (e. g. substrings from the author- and title-information), joined together to a new string and sorted alphanumerically. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.