Academic journal article Reference & User Services Quarterly

"Linkrot" and the Usefulness of Web Site Bibliographies

Academic journal article Reference & User Services Quarterly

"Linkrot" and the Usefulness of Web Site Bibliographies

Article excerpt

This article examines whether outdated URLs are still a problem and to what extent they decrease the usefulness of paper bibliographies and reviews of Web sites. The authors make specific recommendations that bibliographers and Web site authors can use to increase the probability of users finding Web sites that may have relocated, thus improving the long term utility of Web site bibliographies.

A traditional role of librarians has been to collect, evaluate, and annotate information on a given topic and to produce bibliographies and reviews. With the birth of the Internet and World Wide Web, librarians have expanded this role to include collecting and evaluating Web sites. Annotations of Web sites now appear in professional journals for librarians, in traditional librarian-oriented review sources (such as Choice and Library Journal), in such electronic formats as the Internet Scout Project and Librarians' Index to the Internet, and in virtual libraries, including INFOMINE and Argus Clearinghouse.(1) Librarians also publish subject bibliographies that appear in professional journals of other disciplines.

Paper-based subject bibliographies of Web sites have several benefits, including their use as "ready-reference tools for librarians and information specialists," even though these lists, like traditional bibliographies, are usually out of date by the time they are published.(2) An additional problem exists for Web site lists: while traditional bibliographies rarely give the precise location of the resource, the addresses of resources on the World Wide Web (URLs) are included and can change. When Web pages are removed or moved by their creators without a "forwarding address," the URLs given in the bibliography are no longer correct, a phenomenon known as "linkrot." When a link expires, users have to wait until search engines update their databases, unless the new URL is posted at the old address or the user is automatically sent to the new site.

How pervasive is linkrot in this environment? In 1998, Benbow examined two of her previously published Web resource journal articles and found that 50 percent of the URLs selected in mid-1995 and 19 percent of the URLs selected in mid-1996 were no longer accurate.(3) Kitchens and Mosley examined URLs in selected Internet guide books and found that "on average, 29 [percent] of URLs were inactive within two years of publication."(4)

Is linkrot still a significant problem in paper bibliographies of Web resources? To determine this, the authors examined Web site annotations that appeared in the "Internet Resources" column in College & Research Library News over a twelve-month period. In addition to reporting the results of the inquiry, we will offer suggestions and strategies for both bibliographers and Web site authors that may make bibliographies more useful on a long-term basis and decrease linkrot.

Method

The authors examined the URLs and annotations of Web sites (those beginning with "http://") in "Internet Resources" columns from October 1997 through October 1998. The analysis had two main goals: to determine how many URLs were outdated and to determine if the annotations provided enough information to retrieve the URLs using a meta search engine. The original sample included 510 URLs. After we excluded obvious typographical errors (for example, "ww." instead of "www.") and duplicates, the final sample contained 482 URLs. These sites were tested during the month of October 1998 to determine whether the URLs were still active. URLs that were active in October 1998 were retested in May 1999 to see if they were still in use.

If the URLs had expired, the authors documented if and how users were directed to the new locations, including instantly taking users to the new location, listing the new URL on an error page (for example, "404 File Not Found"), or providing site maps or site search functions on the error page. We then used MetaCrawler, a meta search engine, to attempt to retrieve all of the sites listed in the article just using the information given in the annotations.(5) Phrase and Boolean "AND" searching techniques were used; search terms were derived from the names of the resources and other information given in the articles.

We selected MetaCrawler because it compiles results from multiple search engines and so increases the likelihood of retrieving the new location. As Lawrence and Giles have noted, "search engines do not index sites equally, may not index new pages for months, and no engine indexes more than about 16 [percent] of the Web."(6)

There were some limitations with these methods. Web site content could have changed since the articles were written and published, and therefore information in the annotation based on the cited edition of a page would no longer be useful in locating the new version. Also, there is always a delay between the time of the establishment of the new URL and the date that the search engines index the new location. At the time, MetaCrawler had its own limitations: it returned a maximum of thirty sites from each search engine, so it is possible that although individual search engines might have indexed the new locations, this information might not have been included in the results. Although MetaCrawler allowed phrase searching, not all of the individual search engines supported this technique. In addition, the level of indexing varied among the included search engines.

Results

As table 1 indicates, each article contained an average of 40.2 URLs (rounded to the nearest tenth). Most of the annotated Web sites were hosted on commercial sites (32.8 percent), academic sites (24.7 percent), government sites (18.7 percent), and organizational sites (18.0 percent) (see table 2). Of the 428 URLs in the study, 64 (13.3 percent) had expired by October 1998. An additional 43 URLs were inactive in May 1999 for a total of 107 (22.2 percent) outdated addresses (see table 1). Only two military sites were in the study and both URLs were no longer active; the .net domain had the next highest percentage of inactive URLs, followed by commercial sites, academic sites, and government sites (see table 2).

Table 1
Outdated URLs by Article and Date since Publication

                                Changes Noted October 1998

                          Total                            Months
                           URLs    Outdated            %    since
Article Subject       (N = 482)        URLs     Outdated    Publ.

Health                       47          14         29.8       13
Career                       36           4         11.1       12
Biotechnology                37           3          8.1       11
Clip art                     21           2          9.5       10
Finance                      42           7         16.7        9
Social work                  47          11         23.4        8
Nutrition                    28           2          7.1        7
Distance learning            21           4         19.0        6
Travel                       42           6         14.3        5
East Asia                    88           9         10.2        4
Refugees                     31           0          0.0        2
El Nino                      42           2          4.8        1
Total                       482          64            -        -

                               Changes Noted May 1999

                          Add'l                   Months
                       Outdated           %        since
Article Subject            URLs    Outdated        Publ.

Health                        3         6.4           20
Career                        4        11.1           19
Biotechnology                 9        24.3           18
Clip art                      0         0.0           17
Finance                       1         2.4           16
Social work                   3         6.4           15
Nutrition                     2         7.1           14
Distance learning             0         0.0           13
Travel                        5        11.9           12
East Asia                     9        10.2           11
Refugees                      1         3.2            9
El Nino                       6        14.2            8
Total                        43           -            -

                               Totals

                          Total
                       Outdated           %
Article Subject            URLs    Outdated

Health                       17        36.2
Career                        8        22.2
Biotechnology                12        32.4
Clip art                      2         9.5
Finance                       8        19.0
Social work                  14        29.8
Nutrition                     4        14.3
Distance learning             4        19.0
Travel                       11        26.2
East Asia                    18        20.5
Refugees                      1         3.2
El Nino                       8        19.0
Total                       107           -
Table 2
Outdated URLs by Domain Type

                            Number of           Percent of
Domain Type                 Sites (N)      Total Sites (N/482)

Academic (.edu or .ac)        119                  24.7
Commercial (.com)             158                  32.8
Government (.gov)              90                  18.7
Military (.mil)                 2                   0.4
Network (.net)                 24                  51.0
Organization (.org)            87                  18.0
Other                           2                   0.4
Totals                        482                    -

                           Total Outdated     Percent of Total
Domain Type                URLs in May (n)     Outdated (n/N)

Academic (.edu or .ac)          25                 21.0
Commercial (.com)               38                 24.1
Government (.gov)               17                 18.9
Military (.mil)                  2                100.0
Network (.net)                   7                 29.2
Organization (.org)             18                 20.7
Other                            0                    0
Totals                         107                    -

As expected, the oldest article contained the largest percentage of expired URLs and a statistically significant positive correlation (r critical value=0.4971, df=10) was observed between the age of an article and the percentage of inactive URLs. Correlation coefficients of 0.55 and 0.52 were noted in October and May. Other factors--such as time-sensitive subject matter, type of site (commercial, government, etc.), and ownership (corporate vs. individual) of the resource--also might have been influential. For example, personal homepages often change locations as the author changes Internet service providers or employers. Other researchers might want to investigate these relationships further.

The most common responses for outdated URLs were error messages alone (41.1 percent), followed by pages that automatically redirected users to the new URL (36.4 percent), error messages that also listed the new location (13.1 percent), and, finally, pages that listed the new location and then sent users to the new site (9.3 percent).

In the second part of the analysis, the authors searched for all 482 sites using MetaCrawler. Of these sites, 21 (4.4 percent) could not be found by the authors; this number may have been larger for inexpert or impatient searchers. Additionally, seven resources were retrieved only by locating a page at the same site that linked directly to the resource; usually this was the homepage.

Discussion and Recommendations

Linkrot still seems to be a significant issue with paper-based bibliographies of Web sites. In the future, initiatives such as OCLC's persistent URL (PURL) project and Alexa, which includes an archive of Internet pages, could reduce the difficulty of locating pages which have moved.(7) Undoubtedly search engines also will improve.

Until then, authors of bibliographies should learn how search engines index Web pages and provide specific information to enhance retrievability and increase the long-term usefulness of the bibliographies. For example, while some search engines' databases include every word of a page, others restrict their indexing to data from certain HTML codes, including the TITLE tag (which displays in the browser's title bar), META tags (including author, keywords, and description of the site), and the Alternative Text attribute (ALT) within Image (IMG) tags.(8)

It might not be possible to include all this information in a journal article. Web bibliographies should be kept online whenever feasible. Online directories can be more easily updated, and there also are fewer space constraints than in paper versions so more data about the sites may be included. For example, because the "Internet Resources" column in College & Research Libraries News is also available online, the online version of the East Asian bibliography contained more resources than the paper version.(9)

Based on the findings in this study, the following recommendations are made for Web site bibliographers:

* Use the name of the resource as given in the browser title bar (TITLE tag) as the name of the resource in the bibliography. If the title bar name is misleading, give the site a name but document the actual title bar information within the annotation.

* Document other information from specific HTML codes, including the META tags and the ALT attribute for images, if available.(10) Include site-specific information given in the first few lines of text on the page. This is another frequently indexed area.

* Include the name of the contact person for the site, electronic mail address, and traditional mail address if available.

* Note the "official" or "preferred" URL for the page if one is given, which might be different from the URL for the page shown in the browser's location bar. Some busy sites keep their information on multiple servers and users reaching the "official" URL are redirected to whichever server has the lowest amount of traffic.(11)

* Include the latest date that the resource was found.

* When listing a resource that is not the homepage of a Web site, also include the homepage URL. Some search engines only index homepages.

Knowledge of how search engines index sites can help Web site creators develop strategies to ensure ongoing access to their products.(12) We suggest the following recommendations for Web site creators:

* Plan the site directory structures before posting pages. Planning will decrease the likelihood that the site will need reorganization and subsequent URL changes.

* List contact information on each page, including the name of the person responsible for the content, a traditional mail address, and an electronic mail address if available. This will allow users to ask questions about the site and to report broken links.

* Include the creation date and date of the last revision on each page. This allows users to evaluate the currency of the information.

* Using the Title tag, create a unique title for each page in the Web site. The title should reflect the content of the page.

* If a page is moved, post a referral page that includes the new URL and a link to the new site. Consider using the META tag "Refresh" element and automatically move the user to the new site after several seconds. Some owners prefer to redirect users to the new site without a referral page, but in this case users may not realize that the site has changed and will not update their references to the site.

* Make error pages useful: include links to the Web host's homepage, to the site map, and to the site search function if available. These resources may help users find pages that have been relocated. In addition, provide contact information for the Web page owner (electronic mail address, etc.).(13)

* Some Web site creators use images to display textual information about their site (logos, banners, etc.). Search engines, however, do not index the content of images. Use the ALT attribute of the IMG tag to document this information.

* If the same information is kept on multiple servers and users are redirected to whichever server is available, document the central "official" or "preferred" URL in the text of the page.

* Consider using a PURL or registering a permanent domain name or alias for the Web site.(14) A permanent domain name allows Web pages to have URLs that do not change even if the Web site is moved to a different server at a different Internet Protocol address.

* Use relative links to pages within the site so that URLs will work even if the site is moved to a different server. Relative links do not include the host name, only directory and file name information.

* Check the links on the site frequently, either by hand or with link checking software such as Linkbot or CyberSpyder.(15)

* Submit URL changes to the major search engines.

Conclusions

Linkrot will never disappear entirely, if only because standard domain names are not free and they are not required for Web publishing. Paper-based bibliographies of Web sites are still useful even if URLs are ephemeral because the information in the annotations can be used to find new locations via search engines. Bibliographers, however, need to provide a detailed enough description of the site to ensure its future retrieval regardless of possible URL changes.

Acknowledgments

The authors wish to thank Judy MacLeod Reardon of Data Research Associates for her assistance in designing this study, Paul DuBois of the Wisconsin Regional Primate Research Center for his editorial comments and help with the statistical analysis of the study, and Loretta Koch, Roland Person, Kathleen Fahey, and Susan Logue of Morris Library, Southern Illinois University Carbondale, for their editorial comments and support.

References and Notes

(1.) Internet Scout Project, www.scout.cs.wisc.edu, accessed Dec. 1, 1999; Librarians' Index to the Internet, http://lii.org, accessed Apr. 13, 2000; INFOMINE: Scholarly Internet Resource Collections, http:// infomine.ucr.edu, accessed Dec. 1, 1999; The Argus Clearinghouse, http://clearinghouse.net, accessed Dec. 1, 1999.

(2.) Subash Gandhi, "Proliferation and Categories of Internet Directories: A Database of Internet Subject Directories," Reference and User Services Quarterly 37 (1998): 322.

(3.) S. Mary P. Benbow, "File Not Found: The Problems of Changing URLs for the World Wide Web," Internet Research: Networking Applications and Policy 8 (1998): 247-50.

(4.) Joel D. Kitchens and Pixey Anne Mosley, "A Study of the Effective Shelf Life of Printed Internet Guides," poster session presented at the Annual Meeting of the American Library Association, New Orleans, La., June 1999.

(5.) The URL for MetaCrawler is now www.metacrawler. com, accessed Apr. 13, 2000. It is still part of Go2Net.com, although this information is now tucked away at the bottom of the page. The old URL will pull you to the new URL.

(6.) Steve Lawrence and C. Lee Giles, "Accessibility of Information on the Web," Nature 400, no. 6,740 (1999): 107.

(7.) PURLs act as automatic redirects, using an intermediary. See the Persistent URL Home Page at http:// purl.oclc.org, accessed Dec. 1, 1999; Alexa Internet, www.alexa.com, accessed Dec. 1, 1999.

(8.) Search Engine Watch, "Search Engine Features for Searchers." Accessed Dec. 1, 1999, www. searchenginewatch.com/facts/ataglance.html.

(9.) C&RL NewsNet Internet Resources, www.ala.org/ acrl/resrces.html, accessed Dec. 1, 1999; July/August 1998 Internet Resources, www.ala.org/acrl/resju198. html, accessed Dec. 1, 1999.

(10.) Most browsers allow viewing of the source code. The View menus in Internet Explorer and Netscape Navigator provide this option.

(11.) The Internet Grateful Med (IGM) User Tips page (http://igm.nlm.nih.gov/user-tips.html, accessed Dec. 1, 1999) discusses why users should bookmark only the main URL for the IGM site.

(12.) Search Engine Watch, "Search Engine Features for Webmasters." Accessed Dec. 1, 1999 www. searchenginewatch.com/webmasters/features.html.

(13.) For more ideas for improving error message pages, see Jakob Nielsen, "Improving the Dreaded 404 Error Message," accessed Dec. 1, 1999, www.useit.com/ alertbox/404_improvement.html, and Brian Kelly, "Web Watch: 404s--What's Missing?" Ariadne 20 (22 June 1999), accessed Dec. 1, 1999, www.ariadne. ac.uk/issue20/404.

(14.) For a discussion of domain names and aliases, see Wallace C. Koehler Jr., "Unraveling the Issues, Actors, and Alphabet Soup of the Great Domain Name Debates," Searcher 7 (May 1999): 16, 18, 20-26. Also available at: www.infotoday.com/searcher/may99/ koehler.htm, accessed Apr. 13, 2000.

(15.) Linkbot, www.tetranetsoftware.com, accessed Dec. 1, 1999; CyberSpyder, www.cyberspyder.com, accessed Dec. 1, 1999.

Mary K. Taylor is Assistant Science/Instructional Support Services Librarian (e-mail: mtaylor@lib.siu.edu) and Diane Hudson is Assistant Undergraduate Librarian, both at Southern Illinois University, Carbondale.

Author Advanced search

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.