Academic journal article Journal of Digital Information Management

Soft-404 Pages, a Crawling Problem

Academic journal article Journal of Digital Information Management

Soft-404 Pages, a Crawling Problem

Article excerpt

1. Introduction

Currently, the WWW constitutes the biggest repository of information ever built, and it is continuously growing. According to the study presented by Gulli and Signorini [17] in 2005, the Web consists of thousands of millions of pages. Three years later, in 2008, according to Official Blog of Google (1), the Web contained 1 trillion of unique URLs. Due to its large size, search engines are essential tools for users who want to access relevant information for a specific query. The set of tasks performed by a search engine is very complex. Therefore, there are many studies that analyse the architecture of each of the main parts of a search engine.

Baeza-Yates and Ribeiro-Neto discuss in [3] the architecture of a search engine. In short these are its main components:

--Crawling Process [30]: Responsible for downloading pages from the Web automatically and storing them in a repository.

--Indexing Process: Responsible for creating an index, from the repository pages. This index allows performing the user queries. The most used index structure is an inverted index composed of all the distinct words of the collection and, for each word, a list of the documents that

contain it.

--Retrieval & Ranking Processes: The first one retrieves documents that satisfy a user query. Finally, the documents are ranked and those most relevant for the user are returned.

The different tasks of the search engines involve significant challenges in the treatment of vast amounts of information. Among these challenges, specific aspects can be highlighted, such as the technologies used in web pages to access to data, both in the server-side [34] or in the client-side [8]; or problems associated with web content as Web Spam [18] or repeated contents [26], etc. We want to highlight the article by Cambazoglu and Baeza-Yates [10], which presents these challenges through an architectural classification, starting from a simple single-node search system and moving towards a multi-site web search architecture. In this study we will focus on improve the crawling process to do it more efficiently.

There are several studies that try to improve the performance of the crawler, for example, changing its architecture by means of distributed systems [24] [1] (explained in more detail in Section 2). However, there is a complementary way to improve the efficiency of crawling systems: by minimizing the use of resources [20].

That is, for instance, by reducing the garbage content they have to process. This garbage consists of: a) Web Spam [18] [15], b) incorrectly identified death links (Soft-404 pages [5], parking pages); c) duplicate contents [26], etc.

In this article we study a method to avoid part of the garbage contents. More precisely, we focus on Soft-404 pages, the set of web pages that some web server returns with a 200 HTTP code when a nonexistent resource is requested. In Section 6 we will show that many of these pages come from parking domains. These pages have no content or are generated automatically, with many anchors and advertisements, to obtain revenue. This causes a crawling system to believe that the page exists and that it must be processed and indexed, with the consequent loss of resources.

So, protection mechanisms should be created for a) the final users who wasting their time and perhaps their money and b) the companies that provide search engines. The latter are very much affected since they not only lose prestige when Soft-404 pages are shown among their results, but they are also wasting money and resources in analysing and indexing these types of pages.

We propose a system, called Soft404 Detector, based on web content analysis to identify web pages that are Soft-404 pages. The idea is to include the proposed system as a module of a crawling system. Thus, the crawler will avoid storing, processing and indexing these types of pages. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.