The Infocious Web Search Engine: Improving Web Searching through Linguistic Analysis
Ntoulas, Alexandros, Chao, Gerald, Cho, Junghoo, Journal of Digital Information Management
ABSTRACT: In this paper we present the Infocious Web search engine , which currently indexes more than 2 billion pages collected from the Web. The main goal of Infocious is to enhance the way that people find relevant information on the Web by resolving ambiguities present in natural language text. Towards this goal, Infocious performs linguistic analysis to the content of the Web pages prior to indexing and exploits the output of this analysis when ranking and presenting the results to the users. Our hope is that this additional step of linguistic processing provides Infocious with two main advantages. First, Infocious tries to gain a deeper understanding of the content of Web pages and to match the users' queries with the indexed documents better, improving the relevancy of the returned results. Second, based on its linguistic processing, Infocious tries to organize and present the results to the users in a structured and more intuitive way. In this paper we present the linguistic processing technologies that we investigated and/or incorporated into the Infocious search engine, and we discuss the main challenges in applying these technologies to Web documents. We also present the various components in the architecture of Infocious, and how each one of these components benefits from the added linguistic processing. Finally, we present preliminary results from our experimental study that evaluates the effectiveness of the described linguistic analysis.
Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]; Linguistics Processing: H.3.5 [Online Information Services]; Web based services I.2.7 [Natural Language Processing]
Web information retrieval, Search engines, Natural Language Processing, Linguistic analysis
Keywords: Infocious search engine, Linguistic processing, Web search engines, Web pages
Millions of users today use Web search engines as the primary (and often times the sole) means for locating relevant and interesting information. They rely on search engines to satisfy a broad variety of informational needs, ranging from researching medical conditions to locating a convenience store to comparing available products and services. The most popular of the search engines today (e.g. Ask , Google , Live , Yahoo! , etc.) maintain a fresh local repository of the ever-increasing Web. Once a user issues a query, search engines go through their enormous repository and identify the most relevant documents to the user's query.
While the exact process of identifying and ranking the relevant documents for current major Web search engines is a closely-guarded secret, search engines generally match the keywords present in the user's query with the keywords in the Web pages and their anchor text (possibly after stemming) in order to identify the pages relevant to the query. In addition, search engines often exploit the link structure of the Web to determine some notion of "popularity" for every page, which is used during the ranking of results. In most cases, simple keyword-based matching can work very well in serving the users' needs, but there are queries for which the keyword matching may not be sufficient.
As an example, consider the query jaguar that a user might issue to a search engine. Typically, the major search engines may return results that deal with at least three disjoint topics: (1) Jaguar--the car brand name, (2) Jaguar--one version of MacOS X, (3) Jaguar--the animal. As one can imagine, it is highly unlikely that a user is interested in all three of the above at the same time.
The query jaguar is an example of an ambiguous query because it is associated with multiple senses, each one pertaining to a different topic of interest. As a consequence, Web pages that discuss distinct topics but all share the same keywords may be considered relevant and presented to the user all at the same time. …