The Infocious Web Search Engine: Improving Web Searching through Linguistic Analysis

By Ntoulas, Alexandros; Chao, Gerald et al. | Journal of Digital Information Management, October 2007 | Go to article overview

The Infocious Web Search Engine: Improving Web Searching through Linguistic Analysis


Ntoulas, Alexandros, Chao, Gerald, Cho, Junghoo, Journal of Digital Information Management


ABSTRACT: In this paper we present the Infocious Web search engine [23], which currently indexes more than 2 billion pages collected from the Web. The main goal of Infocious is to enhance the way that people find relevant information on the Web by resolving ambiguities present in natural language text. Towards this goal, Infocious performs linguistic analysis to the content of the Web pages prior to indexing and exploits the output of this analysis when ranking and presenting the results to the users. Our hope is that this additional step of linguistic processing provides Infocious with two main advantages. First, Infocious tries to gain a deeper understanding of the content of Web pages and to match the users' queries with the indexed documents better, improving the relevancy of the returned results. Second, based on its linguistic processing, Infocious tries to organize and present the results to the users in a structured and more intuitive way. In this paper we present the linguistic processing technologies that we investigated and/or incorporated into the Infocious search engine, and we discuss the main challenges in applying these technologies to Web documents. We also present the various components in the architecture of Infocious, and how each one of these components benefits from the added linguistic processing. Finally, we present preliminary results from our experimental study that evaluates the effectiveness of the described linguistic analysis.

Categories and Subject Descriptors

H.3.1 [Content Analysis and Indexing]; Linguistics Processing: H.3.5 [Online Information Services]; Web based services I.2.7 [Natural Language Processing]

General Terms

Web information retrieval, Search engines, Natural Language Processing, Linguistic analysis

Keywords: Infocious search engine, Linguistic processing, Web search engines, Web pages

1. Introduction

Millions of users today use Web search engines as the primary (and often times the sole) means for locating relevant and interesting information. They rely on search engines to satisfy a broad variety of informational needs, ranging from researching medical conditions to locating a convenience store to comparing available products and services. The most popular of the search engines today (e.g. Ask [3], Google [21], Live [38], Yahoo! [55], etc.) maintain a fresh local repository of the ever-increasing Web. Once a user issues a query, search engines go through their enormous repository and identify the most relevant documents to the user's query.

While the exact process of identifying and ranking the relevant documents for current major Web search engines is a closely-guarded secret, search engines generally match the keywords present in the user's query with the keywords in the Web pages and their anchor text (possibly after stemming) in order to identify the pages relevant to the query. In addition, search engines often exploit the link structure of the Web to determine some notion of "popularity" for every page, which is used during the ranking of results. In most cases, simple keyword-based matching can work very well in serving the users' needs, but there are queries for which the keyword matching may not be sufficient.

As an example, consider the query jaguar that a user might issue to a search engine. Typically, the major search engines may return results that deal with at least three disjoint topics: (1) Jaguar--the car brand name, (2) Jaguar--one version of MacOS X, (3) Jaguar--the animal. As one can imagine, it is highly unlikely that a user is interested in all three of the above at the same time.

The query jaguar is an example of an ambiguous query because it is associated with multiple senses, each one pertaining to a different topic of interest. As a consequence, Web pages that discuss distinct topics but all share the same keywords may be considered relevant and presented to the user all at the same time. …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items

Items saved from this article

This article has been saved
Highlights (0)
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

Citations (0)
Some of your citations are legacy items.

Any citation created before July 30, 2012 will labeled as a “Cited page.” New citations will be saved as cited passages, pages or articles.

We also added the ability to view new citations from your projects or the book or article where you created them.

Notes (0)
Bookmarks (0)

You have no saved items from this article

Project items include:
  • Saved book/article
  • Highlights
  • Quotes/citations
  • Notes
  • Bookmarks
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Cited article

The Infocious Web Search Engine: Improving Web Searching through Linguistic Analysis
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.