A Systematic Approach for Vsm-Based Web Page Classification

By Li, Lei; Vaishnavi, Vijay K. et al. | Journal of Management Information and Decision Sciences, July 1, 2012 | Go to article overview

A Systematic Approach for Vsm-Based Web Page Classification


Li, Lei, Vaishnavi, Vijay K., Vandenberg, Art, Journal of Management Information and Decision Sciences


ABSTRACT

Effective and efficient Web information classification becomes increasingly important as the Web continues to grow exponentially. Vector space model (VSM) is a traditional algorithm used to classify Web pages. However, VSM's effectiveness largely relies on selecting appropriate parameter values and such selection is often done in an ad hoc manner. Another challenge is the performance of VSM, especially when the classification algorithm processes large data sets. This paper describes a systematic approach for Web page classification that addresses both of these challenges. A genetic algorithm (GA) is integrated with VSM for the selection of appropriate parameter values and the integrated algorithm is enhanced to run in a grid-computing infrastructure. By utilizing GA to discover VSM parameter values set, the effectiveness of VSM is greatly improved. By applying grid, it becomes possible to efficiently deal with large data sets. A preliminary research prototype has been implemented and used to conduct empirical studies. Results of the experiments are reported and discussed.

Keywords: Web Page Classification, VSM, Genetic Algorithm, Grid, Systematic Approach

(ProQuest: ... denotes formulae omitted.)

INTRODUCTION

There are billions of Web pages on the World Wide Web and the number continues to grow exponentially. People are relying on commercial search engines such as Google (www.google.com), Yahoo (www.yahoo.com), AltaVista (www.altavista.com/), etc., to retrieve information from the Web. Search engine results are often loosely classified according to keyword matches, link analysis or other mechanisms. Search engines provide a good start for information retrieval but may not be sufficient for complex information inquiry tasks that require relevant classification of a large volume of results. Indeed, numerous studies (Fairthorne 1961; Hayes 1963; Attardi, et al. 1999; Craven, et al. 2000; Flake, et al. 2002; Kennedy and Shepherd 2005; Calado, et al. 2006; Kousha and Thelwall 2007) have demonstrated that appropriate classification of search results can greatly improve the efficiency of information retrieval.

Consider the following real world use case that we subsequently refer to as the SURA project. SURA (Southeastern Universities Research Association, www.sura.org) has a strategic initiative to develop cyber infrastructure such as grids for its 62 member universities. SURA wants a knowledgebase of all researchers who have grid-enabled applications or grid-potential activities. Since search results from commercial search engines can include too much irrelevant information, an initial knowledgebase of faculty research pages from nine SURA sites was compiled manually - a task requiring dedicated staff several months in browsing each university Web site, locating and evaluating relevant pages about faculty researchers, and storing results in a database. While initially useful, such a database quickly grows stale as pages change or whole sites revise their content. Moreover, SURA ultimately wants all 62 university Web sites to be similarly processed, a rather impractical, un-scalable approach. Indeed, browsing each of the 62 Web sites is a cumbersome, impractical task - each university site contains millions of Web pages.

To address the problem posed by the SURA project, it is important to develop an effective and efficient approach to automatically classify large volumes of Web pages. In this paper, we introduce a systematic approach that integrates vector space model (VSM) based classification algorithm, genetic algorithm, and grid computing infrastructure to effectively and efficiently classify Web pages.

The rest of the paper is organized as follows. Section 2 provides an overview of related research about Web page classification. Section 3 discusses the proposed research approach in detail and presents the prototype development work. Section 4 describes the experiment design and results. …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Note: primary sources have slightly different requirements for citation. Please see these guidelines for more information.

Cited article

A Systematic Approach for Vsm-Based Web Page Classification
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen
Items saved from this article
  • Highlights & Notes
  • Citations
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Search by... Author
    Show... All Results Primary Sources Peer-reviewed

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.