Magazine article Online

Mining Meets the Web

Magazine article Online

Mining Meets the Web

Article excerpt

a growing market for mining and management of Web data is emerging.

For most industries, the existence of very large databases with critical information is not new and pulling out the data that is needed, when it is needed, has been an age-old challenge. It is estimated that typical Fortune 500 companies manage over a terabyte of electronic information each day, with annual growth projected at 57% [1]. Most companies track sales, marketing, and other financial data in large databases, often referred to as "data warehouses." These large databases allow employees to retrieve specific portions of the data or to perform statistical tests on the data to predict forecasts and trends. Use of data mining technologies has been the standard choice for retrieval of information from these types of databases, and its use is expanding with well-established vendors such as SAS and Oracle providing the major products. Data mining use is increasing due to a variety of factors:

Trend toward use of "data warehouses" (see sidebar) for consolidation and management of large sets of related data in organizations

Explosion in amount of information that is captured electronically

Dramatic price decreases in data storage hardware

Focus on knowledge management in organizations has increased pressure to share and use electronic data captured as a competitive advantage


Data mining can be defined as analyzing the data in large databases to identify trends, similarities, and patterns to support managerial decision making. Data mining technologies generally use algorithms and advanced statistical models to analyze data according to rules set forth by the particular application at hand. Data mining models fall into three basic categories: classification, clustering, and associations and sequencing (see Figure 1).

Classification-involves analyzing data and assigning it to predefined concept categories or "tags," based on predefined rules. Automatically assigning controlled vocabulary terms to records based on word occurrence is an example of classification.

Clustering-similar to classification in that different concept categories are identified through analysis of the data using distance or proximity measures, however, no predefined groups are used. All groups are autogenerated through patterns identified in the data. Clustering could be used to dynamically create a controlled vocabulary based on patterns present in the data and then format retrieval groups according to the vocabulary terms or concept categories.

Associations and sequencinggenerate descriptive models based on the data that identify rules to allow for prediction of future trends. Associations and sequences allow for modeling of "if, then" scenarios based on patterns identified in the data.

All data mining models can be predictive and are often used for forecasting of future behavior.

Data mining applications in the sales and marketing, actuarial, strategic planning, and risk-benefit analysis areas are prevalent. Analysis of sales data over a period of time can be used to predict future consumer trends and expected profit levels. Predictive statistical models run against huge databases form the basis of actuarial work in all areas, and strategic planning and risk-benefit analysis rely heavily on analysis of large sets of past data to forecast future trends. While these applications for data mining are maturing, a growing market for mining and management of Web data is emerging.


The huge amounts of text and documents available on the Web, both on the Internet and on intranets, are ripe for some type of management and organization. Much of this content is "unstructured," and exists in the form of Web pages or documents. Some estimates place the ratio of structured to unstructured information currently stored electronically at 10% structured and 90% unstructured [2], and this trend is expected to continue. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.