Magazine article Online

Mining Meets the Web

Magazine article Online

Mining Meets the Web

Article excerpt

Peggy Zorn (Peggy.Zorn@wl.com) and Mary Emanoil (Mary.Emanoil@wl.com) are Manager, Document Administration Services, and Documentum Consultant at respectively at Parke-Davis Pharmaceutical Research. Lucy Marshall edgeinfo@daneris.com) is with Edge Information Services, and Mary Panek (Mary.Panek@carrier.utc.com) is Information Manager at United Technologies Research Center

For most industries, the existence of very large databases with critical information is not new and pulling out the data that is needed, when it is needed, has been an age-old challenge. It is estimated that typical Fortune 500 companies manage over a terabyte of electronic information each day, with annual growth projected at 57% [1]. Most companies track sales, marketing, and other financial data in large databases, often referred to as "data warehouses." These large databases allow employees to retrieve specific portions of the data or to perform statistical tests on the data to predict forecasts and trends. Use of data mining technologies has been the standard choice for retrieval of information from these types of databases, and its use is expanding with well-established vendors such as SAS and Oracle providing the major products. Data mining use is increasing due to a variety of factors:

a growing market for mining and management of Web data is emerging.

* Trend toward use of "data warehouses" (see sidebar) for consolidation and management of large sets of related data in organizations

* Explosion in amount of information that is captured electronically

* Dramatic price decreases in data storage hardware

* Focus on knowledge management in organizations has increased pressure to share and use electronic data captured as a competitive advantage

WHAT IS DATA MINING?

Data mining can be defined as analyzing the data in large databases to identify trends, similarities, and patterns to support managerial decision making. Data mining technologies generally use algorithms and advanced statistical models to analyze data according to rules set forth by the particular application at hand. Data mining models fall into three basic categories: classification, clustering, and associations and sequencing (see Figure 1).

* Classification--involves analyzing data and assigning it to predefined concept categories or "tags," based on predefined rules. Automatically assigning controlled vocabulary terms to records based on word occurrence is an example of classification.

* Clustering--similar to classification in that different concept categories are identified through analysis of the data using distance or proximity measures, however, no predefined groups are used. All groups are auto-generated through patterns identified in the data. Clustering could be used to dynamically create a controlled vocabulary based on patterns present in the data and then format retrieval groups according to the vocabulary terms or concept categories.

* Associations and sequencing--generate descriptive models based on the data that identify rules to allow for prediction of future trends. Associations and sequences allow for modeling of "if, then" scenarios based on patterns identified in the data.

All data mining models can be predictive and are often used for forecasting of future behavior.

Data mining applications in the sales and marketing, actuarial, strategic planning, and risk-benefit analysis areas are prevalent. Analysis of sales data over a period of time can be used to predict future consumer trends and expected profit levels. Predictive statistical models run against huge databases form the basis of actuarial work in all areas, and strategic planning and risk-benefit analysis rely heavily on analysis of large sets of past data to forecast future trends. While these applications for data mining are maturing, a growing market for mining and management of Web data is emerging. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.