Academic journal article Informatica Economica

Physical Integration of Heterogeneous Web Based Data

Academic journal article Informatica Economica

Physical Integration of Heterogeneous Web Based Data

Article excerpt

Introduction

Nowadays it is generally accepted that the main source of information is on the Internet. The problem is that while the quantity of data is ever increasing, the means of exploring this data has not developed at the same pace. According to [1], the percentage of data analyzed worldwide is less than 1% out of the total amount of data existent. All this while the quantity of data is increasing multiple times every few years according to [2], [3] and [4]. Since databases are not designed to work with heterogeneous data, as [5] mentions, an extra step is needed commonly known as data integration layer, in order to prepare the information for the analysis. This step can be implemented traditionally either by building a virtual mapping of the data sources that would offer the data on demand as a view [6], or through some means of physically integrating the heterogeneous data sources as described by [7] which would then be used to supply the decision support system with the required content. The advantage of the former method is that there is no need to store the actual integrated data separately, offering real-time perspective on the gathered information.

Even if data virtualization offers considerable benefits, the problem is that it cannot be used in every situation. This is the case for data gathered from the Internet, using different ways of collecting the information usually through some custom made web crawlers or by using automation tools that provides predefined procedures which help in speeding up the development process, while also reducing the number of error occurrences.

The reason why data virtualization is not a feasible solution for this case is because of several web specific factors. First of all, the data to be extracted is not stored in any database, rather most of the times we are dealing with semi-structured or even unstructured data which must be gathered from multiple web pages requiring considerable amount of time. This can also include images, videos or other large size documents that will impact data collection speed. Another factor refers to the problem of identifying and removing the duplicate records that might be present in multiple data sources, which will also impact the data retrieval speed thus making it unfeasible for virtualization.

2Data Integration Techniques

As stated by [8] there are multiple data integration techniques that can be used to obtain a centralized view on the required information. The most common integration techniques are: manual integration, middleware solutions, data virtualization and physical data integration/data warehousing.

2.1 The manual approach involves the user to collect the data and apply the validation and cleansing standards of the organization, followed by its loading into the database. It is only recommended for small datasets, or for handling exceptional cases which the integration software fails to treat. The drawbacks are the slower speed and the high costs per record, making it an unfeasible solution for large datasets.

2.2 Middleware solutions acts like a bridge between multiple data sources which allows both ways communication with the involved systems. By facilitating this communication it also implements data conversion, mapping and cleaning techniques. Middleware is highly recommended in case of large enterprises with multiple data source systems, where a central view is necessary for the monitoring of different business wide indicators which provides knowledge for the higher management. A well implemented middleware system can provide real-time information on the organization status and often represent a valuable comparative advantage over competition.

In contrast, it comes at a high cost and because of its inherent complexity it requires a lot of time to be implemented [9] resulting in a high chance that the project will end up in a failure.

2.3 Data virtualization represents an alternative to ETL systems that is continuing to grow in popularity, which unlike data warehouses it can provide real time information about the underlying systems. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.