Academic journal article Informatica Economica

On the Performance of Three In-Memory Data Systems for on Line Analytical Processing

Academic journal article Informatica Economica

On the Performance of Three In-Memory Data Systems for on Line Analytical Processing

Article excerpt

Introduction

In terms of data persistence and processing, Big Data systems [1] [2] [3] encompass a broad variety of technologies such as NoSQL data stores [4] [5] [6], Hadoop ecosystem [7] [8] and New SQL [9]. In-memory distributed systems [10] are one of the most recent development of Big Data technologies. They are meant to close the gap between OLTP and OLAP workloads into a single system by offering real time distributed processing and analytics. Some of them are part of the New SQL strand. They rely on a distributed cache system to make data processing faster by limiting the I/O disk bottleneck. As memory price has constantly decreased over time, an array of database technologies has emerged on the market to take advantage of the in-memory structures. In-memory technologies manifest either as new brand distributed systems (e.g. Spark [11], MemSQL [12], Apache Ignite [13], Geode [14], VoltDB [15]) or as a set of features added to classical relational database systems (Oracle, Microsoft SQL Server and MySQL implement in-memory features in their enterprise editions).

The main promise of in-memory persistence concerns the data retrieval and processing speed. Currently the trade-offs come from the lack or poor implementation of essential functionalities such as high availability, storage and transactions. Additionally, cost could raise serious concerns when adopting an in-memory solution.

In-memory data systems have a broad range of use cases, from OLTP (On Line Transactional Processing) to OLAP (On Line Analytical Processing) and even a mixture between the two - HTAP (Hybrid Transactional Analytical Processing).

This paper investigates data load performance, memory footprint and query performance of three in-memory systems - MemSQL, Oracle and Microsoft SQL Server. TPC-H benchmark [16] database was used for testing the data loading and query performance. Data was randomly generated using dbGen tool [17] on various scale factors (database loadings). Query performance was assessed (on each database scale factor) by collecting execution duration for the 22 queries in the official set provided by TPC-H.

2In-Memory Database Systems - Current Products and Research

Moving computation from CPU to memory has gained considerable interest in recent years as a solution for overcoming the bandwidth and latency bottleneck through releasing the CPU from some of its tasks [18]. In this new paradigm, the memory chips have both storage and computation capability. Great benefits can be achieved through parallelism in the form of in-memory clusters which use GPU (Graphical Processing Unit) power (e.g. Kinetica [19] or SQream [20]). Mian [21] explores the results of two inmemory data systems deployed for healthcare big data management, i.e. MemSQL and VoltDB. The case study focuses on the support for detecting medical fraud, diagnosing diseases at an early stage and generating actionable insights for patients, providers and physicians. Healthcare datasets are large in volume and unstructured, making ad-hoc querying painfully slow, so that inmemory systems came as a natural solution. The paper argued that VoltDB performs slower than MemSQL when returning high amounts of distinct rows. Results were inconclusive when queries were simpler and returned a small amount of rows. Also, no details were provided concerning the data loads, testing methodology and results gathering.

Sen et al. [22] describe MemSQL optimization techniques for complex analytical queries (requiring real-time answer) based on heuristics that generate execution plans. The cost-based optimizer can use either a left-deep tree where the result of a join is used as an outer input for the next join or a right-deep tree where the result of a join is used as an inner input to the next join. The latter are called bushy plans and are generated via query-rewrite. The effectiveness of these techniques is analyzed against TPC-H and TPC-DS [23] queries. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.