paper review: HadoopDB 

The paper I’m reviewing here is titled “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads” by A. Abouzeid et al.

Motivation

MapReduce system such as Hadoop provides a very flexible data processing framework where data does not need to be modeled before all the analytic work can start. At run-time, the system automatically figures out how to parallelize the job into tasks, does load-balancing and failure recovery. Parallel DBMSs on the other hand have sophisticated cost-based optimization techniques built-in which allows orders of magnitude speedup on performance as compared to Hadoop. However, parallel DBMSs need to have the data modeled in a schematic way before useful programs can run to provide the system enough information to optimize the queries.

MapReduce was designed with fault tolerance and unstructured data analytics in mind. Structured data analysis with MapReduce emerged later such as Hive. Historically Parallel DBMSs carried the design assumption that failures do not occur very often, which is not quite the case especially in large clusters with heterogeneous machines.

Therefore, HadoopDB is one of the attempts to combining the fault tolerance of MapReduce with Parallel DBMS’s advantage of query optimization.

Main Idea of HadoopDB

HadoopDB has two layers, the data processing layer and data storage layer. The data storage layer runs the DBMS instances, one at each node. The data processing layer runs the Hadoop as a job scheduler and communication layer. HadoopDB needs to load data from HDFS into the data storage layer before processing happens. Like Hive, HadoopDB compiles the SQL queries into a DAG of operators such as scan, select, group-by, reduce, which correspond to either the map phase or reduce phase in a traditional sense of MapReduce. These operators become the job to run in the system.

Compared to Hive, HadoopDB does more on the system-level architecture which enables higher-level optimization opportunities. For example, Hive does not care about whether tables are collocated in the node. HadoopDB, however, detects from the metadata it collects whether the attribute to group by is also the anchor attribute to partition the table; if so, then the group-by operator can be pushed to each node, and joins can also be done on this partitioned attribute as well.

The experiment results show that HadoopDB outperforms Hadoop in unstructured data and structured data analytic workload mainly because of the index technique and cost-based optimization can be done within the former, except the test case of UDF aggregates on text data. However, HadoopDB’s data pre-load time is 10 times slower than Hadoop. But thinking about 10 times run-time performance boost, the data pre-loading cost might be a secondary concern in some cases (not so much in stream data processing, though).