paper review: Hive

The paper in this review is “Hive — A Petabyte Scale Data Warehouse Using Hadoop” by Ashish Thusoo et al.

Hive is an interesting system. I heard that Facebook actually went through the proof of concept of many commercial or open-source relational databases (RDMS) as well as the Hadoop map-reduce system for their petabyte structured data, before resolved to the latter. However, MapReduce system does not provide an explicit structured data processing framework, so programmers who are familiar with SQL would probably miss the expressiveness of the latter. After all, SQL allows one to model very complicated data relationship in a finite but richer set of operators; whereas MapReduce,  in SQL terms, only provide two “operators”, MAP and REDUCE, which are more primitive and requires more work to model data relationship.

The basic idea of Hive is to provide programmers a SQL-like language, HiveQL. Programs written in HiveQL are then compiled into map-reduce jobs that are executed using Hadoop.

Hive hides the complicated pipelining of multiple map-reduce jobs from the programmers, especially who are not so familiar with MapReduce programming paradigm. With the SQL-like language, programmers can write succinct and complicated queries in ad hoc analysis or reporting dashboard, and leave the translation to and optimization of the map-reduce jobs to the system to run in the beloved fault-tolerant, distributed style.

One cool part I found on Hive is the multi-table insertion. The idea is to parallelize the read operations on a common table among MapReduce jobs, such that each job does its own transformation of the shared source of data input and directs its output to its own destination table. Of course the pre-requisite is that there is no input-data dependency among these MapReduce jobs. One example is as the following: let’s say we are doing join on T1 and T2, and want to run two user different aggregates on the joined table, and store the two separate results to two different files. Using Hive, we only need to compose a HiveQL query block, which consists of a join query over two tables T1 and T2, and two subsequent INSERT OVERWRITE queries that uses the joined result of T1 and T2. What Hive is able to do is to figure out the correct dependency of the three queries, and hence does the right thing in an optimized way: do the join of T1 and T2 only once in one map-reduce job, store the result to a temporary file, and share this temporary file as the input for the subsequent aggregate queries. If doing this in bare-bone Hadoop, one has to write 3 separate MapReduce jobs, which is in total 6 mapper and reducer scripts, and has to mentally figure out the data dependency, and manually run them in the right order. In HiveQL, this style of data processing becomes much simpler and automatic.

However, Hive is not the swiss army knife. It’s not built for OLTP workloads, because of the lack of INSERT INTO, UPDATE and DELETE in HiveQL. Rather, Hive is built for data warehouse processing (OLAP), such as ad hoc queries (e.g. data subset exploration) and reporting dashboards (e.g. joining several fact tables), where joins, aggregates prevail.

Hive does not provide a fully automatic query optimizer: It needs programmer to provide query hint on doing MAPJOINs on small tables, where small tables in equi-joins are copied across mappers to join the separate parts of the big tables.

Hive also needs query hint on 2-stage map/reduce for GROUP BY aggregates where the group-by columns have highly skewed data.

One of the reasons behind the design of the above two programer hints, I suspect, would be that Hive’s query optimizer does not have selectivity, cardinality statistics and estimation around for it to determine the table sizes. However I’m just guessing from outside of the box, and I need to verify this speculation.

It’s not possible to tell HDFS where to store the data blocks. Hive’s data storage operates on the logical level and does not have the power to control where to store the actual data blocks. As a result, some optimization is not possible: If table T1 and T2, both CLUSTERED BY the same key, are to join by that key, ideally the collocation of the partitions of T1 and T2 on the same node eliminates the need to shuffle the tuples between map and reduce phase. But Hive does not guarantee matching buckets will sit on the same node in this case, and so it has to run the shuffle phase.

Overall, I enjoy reading the paper very much. I feel the query optimizer in Hive could probably achieve more if it knows more about the arrangement of the data. Next step for me is to poke around the query optimizer and get myself pretty lost and then hopefully found. 🙂