Tagged: MapReduce Toggle Comment Threads | Keyboard Shortcuts

  • zhegu 2:14 am on October 18, 2012 Permalink | Reply
    Tags: , Hive, MapReduce   

    paper review: Hive 

    The paper in this review is “Hive — A Petabyte Scale Data Warehouse Using Hadoop” by Ashish Thusoo et al.

    Hive is an interesting system. I heard that Facebook actually went through the proof of concept of many commercial or open-source relational databases (RDMS) as well as the Hadoop map-reduce system for their petabyte structured data, before resolved to the latter. However, MapReduce system does not provide an explicit structured data processing framework, so programmers who are familiar with SQL would probably miss the expressiveness of the latter. After all, SQL allows one to model very complicated data relationship in a finite but richer set of operators; whereas MapReduce,  in SQL terms, only provide two “operators”, MAP and REDUCE, which are more primitive and requires more work to model data relationship.

    The basic idea of Hive is to provide programmers a SQL-like language, HiveQL. Programs written in HiveQL are then compiled into map-reduce jobs that are executed using Hadoop.

    Hive hides the complicated pipelining of multiple map-reduce jobs from the programmers, especially who are not so familiar with MapReduce programming paradigm. With the SQL-like language, programmers can write succinct and complicated queries in ad hoc analysis or reporting dashboard, and leave the translation to and optimization of the map-reduce jobs to the system to run in the beloved fault-tolerant, distributed style.

    One cool part I found on Hive is the multi-table insertion. The idea is to parallelize the read operations on a common table among MapReduce jobs, such that each job does its own transformation of the shared source of data input and directs its output to its own destination table. Of course the pre-requisite is that there is no input-data dependency among these MapReduce jobs. One example is as the following: let’s say we are doing join on T1 and T2, and want to run two user different aggregates on the joined table, and store the two separate results to two different files. Using Hive, we only need to compose a HiveQL query block, which consists of a join query over two tables T1 and T2, and two subsequent INSERT OVERWRITE queries that uses the joined result of T1 and T2. What Hive is able to do is to figure out the correct dependency of the three queries, and hence does the right thing in an optimized way: do the join of T1 and T2 only once in one map-reduce job, store the result to a temporary file, and share this temporary file as the input for the subsequent aggregate queries. If doing this in bare-bone Hadoop, one has to write 3 separate MapReduce jobs, which is in total 6 mapper and reducer scripts, and has to mentally figure out the data dependency, and manually run them in the right order. In HiveQL, this style of data processing becomes much simpler and automatic.

    However, Hive is not the swiss army knife. It’s not built for OLTP workloads, because of the lack of INSERT INTO, UPDATE and DELETE in HiveQL. Rather, Hive is built for data warehouse processing (OLAP), such as ad hoc queries (e.g. data subset exploration) and reporting dashboards (e.g. joining several fact tables), where joins, aggregates prevail.

    Hive does not provide a fully automatic query optimizer: It needs programmer to provide query hint on doing MAPJOINs on small tables, where small tables in equi-joins are copied across mappers to join the separate parts of the big tables.

    Hive also needs query hint on 2-stage map/reduce for GROUP BY aggregates where the group-by columns have highly skewed data.

    One of the reasons behind the design of the above two programer hints, I suspect, would be that Hive’s query optimizer does not have selectivity, cardinality statistics and estimation around for it to determine the table sizes. However I’m just guessing from outside of the box, and I need to verify this speculation.

    It’s not possible to tell HDFS where to store the data blocks. Hive’s data storage operates on the logical level and does not have the power to control where to store the actual data blocks. As a result, some optimization is not possible: If table T1 and T2, both CLUSTERED BY the same key, are to join by that key, ideally the collocation of the partitions of T1 and T2 on the same node eliminates the need to shuffle the tuples between map and reduce phase. But Hive does not guarantee matching buckets will sit on the same node in this case, and so it has to run the shuffle phase.

    Overall, I enjoy reading the paper very much. I feel the query optimizer in Hive could probably achieve more if it knows more about the arrangement of the data. Next step for me is to poke around the query optimizer and get myself pretty lost and then hopefully found. 🙂

    Advertisements
     
  • zhegu 4:36 am on October 5, 2012 Permalink | Reply
    Tags: , HadoopDB, MapReduce, RDMS   

    paper review: HadoopDB 

    The paper I’m reviewing here is titled “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads” by A. Abouzeid et al.

    Motivation

    MapReduce system such as Hadoop provides a very flexible data processing framework where data does not need to be modeled before all the analytic work can start. At run-time, the system automatically figures out how to parallelize the job into tasks, does load-balancing and failure recovery. Parallel DBMSs on the other hand have sophisticated cost-based optimization techniques built-in which allows orders of magnitude speedup on performance as compared to Hadoop. However, parallel DBMSs need to have the data modeled in a schematic way before useful programs can run to provide the system enough information to optimize the queries.

    MapReduce was designed with fault tolerance and unstructured data analytics in mind. Structured data analysis with MapReduce emerged later such as Hive. Historically Parallel DBMSs carried the design assumption that failures do not occur very often, which is not quite the case especially in large clusters with heterogeneous machines.

    Therefore, HadoopDB is one of the attempts to combining the fault tolerance of MapReduce with Parallel DBMS’s advantage of query optimization.

    Main Idea of HadoopDB

    HadoopDB has two layers, the data processing layer and data storage layer. The data storage layer runs the DBMS instances, one at each node. The data processing layer runs the Hadoop as a job scheduler and communication layer. HadoopDB needs to load data from HDFS into the data storage layer before processing happens. Like Hive, HadoopDB compiles the SQL queries into a DAG of operators such as scan, select, group-by, reduce, which correspond to either the map phase or reduce phase in a traditional sense of MapReduce. These operators become the job to run in the system.

    Compared to Hive, HadoopDB does more on the system-level architecture which enables higher-level optimization opportunities. For example, Hive does not care about whether tables are collocated in the node. HadoopDB, however, detects from the metadata it collects whether the attribute to group by is also the anchor attribute to partition the table; if so, then the group-by operator can be pushed to each node, and joins can also be done on this partitioned attribute as well.

    The experiment results show that HadoopDB outperforms Hadoop in unstructured data and structured data analytic workload mainly because of the index technique and cost-based optimization can be done within the former, except the test case of UDF aggregates on text data. However, HadoopDB’s data pre-load time is 10 times slower than Hadoop. But thinking about 10 times run-time performance boost, the data pre-loading cost might be a secondary concern in some cases (not so much in stream data processing, though).

     
    • name 5:16 am on February 18, 2013 Permalink | Reply

      Hello Sam,

      Did you try your hands on with HadoopDB?

      • zhegu 2:12 am on February 19, 2013 Permalink | Reply

        Hi! (Sorry I can’t see your name on the comment, so I can’t address you here appropriately) Thanks for the comment. I did check out the github snapshot of HadoopDB, but didn’t really catch a chance to dig in. Luckily I will get to work on a commercialized version of HadoopDB in Hadapt Inc. soon in a few weeks! So stay tuned! 🙂 -Sam

    • Yong 8:21 pm on September 11, 2013 Permalink | Reply

      Any update about your experience about the produce from Hadapt?

    • Anonymous 5:53 pm on September 21, 2013 Permalink | Reply

      Sir,
      Where actually data is stored in HadoopDB?

  • zhegu 3:55 pm on September 25, 2012 Permalink | Reply
    Tags: Dremel, MapReduce   

    paper review: Dremel 

    Main Ideas of Dremel

    Dremel uses columar storage for nested record at the bottom, a SQL-like programming language on top, and in the middle a query execution engine very different than MR.

    The motivation of Dremel comes from the need to run interactive data analysis over larg
    e-scale datasets *in a short time* and many times with slightly modification of the ana
    lytic program during the new feature extraction task.

    A typical workflow using Dremel is as follows: say Alice wants to extract a new feature from a collection of files. (1)She creates MapReduce jobs to work on the raw input data to produce a dataset. (2)Then she analyzes and evaluates her resulting dataset by running interactive queries in Dremel over the dataset. (3)However, if she finds irregularitie in the dataset, she probably needs to use FlumeJava over the dataset to do more complex bug-analysis. (4)After the bug is fixed, she use FlumeJava over the raw input data to process continuously. The results are stored in Dremel as well. At the last stage of this step she programs a few SQL queries to aggregate results in the Dremel datasets. (5)She registers the new dataset in a catalog.

    The key here is that the dataset used in the debugging phase–(2), (3) can be done in t
    he Dremel mode, where data retrieval is interactive. In such a way Dremel helps Alice t

    need to be run in MapReduce, which does not provide the interactive speed.

    The main advantage of columnar storage is where projections and selections can skip irrelevant columns and rows to minimize disk I/O.

    The advantage of using SQL-like language is that it enables the various optimization techniques of relational databases, such as pushing down the selection and projection.

    The novelty of the query exeuction is where one SQL query is broken down into equivalent samller SQL queries running on a horizontal partiion of the dataset. Each of these smaller SQL queries is self-contained and so each path from the execution tree root to the execution layer is a shard, and the results can be retured to the user quickly (not for all aggregations though. But for functions like TOP(signal1, 100), the stream-style aggregation, it’s capable to be quick).

    One interesting to mention is the difference between columnar storage and column storag
    e. Columnar storage also organizes the data of the same column together, but the record
    decomposition and record assembly are different. Columnar storage transforms nested data records. The record assembly does not do joins as in column-storage, but uses a Autu
    maton for assembling.

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel