paper review: Dremel

Main Ideas of Dremel

Dremel uses columar storage for nested record at the bottom, a SQL-like programming language on top, and in the middle a query execution engine very different than MR.

The motivation of Dremel comes from the need to run interactive data analysis over larg
e-scale datasets *in a short time* and many times with slightly modification of the ana
lytic program during the new feature extraction task.

A typical workflow using Dremel is as follows: say Alice wants to extract a new feature from a collection of files. (1)She creates MapReduce jobs to work on the raw input data to produce a dataset. (2)Then she analyzes and evaluates her resulting dataset by running interactive queries in Dremel over the dataset. (3)However, if she finds irregularitie in the dataset, she probably needs to use FlumeJava over the dataset to do more complex bug-analysis. (4)After the bug is fixed, she use FlumeJava over the raw input data to process continuously. The results are stored in Dremel as well. At the last stage of this step she programs a few SQL queries to aggregate results in the Dremel datasets. (5)She registers the new dataset in a catalog.

The key here is that the dataset used in the debugging phase–(2), (3) can be done in t
he Dremel mode, where data retrieval is interactive. In such a way Dremel helps Alice t

need to be run in MapReduce, which does not provide the interactive speed.

The main advantage of columnar storage is where projections and selections can skip irrelevant columns and rows to minimize disk I/O.

The advantage of using SQL-like language is that it enables the various optimization techniques of relational databases, such as pushing down the selection and projection.

The novelty of the query exeuction is where one SQL query is broken down into equivalent samller SQL queries running on a horizontal partiion of the dataset. Each of these smaller SQL queries is self-contained and so each path from the execution tree root to the execution layer is a shard, and the results can be retured to the user quickly (not for all aggregations though. But for functions like TOP(signal1, 100), the stream-style aggregation, it’s capable to be quick).

One interesting to mention is the difference between columnar storage and column storag
e. Columnar storage also organizes the data of the same column together, but the record
decomposition and record assembly are different. Columnar storage transforms nested data records. The record assembly does not do joins as in column-storage, but uses a Autu
maton for assembling.