Recent Updates Toggle Comment Threads | Keyboard Shortcuts

  • zhegu 3:41 am on October 7, 2015 Permalink | Reply  

    Hand sketch study about the site (Hongkong Central) 

    Source: Hand sketch study about the site (Hongkong Central)

  • zhegu 2:31 am on July 28, 2014 Permalink | Reply
    Tags: , malloc, memory, swap space   

    What happens if I keep calling malloc() forever? 

    Today is not a mood-brightening weather in Providence. So I decided to make fun of my computer. Exactly how I did it was by putting malloc within a gigantic loop. Each iteration it would ask OS to allocate a sizable amount of memory. I wondered what would happen. Good question, right?

    Here is how the program looks like:

    int main(int argc, char** argv) {
        const size_t NUM_MALLOCS = LONG_MAX; // 2^31-1, at least.
        long pageSize = sysconf(_SC_PAGE_SIZE);  // Number of bytes to allocate for each malloc() call.
        vector<void*> mallocPtrs;
        for (size_t i = 0; i < NUM_MALLOCS; ++i) {
            void* p = malloc(static_cast<size_t>(pageSize));
            if (p == NULL) {
                // Malloc failed at total size: i * pageSize.
                cleanUp(mallocPtrs);  // Free the allocated memory.
                return 0;
            }  else {
        return 0;

    In theory there’s one outcome I can think of. What the malloc() function does essentially is simply to add some rooms for the process’ data segment to grow, and this increment happens in the heap portion of the virtual memory space. So if we malloc too many times, we might end up having the process’ data segment trespass into the stack portion of virtual memory space (because heap grows low-to-high, and stack grows high-to-low address-wise). This does strike the chord of the notorious segmentation fault, aye?

    In fact, when the program runs, it doesn’t sour up to 100% memory usage and seg-fault immediately afterwards. The memory utilization goes slower and slower towards 80%, and stays at around 81%, and I notice that the swap space starts to utilize some of its free space. 

    It seems that the OS starts to heavily swap cold physical memory pages, both dirty and clean, to the swap space located in my disk. Incredible slow down on my machine as well, as I try to switch terminal screen sessions or to Chrome.

    I kill the processes after a while, before the swap space runs out (which I shouldn’t, because then I’d have been able to see the legendary OOM-killer (out-of-memory killer) come out and kill random processes!).

    Now I notice a huge vacuum in the free memory space:

    sam@butler$ /usr/bin/procinfo
    Memory:        Total        Used        Free     Buffers
    RAM:         4060612      833360     3227252        1016
    Swap:        8388604     2880484     5508120

    And switching to Chrome and terminal is still taking ages… Probably because the pages are still in swap space.

    I start browsing using Chrome, despite the slowness. And I see it slowly regains the speed among switching tabs, because the swapped pages have returned to the physical memory:

    sam@butler$ /usr/bin/procinfo
    Memory:        Total        Used        Free     Buffers
    RAM:         4060612     1395148     2665464        2568
    Swap:        8388604     2625552     5763052
  • zhegu 9:35 pm on November 10, 2013 Permalink | Reply
    Tags: empty set, math, set   

    Empty basket is a basket of apples, tennis balls, or stars 

    At school I first learned about the empty set:

    An empty set is a subset of any set.

    I still remembered being taught simply taking it for granted, and memorizing it correctly on important occasions (e.g. during an exam).  Despite how odd the mathematical statement sounded like, I did what I was taught to do.

    This is how I makes sense of it.  Given a basket of apples, after all apples were taken by some monkeys, the empty basket would still be called a basket of apples, because it’s literally “a basket of apples that contains none at all”.  Moreover, suppose someone comes along and doesn’t see what kind of fruits the basket originally contained, then he or she could claim that the very same empty basket a basket of pears, or a basket of tennis balls, etc. So an empty basket could be an basket of any kind of fruits, or items.

    How could one actually be sure that a basket is NOT a basket of apples?  Well, if one actually sees that the basket contains anything that’s not an apple, for example a star, then he or she must be able to claim that the basket is not of apples, but of stars.

    So can one say a given empty basket is NOT a basket of apples?  Not really, because it does NOT contain anything that is NOT an apple.  Furthermore, it seems to me that one can not even say an empty basket is NOT a basket of bananas, tennis balls, stars, etc, simply because the empty basket does not contain anything!

    Knowing that one can not say a given empty basket is not a basket of apples, can one say the opposite is true, that a given empty basket is a basket of apple?  

    Here where mathematical logic comes in.  There is an interesting law called the Law of Excluded Middle, which means a mathematical statement is either true of false, nothing in the middle.  So a basket is either a basket of apples or not.  

    So applying the Law of Excluded Middle, one can safely draw from the fact that one can not say a given empty basket is not a basket of apples, to the conclusion that a given empty basket is a basket of apples.

    Substitute apples with anything you like, bananas, tennis balls, stars, black holes, etc.  You end up with the same conclusion that a given empty basket cannot be said to NOT be a basket of some things, so it has to be a basket of those things.

  • zhegu 3:20 pm on October 19, 2013 Permalink | Reply
    Tags: graph theory   

    All Friends Inline 

    Suppose you go to a party.  As the night goes the party gets more and more crowded so now the party host comes up with a cute idea to kick some people out: try to line people up such that each person in the line knows the person standing right next to you on both sides.  The line starts with the party host (of course).  She grabs Andy, who knows her for a long time.  Andy is popular among the crowds and he knows many folks there.  But he has to pick one, so he chooses Barbara, his sister, because he doesn’t want her to be kicked out of the party.  And the line grows…

    Let’s see what we can say about this kind of lines.  Yes, lines.  There are many possible lines.  For example, if Barbara happened to know nobody besides Andy, then the line ends and everyone else other than the party owner, Andy and Barbara would need to head out.  What if Andy didn’t choose Barbara, but chose one of his friends, say Beth, who also happened to have many acquaintances in the crowds, unlike Barbara who didn’t?  So the line would go on.  So, many of this kind of lines possible, and it’s possible that someone won’t be able to get in line.

    Untitled drawing (3)

    One interesting thing you can say is that, if you happen to be the last person in line, then your friends must have all been in line already.  So don’t worry about your friends being kicked out if you are the last person in line!

    This is somehow not very intuitive to me.  How can you be so sure?  But think about it, if you had a friend who’s not yet in line, then she could have joined by your side and would hence have been the last one in the line.  That’s not gonna happen because you are the last one.

    Now that you know all your friends are already in the line, you can at least put forth a rough estimate that the length of the line is at least equal to the number of your friends.  That’s kind of intuitive, because if you have only one friend, and you are in the line, then your friend must be the second to the last, right next to you. But if you have say 20 friends at the party, then you could at least tell that the line ended by you is at least 20 people long, plus you.

    (Propoition 1.3.1., GT)

  • zhegu 7:41 pm on January 11, 2013 Permalink | Reply
    Tags: Battery, CPU, , Graphic card, Heating, Intel, Laptop, , Overheat, Radeon, Sony VAIO, Temperature   

    fedora linux overheat problem on sony vaio 

    I have a Sony VAIO S laptop, with Intel i7 CPU and Radeon Graphics. I bought this pretty in the summer of 2012 when I was on a very well-paid internship at @WalmartLabs, CA. At the time I just started the job and was unsatisfied with a given MacBook Pro. My main complaint was the Apple keyboard places the function key at the left bottom corner where the control button usually is. This posts major problem for me since I’m a heavy command-line and vim user, all the muscle memory has already been wired in my both hands. And if you are about to type more than 1000 lines of code, it’s not so pleasant to have to hustle with the typing habit first.

    So I bought a Sony VAIO, and loaded it with Fedora 17. The installation goes through nice and smooth like a charm. However, once I start typing or browsing the web for a few minutes, the fan starts blowing like a turbine and my wrists just can’t even bear the heat on the surface.

    Measure the Temperature

    I notice that it’s not totally my fan’s fault. It’s running up and down with varied speed, responding to hardware heating well enough. The temperature sensor tool that I use is lm_sensors. It gives a temperature reading for my CPU, graphic card and motherboard.

    $ sudo yum install lm_sensors   # Install lm_sensors package
    $ sensors-detect                # Detect available sensors
    $ sensors                       # Get sensor readings

    With light web browsing on Google Chrome, I got

    $ sensors
    Adapter: Virtual device
    temp1: +67.0°C (crit = +97.0°C)
    radeon-pci-0100   # Radeon graphics card
    Adapter: PCI adapter
    temp1: +68.0°C (crit = +100.0°C)
    coretemp-isa-0000   # CPU
    Adapter: ISA adapter
    Physical id 0:+67.0°C (high = +86.0°C, crit = +100.0°C)
    Core 0: +68.0° (high = +86.0°C, crit = +100.0°C)
    Core 1: +66.0°C (high = +86.0°C, crit = +100.0°C)

    68°C is very very hot. Resting on your hands on the keyboard feels like grilling your wrists.

    Turn off the Graphics Card

    My first suspect to the overheating issue is the graphics card. Since Radeon’s graphics cards have long been known for poor linux support and hard to tame under the open source drivers, I am thinking about simply turning off the Radoen graphics card module from the Linux kernel. If it’s the graphics card that causes the heating on mother board, then I should see some dropping of temperature after disabling it.

    Turning off the graphic card can be done in linux with the following shell commands:

    $ modprobe radeon
    $ chown -R $USER:$USER /sys/kernel/debug
    # Turn off the Radeon graphics card
    $ echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

    Now the Radeon graphic card is turned off. It can be verified as:

    $ cat /sys/kernel/debug/vgaswitcheroo/switch
    # External Graphics card is turned off
    0:DIS: :<strong>Off</strong>:0000:01:00.0     
    # Intel graphics is on

    I shut down the computer, let it sit for a while to cool down to room temperature, and reboot again. Immediately after reboot I run the above commands to turn off the graphic card. I immediately notice the fan speed starts to wind down, and after some Chrome web browsing, here is a sample sensor reading:

    $ sensors
    Adapter: Virtual device
    temp1: +47.0°C (crit = +97.0°C)
    Adapter: PCI adapter
    temp1: -128.0°C  # Negative, because it's turned off!
    Adapter: ISA adapter
    Physical id 0: +50.0°C (high = +86.0°C, crit = +100.0°C)
    Core 0: +48.0°C (high = +86.0°C, crit = +100.0°C)
    Core 1: +50.0°C (high = +86.0°C, crit = +100.0°C)

    It’s about 25% reduction on temperature! So the graphic card is really contributing to the overheating. Now I still notice some warming on the surface of the keyboard, but at least it’s not like the surface of the Sahara Desert anymore. Fan is tamed, and the battery is not drained as fast before. Good start.


    Whatever we have achieved above doesn’t repeat after reboot. Before finding a way to further reduce the temperature, I find turning off the graphic card useful enough to automate this task at Fedora startup. Anyway I would always boot into Windows when I need to stream video on Hulu or Youtube. So here is how you do it: put the corresponding set of shell commands (which turns off the graphics card, as listed in the previous section) into /etc/rc.local, which is a system service that will automatically run the shell commands it contains at boot time. Remember to make the rc.local file executable afterwards.

    modprobe radeon
    echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

    The next step is to enable the rc.local service. If you already have /etc/systemd/system/rc-local.service, enable and start it by

    $ sudo systemctl enable rc-local.service
    $ sudo systemctl start rc-local.service
    # Check that rc.local service is indeed enabled and active
    $ sudo systemctl status rc-local.service 
    rc-local.service - /etc/rc.local Compatibility
    Loaded: loaded (/etc/systemd/system/rc-local.service; enabled)
    Active: active (exited) since XXX; XXX ago

    If you don’t have /etc/systemd/system/rc-local.service, here is a good copy of it from my own Fedora, use it at your own risk 🙂

    #  This file is part of systemd.
    #  systemd is free software; you can redistribute it and/or modify it
    #  under the terms of the GNU General Public License as published by
    #  the Free Software Foundation; either version 2 of the License, or
    #  (at your option) any later version.
    Description=/etc/rc.local Compatibility
    ExecStart=/etc/rc.local start

    -Sam Zhao

    This solution also seems to prolong the battery life of my laptop quite significantly, from average 2 hours before to 4.5 hours now with web browsing and youtubing. Sweet.

    • sebastian leste 9:59 am on June 23, 2013 Permalink | Reply

      Thanks a Great Job

    • Sello Mathakhoe 6:46 pm on February 27, 2014 Permalink | Reply

      what some great info on your blog about fedora linux overheat problem and I am glad that I got a solution for my laptop. At this time I am using Msi motherboard. Check out this ,

      Surely Visiting your blog helped me, Awesome stuff!

      Best Regards

    • bestseoservicereview 1:34 pm on March 20, 2014 Permalink | Reply

      This site has got all that I was looking for, thanks, I will bookmark this.Thanks for your helpfull tips,Cheers!!

    • Gokul 9:17 am on March 7, 2015 Permalink | Reply

      I’ve the same overheating and battery problem with my sony vaio e series having Fedora21 installed in it.
      When i try to execute the steps that you’ve mentioned I’m getting the following error.

      [gokul@localhost ~]$ sudo echo OFF > /sys/kernel/debug/vgaswitcheroo/switch
      bash: /sys/kernel/debug/vgaswitcheroo/switch: No such file or directory
      [gokul@localhost ~]$

      My graphics card detail follows,
      [gokul@localhost ~]$ lspci | grep VGA
      01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Thames [Radeon HD 7550M/7570M/7650M]
      [gokul@localhost ~]$

      Please help me to fix this.. Thanks in advance..

    • service komputer di jakarta 2:52 am on April 9, 2015 Permalink | Reply

      An intriguing discussion is definitely worth comment. I believe that you ought to publish more about this issue, it may not be a taboo subject but typically people do not speak about such issues. To the next! All the best!!

  • zhegu 11:09 pm on November 25, 2012 Permalink | Reply
    Tags: Math puzzle, Missing square puzzle, Triangle Formula   

    Missing square puzzle 

    The puzzle says,


    So what’s going on here? ShapeOne and ShapeTwo should have the same areas, because both equal to the sum of areas of Red, Green, Yellow, and Lime. But if you count the squares of each, ShapeTwo is one square less than ShapeOne, marked by (?) mark!

    The question here is:

    Is ShapeOne = ShapeTwo? Or is ShapeOne one square bigger than ShapeTwo?

    Let’s expose the contradiction in a more formal way. First, let’s be safe and sum up the total area by areas of individual color shapes:

    Method 1:

    Red (triangle) = 8 * 3 / 2 = 12 (squares)

    Green (triangle) = 5 * 2 / 2 = 5 (squares)

    Yellow = 7 (squares)

    Lime = 8 (squares)

    ShapeOne = Red + Green + Yellow + Lime = 12 + 5 + 7 + 8 = 32 (squares)

    ShapeTwo = Red + Green + Yellow + Lime = 32 (squares)

    So, ShapeOne = ShapeTwo, ShapeOne and ShapeTwo are the same.

    On the other hand, our eyes tell us ShapeOne is a right triangle, and ShapeTwo is also a right triangle but with a ?-marked square subtracted off. So what happens if we count the number of squares using the area formula of right triangles assuming each square’s area is 1:

    Method 2:

    ShapeOne’ = 13 * 5 / 2 = 32.5 (squares)

    ShapeTwo’ = 13 * 5 / 2 – 1 * 1 = 32.5 – 1 = 31.5 (squares) (Note: subtract off the ?-marked square)

    So, ShapeOne’ > ShapeTwo’, ShapeOne is one square bigger than ShapeTwo, which is the missing ?-marked square.

    So here is the contradiction: the two methods do not agree on our question of the relationship of ShapeOne and ShapeTwo. Which one is correct?

    Actually we have one more contradiction above, that the summed areas in the two methods above do not agree:

    ShapeOne ≠ ShapeOne’

    ShapeTwo ≠ ShapeTwo’

    So which method has the right number? Could it be our beloved trigonometry wrong?

    Well, actually Method 2 assumed the wrong thing, so of course it gave the wrong result. Our eyes fool our brain. In fact, as I’m going to show soon, that ShapeOne is not a right triangle, and ShapeTwo is not a right triangle with a ?-marked square subtracted off. So we can’t actually run Method 2 with the assumption that both shapes are right triangles–How could we run triangular formula on non-triangles? So the only safe and right method to use is method 1, which only assumes the shape is irregular.


    Now let’s take a closer look at our wrong assumption of Method 2.  Why ShapeOne and ShapeTwo are not right triangles? Let’s first see why ShapeOne is not a right triangle:

    Is edge ACB really a straight line? If it’s not, then it follows that ShapeOne is not a right triangle.

    ACB is straight line only if angle α = angle β, which holds only if CD / AD = BE / CE.

    But CD / AD = 3 / 8 = 0.375, BE / CE = 2 / 5= 0.4.

    So CD / AD ≠ BE / CE. So angle α ≠ angle β. So edge ACB is not a straight line. So ShapeOne is not a right triangle. Similarly, ShapeTwo is not a right triangle with the ?-marked square added.

    So our eyes fooled us that edge ACB is a straight line. In fact, it’s almost a straight line, because angle α and angle β differs very little:

    angle α = arctan(CD / AD) = arctan(3 / 8) = 0.35877067 rad

    angle β = arctan(BE / CE) = arctan(2 / 5) = 0.380506377 rad

    Difference = 0.0217357068 rad

    So in summary, don’t trust your eyes, and assume the wrong things that ShapeOne and ShapeTwo (with ?-marked square) are right triangles and so use your beloved triangle formula. Instead, always be safe and break down the big shape to small shapes that are safe to assume to be fundamental types (triangles, squares, etc) and sum them up.

  • zhegu 10:57 pm on October 30, 2012 Permalink | Reply
    Tags: , ssh, sshd   

    connect to host localhost port 22: connection refused 

    I ran into this problem when I tried to configure Hadoop in pseudo-distributed mode:


    $ ssh localhost
    ssh: connect to host localhost port 22: Connection refused.
    $ bin/hadoop fs -put foo input
    Bad connection to FS. command aborted. exception: Call to localhost/ failed on connection exception: Connection refused

    One possible reason is that because the ssh server daemon, or sshd, is not loaded and running on localhost, so any attempt to ssh connect to localhost would fail. I check to see whether ssh and sshd are running by typing the following command:

    $ ps aux | grep ssh
    # Result: bunch of lines, but none of them is about "sshd", only one line is about "ssh"

    Result is, I only see ssh running, but not sshd. So I know ssh server daemon is probably not started yet.

    First, I verify that I have installed the package “openssh-server” on my fedora box.

    Then I check the status of sshd service:

    $ systemctl status sshd.service
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled) # sshd is disabled
    Active: inactive (dead) # sshd is inactive

    Here we go, it says the sshd service is currently disabled.

    So I do the following commands to enable and start it:

    $ systemctl enable sshd.service
    $ systemctl start sshd.service

    Now it is running!

    If I check the status of sshd service again:

    $ systemctl status sshd.service
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled) # sshd is loaded
    Active: active (running) # sshd is active

    Connecting ssh to localhost also woks fine now.

    • edu links service 9:17 pm on May 2, 2013 Permalink | Reply

      Everything is very open with a clear description of the issues.
      It was truly informative. Your site is very helpful.
      Thank you for sharing!

    • Raquel 4:14 am on August 24, 2013 Permalink | Reply

      You really make it appear really easy along
      with your presentation however I find this matter
      to be really something that I think I might by no means understand.
      It sort of feels too complex and extremely broad for me. I’m having a look forward on your next submit, I will attempt to get the grasp of it!

    • Lowell 5:18 pm on August 24, 2013 Permalink | Reply

      I like the helpful info you provide in your articles.
      I will bookmark your weblog and check again here frequently.

      I am quite certain I will learn a lot of new stuff right here!
      Good luck for the next!

    • Zachery 11:32 pm on August 24, 2013 Permalink | Reply

      Does your site have a contact page? I’m having a tough time locating it but, I’d like to send you an e-mail.

      I’ve got some creative ideas for your blog you might be interested in hearing. Either way, great website and I look forward to seeing it grow over time.

    • chanel かばん 1:22 am on September 18, 2013 Permalink | Reply

      ロキシー ラッシュガード

    • how to clean registry 12:36 pm on September 30, 2013 Permalink | Reply

      You could certainly see your expertise in the article
      you write. The arena hopes for even more passionate writers like you who are not afraid to say how they believe.
      All the time follow your heart.

    • Manjusha 4:06 am on February 26, 2014 Permalink | Reply

      thank you very much it’s working….:)

    • mixoni 6:46 pm on September 25, 2014 Permalink | Reply

      Thank you very much for this solution man…. its straightforward from others on web and its working 🙂

  • zhegu 2:14 am on October 18, 2012 Permalink | Reply
    Tags: , Hive,   

    paper review: Hive 

    The paper in this review is “Hive — A Petabyte Scale Data Warehouse Using Hadoop” by Ashish Thusoo et al.

    Hive is an interesting system. I heard that Facebook actually went through the proof of concept of many commercial or open-source relational databases (RDMS) as well as the Hadoop map-reduce system for their petabyte structured data, before resolved to the latter. However, MapReduce system does not provide an explicit structured data processing framework, so programmers who are familiar with SQL would probably miss the expressiveness of the latter. After all, SQL allows one to model very complicated data relationship in a finite but richer set of operators; whereas MapReduce,  in SQL terms, only provide two “operators”, MAP and REDUCE, which are more primitive and requires more work to model data relationship.

    The basic idea of Hive is to provide programmers a SQL-like language, HiveQL. Programs written in HiveQL are then compiled into map-reduce jobs that are executed using Hadoop.

    Hive hides the complicated pipelining of multiple map-reduce jobs from the programmers, especially who are not so familiar with MapReduce programming paradigm. With the SQL-like language, programmers can write succinct and complicated queries in ad hoc analysis or reporting dashboard, and leave the translation to and optimization of the map-reduce jobs to the system to run in the beloved fault-tolerant, distributed style.

    One cool part I found on Hive is the multi-table insertion. The idea is to parallelize the read operations on a common table among MapReduce jobs, such that each job does its own transformation of the shared source of data input and directs its output to its own destination table. Of course the pre-requisite is that there is no input-data dependency among these MapReduce jobs. One example is as the following: let’s say we are doing join on T1 and T2, and want to run two user different aggregates on the joined table, and store the two separate results to two different files. Using Hive, we only need to compose a HiveQL query block, which consists of a join query over two tables T1 and T2, and two subsequent INSERT OVERWRITE queries that uses the joined result of T1 and T2. What Hive is able to do is to figure out the correct dependency of the three queries, and hence does the right thing in an optimized way: do the join of T1 and T2 only once in one map-reduce job, store the result to a temporary file, and share this temporary file as the input for the subsequent aggregate queries. If doing this in bare-bone Hadoop, one has to write 3 separate MapReduce jobs, which is in total 6 mapper and reducer scripts, and has to mentally figure out the data dependency, and manually run them in the right order. In HiveQL, this style of data processing becomes much simpler and automatic.

    However, Hive is not the swiss army knife. It’s not built for OLTP workloads, because of the lack of INSERT INTO, UPDATE and DELETE in HiveQL. Rather, Hive is built for data warehouse processing (OLAP), such as ad hoc queries (e.g. data subset exploration) and reporting dashboards (e.g. joining several fact tables), where joins, aggregates prevail.

    Hive does not provide a fully automatic query optimizer: It needs programmer to provide query hint on doing MAPJOINs on small tables, where small tables in equi-joins are copied across mappers to join the separate parts of the big tables.

    Hive also needs query hint on 2-stage map/reduce for GROUP BY aggregates where the group-by columns have highly skewed data.

    One of the reasons behind the design of the above two programer hints, I suspect, would be that Hive’s query optimizer does not have selectivity, cardinality statistics and estimation around for it to determine the table sizes. However I’m just guessing from outside of the box, and I need to verify this speculation.

    It’s not possible to tell HDFS where to store the data blocks. Hive’s data storage operates on the logical level and does not have the power to control where to store the actual data blocks. As a result, some optimization is not possible: If table T1 and T2, both CLUSTERED BY the same key, are to join by that key, ideally the collocation of the partitions of T1 and T2 on the same node eliminates the need to shuffle the tuples between map and reduce phase. But Hive does not guarantee matching buckets will sit on the same node in this case, and so it has to run the shuffle phase.

    Overall, I enjoy reading the paper very much. I feel the query optimizer in Hive could probably achieve more if it knows more about the arrangement of the data. Next step for me is to poke around the query optimizer and get myself pretty lost and then hopefully found. 🙂

  • zhegu 4:36 am on October 5, 2012 Permalink | Reply
    Tags: , HadoopDB, , RDMS   

    paper review: HadoopDB 

    The paper I’m reviewing here is titled “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads” by A. Abouzeid et al.


    MapReduce system such as Hadoop provides a very flexible data processing framework where data does not need to be modeled before all the analytic work can start. At run-time, the system automatically figures out how to parallelize the job into tasks, does load-balancing and failure recovery. Parallel DBMSs on the other hand have sophisticated cost-based optimization techniques built-in which allows orders of magnitude speedup on performance as compared to Hadoop. However, parallel DBMSs need to have the data modeled in a schematic way before useful programs can run to provide the system enough information to optimize the queries.

    MapReduce was designed with fault tolerance and unstructured data analytics in mind. Structured data analysis with MapReduce emerged later such as Hive. Historically Parallel DBMSs carried the design assumption that failures do not occur very often, which is not quite the case especially in large clusters with heterogeneous machines.

    Therefore, HadoopDB is one of the attempts to combining the fault tolerance of MapReduce with Parallel DBMS’s advantage of query optimization.

    Main Idea of HadoopDB

    HadoopDB has two layers, the data processing layer and data storage layer. The data storage layer runs the DBMS instances, one at each node. The data processing layer runs the Hadoop as a job scheduler and communication layer. HadoopDB needs to load data from HDFS into the data storage layer before processing happens. Like Hive, HadoopDB compiles the SQL queries into a DAG of operators such as scan, select, group-by, reduce, which correspond to either the map phase or reduce phase in a traditional sense of MapReduce. These operators become the job to run in the system.

    Compared to Hive, HadoopDB does more on the system-level architecture which enables higher-level optimization opportunities. For example, Hive does not care about whether tables are collocated in the node. HadoopDB, however, detects from the metadata it collects whether the attribute to group by is also the anchor attribute to partition the table; if so, then the group-by operator can be pushed to each node, and joins can also be done on this partitioned attribute as well.

    The experiment results show that HadoopDB outperforms Hadoop in unstructured data and structured data analytic workload mainly because of the index technique and cost-based optimization can be done within the former, except the test case of UDF aggregates on text data. However, HadoopDB’s data pre-load time is 10 times slower than Hadoop. But thinking about 10 times run-time performance boost, the data pre-loading cost might be a secondary concern in some cases (not so much in stream data processing, though).

    • name 5:16 am on February 18, 2013 Permalink | Reply

      Hello Sam,

      Did you try your hands on with HadoopDB?

      • zhegu 2:12 am on February 19, 2013 Permalink | Reply

        Hi! (Sorry I can’t see your name on the comment, so I can’t address you here appropriately) Thanks for the comment. I did check out the github snapshot of HadoopDB, but didn’t really catch a chance to dig in. Luckily I will get to work on a commercialized version of HadoopDB in Hadapt Inc. soon in a few weeks! So stay tuned! 🙂 -Sam

    • Yong 8:21 pm on September 11, 2013 Permalink | Reply

      Any update about your experience about the produce from Hadapt?

    • Anonymous 5:53 pm on September 21, 2013 Permalink | Reply

      Where actually data is stored in HadoopDB?

  • zhegu 3:55 pm on September 25, 2012 Permalink | Reply
    Tags: Dremel,   

    paper review: Dremel 

    Main Ideas of Dremel

    Dremel uses columar storage for nested record at the bottom, a SQL-like programming language on top, and in the middle a query execution engine very different than MR.

    The motivation of Dremel comes from the need to run interactive data analysis over larg
    e-scale datasets *in a short time* and many times with slightly modification of the ana
    lytic program during the new feature extraction task.

    A typical workflow using Dremel is as follows: say Alice wants to extract a new feature from a collection of files. (1)She creates MapReduce jobs to work on the raw input data to produce a dataset. (2)Then she analyzes and evaluates her resulting dataset by running interactive queries in Dremel over the dataset. (3)However, if she finds irregularitie in the dataset, she probably needs to use FlumeJava over the dataset to do more complex bug-analysis. (4)After the bug is fixed, she use FlumeJava over the raw input data to process continuously. The results are stored in Dremel as well. At the last stage of this step she programs a few SQL queries to aggregate results in the Dremel datasets. (5)She registers the new dataset in a catalog.

    The key here is that the dataset used in the debugging phase–(2), (3) can be done in t
    he Dremel mode, where data retrieval is interactive. In such a way Dremel helps Alice t

    need to be run in MapReduce, which does not provide the interactive speed.

    The main advantage of columnar storage is where projections and selections can skip irrelevant columns and rows to minimize disk I/O.

    The advantage of using SQL-like language is that it enables the various optimization techniques of relational databases, such as pushing down the selection and projection.

    The novelty of the query exeuction is where one SQL query is broken down into equivalent samller SQL queries running on a horizontal partiion of the dataset. Each of these smaller SQL queries is self-contained and so each path from the execution tree root to the execution layer is a shard, and the results can be retured to the user quickly (not for all aggregations though. But for functions like TOP(signal1, 100), the stream-style aggregation, it’s capable to be quick).

    One interesting to mention is the difference between columnar storage and column storag
    e. Columnar storage also organizes the data of the same column together, but the record
    decomposition and record assembly are different. Columnar storage transforms nested data records. The record assembly does not do joins as in column-storage, but uses a Autu
    maton for assembling.

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc