Tagged: Hadoop Toggle Comment Threads | Keyboard Shortcuts

  • zhegu 10:57 pm on October 30, 2012 Permalink | Reply
    Tags: Hadoop, ssh, sshd   

    connect to host localhost port 22: connection refused 

    I ran into this problem when I tried to configure Hadoop in pseudo-distributed mode:


    $ ssh localhost
    ssh: connect to host localhost port 22: Connection refused.
    $ bin/hadoop fs -put foo input
    Bad connection to FS. command aborted. exception: Call to localhost/ failed on connection exception: java.net.ConnectionException: Connection refused

    One possible reason is that because the ssh server daemon, or sshd, is not loaded and running on localhost, so any attempt to ssh connect to localhost would fail. I check to see whether ssh and sshd are running by typing the following command:

    $ ps aux | grep ssh
    # Result: bunch of lines, but none of them is about "sshd", only one line is about "ssh"

    Result is, I only see ssh running, but not sshd. So I know ssh server daemon is probably not started yet.

    First, I verify that I have installed the package “openssh-server” on my fedora box.

    Then I check the status of sshd service:

    $ systemctl status sshd.service
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled) # sshd is disabled
    Active: inactive (dead) # sshd is inactive

    Here we go, it says the sshd service is currently disabled.

    So I do the following commands to enable and start it:

    $ systemctl enable sshd.service
    $ systemctl start sshd.service

    Now it is running!

    If I check the status of sshd service again:

    $ systemctl status sshd.service
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled) # sshd is loaded
    Active: active (running) # sshd is active

    Connecting ssh to localhost also woks fine now.

    • edu links service 9:17 pm on May 2, 2013 Permalink | Reply

      Everything is very open with a clear description of the issues.
      It was truly informative. Your site is very helpful.
      Thank you for sharing!

    • Raquel 4:14 am on August 24, 2013 Permalink | Reply

      You really make it appear really easy along
      with your presentation however I find this matter
      to be really something that I think I might by no means understand.
      It sort of feels too complex and extremely broad for me. I’m having a look forward on your next submit, I will attempt to get the grasp of it!

    • Lowell 5:18 pm on August 24, 2013 Permalink | Reply

      I like the helpful info you provide in your articles.
      I will bookmark your weblog and check again here frequently.

      I am quite certain I will learn a lot of new stuff right here!
      Good luck for the next!

    • Zachery 11:32 pm on August 24, 2013 Permalink | Reply

      Does your site have a contact page? I’m having a tough time locating it but, I’d like to send you an e-mail.

      I’ve got some creative ideas for your blog you might be interested in hearing. Either way, great website and I look forward to seeing it grow over time.

    • chanel かばん 1:22 am on September 18, 2013 Permalink | Reply

      ロキシー ラッシュガード

    • how to clean registry 12:36 pm on September 30, 2013 Permalink | Reply

      You could certainly see your expertise in the article
      you write. The arena hopes for even more passionate writers like you who are not afraid to say how they believe.
      All the time follow your heart.

    • Manjusha 4:06 am on February 26, 2014 Permalink | Reply

      thank you very much it’s working….:)

    • mixoni 6:46 pm on September 25, 2014 Permalink | Reply

      Thank you very much for this solution man…. its straightforward from others on web and its working 🙂

  • zhegu 2:14 am on October 18, 2012 Permalink | Reply
    Tags: Hadoop, Hive,   

    paper review: Hive 

    The paper in this review is “Hive — A Petabyte Scale Data Warehouse Using Hadoop” by Ashish Thusoo et al.

    Hive is an interesting system. I heard that Facebook actually went through the proof of concept of many commercial or open-source relational databases (RDMS) as well as the Hadoop map-reduce system for their petabyte structured data, before resolved to the latter. However, MapReduce system does not provide an explicit structured data processing framework, so programmers who are familiar with SQL would probably miss the expressiveness of the latter. After all, SQL allows one to model very complicated data relationship in a finite but richer set of operators; whereas MapReduce,  in SQL terms, only provide two “operators”, MAP and REDUCE, which are more primitive and requires more work to model data relationship.

    The basic idea of Hive is to provide programmers a SQL-like language, HiveQL. Programs written in HiveQL are then compiled into map-reduce jobs that are executed using Hadoop.

    Hive hides the complicated pipelining of multiple map-reduce jobs from the programmers, especially who are not so familiar with MapReduce programming paradigm. With the SQL-like language, programmers can write succinct and complicated queries in ad hoc analysis or reporting dashboard, and leave the translation to and optimization of the map-reduce jobs to the system to run in the beloved fault-tolerant, distributed style.

    One cool part I found on Hive is the multi-table insertion. The idea is to parallelize the read operations on a common table among MapReduce jobs, such that each job does its own transformation of the shared source of data input and directs its output to its own destination table. Of course the pre-requisite is that there is no input-data dependency among these MapReduce jobs. One example is as the following: let’s say we are doing join on T1 and T2, and want to run two user different aggregates on the joined table, and store the two separate results to two different files. Using Hive, we only need to compose a HiveQL query block, which consists of a join query over two tables T1 and T2, and two subsequent INSERT OVERWRITE queries that uses the joined result of T1 and T2. What Hive is able to do is to figure out the correct dependency of the three queries, and hence does the right thing in an optimized way: do the join of T1 and T2 only once in one map-reduce job, store the result to a temporary file, and share this temporary file as the input for the subsequent aggregate queries. If doing this in bare-bone Hadoop, one has to write 3 separate MapReduce jobs, which is in total 6 mapper and reducer scripts, and has to mentally figure out the data dependency, and manually run them in the right order. In HiveQL, this style of data processing becomes much simpler and automatic.

    However, Hive is not the swiss army knife. It’s not built for OLTP workloads, because of the lack of INSERT INTO, UPDATE and DELETE in HiveQL. Rather, Hive is built for data warehouse processing (OLAP), such as ad hoc queries (e.g. data subset exploration) and reporting dashboards (e.g. joining several fact tables), where joins, aggregates prevail.

    Hive does not provide a fully automatic query optimizer: It needs programmer to provide query hint on doing MAPJOINs on small tables, where small tables in equi-joins are copied across mappers to join the separate parts of the big tables.

    Hive also needs query hint on 2-stage map/reduce for GROUP BY aggregates where the group-by columns have highly skewed data.

    One of the reasons behind the design of the above two programer hints, I suspect, would be that Hive’s query optimizer does not have selectivity, cardinality statistics and estimation around for it to determine the table sizes. However I’m just guessing from outside of the box, and I need to verify this speculation.

    It’s not possible to tell HDFS where to store the data blocks. Hive’s data storage operates on the logical level and does not have the power to control where to store the actual data blocks. As a result, some optimization is not possible: If table T1 and T2, both CLUSTERED BY the same key, are to join by that key, ideally the collocation of the partitions of T1 and T2 on the same node eliminates the need to shuffle the tuples between map and reduce phase. But Hive does not guarantee matching buckets will sit on the same node in this case, and so it has to run the shuffle phase.

    Overall, I enjoy reading the paper very much. I feel the query optimizer in Hive could probably achieve more if it knows more about the arrangement of the data. Next step for me is to poke around the query optimizer and get myself pretty lost and then hopefully found. 🙂

  • zhegu 4:36 am on October 5, 2012 Permalink | Reply
    Tags: Hadoop, HadoopDB, , RDMS   

    paper review: HadoopDB 

    The paper I’m reviewing here is titled “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads” by A. Abouzeid et al.


    MapReduce system such as Hadoop provides a very flexible data processing framework where data does not need to be modeled before all the analytic work can start. At run-time, the system automatically figures out how to parallelize the job into tasks, does load-balancing and failure recovery. Parallel DBMSs on the other hand have sophisticated cost-based optimization techniques built-in which allows orders of magnitude speedup on performance as compared to Hadoop. However, parallel DBMSs need to have the data modeled in a schematic way before useful programs can run to provide the system enough information to optimize the queries.

    MapReduce was designed with fault tolerance and unstructured data analytics in mind. Structured data analysis with MapReduce emerged later such as Hive. Historically Parallel DBMSs carried the design assumption that failures do not occur very often, which is not quite the case especially in large clusters with heterogeneous machines.

    Therefore, HadoopDB is one of the attempts to combining the fault tolerance of MapReduce with Parallel DBMS’s advantage of query optimization.

    Main Idea of HadoopDB

    HadoopDB has two layers, the data processing layer and data storage layer. The data storage layer runs the DBMS instances, one at each node. The data processing layer runs the Hadoop as a job scheduler and communication layer. HadoopDB needs to load data from HDFS into the data storage layer before processing happens. Like Hive, HadoopDB compiles the SQL queries into a DAG of operators such as scan, select, group-by, reduce, which correspond to either the map phase or reduce phase in a traditional sense of MapReduce. These operators become the job to run in the system.

    Compared to Hive, HadoopDB does more on the system-level architecture which enables higher-level optimization opportunities. For example, Hive does not care about whether tables are collocated in the node. HadoopDB, however, detects from the metadata it collects whether the attribute to group by is also the anchor attribute to partition the table; if so, then the group-by operator can be pushed to each node, and joins can also be done on this partitioned attribute as well.

    The experiment results show that HadoopDB outperforms Hadoop in unstructured data and structured data analytic workload mainly because of the index technique and cost-based optimization can be done within the former, except the test case of UDF aggregates on text data. However, HadoopDB’s data pre-load time is 10 times slower than Hadoop. But thinking about 10 times run-time performance boost, the data pre-loading cost might be a secondary concern in some cases (not so much in stream data processing, though).

    • name 5:16 am on February 18, 2013 Permalink | Reply

      Hello Sam,

      Did you try your hands on with HadoopDB?

      • zhegu 2:12 am on February 19, 2013 Permalink | Reply

        Hi! (Sorry I can’t see your name on the comment, so I can’t address you here appropriately) Thanks for the comment. I did check out the github snapshot of HadoopDB, but didn’t really catch a chance to dig in. Luckily I will get to work on a commercialized version of HadoopDB in Hadapt Inc. soon in a few weeks! So stay tuned! 🙂 -Sam

    • Yong 8:21 pm on September 11, 2013 Permalink | Reply

      Any update about your experience about the produce from Hadapt?

    • Anonymous 5:53 pm on September 21, 2013 Permalink | Reply

      Where actually data is stored in HadoopDB?

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc