Hadoop Streaming

Hadoop streaming is a utility that allows the mappers and reducers of a MapReduce job to be processed with any executable or script. This means that you can write code in your favourite language and have Hadoop use them for computation. Streaming will basically read STDIN line-by-line for input and emit output to STDOUT.

Choosing this method limits the use of some framework functionality, but for the purposes of learning it during a one day event, it's usually the better option. Especially if you're not used to programming in Java.

Official Hadoop streaming guide

http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

Usage

Example command for running a Hadoop streaming job:

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/users/{team}/mapper.py -mapper ~/users/{team}/mapper.py -file ~/users/{team}/reducer.py -reducer ~/users/{team}/reducer.py -input /datasets/wikipedia/* -output /tmp/{team}/job-output

Language specific setup

Don't forget that in a MapReduce job, each map and reduce task occurs on one of the cluster's slave nodes, and any scripts used in streaming will be subject to the slaves' local install of the programming language you choose. The implication of this is that if your script has a dependency on a library, then that library will need to be installed on ALL of the slave nodes. An example of which is a Ruby script requiring a Rubygem to be present. If your team needs this done, please see a Hopper event organizer (probably Greg).

Notes:

-mapper and -reducer are paths on the LOCAL filesystem of the master node
-file is a repeated argument for each mapper and reducer script you have that you want distributed to the cluster nodes (http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Package+Files+With+Job+Submissions)
-input and -output are paths on the Hadoop filesystem (HDFS)

Other resources:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hadoop Streaming

Official Hadoop streaming guide

Usage

Language specific setup

Notes:

Other resources:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally