Skip to content
greglu edited this page Jun 22, 2011 · 12 revisions

Hadoop streaming is a utility that allows the mappers and reducers of a MapReduce job to be processed with any executable or script. This means that you can write code in your favourite language and have Hadoop use them for computation. Streaming will basically read STDIN line-by-line for input and emit output to STDOUT.

Choosing this method limits the use of some framework functionality, but for the purposes of learning it during a one day event, it's usually the better option. Especially if you're not used to programming in Java.

Official Hadoop streaming guide

http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

Usage

Example command for running a Hadoop streaming job:

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/users/{team}/mapper.py -mapper ~/users/{team}/mapper.py -file ~/users/{team}/reducer.py -reducer ~/users/{team}/reducer.py -input /datasets/wikipedia/* -output /tmp/{team}/job-output

Language specific setup

Don't forget that in a MapReduce job, each map and reduce task occurs on one of the cluster's slave nodes, and any scripts used in streaming will be subject to the slaves' local install of the programming language you choose. The implication of this is that if your script has a dependency on a library, then that library will need to be installed on ALL of the slave nodes. An example of which is a Ruby script requiring a Rubygem to be present. If your team needs this done, please see a Hopper event organizer (probably Greg).

Notes:

Other resources:

Clone this wiki locally