Hadoop Streaming

Jump to bottom Edit New page

greglu edited this page Jun 18, 2011 · 12 revisions

Example command for running a Hadoop streaming job:

~/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/users/you/mapper.py -mapper ~/users/you/mapper.py -file ~/users/you/reducer.py -reducer ~/users/you/reducer.py -input /datasets/wikipedia/* -output job-output

Notes:

-mapper and -reducer are paths on the LOCAL filesystem of the master node
-file is a repeated argument for each mapper and reducer script you have
-input and -output are paths on the Hadoop filesystem (HDFS)

Culled from here:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python