Set up a single node cluster and optionally an eclipse development environment to create and test your programs.
This project uses Google cloud VM for setting up a Cloudera quick start docker container for Hadoop Map Reduce development environment. This shows how the setting up in details.
Below is how to connect to the Cloudera quick start container and check Hadoop services status:
thongnguyen2410@small:~$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                NAMES
194453c4f758        90568ffbcb7c        "/usr/bin/docker-qui…"   2 days ago          Up 2 days           0.0.0.0:90->80/tcp, 0.0.0.0:7190->7180/tcp, 0.0.0.0:8777->8888/tcp   trusting_gagarin
thongnguyen2410@small:~$ sudo docker exec -it 194453c4f758  bash
[root@quickstart /]# service --status-all
...
Hadoop datanode is running                                 [  OK  ]
Hadoop journalnode is running                              [  OK  ]
Hadoop namenode is running                                 [  OK  ]
Hadoop secondarynamenode is running                        [  OK  ]
Hadoop httpfs is running                                   [  OK  ]
Hadoop historyserver is running                            [  OK  ]
Hadoop nodemanager is running                              [  OK  ]
...Get WordCount (test run)
The script run.sh will:
- run 
javacto build.javafiles. This use the-cpto refer to Hadoop libs(.jar files) in e.g./usr/lib/hadoop - run 
jarto package.classfiles to.jarfile - run 
hadoop fs -copyFromLocalto copy test files ininput/*to HDFS - run 
hadoop jarto execute.jarfile inpseudo distributedmode (or runjava -jarinlocalmode) - run 
hadoop fs -catto display output 
Usage of run.sh:
This script is default to use Hadoop libs at /usr/lib/hadoop. If in your environment, Hadoop libs is at different location, then run the below before this script:
HADOOP_LIB_DIR=</path/to/lib/hadoop>To run in pseudo distributed mode:
Usage  : ./run.sh <package_dir> <class_name> [numReduceTasks]
Example: ./run.sh part1/c WordCountOr to run in local mode:
Usage  : ./run.sh <package_dir> <class_name> local
Example: ./run.sh part1/c WordCount local
[cloudera@quickstart BD_MapRed_Project]$ ./run.sh part1/c WordCount 4
...
==================================================
hadoop fs -cat /user/cloudera/input/*
==================================================
one six three
two three five
two six four six five
three six four
four five five six
four five six
==================================================
hadoop fs -cat /user/cloudera/output/*
==================================================
==>/user/cloudera/output/_SUCCESS<==
==>/user/cloudera/output/part-r-00000<==
==>/user/cloudera/output/part-r-00001<==
one     1
six     6
three   3
==>/user/cloudera/output/part-r-00002<==
==>/user/cloudera/output/part-r-00003<==
five    5
four    4
two     2Modify WordCount to InMapperWordCount and test run
./run.sh part1/d InMapperWordCountAverage Computation Algorithm for Apache access log
./run.sh part1/e ApacheLogAvgIn-mapper combining version of Average Computation Algorithm for Apache access log
./run.sh part1/f InMapperApacheLogAvgPairs algorithm to compute relative frequencies
./run.sh part2 RelativeFreqPair 2Stripes algorithm to compute relative frequencies
./run.sh part3 RelativeFreqStripe 2Pairs in Mapper and Stripes in Reducer to compute relative frequencies
./run.sh part4 RelativeFreqPairStripe 2Solve a MapReduce problem of your choice!
The problem of facebook common friends finding is described here.
Assume the friends are stored as Person->[List of Friends], our friends list is then:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
...
The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)
./run.sh part5 FriendFinding 2==================================================
hadoop fs -cat /user/cloudera/input/*
==================================================
A B C D
B A C D E
C A B D E
D A B C E
E B C D
==================================================
hadoop fs -cat /user/cloudera/output/*
==================================================
==>/user/cloudera/output/_SUCCESS<==
==>/user/cloudera/output/part-r-00000<==
(A, B)  [ D, C ]
(A, D)  [ B, C ]
(B, C)  [ D, E, A ]
(B, E)  [ D, C ]
(C, D)  [ E, A, B ]
(D, E)  [ B, C ]
==>/user/cloudera/output/part-r-00001<==
(A, C)  [ D, B ]
(B, D)  [ E, A, C ]
(C, E)  [ D, B ]