GitHub - mbhavik91/United-States-Census-Data-Analysis-Using-MapReduce

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src/cs455/gigasort/hadoop		src/cs455/gigasort/hadoop
GenerateHash.java		GenerateHash.java
ReadMe		ReadMe
build.xml		build.xml
finalHash_on_unsorted-2_reducers-used_16		finalHash_on_unsorted-2_reducers-used_16

Repository files navigation

#READ ME !!!

Dataset used is "unsorted-2"

Number of reducers used is 16. (change code if asked to use 32 reducers. change partitioner and main job)

Main Classname is MainJob

Package Structure: cs455.gigasort.hadoop

How to run? Please typr the following command:
$HADOOP_HOME/bin/hadoop jar workspace/Assignment3/dist/gigasort.jar cs455.gigasort.hadoop.MainJob /data/gigasort/unsorted-2 /home/output/final

Dataset Size = 19.9 GB

A shell script was developed to get the final results

The following commang was used to generate the hash value of each file
"openssl sha1 part-r-00006 | awk '{print $2}' >> hashvalue"

After that Generatehash.java was used to generate the finalHash in a recursive fashion

Merkle Tree Algorithm was used to generate the final hash value