Skip to content

Hadoop intro

Ronak Nathani edited this page Aug 27, 2016 · 9 revisions

##Introduction

Before we begin, it’s important to roughly understand the three components of Hadoop:

  1. Hadoop Distributed File System (HDFS) is a distributed file system based off Google File System (GFS) that splits files into “blocks” and stores them redundantly on several relatively cheap machines, known as DataNodes. Access to these DataNodes is coordinated by a relatively high-quality machine, known as a NameNode.

  2. Hadoop MapReduce (based off Google MapReduce) is a paradigm for distributed computing that splits computations across several machines. In the Map task, each machine in the cluster calculates a user-defined function on a subset of the input data. The outputted data of the Map task is shuffled around the cluster of machines to be grouped or aggregated in the Reduce task.

  3. YARN (unofficially Yet Another Resource Negotiator) is a new feature in Hadoop 2.0 (released in 2013) that manages resources and job scheduling, much like an operating system. This is an important improvement, especially in productions with multiple applications and users, but we won’t focus on this for now.

Spin up AWS Instances

t2.micro instances should be sufficient to start playing with HDFS and MapReduce. The current setup uses the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type. Be sure to terminate the instances if you want AWS to keep more free credits for the month. For practice you can try spinning up 4 nodes, treating one as the namenode(master) and the remaining three as datanodes.

Setup passwordless SSH

The namenode of the Hadoop cluster must be able to communicate with the datanodes without requiring a password.

Copy AWS pem-key from localhost to namenode

localhost$ scp -o "StrictHostKeyChecking no" -i ~/.ssh/personal-aws.pem ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns:~/.ssh

SSH into the namenode

localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns

Generate an authorization key for the namenode

namenode$ sudo apt-get update
namenode$ sudo apt-get install ssh rsync
namenode$ ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""
namenode$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Make sure you can successfully perform passwordless ssh by entering:
namenode$ ssh localhost

When prompted to add the host, type yes and then enter

The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is ... Are you sure you want to continue connecting (yes/no)?

Copy the namenode’s id_rsa.pub to the datanodes’ authorized_keys

  • This must be done for each datanode by changing the datanode-public-dns
namenode$ cat ~/.ssh/id_rsa.pub | ssh -o "StrictHostKeyChecking no" -i ~/.ssh/*.pem ubuntu@datanode-public-dns 'cat >> ~/.ssh/authorized_keys'

Setup Hadoop cluster

Hadoop will be installed on the namenode and all the datanodes. The namenode and datanodes will be configured after the main installation.

Run the following on the namenode and all datanodes by SSH-ing into each node:

Install java-development-kit
name-or-data-node$ sudo apt-get update
name-or-data-node$ sudo apt-get install openjdk-7-jdk

Install Hadoop
name-or-data-node$ wget http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads
name-or-data-node$ sudo tar zxvf ~/Downloads/hadoop-*.tar.gz -C /usr/local
name-or-data-node$ sudo mv /usr/local/hadoop-* /usr/local/hadoop

Set the JAVA_HOME and HADOOP_HOME environment variables and add to PATH in .profile
name-or-data-node$ nano ~/.profile
Add the following to the end of the file

export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

name-or-data-node$ . ~/.profile

Set the Java path for hadoop-env.sh
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Find the line with export JAVA_HOME=${JAVA_HOME} and replace the ${JAVA_HOME} with /usr

e.g.
export JAVA_HOME=/usr

Configure core-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Scroll down to the  tag and add the following between the  tags

<property> 
   <name>fs.defaultFS</name>
   <value>hdfs://namenode-public-dns:9000</value>
<property> 

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure yarn-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Scroll down to the  tags

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
   
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>namenode-public-dns:8025</value>
</property>

<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>namenode-public-dns:8030</value>
</property>

<property>
   <name>yarn.resourcemanager.address</name>
   <value>namenode-public-dns:8050</value>
</property>

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure mapred-site.xml
First make a mapred-site.xml by copying the mapred-site.xml.template file
name-or-data-node$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Scroll down to the  tag and add the following between the  tags

<property>
   <name>mapreduce.jobtracker.address</name>
   <value>name-or-data-node:54311</value>
</property>

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure Namenode

SSH into the namenode

<b>localhost<b>$ ssh -i ~/.ssh/<b>personal-aws.pem</b> ubuntu@<b>namenode-public-dns</b>

Add to hosts

<b>namenode</b>$ sudo nano /etc/hosts

The namenode and datanodes will be added to this list in the format of the public-dns followed by the node name. If you have 4 nodes in your Hadoop cluster, then 4 extra lines will be added to the file. Refer to the example:

e.g. If I have a node with public-dns: ec2-52-35-169-183.us-west-2.compute.amazonaws.com private-dns: ip-172-31-41-149.us-west-2.compute.internal

the following is what should be added to the /etc/hosts file on the namenode:

ec2-52-35-169-183.us-west-2.compute.amazonaws.com ip-172-31-41-149

Configure hdfs-site.xml

Add namenode and datanode names to masters and slaves files

Configure Datanodes

SSH into the datanodes

Configure hdfs-site.xml

Start Hadoop services

SSH into the namenode

Start HDFS

Start YARN

Start Job History Server

Common HDFS command

Example of working with HDFS

Clone this wiki locally