Recent Posts

Friday, 5 January 2018

Hadoop Multinode Cluster Setup

     This tutorial describes how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop environment. This is practically used in organizations to store and analyze their Petabytes and Exabytes of data.

Recommended Platform
OS: Linux is supported as a development and production platform. You can use Ubuntu 14.04 or 16.04 or later (you can also use other Linux flavors like CentOS, Redhat, etc.)
Hadoop: Cloudera Distribution for Apache Hadoop CDH5.x (you can use Apache Hadoop 2.x)

Install Hadoop on Master
      Let us now start with installing Hadoop on master node in the distributed mode.
Prerequisites for Hadoop Multi Node Cluster Setup
1. Add Entries in hosts file
     Edit hosts file and add entries of both master and slaves
sudo vi /etc/hosts
MASTER-IP master
SLAVE01-IP slave01
SLAVE02-IP slave02
(NOTE: In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the corresponding IP).

Example
172.26.110.100 master
172.26.110.101 slave01
172.26.110.102 slave02

2. Install Java 8 (Recommended Oracle Java)
     Hadoop requires a working Java 1.5+ installation. However, using Java 8 is recommended for running Hadoop.
2.1 Install Python Software Properties
Command : sudo apt-get install python-software-properties

2.2 Add Repository
Command : sudo add-apt-repository ppa:webupd8team/java

2.3 Update the source list
Command : sudo apt-get update

2.4 Install Java
Command : sudo apt-get install oracle-java8-installer

3. Configure SSH
     Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it.
3.1 Install Open SSH Server-Client
Command : sudo apt-get install openssh-server openssh-client

3.2 Generate KeyPairs
Command : ssh-keygen -t rsa -P ""

3.3 Configure password-less SSH
3.3.1 Copy the generated ssh key to master node’s authorized keys.Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3.3.2 Copy the master node’s ssh key to slave’s authorized keys.
Command:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave01
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave02
3.4 Check by SSH to all the Slaves
ssh slave01
ssh slave02

2. Install Hadoop
1 Download Hadoop
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.3.2.tar.gz

Note:
     You can download any version of hadoop version 2+. Here I am using CDH version is Cloudera’s 100% open source platform distribution.

2 Untar Tar ball
Command : tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz

3. Hadoop multi-node cluster setup Configuration
1 Edit .bashrc
     Edit .bashrc file located in user’s home directory and add following parameters.
Command :  vi .bashrc
export HADOOP_PREFIX="/home/ashok/hadoop-2.5.0-cdh5.3.2"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}


Note
     After above step restart the terminal, so that all the environment variables will come into effect or execute the source command.
Command : source .bashrc 


2 Edit hadoop-env.sh 

     hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc. Edit configuration file hadoop-env.sh (located in HADOOP_HOME/etc/hadoop) and set JAVA_HOME. 
Command : vi hadoop–env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

3 Edit core-site.xml 

     core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
     Edit configuration file core-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries.
Command : vi core-site.xml 
<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master:9000</value>
   </property>
   <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/ashok/hdata</value>
   </property>
</configuration> 
Note
     Here /home/ashok/hdata is a sample location; please specify a location where you have Read Write privileges.
4 Edit hdfs-site.xml 
     hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
      Edit configuration file hdfs-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command : vi hdfs-site.xml 
<configuration>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
</configuration>
5 Edit mapred-site.xml 
     mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the mapper and the reducer process,  CPU cores available for a process, etc.
     In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml file using mapred-site.xml template. Edit configuration file mapred-site.xml (located in HADOOP_HOME/ etc/hadoop) and add following entries
Command : vi mapred-site.xml 
<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>  

6 Edit yarn-site.xml 
     yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc. Edit configuration file mapred-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command : vi yarn-site.xml  
<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
   <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8025</value>
   </property>
   <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8030</value>
   </property>
   <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8040</value>
   </property>
</configuration>
7. Edit salves
     Edit configuration file slaves (located in HADOOP_HOME/etc/hadoop) and add following entries:
slave01
slave02 

Now Hadoop is set up on Master, now setup Hadoop on all the Slaves.
Install Hadoop On Slaves 
1. Setup Prerequisites on all the slaves
     Run following steps on all the slaves
1. Add Entries in hosts file
2. Install Java 8 (Recommended Oracle Java)

2. Copy configured setups from master to all the slaves
2.1. Create tarball of configured setup
Command : tar czf hadoop.tar.gz hadoop-2.5.0-cdh5.3.2
(NOTE: Run this command on Master)

2.2. Copy the configured tarball on all the slaves
Command : scp hadoop.tar.gz slave01:~
(NOTE: Run this command on Master)

Command  :scp hadoop.tar.gz slave02:~
(NOTE: Run this command on Master)

2.3. Un-tar configured Hadoop setup on all the slaves
Command : tar xvzf hadoop.tar.gz
(NOTE: Run this command on all the slaves)

     Now Hadoop is set up on all the Slaves. Now Start the Cluster.

4. Start the Hadoop Cluster
     Let us now learn how to start Hadoop cluster?
4.1. Format the name node
Command : bin/hdfs namenode -format
(Note: Run this command on Master)
(NOTE: This activity should be done once when you install Hadoop, else it will delete all the data from HDFS)

4.2. Start HDFS Services
Command : sbin/start-dfs.sh
(Note: Run this command on Master)

4.3. Start YARN Services
Command : sbin/start-yarn.sh
(Note: Run this command on Master)

4.4. Check for Hadoop services
4.4.1. Check daemons on Master
Command : jps
NameNode
ResourceManager

4.4.2. Check daemons on Slaves
Command : jps
DataNode
NodeManager

5. Stop The Hadoop Cluster
     Let us now see how to stop the Hadoop cluster?
5.1. Stop YARN Services
Command : sbin/stop-yarn.sh
(Note: Run this command on Master)

5.2. Stop HDFS Services
Command : sbin/stop-dfs.sh
(Note: Run this command on Master)


Next Tutorial : HDFS Tutorial

Previous Tutorial : Deploy Hadoop on Single Node Cluster 

1 comment:

  1. Changing this
    hdfs://localhost:9000
    as
    hdfs://master:9000
    worked for me

    ReplyDelete