Recent Posts

Friday, 5 January 2018

Deploy Hadoop on Single Node Cluster

Prerequisites to Deploy Hadoop on Single Node Cluster
Step 1: Install Java 8 (Recommended Oracle Java) 
     Hadoop requires a working Java 1.5+ installation. However, using Java 8 is recommended for running Hadoop.
1.1 Install Python Software Properties
Command : sudo apt-get install python-software-properties

1.2 Add Repository
Command : sudo add-apt-repository ppa:webupd8team/java

1.3 Update the source list
Command : sudo apt-get update

1.4 Install Java 
Command : sudo apt-get install oracle-java8-installer

Step 2: Configure SSH
     Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it.
2.1 Install Open SSH Server-Client
Command : sudo apt-get install openssh-server openssh-client

2.2 Generate KeyPairs
Command : ssh-keygen -t rsa -P ""

2.3 Configure password-less SSH
Command : cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

2.4 Check by SSH to localhost
Command : ssh localhost

Step 3: Install Hadoop
3.1 Download Hadoop

     You can download any version of hadoop version 2+. Here I am using CDH version is Cloudera’s 100% open source platform distribution.

3.2 Untar Tar ball
Command : tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz

     All the required jars, scripts, configuration files, etc. are available in HADOOP_HOME directory (hadoop-2.5.0-cdh5.3.2).

Step 4: Setup Configuration
4.1 Edit .bashrc
     Edit .bashrc file located in user’s home directory and add following parameters.
Command :  vi .bashrc
export HADOOP_PREFIX="/home/hdadmin/hadoop-2.5.0-cdh5.3.2"

     After above step restart the terminal, so that all the environment variables will come into effect or execute the source command.

Command : source .bashrc

4.2 Edit contains the environment variables that are used in the script to run Hadoop like Java home path, etc. Edit configuration file (located in HADOOP_HOME/etc/hadoop) and set JAVA_HOME. 
Command : vi hadoop–
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

     Here your can change java path according to your java installation directory.

4.3 Edit core-site.xml 
     core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
     Edit configuration file core-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries.
Command : vi core-site.xml 

     /home/hdadmin/hdata is a sample location; please specify a location where you have Read Write privileges.

4.4 Edit hdfs-site.xml 
     hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
      Edit configuration file hdfs-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command : vi hdfs-site.xml 

4.5 Edit mapred-site.xml 
     mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the mapper and the reducer process,  CPU cores available for a process, etc.
     In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml file using mapred-site.xml template. Edit configuration file mapred-site.xml (located in HADOOP_HOME/ etc/hadoop) and add following entries
Command : vi mapred-site.xml 

4.6 Edit yarn-site.xml 
     yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc. Edit configuration file mapred-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command : vi yarn-site.xml 
<?xml version="1.0" encoding="UTF-8"?>

Step 5: Start the Cluster
5.1 Format the name node:
Command : bin/hdfs namenode -format 

     This activity should be done once when you install hadoop, else It will delete all your data from HDFS.

5.2 Start HDFS Services
Command : sbin/

5.3 Start YARN Services
Command : sbin/

5.4 Check whether services have been started
     To check that all the Hadoop services are up and running, run the below command.
Command : jps

Step 6. Stop The Cluster
6.1 Stop HDFS Services
Command : sbin/

6.2 Stop YARN Services
Command : sbin/

Next Tutorial : Hadoop Multinode Cluster Setup

Previous Tutorial : Apache Hadoop Introduction 

No comments:

Post a Comment