How to Setup Hadoop on ubuntu 20.04

Hadoop is a free & open-source software framework.It is based on Java.Hadoop is used for the storage processing of large set of data on clusters of machines.Using Hadoop,we can manage multiple number of dedicated server.

Install and Configure Hadoop on ubuntu

Update the System.

apt-get update

Install Java.

apt-get install openjdk-11-jdk

Check Java Version.

java -version 

Here is the command output.

openjdk version "11.0.11" 
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)

Create a User.

adduser hadoop 

Here is the command output.

  • Provide the password for user.
Adding user `hadoop' ...
Adding new group `hadoop' (1002) ...
Adding new user `hadoop' (1002) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
        Full Name []:
        Room Number []:
        Work Phone []:
        Home Phone []:
        Other []:
Is the information correct? [Y/n] 
  • Type Y.

Login to Hadoop user.

su - hadoop 

Provide the hadoop user password.

Configure the SSH Key.

ssh-keygen -t rsa 

Here is the command output.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:QSa2syeISwP0hD+UXxxi0j9MSOrjKDGIbkfbM3ejyIk hadoop@ubuntu20
The key's randomart image is:
+---[RSA 3072]----+
| ..o++=.+        |
|..oo++.O         |
|. oo. B .        |
|o..+ o * .       |
|= ++o o S        |
|.++o+  o         |
|.+.+ + . o       |
|o . o * o .      |
|   E + .         |
+----[SHA256]-----+

Move the public key from id_rsa.pub to authorized_keys.Provide the following permission.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
chmod 640 ~/.ssh/authorized_keys 

Verify the SSH authentication.

ssh localhost 

Here is the command output.

The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:JFqDVbM3zTPhUPgD5oMJ4ClviH6tzIRZ2GD3BdNqGMQ.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes

Install the Hadoop

Login to Hadoop user.

su - hadoop 

Download the Hadoop.

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz 

Extract the downloaded file.

tar -xvzf hadoop-3.3.0.tar.gz 

Rename the extracted downloaded file to hadoop.

mv hadoop-3.3.0 hadoop 

Open the ~/.bashrc file.

vim ~/.bashrc 

Add the following lines.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Activate the environment.

source ~/.bashrc 

Open the environment variable file of Hadoop.

vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh 

Add the following lines.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Create a Directory.

mkdir -p ~/hadoopdata/hdfs/namenode 
mkdir -p ~/hadoopdata/hdfs/datanode 

Open the core-site.xml file.

vim $HADOOP_HOME/etc/hadoop/core-site.xml 

Add the following lines.

<configuration>
        <property>
                <name>fs.defaultFS</name>
           <value>hdfs://127.0.0.1:9000</value> or <value>hdfs://0.0.0.0:9000</value>
        </property>
</configuration>

Open the hdfs-site.xml file.

vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml 

Add the following lines.

<configuration>
 
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
 
        <property>
                <name>dfs.name.dir</name>
                <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
        </property>
 
        <property>
                <name>dfs.data.dir</name>
                <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
        </property>
</configuration>

Open the mapred-site.xml file.

vim $HADOOP_HOME/etc/hadoop/mapred-site.xml 

Add the following lines.

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

Open the yarn-site.xml file.

vim $HADOOP_HOME/etc/hadoop/yarn-site.xml 

Add the following lines.

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

Format the Namenode as a hadoop user.

hdfs namenode -format 

Here is the command output.

INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-address
************************************************************/

Start the hadoop cluster:

start-dfs.sh 

Here is the command output.

Starting namenodes on [15.228.82.126]
15.228.82.126: Warning: Permanently added '15.228.82.126' (ECDSA) to the list of 
known hosts.
Starting datanodes
Starting secondary namenodes ip-address
ip-address: Warning: Permanently added 'ip-address' (ECDSA) to the list of known hosts.

Start the YARN service.

start-yarn.sh 

Here is the command output.

Starting resourcemanager
Starting nodemanagers

Check the status of all Hadoop services.

jps 

Here is the command output.

6032 ResourceManager
5625 DataNode
6523 Jps
5836 SecondaryNameNode
6206 NodeManager

Open the port number 9870 & 8088 on ufw firewall.

ufw allow 9870/tcp
&
ufw allow 8088/tcp

Access Hadoop web-interface

http://server-ip:9870

Here is the output.

 

Fig 1

Access the Resource Manage web-interface

http://server-ip:8088

Here is the output.

 

Fig 2

Test the Hadoop Cluster.

Create a Directory in the HDFS filesystem.

hdfs dfs -mkdir /logs 
hdfs dfs -mkdir /example

list the directory:

hdfs dfs -ls / 

Here is the command output.

Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2021-07-19 15:27 /logs
drwxr-xr-x   - hadoop supergroup          0 2021-07-19 15:26 /example

Push log files from local machine to hadoop file system.

hdfs dfs -put /var/log/* /logs/  

Open the Hadoop Namenode web interface.

http://server-ip:9870/explorer.html

Here is the output.

 

Fig. 3

 

Leave a Reply