Hadoop Multi Node Installation Guide
This document describes the steps needed to set up a distributed, multi-node Apache Hadoop cluster. The proper approach would be a 2 step approach. First step would be to install 2 single-node Hadoop machines, configure and test them as local Hadoop systems. Second step would be to merge both the single-node systems into a multi-node cluster, such that one Ubuntu box is designated as a Master and the other Ubuntu box as a Slave only. The Master would also act as a slave for data storage and processing.
There are certain required steps to be done before starting the Hadoop Installation. They are listed as below. Each step is further described below with screenshots and commands for additional understanding.
Configure single-node Hadoop cluster
Hadoop requires an installation of Java 1.6 or higher. It is always better to go with the latest Java version. In our case we have installed Java 7u-25.
- Please refer to the Installation document for Single Node Hadoop cluster setup, to set up the Hadoop cluster on each Ubuntu box.
- Use the same settings with respect to installation paths etc, on both machines so that there are no issues while merging the 2 boxes into a multi-node cluster setup.
Once the single-node Hadoop clusters are up and running, we need to make configuration changes to make one Ubuntu box as a “ master” (this box will also act as the slave) and the other Ubuntu box as the “slave”.
Networking plays a major role out here. Before merging the servers into a multi-node cluster, make sure that both Ubuntu boxes are able to ping each other. Both the boxes should be connected on the same network/hub so that they can speak to each other
- Select one of the Ubuntu boxes to serve as the Master and the other to serve as the Slave. Make a note of their ip addresses. In our case, we have selected 172.16.17.68 as the master (hostname of master machine is Hadoopmaster) and 172.16.17.61 as the slave (host name of the slave machine is hadoopnode).
- Make sure that the Hadoop services are shutdown before proceeding with the network set-up. Execute the following command on each box to shut down the services
- Provide both the machines these respective host names in their networking setup, i.e the /etc/hosts file. Execute the following command
sudo vi /etc/hosts
- Add the following lines to the file.
172.16.17.68 Hadoopmaster 172.16.17.61 hadoopnode
- In the event, more nodes (slaves) are added to the cluster, the /etc/hosts/ file on each machine should be updated accordingly, using unique names (ex. 172.16.17.xx hadoopnode1, 172.16.17.xy hadoopnode2 and so on.
Enabling SSH Access
The Hadoop user hduser on the master (i.e. hduser@hadoopmaster) should be able to connect to
i) It’s own user account, i.e. ssh Hadoopmaster
ii) Hadoop user account hduser on the slave (i.e. hduser@hadoopnode), via SSH
- Add the hduser@Hadoopmaster’s public ssh key to the authorized key file of hduser@slaves, by executing the following command.
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode
- The above command will prompt for the password for hduser on the slave, once the password is provided, it will copy the SSH key for you.
- Next, test the ssh setup by logging in with hduser from master and trying to connect to the user account hduser on the slave. This will create a slave
key’s fingerprint on the master’s known_ hosts file. Also, test the connection from master to master. Execute the following commands
hduser@Hadoopmaster:~$ ssh Hadoopmaster hduser@Hadoopmaster:~$ ssh hadoopnode
Hadoop Multi-Node Setup
There are certain required steps to be done for the Hadoop multi-node setup. They are listed as below. Each step is further described below with screenshots and commands for additional understanding.
- Configure the conf/masters file (on Hadoopmaster)
The conf/masters file contains the list of machines Hadoop will start as secondary NameNode. On the master, edit this file and add the line “Hadoopmaster” as below
vi /conf/masters Hadoopmaster
- Configure the conf/slaves file (on Hadoopmaster)
The conf/masters file contains the list of machines Hadoop will run its DataNodes and Task Trackers. On the master, edit this file and add the line “Hadoopmaster” and “hadoopnode” as below
vi /conf/slaves Hadoopmaster hadoopnode
In case, of additional nodes (slaves), add them to this list one on each line.
- Configure the conf/core-site.xml.
The following configuration files, conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines should be updated to reflect the same (I our case both the Hadoopmaster and hadoopnode)
In core-site.xml (conf/core-site.xml), change the fs.default.name parameter, to specify the hostname of the master. In our case it is Hadoopmaster.
<property> <name>fs.default.name</name> <value>hdfs://Hadoopmaster:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
- Configure the conf/hdfs-site.xml.
In hdfs-site.xml (conf/hdfs-site.xml), change the dfs.replication parameter to specify the default block replication. This defines how many machines, a single file should be copied to before it becomes ready to use. In our case, this value is set to 2.
<property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
- Configure the conf/mapred-site.xml.
In mapred-site.xml (conf/mapred-site.xml), change the mapred.job.tracker parameter, to specify the JobTracker host and port. In our case, this value is set to master.
<property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
There are some other configuration options worth studying. The following information is taken from the Hadoop API Overview.
In file conf/mapred-site.xml:
Determines where temporary MapReduce data is written. It also may be a list of directories.
As a rule of thumb, use 10x the number of slaves (i.e., number of TaskTrackers).
As a rule of thumb, use num_tasktrackers * num_reduce_slots_per_tasktracker * 0.99. If num_tasktrackers is small (as in the case of this tutorial), use (num_tasktrackers - 1) * num_reduce_slots_per_tasktracker.
Formatting the HDFS filesystem via NameNode
The first step to starting your Hadoop installation is the formatting of the Hadoop file system (HDFS) implemented on top of your local filesystem of your cluster. This step is required the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), execute the following command in the $HADOOP_HOME/bin directory.
hadoop namenode –format
Starting the multi-node cluster
Starting the cluster is performed in 2 steps.
i) Starting the HDFS Daemons – the NameNode daemon is started on the master (Hadoopmaster) and the DataNode daemons are started on all slaves (in our case Hadoopmaster and hadoopnode).
ii) Start the MapReduce Daemons - The JobTracker is started on the master (Hadoopmaster), and the TaskTracker daemons are started on all slaves (in our case Hadoopmaster and hadoopnode)
- HDFS Daemon.
Start hdfs by executing the following command on the master (Hadoopmaster). This will bring up the HDFS with the NameNode up on the master and, the DataNode on the machine listed in the conf/slaves file.
Run the following command to list the processes running on the master.
- Mapred Daemon.
Start mapreduce by executing the following command on the master (Hadoopmaster). This will bring up the MapReduce cluster with the JobTracker up on the master and, the TaskTracker on the machine listed in the conf/slaves file.
Run the following command to list the processes running on the slave.
Stopping the multi-node cluster
Like starting the cluster, stopping the cluster is also performed in 2 steps.
i) Stopping the MapReduce Daemons – the JobTracker daemon is stopped on the master (Hadoopmaster) and the TaskTracker daemons are stopped on all slaves (in our case Hadoopmaster and hadoopnode).
ii) Stopping the HDFS Daemons - the NameNode daemon is stopped on the master (Hadoopmaster) and the DataNode daemons are stopped on all slaves (in our case Hadoopmaster and hadoopnode).
- Stop MapReduce Daemon.
Stop Mapreduce hdfs by executing the following command on the master (Hadoopmaster). This will shut down the JobTracker daemon and, the TaskTrackers on the machine listed in the conf/slaves file.
- Stop HDFS Daemon.
Stop HDFS by executing the following command on the master (Hadoopmaster). This will shut down the HDFS by stopping the NameNode on the master and, the DataNode on the machine listed in the conf/slaves file.