Configuring Hadoop Cluster using Ansible Playbook

Suman Sourav
4 min readApr 20, 2021

What is Ansible?

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration. Ansible was written by Michael DeHaan and acquired by Red Hat in 2015. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management (allowing remote PowerShell execution) to do its tasks.

What is Hadoop?

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

So let's start configuring

Pre-requisite

  1. Ansible Configured Manager Node
  2. 2 X DataNodes
  3. 1 X NameNode

STEP 1: Edit your Ansible inventory to add IP address of the NameNode and DataNode it should look like this

STEP 2: Create an Ansible role NameNode inside a folder HadoopRole

ansible-galaxy init HadoopRole

STEP 3: Edit the main.yml inside the task folder of the NameNode role.

STEP 4: Now in the Template folder in the NameNode create 2 “.xml” with name “core-site.xml” and “hdfs-site.xml” and copy the content as in image below

  1. core-site.xml

2. hdfs-site.xml

STEP 5: Create an Ansible Role in the HadoopRole with the name DataNode using the command

ansible-galaxy init DataNode

STEP 6: Edit the main.yml inside the task folder of the DataNode role

STEP 7: Now in the Template folder in the DataNode create 2 “.xml” with name “core-site.xml” and “hdfs-site.xml” and copy the content as in image below

  1. core-site.xml

Here we are using jinja template to replace the namenode_ip variable with our variable which we will create in the next step.

2.

STEP 8: Edit the mail.yml in the vars folder inside DataNode

STEP 9: Create a PlayBook as given below

STEP 10: Now run the playbook using

ansible-playbook <playbookname.yml>

STEP 11: Done check the Hadoop configuration and DataNode by running the command.

ansible NameNode -m shell -a ‘hadoop dfsadmin -report’

GIT : https://github.com/SammyFidato/ansiblehadoop

--

--