Ansible (DevOps): Creating Roles to Set Up Hadoop


DevOps is a bunch of programming advancement practices that consolidate Software Development (Dev) and Information Technology Operations (Ops). In this blog we will be using the tool Ansible - IT configuration and Deployment Tool to automate a Hadoop cluster.

Ansible (DevOps) Creating Roles to Set Up Hadoop

What is Hadoop?

Hadoop is an assortment of open-source programming utilities that utilizing numerous computers associated through a network takes care of the issues involving huge information and computation. It goes under Apache Software Foundation.

What issue do we solve by automating Hadoop?

With the advancement in technology, time is becoming a major issue. Several things are done manually which takes time. In today’s era, almost all companies are facing the problem to store and process their large amount of data. Suppose a new system comes into the industry, we have to deploy all the codes according to our needs that already exists in the industry and which takes a lot of time to do all the changes. These things can be now done through automation.

Why do we need to automate Hadoop?

Deploying an infrastructure grade Hadoop cluster is a monumental task and can take a lot of time to deploy as every system needs to be configured for its specific purposes like data nodes for storage, nodes for job scheduling and processing, etc. Our expert big data hadoop developer will implement HDFS that is mainly used for data storage.

Hadoop Architecture

Hadoop Architecture

Ansible Architectural Diagram:

Ansible Architectural Diagram

The software and hardware requirements of this project are as follows:

1.RHEL 7.5 and aboveA minimum of 1 GHz processor
2.YML, JINJAA minimum of 1 GB RAM
3.Ansible, HTTP, HadoopNo strict specifications about hard disk



The primary infrastructure software services aimed to automate are as follows:

Step 1: Install the ansible package in Linux using yum. In this, we are installing an ansible package using the “yum install ansible” command. Before this, yum is to be configured.

Step 2: Then make Ansible galaxy of Hadoop Cluster in some different folders like playbooks. This Hadoop cluster is implemented to solve the big data problem using the command “Ansible-galaxy init hadoop cluster”.

Step 3: Accordingly put the client IP in the host’s file so that it can read the IP from there and so that playbook can be automatically run in that system. The location of a host file would be “/etc/ansible/hosts”.

Step 4: Configure the ansible file according to the need in ansible.cfg file.

Step 5: Write a Hadoop cluster role to set-up a Master Node, Slave Node.

Step 6: Then create a site.yml file in which write a code to import the role Hadoop.

Step 7: Execute the file using the command “ansible-playbook site.yml”.

Step 8: In the Client node role: we copy the Java and Hadoop setup files to the respective nodes.

Step 9: In the Master node role: we copy the master node configuration ie core-site.xml and hdfs-site.xml on the master node machine.

Step 10: In the Slave node role: we copy the slave-node configuration ie core-site.xml and hdfs-site.xml on the master node machine.

Step 11: Now we have to run the following command:

On Name Node - “ start namenode”

On Data Node -“ start datanode”

Step 12: On the client, the machine checks the Hadoop setup by running the following command:

“hadoop hdfs admin -report”

Step 13: To upload a file using the command: “ hadoop fs -put filename / ”

Step 14: To upload a file using the command: “ hadoop fs cat /filename ”

All the respective yml files are listed below:



-name: deploy slave node import_playbook: sn.yml -name: deploy client node import_playbook: cn.yml </> <div id="link11"></div> <h4>Sn.yml</h4> <xmp> -hosts: dn roles: - role: slavenode


[dn] DATA NODE ansible_user=root ansible_password=redhat #slave1 ansible_user=root ansible_password=redhat #slave2 ansible_user=root ansible_password=redhat #slave No [nn] MASTER NODE ansible_user=root ansible_password=redhat #master

Client role: main.yml

-command: "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force" -command: "rpm -ivh jdk-8u171-linux-x64.rpm --force" -template: src: ".bashrc" dest: "/root/.bashrc" -template: src: "core-site.xml.j2" dest: "/etc/hadoop/core-site.xml"

Master Node: main.yml

-command: "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force" -command: "rpm -ivh jdk-8u171-linux-x64.rpm --force" -template: src: ".bashrc" dest: "/root/.bashrc" -file: path: /master state: directory 21 -template: src: "hdfs-site.xml" dest: "/etc/hadoop/hdfs-site.xml" -template: src: "core-site.xml.j2" dest: "/etc/hadoop/core-site.xml" -command: "hadoop namenode -format -force" #- command: " start namenode"


&#60;configuration&#62; &#60;property&#62; &#60;name&#62;;/name&#62; {% for i in groups["nn"] %} &#60;value&#62;hdfs://{{ i }}:9001&#60;/value&#62; {% endfor %} &#60;/property&#62; &#60;/configuration&#62;







File Uploads
Related article

Aegis big data hadoop developers are posting this article to let the development community know how to get top N words frequency count via distinct articals

Setting up and implementing Hadoop services in a cost effective way in near to impossible for small and medium sized organizations.

In recent years, data science has acquired momentum as an integrative field of study due to the massive quantities of data we generate regularly, which is estimated to be more than 2.5 quintillion bytes in size. The area of research makes use of contemporary methods and technologies to extract useful insights from organized and unstructured data, uncover interesting patterns, and make decisions based on that knowledge. Because data science makes use of both organized and unorganized data, the data utilized for analytics may be sourced from a variety of application areas and be made accessible in many different forms.

DMCA Logo do not copy