Oozie is a well-known workflow scheduler engine in Big Data world and is already used industry wide to schedule Big Data jobs. Oozie provides a simple and scalable way to define workflows for defining Big Data pipelines.Internally Oozie workflows run as Java Web Applications on Servlet Containers. In this blog, we will look intowhat oozie support, how it works and how we can write workflows using oozie.
What Oozie supports?
Oozie is well integrated with various Big data Frameworks to schedule different types of jobs such as Map-Reduce, Hive, Pig, Sqoop, Hadoop File System, Java Programs, Spark, Shell scripts and Many more.
How oozie works?
Oozie is internally written in Java but we can write our workflows using XML or Java. In xml, we have to define various action nodes and control nodes to define the flow of our system.
For example, Suppose you want to build a big data pipeline in which the first step is to import some data from oracle database using Sqoop Job and the next step we want to create schema/table for that using Hive and final step is to run some transformation/analytics using apache Spark MapReduce job before saving it the database. This overall pipeline can be specified using oozie workflows.
Terminologies in Oozie
Workflow is a definition file written in XML that defines the DAG (Directed Acyclic Graph) to run the overall flow using action and control nodes.
Action nodes triggers the execution of some task written in MapReduce, Pig, Hive or Sqoop etc. You can also extend oozie to write customized action nodes. Below are the examples of action nodes supported by oozie workflow.
Control Nodes control the flow path and also specify the starting and ending of workflow. Below are the examples of control nodes supported by oozie workflow
Starting & Ending a Workflow
These control nodes specify the start and end of your workflow and you can provide any arbitrary name to these nodes.
Killing a Workflow
This control node can kill a workflow and can return the output reason as whatever is provided in the message field. You can provide a name and a reason in the message field for getting killed.
Making a Decision
Decision node can be used to decide the flow in a workflow by providing a condition in switch case statements. You can define any predicate inside your case statement to control the flow.
Splitting & Joining
You can split your execution path into concurrent execution paths using Fork control nodes and you can also wait for concurrent execution path to complete using Join Control Node.
In the below example, Fork control node will split the workflow into two concurrent MapReduce jobs named as “firstjob” and “secondjob”. When both jobs get completed only then third MapReduce job named as “thirdjob” will get executed.
Though Oozie support various types of job but in this blog, we will discuss only MapReduce, Spark, Sqoop & HDFS Action workflow samples.
If you want to execute MapReduce jobs using oozie engine, we need to define job tracker, name-node details, Mapper class, Reducer class, input and output directory. We can also configure more configuration parameters, but these are the base parameters.8n8
Oozie also provides support to run Sqoop commands directly in workflow using Sqoop action nodes. You can also specify Sqoop job but in below example, we are going to fetch all records from customers table in MYSQL and storing the result in HDFS location.
Oozie Workflow also provides us to work with HDFS storage and to run HDFS commands.
Oozie also provides extended support to run Spark actions.You can defines various configurations to run spark job such number of executor, output compressions, memory required and much more to run your spark application.
Running oozie Workflow
Killing oozie Workflow
Suspending oozie workflow
Resuming oozie workflow
Apache Airflow versus Oozie
Apache Airflow is a new scheduling engine introduced in Big Data World to schedule big data jobs. Its immense popularity contributes to the fact that it can be used to build complex data pipelines. However, in order to write workflow in Apache Airflow, you must know python.
Oozie is a well-known and adopted scheduling engine since a decade but recently a new scheduling engine called Apache Airflow is marking its place in the market. What is the difference between the both?
Oozie is written in Java internally and workflows can be written in XML language. So a non-programmer can also write workflows using oozie where as in order to write workflows in Apache Airflow, you need to learn python language.
Oozie can be used to write simple pipelines but in order to write complex pipelines you can look further for Apache Airflow where you can write complex logic and can integrate with Cloud, Kubernetes and many other technologies.
Though there are new ways coming into market for scheduling big data analytics pipelines such as Apache Airflow but Oozie is still the choice for many to schedule big data pipelines because of its simplicity. Though it can not be used to build complex pipelines but cater well for simple usecases.