Workflow Engine Comparison(First Impressions)

April 13, 2014 · ∞ -

I was looking at different options for workflow engines. I have some experience in Oozie, little experience in Luigi and no experience in Azkaban. In this post, I will try to give an overview of these engines in terms of their advantages and disadvantages. Take my word with a grain of salt(based on the experience I have with these tools), though.

Crons do not scale(Surprise!)

If you have a lot of processes which manipulate, transform and write data to database, you will sooner or later will face the limitations of the cron jobs. You want to be able to handle failures, debug processes and rerun the failed jobs. You want to have multiple scripts to run based on data availability, data dependency and time-based scheduling. You may want to also share the data workflow with many people where you cannot do any of the items with cron jobs.

What is sufficient?

Regular scheduling (depending on data availability and time-based)
Workflow of multiple jobs
You should be able to write workflows in the same way that you are writing programs
You should handle the errors and failures gracefully
Communication between cluster and client should be secure
Community support should be very good
Continuous integration(whenever you push to the master, it should adapt the changes(woohoo!))
Testing should be supported out of the box
Let’s cut to the chase; pretty much anything that you want to expect from a decent library or framework in terms of software quality and practices

Let’s write our own workflow engine

Resources are limited(time, effort, human resources)
Is your usage is too specific or could you make it work in one of the available tools?
No need to reinvent the wheel

What do we want from the workflow engines?

First and foremost: resilient to failures
Debugging is necessary and important advantage over cron jobs
If we have similar workflows, we should not write too much boilerplate code to make it all work
Support for databases, HDFS and common file formats
You should be able to write workflows in the same way that you are writing programs
You should handle the errors and failures gracefully
Communication between cluster and client should be secure
Community support should be very good
Continuous integration(whenever you push to the master, it should adapt the changes(woohoo!))
Testing should be supported out of the box
Default logging would be icing on the cake
Let’s cut to the chase; pretty much anything that you want to expect from a decent library or framework in terms of software quality and practices

Oozie

Advantages

Mature
Support from Hadoop community is strong
Documentation
Default support for Pig, ssh, java, filesystem
Coordinators: when data is available, do the computation. For recurring jobs, you do not need to explicitly configure the job flow.
Security
Built in authentication
It has own testing suite(Mini Oozie)

Disadvantages

XML(Verbose)
Control flow is somehow restrictive
Directed Acyclic Graph(Hard to rerun only a component after failure, perfectly goes along with Pig, though; Pig scripts also define DAG)
User Interface

Luigi

Advantages

Python!
Control flow is advanced as tasks are code
Dependencies between flows
Customization and code reuse through object-oriented programming

Disadvantages

Visualizer is not as good as Azkaban
No default support for Pig, Hive
No storage of history and generally persistent storage is lacking

Azkaban

Advantages

If you are using Voldemort, it supports out of the box
Visualizations for tasks (svg, interactive) is advanced
Authentication and Authorization
History of tasks(which are completed and which are not)
Plugins for Pig, Hive and many more
Web deployment

Disadvantages

Support is not as good as Oozie.
Scheduling is only time based. AFAIK, no data availability scheduling
Workflow is somehow limited and restrictive comparing to Luigi and even Oozie.