Workflow Engine Comparison(First Impressions)
I was looking at different options for workflow engines. I have some experience in Oozie, little experience in Luigi and no experience in Azkaban. In this post, I will try to give an overview of these engines in terms of their advantages and disadvantages. Take my word with a grain of salt(based on the experience I have with these tools), though.
Crons do not scale(Surprise!)
If you have a lot of processes which manipulate, transform and write data to database, you will sooner or later will face the limitations of the cron jobs. You want to be able to handle failures, debug processes and rerun the failed jobs. You want to have multiple scripts to run based on data availability, data dependency and time-based scheduling. You may want to also share the data workflow with many people where you cannot do any of the items with cron jobs.
What is sufficient?
- Regular scheduling (depending on data availability and time-based)
- Workflow of multiple jobs
- You should be able to write workflows in the same way that you are writing programs
- You should handle the errors and failures gracefully
- Communication between cluster and client should be secure
- Community support should be very good
- Continuous integration(whenever you push to the master, it should adapt the changes(woohoo!))
- Testing should be supported out of the box
- Let’s cut to the chase; pretty much anything that you want to expect from a decent library or framework in terms of software quality and practices
Let’s write our own workflow engine
- Resources are limited(time, effort, human resources)
- Is your usage is too specific or could you make it work in one of the available tools?
- No need to reinvent the wheel
What do we want from the workflow engines?
- First and foremost: resilient to failures
- Debugging is necessary and important advantage over cron jobs
- If we have similar workflows, we should not write too much boilerplate code to make it all work
- Support for databases, HDFS and common file formats
- You should be able to write workflows in the same way that you are writing programs
- You should handle the errors and failures gracefully
- Communication between cluster and client should be secure
- Community support should be very good
- Continuous integration(whenever you push to the master, it should adapt the changes(woohoo!))
- Testing should be supported out of the box
- Default logging would be icing on the cake
- Let’s cut to the chase; pretty much anything that you want to expect from a decent library or framework in terms of software quality and practices
Oozie
Advantages
- Mature
- Support from Hadoop community is strong
- Documentation
- Default support for Pig, ssh, java, filesystem
- Coordinators: when data is available, do the computation. For recurring jobs, you do not need to explicitly configure the job flow.
- Security
- Built in authentication
- It has own testing suite(Mini Oozie)
Disadvantages
- XML(Verbose)
- Control flow is somehow restrictive
- Directed Acyclic Graph(Hard to rerun only a component after failure, perfectly goes along with Pig, though; Pig scripts also define DAG)
- User Interface
Luigi
Advantages
- Python!
- Control flow is advanced as tasks are code
- Dependencies between flows
- Customization and code reuse through object-oriented programming
Disadvantages
- Visualizer is not as good as Azkaban
- No default support for Pig, Hive
- No storage of history and generally persistent storage is lacking
Azkaban
Advantages
- If you are using Voldemort, it supports out of the box
- Visualizations for tasks (svg, interactive) is advanced
- Authentication and Authorization
- History of tasks(which are completed and which are not)
- Plugins for Pig, Hive and many more
- Web deployment
Disadvantages
- Support is not as good as Oozie.
- Scheduling is only time based. AFAIK, no data availability scheduling
- Workflow is somehow limited and restrictive comparing to Luigi and even Oozie.