PigPen, Hadoop, Pig, Clojure and All That
My humble contribution to ridiculous images on big data.
This post is about PigPen, a library that Netflix open sourced in the beginning of this year. Yet, in order to introduce the library, I covered some background namely Hadoop, Pig and Clojure which PigPen builds on top of those. If you want to jump right away to my rant on PigPen right away, scroll a bit.
Hadoop
Hadoop (a.k.a distributed data processing framework for counting words) built on top of idea of map-reduce, famous Google paper not only changed how we process data(it is a data processing framework after all) but also create new perspectives on how we might think about data, what type of opportunities that it could provide, how can we use the data we collect. The latter one, the implicit role on collected data usage has a quite profound effect on different type of skills(data wrangling,data analysis, scraping and so on), job types(suddenly you started to hear data engineers, data scientists ..) and also company types(data companies that just collect data and provide in a nice interface possibly through an API, data analytics company that creates nice visualizations and interfaces to interact with and learn from the data, big data consulting companies that provide consulting for big data technologies and so on). Do not get me wrong, I am not saying Hadoop is the only cause for this change but it plays a strong role as it contributed to the change on how we might perceive the data.
Hadoop adoption was great at first thanks to “big data” and its tireless marketers, early adopters, media that repeatedly overloads the phrase, or maybe it was all because developers who want to learn another new cool technology. I am not sure at this point(I bet on “big data” gods, though). After some time, some researchers found out that Noboy got ever fired for using Hadoop on a cluster which is a skeptical view of Hadoop and tries to advocate the usage of memory as it becomes cheaper and cheaper. If you are following big data space, they were almost suggesting using Spark. However, even if the Hadoop popularity suffered a little bit, its adoption did not seem to decrease over time. People who have big data, they still try to use Hadoop for their data processing system.
However, one thing became clear after Hadoop became mainstream that nobody really wanted to write vanilla map-reduce programs. Not only they were very low level but also you need to write a lot of boiler plate for the things that are very common in a data processing systems. They wanted more abstraction than what Hadoop provides. They wanted other language interfaces, too. Because writing Java has been one of the most fulfilling development experience said an experienced nobody developer.(By the way, Java is a fine programming language if not great if you ask me.)
Abstractions, Abstractions, and more Abstractions
We need moar abstractions said one junior developer who spent two days to write a buggy combiner. Not only more abstractions, we need functional programming language interfaces said another junior developer who loves Haskell. We need better workflows for common tasks said another developer. We already know SQL and hate it, why learning another new domain specific language and hate it said a senior developer. Later this developer would happily announce that he hates Hive as much as he hates SQL.
Software Abstraction Projects on Top of Hadoop
Software abstraction projects then developed in order to satisfy some of these needs. The abstraction does not refer mere software abstractions and the concepts that Hadoop provides but also it refers how common tasks(counting words) are abstracted in a way that minimal boilerplate code requires.(if a project requires one-liner for counting words, then it wins.)
Pig and Hive came into the play first. Pig provides a nice language and data processing flow for data processing tasks through User Defined Functions(UDFs). I write a little bit about on Pig in here if you want to learn more.
Hive’s adoption is surprisingly good, whose query language is very similar to SQL. The data analysts that write SQL queries start to write HiveQL queries to analyze the data so the effort that requires for another new language became minimal. Instead of learning a new language, it may make more sense for the companies that want to minimize their efforts and want to maximize the outcome of their employees.
If I compare these two, I would much prefer Pig just because its language is better than SQL. In general, I strongly believe that declarative style is not very good how you might want to interact with your data. Pig’s transformation based approach, especially the transformations as a first class citizen approach is very good in this regard.
Cascading
What if you do not want to learn any new technology, but you will write code that you write everyday, but magically it will process big data out of the box. Cascading is a project that wants to abstract whole map-reduce flow flow and creates the pipeline for you out of the box. Cascading is the framework name and it is for Java but it has quite a lot of interfaces for different languages as well, the most prominent is Scalding for Scala. It has also Clojure interface which is Cascalog and PigPen could be considered as a competitor for Cascalog.
It has a bunch of cool projects on top of it, I will mention in here briefly, but if you are interested in check it on their website. Especially, Pattern seems quite nice for Machine Learning models as it supports PMML and you could dploy your models tested on smaller data on big data seamlessly.
Pattern
It also has Pattern which combines Predictive Model Markup Language(PMML) with Cascading to make the Machine learning deployment much easier. This is very nice for a number of reasons but most important reason is that if you evaluate your machine learning algorithm for small dataset, then to apply the model into the production is quite seamless.
Lingual
SQL for Cascading would be a good term to explain what Lingual is. It provides a SQL interface for Hadoop, which makes it easier to interact the data. This also aims to interoperability as you could make SQL queries for big data in the same way you do in a much smaller scale. If you are using any third party application on top of your data, you could also connect through this interface as your big data supports for big data as well.
Driven
It tries to visualize your data processing system on top of Cascading; dependencies and other components. Similar to Lipstick
Netflix approach
Instead of Cascading, Netlix seems to choose Pig as a data processing backbone for processing data. Therefore, they seem to build their stack on top of Pig. They put up Lipstick previously which is very similar to Driven. Now, they are putting PigPen which is very similar to Cascalog in spirit. Main(maybe not only) difference is the underlying data processing environments.
Cascading generally differs from Pig and Hive where you do not want to write quick-dirty scripts but rather a production environment which you could deploy. Otherwise, if you want to do data ad-hoc data analysis Pig or even Hive might give better solutions than what Cascading provides. Even for this reason, PigPen is a nice competitor for Cascalog in Clojure league.
There are also Scoobi, Scrunch and Spark that you may want to consider for your needs. Especially, if you have data that could be fit a good computer, Spark may be more suitable.
Pig
Large Scale Machine Learning at Twitter paper gives a good overview why Pig is good for Machine learning and data processing in general.(If you are interested in the paper, its presentation is here) It basically says that Pig is the backbone of the processing framework where you could use mighty Python libraries for scientific computing, data processing and machine learning. As you have a general domain programming language that you could depend on via User Defined Functions(UDF’s), you could pretty much process you data however you want independent from the big data as the platform built on Pig already handles query planning as well as Hadoop aspects. Only job is to write UDF’s and combine the processing steps in the Pig. Seems quite straightforward. Netflix also seems to adopt this approach and I find particularly very good for machine learning and data processing in general. I also wrote my rants on these issues as well; previously on Pig and on the original paper that explains the rationale behind Pig and some of its advantages and disadvantages over vanilla map-reduce.
I really like Pig and its approach in data processing as I wrote previously. However, big data is a field that sees a surprisingly good adoption and growing community. Therefore, it is not unexpected that people actually try to push the field quite a lot, trying to come up better solutions for data processing problems. Pig is still being developed and see improvements, features(cube
and roll
functions) and bug fixes, which is great but what about its shortcomings?
Disadvantages of Pig
- Even very simple tasks(i/o) requires a lot of boilerplate code. Importing jar files, registering macros, setting the parameters and so on.
- You need to maintain two and sometimes three different codebase(Java, Pig) or if you are using Jython for UDF’s, add Python as well.
- All of the scripts that are wrapped in Pig is specific to processing jobs. Therefore, even though UDF’s are reusable across different jobs, the component of jobs cannot be reusable. (Macros provide a limited mechanism to do so to be fair, though).
- Due to partially above reason, programs becomes scripts rather than full fledge production workfklows.
Functional Programming
Functional programming is a paradigm where it puts an emphasis on functions which are pure and evaluated as mathematical functions. That means for same input, they always yield same result. This is one of the reason why it is popular, as the functions do not have side effects, this leads cleaner, readable and more bug-free programs. There are a lot of functional programming languages: Common Lisp, Clojure, Erlang, Haskell, Ocaml, F# to name a few. However, these languages also differ quite drastically as some of the languages give more importance to types(Haskell) whereas some others adopt more dynamic approach(Clojure). Some programming languages are not pure functional programming languages but provide various structures and methods to enable developers to write functional programming style: Scala, Python and C# could be examples of this category.
Clojure
Clojure is a modern a dialect of List programming language specifically targeted to Java Virtual Machine(JVM). Rich Hikey who is the inventor and main developer of Clojure explains the rationale why there is a need for clojure in here which I would like to replicate some of the core aspects in here as well:
- Functional Programming Language
- Designed for Concurrency
- Immutable data structures + first-class functions
- Dynamic emphasis instead of strongly types(unlike Haskell)
- JVM is good platform and Java community and libraries could be useful.
- Write Java wherever you have to, extend it with Clojure wherever you can.
- Object oriented programming language is overrated.
- Inheritance is not only way to do polymorphism.
Also, some of the design principles and features are listed in here.
Syntax
Since it is a dialectic of Lisp, and the language is itself homoiconic, the syntax and also the layout of the programs tend to very similar. Most people hate parentheses but coming from Python, even if it is not readable as Python, it is much more pleasant than say Java or Perl where they exploit a lot of different non-characters to form their syntax. This makes it also learn the syntax of Clojure piece of cake.
(def fibonacci
(lazy-cat [0 1] (map + (rest fibonacci) fibonacci)))
(take 10 fibonacci) ; Get first 10 fibonacci numbers
S expressions are good but they become quite hard to unread if you are doing step by step data processing. Consider the next example;
We are getting a vector, filter out the negative values, increment the rest of the numbers and return string representation. However, it is hard to read. Especially, if clojure would be useful for data processing, this becomes quite cumbersome, as we want to abstract each step(i/o, extract-tranformation-load(etl), preprocessing, processing). Clojure provides threading macro mechanism to handle this type of sequence processing.
Macros to rescue
The same expression as previous, yet much readable. Since it is also reversed order, you could follow the execution order where in the previous representation you need to go from the inner expressions to the outer expressions. What would be equivalent of this processing in Python.
Definitely, very readable but not perfect(Pythonic, though). Python supports functional programming to some degree, so we could use functional programming approach to tackle the same problem.
We could improve the lambda
expressions if we abstract into functions similar to clojure:
More readable and improved, yet it is not Pythonic and still suffers reading from the inner expressions to outer expressions, hard to read:
pos_vals = filter(pos, range(-5, 6))
inc_vals = map(inc_, pos_vals)
str_vals = list(map(str, inc_vals))
Generally, Python programmers write programs like these, use intermediate variables, as it is much easier to debug, also much easier to read. One-liners are cool but not readable nor maintainable. I choose this example because of both syntax similarities as well as keyword names’ similarities of Python and Clojure for this particular example.
Clojure with threading macro provides a great mechanism, other languages Javascript provides also a method chaining mechanism where you want to process data by cascading methods in a readable fashion. Scalding (Scala implementation of Cascading) also provides a very similar syntax and feeling, if you are familiar, check that out as well.
PigPen
PigPen combines the Clojure awesomeness namely functional programming, Lisp and JVM on top of Pig. This is quite awesome for a number of reasons:
- You write clojure, no Pig at all. I know I said Pig is great language but with either Python or Java UDF’s, it is harder to maintain two codebase rather than one. And for data processing, however I like transformations first approach of Pig, I much prefer Clojure’s functional programming style to deal with data.
- You get to use Clojure wherever you want in the processing seamlessly. Either that could be a preprocessing step or it could be something that you used UDF’s for, it does not matter, use Clojure. Therefore, you are not limited to Pig functions at all. Similar to UDF’s, except you are writing the UDF’s in the overall flow.
- Clojure has not been around for a long time and its library support is not great. Therefore, the libraries that you want to use may not exist in clojure. However, do not despair, JVM to rescue! If there is a library in Java, which does not exist for Clojure, import the Java library inside of Clojure. Clojure core exploits a number of Java primitive functions like
Integer/valueOf
to get integer representation of a string. JVM works for Clojure, exploit it wherever you can without paying the burden of writing Java. - You could write composable programs since Clojure provides much nicer abstractions than Pig. You could write general purpose functions and reuse all the time, e.g loading data:
- Pig becomes a library that you could just import in the dependencies(with Leiningen) this is straightforward. (If you are using Jython for Python UDF’s, jars, dependencies become a nightmare after some time.) To be able to use it, only the following lines are needed in
project.clj
:
Remaining is handled by mighty Leiningen for you.
It just works. Just works.
Yes, there are no jar files, no external binary files. Only one line states PigPen dependency and that is it.
- Pig is actually another library that you import and you could use Pig functions wherever you need to. Pig functions could be called after importing PigPen:
Pig functions are good old Pig functions with very-to-no differences between actual Pig functions. 7. Scripts are not specific to jobs, they are regular functions that could use other functions. See the example:
(defn process-data [input-path output-path]
(input-data input-path)
(process)
(pig/dump)
(spit output-path)))
where process
is your awesome logic that includes different transformations and does the heavy processing to your data. Reuse in this example should be obvious, and all of the logic is in the process that could change independently from the i/o. Orthogonality is increased. You could further abstract input and output if you want. This could be done in Pig at some level using Macros to be fair. But not this clean and not this straightforward. This is pure beauty.