Pig Advantages and Disadvantages



home · about · subscribe

February 08, 2014 · -

Introduction

Apache Pig is a dataflow language that is built on top of Hadoop to make it easier to process, clean and analyze “big data” without having to write vanilla map-reduce jobs in Hadoop.
It has also a lot of relational database features. Good old joins, distinct, union and many more commands are already in the language. So what exactly Pig solves different than relational database is its applicability to “big data” where it can crunch large files with ease and it does not need a structured data. Contrarily, Pig could be used for ETL(Extraction Transformation Load) tasks naturally as it can handle unstructured data. It is one of the reasons why it exists to tell the truth.
But let’s ask the fundamental question: Why does data analysis matter ?

Data Analysis Matters

Data analysis matters because as original paper very good puts it:

Data analysis is “inner loop” of product innovation.

Companies which have data and “big data” want to automate some of their processes, they want to make better products for their users, want to create new products and platforms. If you do not happen to be Steve Jobs or someone who has natural insights of what users and consumers want from the product or see new features, then you are dependent on data. Feedback of users, their usage, log files of the website and metrics are all things that make you run faster. They are not what you run with(it is the product itself) but how you run faster. (So much for the analogy)

Pig paper also introduces the basic motivation for Pig why it is useful and how does it fit into the analytics and data processing in Hadoop. Moreover, as you read the paper you realize that the processing pipeline is actually Directed Acyclic Graph and paper goes a little more in depth in theoretical aspects of Pig(the programming language).

So, what does Pig bring to the table and what it is missing?

Advantages

Disadvantages

Some Pointers

If you want to do apply some statistics to your dataset(who does not nowadays in order to get good analytics), then you should check out DataFu. Originally DataFu began in Linkedin but now it is incubator Apache project, has a lot of good tools for statistics and utility UDFs in general. Last month, Netflix released an interesting project named PigPen which aims to bring Clojure awesomeness to write Pig jobs. It is an open source project, do not forget to check out the source code. I have not had chance to use it but functional programming paradigm fits quite naturally to pipeline processes, so I expect it to be quite successful.(apart from Clojure’s own awesomeness)

All Rights Reserved

Copyright, 2020