Pig Not So Foreign Language Paper Notes



home · about · subscribe

February 09, 2014 · -

These are notes that I took from the paper, where the authors explain the design principles and some theoretical aspects of Pig the programming language. I gave a basic overview in Pig Advantages and Disadvantages.

Before Pig

Before Pig and Hadoop, there was mighty Map-Reduce paradigm for parallellization and data processin. The overview of data processing before Pig as follows:

Map-Reduce Advantages

  1. Scale
    • Scalable due to simpler design
    • Only parallelizable operations
    • No transactions
  2. Runs on cheap commodity(cost) hardware
  3. Procedural control - a processing “pipe”

Disadvantages

  1. Extremely rigid data flow(Map-Reduce)
  2. Common operations must be coded by hand(Join, filter, projection, aggregation)
  3. Semantics are hidden inside map-reduce functions
    • Difficult to maintain, extend and optimize

Need a high-level, general data-flow language.

Summary

Big demand for parallel data processing.

Building a High-Level Dataflow System on Top of MapReduce

What is Pig

Join implementations

Hbase

Advantages of Pig

Needs

First, Pig Latin is the programming language and Pig is the data processing environment on top of Hadoop.

Pig Latin as a DataFlow Language

Nested Data Model

Why nested model is better?

UDFs

All aspects of processing in Pig Latin can be customized through User Defined Functions.

All Rights Reserved

Copyright, 2020