Pig 101
In this post, I'll be presenting the basics of the Pig system. Note that this is not a tutorial on how to write Pig programs, but rather a system overview. I will be focusing more on the compiler and the techniques used to translate Pig Latin scripts into Map-Reduce jobs.
Why use Pig?
So, what’s the motivation behind Pig? Why use it when you can write simple Java Map-Reduce programs? Simple... hmmm... Well, yes, if what you want to do is WordCount! But when it comes to other operations, like joins or complex algorithms, it’s not that trivial. Pig is very easy to use for people familiar with SQL or scripting languages and it doesn’t require you to understand how Map-Reduce works. Pig Latin programs are also way smaller than Java programs (obviously). So, they’re also probably much faster to write! And the best part is that operations like join, sort, filter and group are already provided in Pig Latin (as expected by a self-claimed high-level language :p)
Not convinced? Then I hope the most popular Map-Reduce example will wash away all your doubts!An example: WordCountwordInput = LOAD 'input' USING TextLoader();
words = FOREACH wordInput GENERATE FLATTEN((TOKENIZE($0))) AS word;
grouped = GROUP words BY word;
result = FOREACH grouped GENERATE group AS key, COUNT(words) AS count;
STORE result INTO 'wordOutput'; Pig’s declarative nature makes it so much more intuitive to write data analysis programs :-)
You don’t have to think about keys and values and squeeze your head on how to fit your problem into a map and a reduce function!
Running this script locally on my machine giving as input the text of the first section of this post produces the following output:I 1
a 2
In 1
be 2
is 1
of 1
on 2
to 2
Pig 3
and 1
but 1
how 1
not 1
the 4
…
System OverviewHere’s a simple diagram to show the system architecture:
words = FOREACH wordInput GENERATE FLATTEN((TOKENIZE($0))) AS word;
grouped = GROUP words BY word;
result = FOREACH grouped GENERATE group AS key, COUNT(words) AS count;
STORE result INTO 'wordOutput'; Pig’s declarative nature makes it so much more intuitive to write data analysis programs :-)
You don’t have to think about keys and values and squeeze your head on how to fit your problem into a map and a reduce function!
Running this script locally on my machine giving as input the text of the first section of this post produces the following output:I 1
a 2
In 1
be 2
is 1
of 1
on 2
to 2
Pig 3
and 1
but 1
how 1
not 1
the 4
…
System OverviewHere’s a simple diagram to show the system architecture:
Each node of the Logical Plan represents an operation of the script and the arrows connecting the nodes show how data flow from one step to the next. Each node in this plan is then translated into one or more nodes of the Physical Plan, depending on the complexity of the operation. In the end, nodes are grouped together to form Map and Reduce operations. In our example, FOREACH and Local Rearrange can be performed inside the Mapper, while the Package and the next FOREACH can be performed inside the Reducer. The Global Rearrange and LOAD/STORE operations are taken care by the Hadoop framework automatically.More generally, there are some rules to follow in order to convert a Physical Plan into a Map-Reduce Plan:
- Convert each (CO)GROUP into a Map-Reduce job
- Map assigns keys based on the BY clause
- Each FILTER and FOREACH between the LOAD and the (CO)GROUP are pushed into the map function
- Commands between (CO)GROUP operations are pushed into the reduce function
- Perform tagging in case of multiple input sets
- Each ORDER command is compiled into 2 Map-Reduce jobs
- Job 1 samples the input to determine key distribution
- Job 2 generates roughly equal-sized partitions and sorts
These rules were designed in the initial version of the Pig system. There is a high chance they have changed. In the next days I will be studying those rules in more detail and I’ll get back!Until then sweet coding!
V.
V.
References and links
- Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of Map-Reduce: the Pig experience.
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing.
- http://pig.apache.org/
- http://www.cloudera.com/videos/introduction-to-apache-pig
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing.
- http://pig.apache.org/
- http://www.cloudera.com/videos/introduction-to-apache-pig

