/* My Journey to the Cloud */ http://vasia.posterous.com Random thoughts, notes I keep, things I've heard, people I've met. posterous.com Tue, 08 May 2012 05:10:00 -0700 Basic Pig Operators http://vasia.posterous.com/basic-pig-operators http://vasia.posterous.com/basic-pig-operators

In this post I will present some of the basic and most common and useful Pig operators. I will explain how they operate on data and what results they produce, but also how they are internally translated into Map-Reduce jobs and executed on the Hadoop execution engine. 

I should remind here how the compilation to Map-Reduce works. The compiler that transforms the Physical Plan into a DAG of Map-Reduce operators uses a predecessor depth-first traversal to generate the graph. When compiling an operator, the goal is to try and merge it in the existing Map-Reduce operator, i.e. in the current Map or Reduce phase. However, some operators, such as group, require the data to be shuffled or sorted, so they cause the creation of a new Map-Reduce operator. The new operator is connected to the previous one with a store-load combination.

  • FOREACH

FOREACH takes as input a record and generates a new one by applying a set of expressions to it. It is essentially a projection operator. It selects fields from a record, applies some tranformations on them and outputs a new record. FOREACH is a non-blocking operator, meaning it can be included inside the current Map-Reduce operator.

Foreach

  • FILTER

FILTER selects those records from dataset for which a predicate is true. Predicates contain equality expressions, regular expressions, boolean operators and user-defined functions. FILTER is also non-blocking and can be merged in the current Map or Reduce plan.

Filter

  • GROUP BY

GROUP collects all records with the same key inside a bag. A bag is a Pig data structure which can be described as an unordered set of tuples. GROUP generates records with two fields: the corresponding key which is assigned the alias "group" and a bag with the collected records for this key.

Groupby

We can group on myltiple keys and we can also GROUP "all". GROUP all will use the literal "all" as a key and will generate one and only record with all the data in it. This can be useful if we would like to use some kind of aggregation function on all our records, e.g. COUNT. 

GROUP is a blocking operator and it compiles down to three new operators in the Physical Plan: Local Rearrange, Global Rearrange and Package. It requires repartitioning and shuffling, which will force a Reduce phase to be created in the Map-Reduce plan. If we are currently inside a Map phase, then this is no big problem. However, if we are currently inside a Reduce phase, a GROUP will cause the pipeline to go through Map-Shuffle-Reduce.

  • ORDER BY

The ORDER BY operator orders records by one or more keys, in ascending ot descending order. However, what is happening behind the scenes is much more interesting than you may imagine. ORDER is not implemented as simply as Sorting-Shuffling-Reduce. Instead it forces the creation of two Map-Reduce jobs. The reason is that datasets often suffer from skew. That means that most of the values are concentrated around a few keys, while other keys have much less corresponding values. This phenomenon will cause only a few of the reducers to be assigned most of the workload, slowing down the overall execution. The first Map-Reduce job that Pig creates is used to perform a fast random sampling of the keys in the dataset. This job will figure out the key distribution and balance the load among reducers in the second job. However, just like in the case of Skew Join, this technique breaks the Map-Reduce convention that all records with the same key will be processed by the same reducer.

  • JOIN

JOIN has been extensively discussed in this post.

  • COGROUP

COGROUP is a generalization of the GROUP operator, as it can group more than one inputs based on a key. Of course, it is a blocking operator and is compiled in a way similar to that of GROUP.

  • UNION

UNION is an operator that concatenates two or more inputs without joining them. It does not require a separate Reduce phase to be created. An interesting point about UNION in PIg is that it does not require the input records to share the same schema. If they do, then the output will also have this schema. If the schemas are different, then the output will have no schema and different records will have different fields. Also, it does not eliminate duplicates.

  • CROSS

CROSS will receive two or more inputs and will output the cartesian product of their records. This means that it will match each record from one input with every record of all other inputs. If we have an input of size n records and an input of size m records, CROSS will generate an output with n*m records. The output of CROSS usually results in very large datasets and it should be used with care. CROSS is implemented in a quite complicated way. A CROSS logical operator is in reality equivalent to four operators:

Cross

The GFCross function is an internal Pig function and its behaviour depends on the number of inputs, as well as the number of reducers available (specified by the "parallel 10" in the script). It generates artificial keys and tags the records of each input in a way that only one match of the keys is guaranteed and all records of one input will match all records of the other. If you are interested in more details, you can read the corresponding part of this book.

My conclusion of the above analysis was that even the Physical Plan is very dependent on the Map-Reduce framework and does not reflect the right level for my work to be done. (CO)GROUP is compiled down to three new operators and CROSS is compiled down to four, while they can be mapped directly to the CoGroup and Cross Input Contracts of Stratosphere. That led me to move up one level and start working in compiling the Logical Plan into a PACT plan. 

It turned out that things are much simpler up there, but a lot more coding needs to be done. Just to illustrate the simplicity, I will use the script and plan generation from the Pig paper:

Paper_script

And this is how the Logical Plan is transformed into a Physical and then a Map-Reduce Plan:

Paper_plans
Now, this is how the Logical Plan could be compiled to a PACT Plan:

Pact_plan
Much simpler and much cleaner! I'm quite optimistic =)

And now I have to sit down and code this thing!

Until next time, happy coding!

V.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Mon, 07 May 2012 05:28:00 -0700 Pig's Logical Plan Optimizer http://vasia.posterous.com/pigs-logical-plan-optimizer http://vasia.posterous.com/pigs-logical-plan-optimizer

Hello from *sunny* Stockholm!

It's been almost a month since my last thesis post and as I was hoping, Spring is finally here :D

It's been a crazy, busy and productive month though, so I will be updating you on my progress by writing two posts today!

This one is about Pig's Logical Plan Optimizer. In my previous posts (here and here) I have explained how Pig creates a data-flow graph from the Pig Latin script, the Logical Plan, and then transforms this graph into a set of Map-Reduce jobs. The Logical Plan goes through the first compiler and is transformed into a Physical Plan, and the Physical Plan is then sent to the Map-Reduce compiler, which transforms it into a DAG of Map-Reduce jobs:

Logical_to_physical_to_mr

An intermediate and quite interesting stage which is not visible in the above diagram, is the optimization of the Logical Plan. The initial Logical Plan is created by an one-to-one mapping of the Pig Latin statements to Logical Operators. The structure of this plan is of course totally dependent on the scripting skills of the user and can result in highly inefficient execution.

Pig performs a set of transformations on this plan before it compiles it to a Physical one. Most of them are trivial and have been long used in database systems and other high-level languages. However, I think they're still interesting to discuss in the "Pig context".

 

Rules, RuleSets, Patterns and Transformers

The base optimizer class is designed to accept a list of RuleSets, i.e. sets of rules. Each RuleSet contains rules that can be applied together without conflicting with each other. Pig applies each rule in a set repeatedly, until no rule is longer applicable or it has reached a maximum number of iterations. It then moves to the next set and never returns to a previous set.

Each rule has a pattern and an associated transformer. A pattern is essentially a sub-plan with specific node types. The optimizer will try to find this pattern inside the Logical Plan and if it exists, we have a match. When a match is found, the optimizer will then have to look more in depth into the matched pattern and decide whether the rule fulfils some additional requirements. If it does, then the rule is applied and the transformer is responsible for making the corresponding changes to the plan.

Some extra caution is needed in two places. The current pattern matching logic assumes that all the leaves in the pattern are siblings. You can read more on this issue here. This assumption creates no problems with the existing rules. However, when new rules are designed, it should be kept in mind that the pattern matching logic might need to be changed.

Another point that needs highlighting has to do with the actual Java implementation. When searching for a matching pattern, the match() method will return a list of all matched sub-plans. Each one of them is a subset of the original plan and the operators returned are the same objects as in the original plan.

 

Some Examples

  • ColumnMapKeyPrune

This rules prunes columns and map keys that are not needed. More specifically, removes a column if it mentioned in a script but never used and a map key if it never mentioned in the script.

  • FilterAboveForeach

Guess what? Pushes Filter operators above Foreach operators! However, it checks if the field that Filter works on is present in the predecessor of Foreach:

Filteraboveforeach

  • MergeFilter

As you can imagine, it merges two consecutive Filter operators, adding the condition of the second Filter to the condition of the first Filter with an AND operator:

Mergefilter

  • MergeForeach

This rule merges Foreach operators, but it's not as simple as it sounds. There are a few additional requirements that need to be met. For example, if the first Foreach operator has a Flatten in its internal plan, the rule cannot be applied. The optimizer also checks how many times the outputs of the first Foreach are used by the second Foreach. The assumption is that if an output is reffered tomore than once, the overhead of multiple expression calculation might even out the benefits from the application of this rule:

Mergeforeach2
There are several more optimization rules, but I hope the idea is clear from the examples I already mentioned. All the optimizations performed at this level are general-purpose transformations and decoupled from the execution engine and the Map-Reduce model. However, this is not true after the transformation to a Physical Plan. And this is why I now understand why the integration alternatives I had in mind in late February are not worth implementing.

 

The reason will become clear with my next post very very soon.

Until then, happy coding :)

V.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Tue, 10 Apr 2012 08:03:00 -0700 Join Types in Pig http://vasia.posterous.com/join-types-in-pig http://vasia.posterous.com/join-types-in-pig

This blog post is on joins! This trivial but extremely useful relational operation I know you' re all familiar with! 

Inner join, equi-join, natural join, theta-join, outer join, left-outer join, right-outer join, full-outer join, self join, semi-join...

I bet you remember the definitions and tell the differences as easy as you remember the multiplication tables... Right! Once upon a time, I also could... just right before my undergrad databases exam... hmmm...

Honestly, I've always found it hard to remember the specific details for all different types of joins available and I always need to refresh the concepts whenever I need to use a specific type. (Oh, how much I love wikipedia, hell yeah I do! :p)

 

Joins in Map-Reduce

No matter how common and trivial, join operations have always been a headache to Map-Reduce users. A simple google search on "map-reduce join operation" will give you several blog posts, presentations and papers as a result. The problem originates from Map-Reduce's Map-Shuffle-Sort-Reduce static pipeline and single input second-order functions. The challenge is finding the most effective way to "fit" the join operation into this programming model.

The most common strategies are two and both consist of one Map-Reduce job:

  • Reducer-side join: In this strategy, the map phase serves as the preparation phase. The mapper reads records from both inputs and tags each record with a label based on the origin of the record. It then emits records setting as key the join key. Each reducer then receives all records that share the same key, checks the origin of each record and generates the cross product. Slides 21-22 from this ETH presentation provide a very clear example.

 

  • Mapper-side join: The alternative comes from the introduction of Hadoop's distributed cache. This facility can be used to broadcast one of the inputs to all mappers and perform the join in the map phase. However, it is quite obvious that this technique only makes sense in the case where one of the inputs is small enough to fit in the distributed cache!

 

Joins in Pig

Fortunately, Pig users do not need to program the join operations themselves as Pig Latin offers the JOIN statement. Also, since Pig is a high-level abstraction that aims to hide low-level implementation details, they do not need to care about the join strategy... Or do they?

Pig users can use the JOIN operator in pair with the USING keyword in order to select the join execution strategy. Pig offers the following Advanced Join Techniques:

  • Fragment-Replicate Join: USING 'replicated'

It is advised to used this technique when a small table that fits to memory needs to be joined with a significantly larger table. The small table will be loaded in the memory of each machine using the distributed cache, while the large table will be fragmented and distributed to the mappers. No reduce phase is required, as the join can be completely implemented in the map phase. This type of join can only support inner and left-outer join, as the left table is always the one that will be replicated. Pig implements this join by creating two map-only jobs. During the first one, the distributed cache is set and the small input is broadcasted to all machines. The second one is used to actually perform the join operation.

The user must pay attention and have in mind that the second table in their statement will be the one loaded into memory, i.e. in the statement:

joined = JOIN A BY $0, B BY $0 USING 'replicated'

B is the input that will be loaded into memory. Extra care needs to be taken for one more reason when using this type of join. Pig will not check beforehand if the specified input will fit into memory, thus resulting in a runtime error in case it doesn't!

  • Merge Join: USING 'merge'

You should use this type of join when the inputs are already sorted by key. This is a variation of the well-known sort-merge algorithm, where the sort is already performed :)

In order to execute this join, Pig will first run an initial Map-Reduce job that will sample the second input and build an index of the values of the join keys for each HDFS block. The second job will take the first input and utilize the index to find the key it is looking for in the correct block. For each key, all records with this particular key will be saved in memory and used to do the join. In other words, two pointers need to be maintained, one for each input. Since both inputs are sorted, only one lookup in the index is required.

  • Skew Join: USING 'skewed'

The third and last type of join provided by Pig is the skew join. It is quite common that some keys are a lot more popular than others in datasets, that is, most of the values correspond to a very small set of keys. Using the default algorithm in such a case would result in significantly overloading some of the reducers in the system. 

In order to overcome this problem, one can use Pig's skew join. Pig will first sample one of the inputs, searching for the popular keys, whose records would not fit in memory. The rest of the records will be handled by a default join. However, records that belong to one of the identified as popular keys, will be split among a number of reducers. The records of the other input that correspond to keys that were split, will be replicated in each reducer that contains that key.

Skew is supported in one input only. If both tables have skew, the algorithm will still work, but will be significantly slower.

However, extra care should be taken when using this type of join! This algorithm breaks the Map-Reduce convention that all records with the same key will be processed by the same reducer! This could be dangerous or weild unexpected results if one tries to use an operation that depends on all records with the same key being in the same part file!

 

Thoughts...

Pig's philosophy states that "Pigs are domestic animals", meaning that users should be able to control and modify its behaviour. This is one of the reasons why Pig does not have an optimizer to choose among the available join strategies and leaves this choice to the user. However, this choice implies that the users have a deep understanding on how the different techniques work, as well as adequate information regarding the format and distribution of the data they want to join is available. If this is not the case, a wrong choice will almost surely lead to severe execution overhead.

My scepticism comes from the high-level nature that such a system is supposed to offer. What do the users of such systems know and what should they know? In my understanding, the whole point of a high-level abstraction is to hide implementation details and low-level information on how the underlying framework works. And honestly speaking, I can't see how an optimizer would come in conflict to Pig's philosophy on it being a "domestic animal". Maybe, it could be designed so that it is possible to disable.

How is all this related to my thesis? The truth is that I will probably have no time at all to look into this any further. On the other hand, it is interesting to point out that Stratosphere offers an almost natural way of expressing joins and other relational operations using its Input Contracts. The Match Contract essentially maps to an inner-join, while the PACT compiler can choose the most effective execution strategy to implement it. The CoGroup Input Contract can be used to realize outer and anti-joins, while the Cross Contract can be used to implement all kinds of arbitrary theta-joins. 

I personally find this kind of issues really intriguing and although I will probably have to "push" them into "future work", I now have something to look forward after my thesis is done =)

 

I hope it will be Spring already by the next time I post!

Until then, happy coding!

V.

 

PS: For more info on Pig's advanced relational operations, here is da book!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Wed, 21 Mar 2012 12:23:18 -0700 Pig's Hadoop Launcher http://vasia.posterous.com/pigs-hadoop-launcher http://vasia.posterous.com/pigs-hadoop-launcher

This is a post on the functionality of the main class that launches Pig for Hadoop Map-Reduce and also a good starting point for developers wishing to contribute to the Pig project.

The class in question is the MapReduceLauncher and is found in the package  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

It extends the abstract class Launcher, which provides a simple interface to:

  • reset the state of the system after launch
  • launch Pig (in cluster or local mode)
  • explain how the generated Pig plan will be executed in the underlying infrastructure

Other methods provided are related to gathering runtime statistics and retrieving job status information.

The most important methods of MapReduceLauncher are compile() and launchPig(). It is advised that launchers for other frameworks (i.e. other than Hadoop MR) should override these methods.

The compile method gets a Physical Plan and compiles it down to a Map-Reduce Plan. It is the point where all optimizations take place. A total of eleven different optimizations are possible in this stage, including combiner optimizations, secondary sort key optimizations, join operations optimizations etc. Optimizations will be the focus of the second phasis of my thesis, when I will have to dig into these classes!

The launchPig method is more interesting to me at this point of my work. It receives the Physical Plan to be compiled and executed as a parameter and returns a PigStats object, which contains statistics collected during the execution.

In short, it consists of the following *simplified* steps:

  • Calls the compile method and retrieves the optimized Map-Reduce Plan
  • Retrieves the Execution Engine
  • Creates a JobClient Object

The JobClient class provides the primary interface for the user-code to interact with Hadoop's JobTracker. It allows submitting jobs and tracking their progress, accessing logs and status information. Usually, a user creates a JobConf object with the configuration information and then uses the JobClient to submit the job and monitor its progress.

  • Creates a JobControlCompiler object. The Jo0bControlCompiler compiles the Map-Reduce Plan into a JobControl object

The JobControl object encapsulates a set of Map-Reduce jobs and their dependencies. It tracks the state of each job and has a separate thread that submits the jobs when they become ready, monitors them and updates their states. I hope the following diagrams will make this clear:

Insidethejobcontrolobject
Controlledjobstatediagram

 

  • Repeatedly calls the JobControlCompiler's compile method until all jobs in the Map-Reduce Plan are exhausted
  • While there are still jobs in the plan, retrieves the JobTracker URL, launches the jobs and periodically checks their status, updating the progress and statistics information
  • When all jobs in the Plan have been consumed, checks for native Map-Reduce jobs and runs them
  • Finally, aggregates statistics, checks for exceptions, decides the execution outcome and logs it

Next Steps

When I did the analysis above (almost two weeks ago) I made a list of my next steps including:

  • Browse through the rest of the thesis-related Pig code, i.e. org.apache.pig.backend*
  • Identify the Classes and Interfaces that need to be changed
  • Identify Hadoop dependencies in the Pig codebase
  • Find the Stratosphere "equivalents" of JobControl, JobClient, JobConf etc.
  • Find out how to run a PACT program from inside Pig

Since then, I've been browsing through the Pig code and I have also started coding (finally!). I've identified way more classes and interfaces that need to be changed even for the simplest version of the system I'm building and I am certainly amazed by the amount of dependencies I've found and need to take care of... And it seems that finding "equivalents" is not a straigh-forward or easy task at all!

But the challenge has already been accepted! I'll be updating soon with my solutions :-)

Until then, happy coding!

V.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Tue, 28 Feb 2012 11:35:00 -0800 Pig and Stratosphere Integration Alternatives http://vasia.posterous.com/pig-and-stratosphere-integration-alternatives http://vasia.posterous.com/pig-and-stratosphere-integration-alternatives

In this post I am going to present some alternative design choices concerning the actual implementation of the project, i.e. the integration of Pig and Stratosphere systems.

The main goal is to have a working system, such that Pig Latin scripts can be executed on top of the Nephele execution engine. However, performance is an issue, and of course, we wouldn't like to end up with a system slower than the current implementation :S The very motivation of this project is to overcome the limitations of the existing system, by exploiting Stratosphere's features.

The architectures of the two systems are shown side by side in the next diagram:

Ps0

The integration can be achieved in several ways and on different levels:

  • Translate MapReduce programs into PACT programs

This is the naive and straight-forward way of solving the given problem. PACT already supports Map and Reduce Input Contracts, which can be used for the transformation of the Map-Reduce Plan into a one-to-one PACT Plan. The Logical and Physical Plans that are generated by Pig can be re-used without modification. It is obvious that this solution wouldn't provide any gains compared to the existing implementation. In fact, it should be slower, since it adds one more layer to the system architecture. However, it is the simplest approach and it will be my starting point, in order to better understand the framworks' internals =)

Psv1

  • Translate the Physical Plan into a PACT Plan

This is a more natural solution and corresponds to the approach that would have been taken if Pig had been designed having Nephele in mind as execution engine, instead of Hadoop. It includes completely replacing the MapReduce Plan by a PACT Plan, which will be generated directly from the Physical Plan. This way, the additional Input Contracts, such as Match, Cross and CoGroup, could be used to compile common operation, like Joins. I hope and do expect this solution to be advantageous over the existing implementation. With this design, we should be able to exploit stratosphere's advantages and reflect them as performance gains, in certain classes of applications.

Psv2

  • Translate the MapReduce Plan (or even Physical Plan) into a Nephele Job
If you just look at the two system architectures, as shown in the above figures, you might think that the more layers you take away the faster the resulting system would be. For example, one could argue that getting rid of both the high-level programming frameworks, Map-Reduce and PACT, would speed up things. However, merging at that point, would include re-implementing a job already done, i.e. compiling down to code that can be understood by an execution engine, such as Nephele (or Hadoop). A speedup in this case is quite unprobable to happen and it should mean that there is something wrong with the PACT compiler. Well, I have no reason to suspect so, or any spare time to check this during the 3 months I have left :p

The solutions discussed here are not the only ones possible. One could think of and propose several variations in different levels. For example, in order to take full advantage of Stratosphere's flexibility, it would be reasonable to try and modify Pig in the level of the Physical Plan. Of course, there is the danger of messing up with Pig's modularity and making it execution engine dependent. Moreover, one could exploit Stratosphere's Output Contracts and implement optimization rules, in cases such as grouping or joining pre-partinioned or already sorted data. 

The thing I like with this project is that I constantly have more and more ideas about variations, optimizations and possible extensions. And every time I meet with my supervisor and his team, I fill in my notebook with as many interesting and motivating thoughts from them all! However, I don't have all the time in the world, so I focus in the first two alternatives for the purpose of my thesis.
Just a final conclusion and something that I always have in mind while working on this project:

When any kind of abstracton is made, and this applies as well for high-level languages, there is always an overhead you have to pay in exchange for simplicity. The underlying system, of which the details the user doesn't need to know anymore, will be designed to take several decisions that would often differ from those an experienced low-level programmer would take.

However, the abstraction only has value, provided that the frustration imposed to the user by the slow-down of accomplishing their job, is lower than the satisfaction they get by being able to accomplish this job in a simpler way.

 

Hoping for a valuable abstraction!

Until next time, happy coding,

V.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Tue, 24 Jan 2012 02:24:00 -0800 First Dive into Pig Code http://vasia.posterous.com/first-dive-into-pig-code http://vasia.posterous.com/first-dive-into-pig-code
The Pig project code base is quite big and complex. In this post, I will focus on the back end side of the system, meaning the execution engine. The hierarchy of Pig’s back end looks roughly like this:

Pigbackend
Zooming in the Hadoop execution engine, we get the following diagram:

Pigexecutionengine
 
However, the engine itself also has a front end and a back end.
 
The front end takes care of all compilation and transformation from one Plan to another. First, the parser transforms a Pig Latin script into a Logical Plan. Semantic checks (such as type checking) and some optimizations (such as determining which fields in the data need to be read to satisfy the script) are done on this Logical Plan. The Logical Plan is then transformed into a PhysicalPlan. In the above hierarchy, PhysicalPlan lies under ExecutionEngine -> PhysicalLayer -> Plans. This Physical Plan contains the operators that will be applied to the data.
This PhysicalPlan is then passed to the MRCompiler. The MRCompiler lies under ExecutionEngine-> MapReduceLayer. This is the compiler that transforms the PhysicalPlan into a DAG of MapReduce operators. It uses a predecessor depth-first traversal of the PhysicalPlan to generate the compiled graph of operators. When compiling an operator, the goal is first trying to merge it in the existing MapReduce operators, in order to keep the generated number of jobs as small as possible. A new MapReduce operator is introduces only for blocking operators and splits. The two operators are then connected using a store-load combination. The output of the MRComiler is an MROperPlan object. This corresponds to the Map-Reduce plan to be executed.
This plan is then optimized by using the Combiner where possible or by compining jobs that scan the same input data etc..  
The final set of of MapReduce jobs is generated by the JobControlCompiler. This class lies under ExecutionEngine-> MapReduceLayer. It takes an MROperPlan and converts it into a JobControl object with the relevant dependency info maintained. The JobControl Object is made up of Jobs each of which has a JobConf. The conversion is done by the method compile(), which compiles all jobs that have no dependencies, removes them from the plan and returns. It must be called with the same plan until exhausted and it returns a JobControl Object
The generated jobs are then submitted to Hadoop and monitored by the 
MapReduceLauncher.
 
In the back end, each PigGenericMapReduce.Map, PigCombiner.Combine, and PigGenericMapReduce.Reduce use the pipeline of physical operators constructed in the front end to load, process, and store data.

The goal of my project is to replace the Hadoop execution engine component with a new one, corresponding to the Nephele execution engine. It might sound easy, but it is not as trivial as it looks. Even if Pig was built having modularity in mind and trying to make it independent of the execution engine, it seems that this is not exactly the case. A lot of parameters are Hadoop-specific and there are a lot of dependencies outside the Hadoop packages that need to be taken care of.

Wish me luck!
I wish you happy coding :-)
V.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Wed, 18 Jan 2012 01:48:00 -0800 Pig 101 http://vasia.posterous.com/pig-101 http://vasia.posterous.com/pig-101
In this post, I'll be presenting the basics of the Pig system. Note that this is not a tutorial on how to write Pig programs, but rather a system overview. I will be focusing more on the compiler and the techniques used to translate Pig Latin scripts into Map-Reduce jobs.

Why use Pig?

So, what’s the motivation behind Pig? Why use it when you can write simple Java Map-Reduce programs? Simple... hmmm... Well, yes, if what you want to do is WordCount! But when it comes to other operations, like joins or complex algorithms, it’s not that trivial. Pig is very easy to use for people familiar with SQL or scripting languages and it doesn’t require you to understand how Map-Reduce works. Pig Latin programs are also way smaller than Java programs (obviously). So, they’re also probably much faster to write! And the best part is that operations like join, sort, filter and group are already provided in Pig Latin (as expected by a self-claimed high-level language :p)
Not convinced? Then I hope the most popular Map-Reduce example will wash away all your doubts!

An example: WordCount

wordInput = LOAD 'input' USING TextLoader();
words = FOREACH wordInput GENERATE FLATTEN((TOKENIZE($0))) AS word;
grouped = GROUP words BY word;  
result = FOREACH grouped GENERATE group AS key, COUNT(words) AS count;
STORE result INTO 'wordOutput';

Pig’s declarative nature makes it so much more intuitive to write data analysis programs :-)
You don’t have to think about keys and values and squeeze your head on how to fit your problem into a map and a reduce function!
Running this script locally on my machine giving as input the text of the first section of this post produces the following output:

I 1
a 2
In 1
be 2
is 1
of 1
on 2
to 2
Pig 3
and 1
but 1
how 1
not 1
the 4
…           


System Overview

Here’s a simple diagram to show the system architecture:

 

The simple Pig Latin program we wrote above, will get parsed and a Logical Plan of operations will be created. This plan will be optimized and turned into a Physical Plan, which will feed the Map-Reduce Compiler, which in turn, will generate a Map-Reduce Plan. This plan will be optimized again and sent to Hadoop for execution:


Each node of the Logical Plan represents an operation of the script and the arrows connecting the nodes show how data flow from one step to the next. Each node in this plan is then translated into one or more nodes of the Physical Plan, depending on the complexity of the operation. In the end, nodes are grouped together to form Map and Reduce operations. In our example, FOREACH and Local Rearrange can be performed inside the Mapper, while the Package and the next FOREACH can be performed inside the Reducer. The Global Rearrange and LOAD/STORE operations are taken care by the Hadoop framework automatically.

More generally, there are some rules to follow in order to convert a Physical Plan into a Map-Reduce Plan:

  • Convert each (CO)GROUP into a Map-Reduce job
  • Map assigns keys based on the BY clause
  • Each FILTER and FOREACH between the LOAD and the (CO)GROUP are pushed into the map function
  • Commands between (CO)GROUP operations are pushed into the reduce function
  • Perform tagging in case of multiple input sets
  • Each ORDER command is compiled into 2 Map-Reduce jobs
    • Job 1 samples the input to determine key distribution
    • Job 2 generates roughly equal-sized partitions and sorts
These rules were designed in the initial version of the Pig system. There is a high chance they have changed. In the next days I will be studying those rules in more detail and I’ll get back!

Until then sweet coding!
V.


References and links

- Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of Map-Reduce: the Pig experience.
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing.
- http://pig.apache.org/
- http://www.cloudera.com/videos/introduction-to-apache-pig

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia
Mon, 16 Jan 2012 10:56:00 -0800 myThesisProject.init(); http://vasia.posterous.com/mythesisprojectinit http://vasia.posterous.com/mythesisprojectinit
Hej hej from white Stockholm!
Christmas holidays are over and I’m back in town!

This will be the last semester of my MSc, during which I will be working on my thesis in collaboration with the Swedish Institute of Computer Science (SICS). I am very excited about the project and this is the first of a series of posts I intend to do, describing my progress and discoveries :-)

So what is this super-interesting project I’m going to work on?

Before getting to that, I will have to make a small introduction on two systems:
Apache Pig and Stratosphere.

Pig is a platform for analyzing big data sets. It consists of a high-level declarative language, Pig Latin, and an execution engine that “translates” Pig scripts into Map-Reduce jobs.

Stratosphere is a data-processing framework, under research by TU Berlin. It provides a programming model for writing parallel data analysis applications and an execution engine, Nephele, able to execute dataflow graphs in parallel. You can think of it as an extension/generalization of Hadoop Map-Reduce and it also shares a lot of ideas with Dryad.

Although right now it is only possible to execute Pig scripts on top of Hadoop, Pig is designed to be modular and it should be straight-forward to deploy it on top of another execution engine. And this is exactly the initial idea of the project. Additionally, the current state of the project appears to have some limitations that make it about 1,5 times slower than native Map-Reduce at the moment. I believe that Stratosphere architecture has several features that could be exploited in order to improve performance.

I am currently in the phase of studying the Pig architecture and the existing Hadoop compiler implementation. (Oh the joy of endless Java code :p )

Soon, I will post here my first findings, so stay tuned!

Until then, sweet coding!

V.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1090350/_sagrada_familia.jpeg http://posterous.com/users/hdKYUkxjXnbgC vasia vasia