Pig and Stratosphere Integration Alternatives

In this post I am going to present some alternative design choices concerning the actual implementation of the project, i.e. the integration of Pig and Stratosphere systems.

The main goal is to have a working system, such that Pig Latin scripts can be executed on top of the Nephele execution engine. However, performance is an issue, and of course, we wouldn't like to end up with a system slower than the current implementation :S The very motivation of this project is to overcome the limitations of the existing system, by exploiting Stratosphere's features.

The architectures of the two systems are shown side by side in the next diagram:

Ps0

The integration can be achieved in several ways and on different levels:

  • Translate MapReduce programs into PACT programs

This is the naive and straight-forward way of solving the given problem. PACT already supports Map and Reduce Input Contracts, which can be used for the transformation of the Map-Reduce Plan into a one-to-one PACT Plan. The Logical and Physical Plans that are generated by Pig can be re-used without modification. It is obvious that this solution wouldn't provide any gains compared to the existing implementation. In fact, it should be slower, since it adds one more layer to the system architecture. However, it is the simplest approach and it will be my starting point, in order to better understand the framworks' internals =)

Psv1

  • Translate the Physical Plan into a PACT Plan

This is a more natural solution and corresponds to the approach that would have been taken if Pig had been designed having Nephele in mind as execution engine, instead of Hadoop. It includes completely replacing the MapReduce Plan by a PACT Plan, which will be generated directly from the Physical Plan. This way, the additional Input Contracts, such as Match, Cross and CoGroup, could be used to compile common operation, like Joins. I hope and do expect this solution to be advantageous over the existing implementation. With this design, we should be able to exploit stratosphere's advantages and reflect them as performance gains, in certain classes of applications.

Psv2

  • Translate the MapReduce Plan (or even Physical Plan) into a Nephele Job
If you just look at the two system architectures, as shown in the above figures, you might think that the more layers you take away the faster the resulting system would be. For example, one could argue that getting rid of both the high-level programming frameworks, Map-Reduce and PACT, would speed up things. However, merging at that point, would include re-implementing a job already done, i.e. compiling down to code that can be understood by an execution engine, such as Nephele (or Hadoop). A speedup in this case is quite unprobable to happen and it should mean that there is something wrong with the PACT compiler. Well, I have no reason to suspect so, or any spare time to check this during the 3 months I have left :p

The solutions discussed here are not the only ones possible. One could think of and propose several variations in different levels. For example, in order to take full advantage of Stratosphere's flexibility, it would be reasonable to try and modify Pig in the level of the Physical Plan. Of course, there is the danger of messing up with Pig's modularity and making it execution engine dependent. Moreover, one could exploit Stratosphere's Output Contracts and implement optimization rules, in cases such as grouping or joining pre-partinioned or already sorted data. 

The thing I like with this project is that I constantly have more and more ideas about variations, optimizations and possible extensions. And every time I meet with my supervisor and his team, I fill in my notebook with as many interesting and motivating thoughts from them all! However, I don't have all the time in the world, so I focus in the first two alternatives for the purpose of my thesis.
Just a final conclusion and something that I always have in mind while working on this project:

When any kind of abstracton is made, and this applies as well for high-level languages, there is always an overhead you have to pay in exchange for simplicity. The underlying system, of which the details the user doesn't need to know anymore, will be designed to take several decisions that would often differ from those an experienced low-level programmer would take.

However, the abstraction only has value, provided that the frustration imposed to the user by the slow-down of accomplishing their job, is lower than the satisfaction they get by being able to accomplish this job in a simpler way.

 

Hoping for a valuable abstraction!

Until next time, happy coding,

V.