Pig and Stratosphere Integration Alternatives
In this post I am going to present some alternative design choices concerning the actual implementation of the project, i.e. the integration of Pig and Stratosphere systems.
The main goal is to have a working system, such that Pig Latin scripts can be executed on top of the Nephele execution engine. However, performance is an issue, and of course, we wouldn't like to end up with a system slower than the current implementation :S The very motivation of this project is to overcome the limitations of the existing system, by exploiting Stratosphere's features.
The architectures of the two systems are shown side by side in the next diagram:
The integration can be achieved in several ways and on different levels:
- Translate MapReduce programs into PACT programs
This is the naive and straight-forward way of solving the given problem. PACT already supports Map and Reduce Input Contracts, which can be used for the transformation of the Map-Reduce Plan into a one-to-one PACT Plan. The Logical and Physical Plans that are generated by Pig can be re-used without modification. It is obvious that this solution wouldn't provide any gains compared to the existing implementation. In fact, it should be slower, since it adds one more layer to the system architecture. However, it is the simplest approach and it will be my starting point, in order to better understand the framworks' internals =)
- Translate the Physical Plan into a PACT Plan
This is a more natural solution and corresponds to the approach that would have been taken if Pig had been designed having Nephele in mind as execution engine, instead of Hadoop. It includes completely replacing the MapReduce Plan by a PACT Plan, which will be generated directly from the Physical Plan. This way, the additional Input Contracts, such as Match, Cross and CoGroup, could be used to compile common operation, like Joins. I hope and do expect this solution to be advantageous over the existing implementation. With this design, we should be able to exploit stratosphere's advantages and reflect them as performance gains, in certain classes of applications.
- Translate the MapReduce Plan (or even Physical Plan) into a Nephele Job
When any kind of abstracton is made, and this applies as well for high-level languages, there is always an overhead you have to pay in exchange for simplicity. The underlying system, of which the details the user doesn't need to know anymore, will be designed to take several decisions that would often differ from those an experienced low-level programmer would take.
However, the abstraction only has value, provided that the frustration imposed to the user by the slow-down of accomplishing their job, is lower than the satisfaction they get by being able to accomplish this job in a simpler way.
Hoping for a valuable abstraction!
Until next time, happy coding,
V.




