Pig's Hadoop Launcher
This is a post on the functionality of the main class that launches Pig for Hadoop Map-Reduce and also a good starting point for developers wishing to contribute to the Pig project.
The class in question is the MapReduceLauncher and is found in the package org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
It extends the abstract class Launcher, which provides a simple interface to:
- reset the state of the system after launch
- launch Pig (in cluster or local mode)
- explain how the generated Pig plan will be executed in the underlying infrastructure
Other methods provided are related to gathering runtime statistics and retrieving job status information.
The most important methods of MapReduceLauncher are compile() and launchPig(). It is advised that launchers for other frameworks (i.e. other than Hadoop MR) should override these methods.
The compile method gets a Physical Plan and compiles it down to a Map-Reduce Plan. It is the point where all optimizations take place. A total of eleven different optimizations are possible in this stage, including combiner optimizations, secondary sort key optimizations, join operations optimizations etc. Optimizations will be the focus of the second phasis of my thesis, when I will have to dig into these classes!
The launchPig method is more interesting to me at this point of my work. It receives the Physical Plan to be compiled and executed as a parameter and returns a PigStats object, which contains statistics collected during the execution.
In short, it consists of the following *simplified* steps:
- Calls the compile method and retrieves the optimized Map-Reduce Plan
- Retrieves the Execution Engine
- Creates a JobClient Object
The JobClient class provides the primary interface for the user-code to interact with Hadoop's JobTracker. It allows submitting jobs and tracking their progress, accessing logs and status information. Usually, a user creates a JobConf object with the configuration information and then uses the JobClient to submit the job and monitor its progress.
- Creates a JobControlCompiler object. The Jo0bControlCompiler compiles the Map-Reduce Plan into a JobControl object
The JobControl object encapsulates a set of Map-Reduce jobs and their dependencies. It tracks the state of each job and has a separate thread that submits the jobs when they become ready, monitors them and updates their states. I hope the following diagrams will make this clear:
- Repeatedly calls the JobControlCompiler's compile method until all jobs in the Map-Reduce Plan are exhausted
- While there are still jobs in the plan, retrieves the JobTracker URL, launches the jobs and periodically checks their status, updating the progress and statistics information
- When all jobs in the Plan have been consumed, checks for native Map-Reduce jobs and runs them
- Finally, aggregates statistics, checks for exceptions, decides the execution outcome and logs it
Next Steps
When I did the analysis above (almost two weeks ago) I made a list of my next steps including:
- Browse through the rest of the thesis-related Pig code, i.e. org.apache.pig.backend*
- Identify the Classes and Interfaces that need to be changed
- Identify Hadoop dependencies in the Pig codebase
- Find the Stratosphere "equivalents" of JobControl, JobClient, JobConf etc.
- Find out how to run a PACT program from inside Pig
Since then, I've been browsing through the Pig code and I have also started coding (finally!). I've identified way more classes and interfaces that need to be changed even for the simplest version of the system I'm building and I am certainly amazed by the amount of dependencies I've found and need to take care of... And it seems that finding "equivalents" is not a straigh-forward or easy task at all!
But the challenge has already been accepted! I'll be updating soon with my solutions :-)
Until then, happy coding!
V.



