Paper of the week: Incoop
MapReduce for Incremental Computations
This week I am writing about Incoop, a system developed to support incremental computations. Incoop transparently extends Hadoop MapReduce, i.e. existing applications can be executed on it without changing any code.
Incoop's motivation comes from the observation that a large amount of MapReduce jobs need to be run repeatedly with slightly different (most often augmented) input. This obviously leads to redundant computations and inefficiencies. In order to overcome this problem, one has to specially design their MapReduce application to handle this case by storing and using state across multiple runs. Incoop's goal is to handle such cases without demanding any extra effort from the programmer.
Incoop extends Hadoop to support incremental computations by making three important modifications:
- Inc-HDFS: A modified HDFS which splits data depending on file contents instead of size
- Contraction Phase: An additional computation phase added before the Reduce phase, used to control task granularity
- Memoization-aware Scheduler: An improved scheduler which takes into account data locality of previously computed results while also using a work-stealing algorithm.
System Overview
The main goal of Incoop is supporting incremental computations while offering transparency and efficiency. Transparency means being backwards compatible, which puts a limitation on the modifications one can make on the initial MapReduce model. On the other hand, efficiency is mainly determined by the granularity of the elements which can be reused.
To achieve these goals, Incoop modified HDFS and the MapReduce Scheduler and also added a memoization server in the architecture. The memoization server is a simply stores a mapping from the input of a computation to its output. Whenever a task runs, it queries the server to find out if the result it is going to produce already exists. A high-level view of the system is shown below:
Inc-HDFS
Incremental HDFS mainly differs from original HDFs in the way that data partitioned into chunks. In original HDFS, data is divided into fixed-sized blocks and these blocks are also the units of computation for a map task. This design is not suitable for incremental computations, as the slightest change in the input could affect all subsequent blocks. One could propose computing differences between the entire input files, but this choices would be too expensive in our case.
What Inc-HDFS does is dividing data into chunks based on content instead of size. The idea comes from the data deduplication area and uses fingerprints to match patterns and mark chunk boundaries. With this approach, a change in the input will most probably only affect the specific chunk of data, while the rest can be reused.
Incremental MapReduce
In the computation part, map task results are persistently stored and references are created which are inserted into the memoization server. If the running map task is part of an incremental computation, it first queries the memoization server and retrieves any result that already exists.
Making the reduce phase support incremental computations is a bit more complicated that in the map case. That is because we do not have control over the input to the reducers, as this is defined directly by the mappers' output. For this reason, Icoop adds an additional phase to the model, right before the reduce phase. This Contraction phase leverages the idea of Combiners to "break" the reduce task into a tree-hierarchy of smaller tasks. The process is run recursively until the last level where the reduce function is applied. In order to result into a data partitioning suitable for reuse, content-based partitioned is again performed on every level of Combiners.
Scheduler
The default MapReduce scheduler assigns tasks to nodes based on data locality and availability of execution slots. Data locality is a desirable factor in the case of Incoop as well, but memoized data need to be taken into account. The memoization-aware scheduler tries to schedule tasks on the nodes that contain data which can be reused. However, this approach might create load imbalance, in case some data is very popular, and lead to straggler tasks. To avoid this situation, the scheduler implements a simple work-stealing algorithm. When a node runs out of work, the scheduler will locate the node with the largest task queue and delegate a task to the idle node.
Thoughts
Incoop's design decisions can yield a lot of discussion on several levels and a thorough evaluation and profiling is needed to measure overheads and locate bottlenecks. For example, the content-based partitioning in inc-HDFS could possibly lead to overloaded map tasks. Moreover, performance gains are questionable in the case where even one map task will need to be re-executed. This is why the reduce phase cannot be initiated before all map tasks have finished, so the speedup inevitably depends on how slow that map task will be.
The paper contains an extensive evaluation section which I will not describe here. The authors use five applications, both data and computation intensive, and try to answer the above questions and study the tradeoffs of their approach.
Happy reading and coding,
V.

