Paper of the week: Themis
An I/O Efficient MapReduce
This week's paper is named after a Titaness of the Greek Mythology, Themis, protector of order and law. However, in our context, Themis is a MapReduce-inspired system, whose main goal is to minimize I/O operations.
It is true that many MapReduce jobs are I/O-bound and it would certainly benefit performance if we could find a mechanism to limit I/O operations. At this point, I need to clarify that Themis aims to reduce disk I/O, i.e. find a way to process records in memory and spil records to disk as rarely as possible.
In order to achieve this goal, Themis needs to make substantially different design choices from common MapReduce systems. The three most important features that enable disk I/O elimination are:
1. Relaxation of fault-tolerance guarantees
Classical MapReduce implementations offer fault-tolerance at the task level, i.e. if a node fails, the tasks that were running on this node will be re-scheduled and executed on another node. The rest of the tasks, running or completed, are unaffected. This model allows for fast recovery and has satisfactory results in very large clusters where failures are common. However, the creators of Themis argue that clusters with thousands of machines are not that common in the real world. Surely Google, Facebook, Yahoo! and a couple others are running MapReduce jobs on clusters of such scale, however, the majority of Hadoop adaptors have much smaller deployments. Making the assumption of a small to mid-size cluster, Themis offers fault-tolerance at the job level only, i.e. if a node fails during execution, the whole job will need to be re-executed. This decision greatly simplifies implementation and allows for data pipelining and time-consuming checkpoint elimination.
2. Dynamic memory management
In order to minimize disk operations, Themis aims to process a record entirely as soon as it is read from disk. To achieve this, the system needs a mechanism to ensure that there will be enough memory throughout the duration of the record processing. Themis implements a dynamic memory manager with the following pluggable policies:
- Pool-based management
This policy views the available memory as a pool of fixed-sized pre-allocated buffers. The application and reserve a buffer from the pool and return it when it is not needed anymore. If a worker requests a block from a pool which is empty, it will block until a buffer is returned from another worker. The obvious advantage of this policy is its simplicity by performing all memory allocation and pool setup at startup. On the other hand, this static partitioning limits the maximum record size supported and sacrifices flexibility.
- Quota-based management
The second policy is design to better handle varying-sized records. It controls the flow of data among computational stages, so that stages receiving input do not get overwhelmed. To achieve this, the manager uses queues between stages. A quota is the amount of memory that is allocated between a producer and a consumer stage. If the producer is about to exceed its allowed quota, it is stalled until the corresponding consumer has processed enough data.
- Constraint-based management
The third and more sophisticated policy dynamically adjusts memory allocation based on worker requests and currently available memory. The manager monitors memory usage by each worker and accepts requests for memory allocation. If enough memory is available, the request is satisfied. Worker requests are prioritized based on their position on the execution graph. This policy adds a significant overhead to the execution time and performs well when the number of allocation requests is relatively small.
3. Per-node disk-I/O management
A disk scheduler, running on every Themis node, manages the way data is grouped and written to disk. The scheduler organizes data in large batches in order to reduce disk accesses.
System Overview
Like many other systems, Themis is implemented as a dataflow graph consisting of stages which are grouped in phases and executed in parallel by workers.
Phase Zero samples input data in a distributed fashion and extracts information about the distribution of records and keys. This phase is necessary in order to ensure that partitions in each of the following stages will be small enough to be processed and sorted in memory. Figuring out the distribution of input data is fairly simple, however, intermediate data also need to be considered. Themis applies the map function to a subset of the input data in order to generate a subset of intermediate data, which can be used to approximate the intermediate distribution. Both range and hash partitioning mechanisms are supported.
Phase One implements the mapping and shuffling, but it is split down to more fine-grained steps, shown in the diagram below:
The Reader, Mapper and Sender are the "producer-side" of the phase and their functionality is as simple as their names suggest. The rest of the stages are the "consumer-side" of the phase and it is the place where the disk-scheduler, described in the previous section, comes into the picture.
Phase Two consists of the sorting and the reduce phase. Each partition is sorted by key, always keeping the result in memory. The stages of phase two are shown below:
Evaluation
Themis is evaluated using several applications, like sorting, WordCount, PageRank and CloudBurst. Both synthetic and real data are used for evaluation. Experiments are run on a fairly small, 20-node cluster, or even on a single node if data is small enough. Experiments test disk performance, memory management policies and compare Themis with Hadoop. Comparison with Hadoop shows that I/O bound applications can greatly benefit from Themis' design. However, Themis does not use a distributed file system or any kind of replication and it is not clear whether those factors were taken into account when taking measurements.
Thoughts
In my opinion, Themis is a quite immature system with some important limitations that need to be solved. However, it clearly shows that Hadoop is not the best fit for all cluster-sizes and types and that many deployments could be highly benefited by alternative designs.
Happy coding and reading,
V.



