Paper of the week: SkewTune
Mitigating Skew in MapReduce Applications
This week's paper is about SkewTune, a Hadoop extension built at the University of Washington, which addresses the problem of skew in MapReduce applications.
In general, skew refers to the case when task runtimes vary in a high degree, i.e. one or more of a set of tasks running in parallel take much longer to finish than the rest. Skew can appear in a parallel execution due to several reasons, including computational load imbalance, characteristics of the algorithm or the specific dataset or degraded performance of single nodes. In a parallel system, execution time is limited by the speed of the slowest task, therefore skew problems have always been a headache and have been extensively researched in ceratin domains, such as parallel databases. In the MapReduce context, the straggler problem was immediately identified and solved by its creators, using speculative execution. More specifically, when a task execution is detected to take longer than expected, a copy of it is scheduled on a different node and the result of the fastest node is reported. This is an approach that works well on a cluster with thousands of nodes where failures are common or hard to be distinguished from degraded performance, but suffers from the fact that work already done must be repeated.
SkewTune addresses skew on a different level and originating from different sources, specifically uneven distribution of input data and skew caused due to some parts of the input requiring longer processing times than others.
Types of Skew in MapReduce
- Map phase: Expensive Record
One would not expect skew in map phases to exist, since in MapReduce applications data is equally divided among mappers, i.e. each map task processes roughly the same amount of data (in bytes). However, there exist cases where some records may require significantly more CPU or memory. The reason for this could be that the value of the key-value pair is simply larger or that the algorithm's runtime depends on the value itself. The PageRank algorithm is an example where this type of skew can be encountered, since its runtime depends on the out-degree of each vertex of the graph being processed.
- Map phase: Heterogeneous Map
Map and reduce functions typically accept one input. Since, this is an important limitation in expressing several algorithms, MapReduce developers often emulate n-ary operations by concatenating multiple input datasets and adding special tags to the records, in order to distinguish their source. Although this technique has proved to be very useful, it can also become an important source of skew, as mappers will be assigned different record types which may require different processing.
- Reduce phase: Partitioning skew
In the reduce phase, skew causes are more obvious and more common. The first type occurs when the partitioning function used does not succeed in fairly distributing the load. It is a common case in many applications that a small set of keys are much more popular than others, i.e. often visited URLs, often used words in text etc.
- Reduce phase: Expensive Key Group
This is the case analogous to the Expensive Record in the map phase.
For a more detailed description of skew types and their causes, I suggest you to read this short study.
SkewTune Overview
SkewTune is built as a Hadoop extension and aims to be transparent to the user, so that no existing code needs to be modified and no guarantees of the MapReduce model are violated. SkewTune has mechanisms to detect stragglers and mitigate skew by repartitioning its remaining unprocessed input data.
In order to decide when a task should be treated as a straggler, while avoiding unnecessary overhead and false-positives, SkewTune is using Late Skew Detection. Simply put, SkewTune delays any skew mitigation as long as all slots in the cluster are busy. Only when a resource is detected to be idle, will SkewTune trigger straggler detection. When this moment comes, the system will have to decide which task to identify as a straggler, if any. The team's experience showed that skew mitigation is only beneficial when considering one straggler task at a time. SkewTune selects the task with the highest remaining time estimate and only initiates repartitioning if half of this time estimate is greater than the repartitioning overhead.
Skew mitigation happens in three steps. First, the identified as straggler task is stopped. Next, the remaining input data needs to be scanned in order to collect information about them. SkewTune collects a compressed summary of the remaining data, which takes the form of a series of approximately equal key intervals. Depending on the size of the remaining data, SkewTune may decide to scan the data locally or in parallel. Finally, the co-ordinator of the system plans the re-partitioning strategy and schedules the additional tasks.
In Hadoop, skew mitigation is implemented by SkewTune as a separate MapReduce job for each parallel data scan and for each mitigation. When repartitioning a map task, a map-only job is executed and the job tracker broadcasts all information about the mitigated map to all the reducers in the system. When repartitioning a reduce task, due to the MapReduce static pipeline inflexibility, an identity map phase needs to be run before the actual additional reduce task.
Performance
The paper contains an extensive performance evaluation based on popular applications, such as PageRank, Inverted Index and CloudBurst. It is shown that SkewTune successfully manages to avoid delays due to skew and re-balances the load in the cluster. Most importantly, it makes no assumptions about the cause of the skew and it is also effective in case of job misconfigurations. According to the published results, it can result in applications running up to 4 times faster in the presence of skew, while imposing negligible overhead otherwise.
Links
Happy reading and coding,
V.

