Paper of the week: HAIL
Only Aggressive Elephants are Fast Elephants
The main character of this post's paper is poor Bob, an analyst, and the story goes about his adventures with elephants, i.e. Hadoop MapReduce. Bob's queries often reveal interesting information or anomalies that need to be further examined, spawning new queries and therefore triggering new MapReduce jobs. In such a scenario, the problem stems from slow query runtimes, due to lack of proper schemas and data indexing. Poor Bob ends up drinking too much coffee waiting for his queries to finish, while he also has to deal with his angry boss! The paper proposes inexpensive index creation on Hadoop data attributes, in order to reduce execution times in exploratory use-cases of MapReduce.
The main idea comes from the observation that HDFS keeps three replicas of each block by default. Why not keep each one of these copies in a different sort order and with a different index? If it happens that our query filters data by an attribute for which there exists an index, we will gain great speedup!
As usual, it's not as easy as it sounds. How can indexes be created during upload time, so that no additional scan is required? How does HDFS need to be modified in order to distinguish the different replicas? How can MapReduce applications exploit the sort orders and indexes effectively? And how can we make the scheduler aware?
HAIL (Hadoop Aggressive Indexing Library) answers all these questions.
System Overview
HAIL is a Hadoop modification which keeps each physical replica of an HDFS block in a different sort order and creates clustered indexes on HDFS blocks when uploading the data. The HAIL client converts each HDFS block to binary PAX format and sends it to three datanodes. Then, each node sorts the data contained in the block using a different sort order. Sorting and indexing happens in main memory. Each datanode creates a different clustered index for each data block and stores it with the sorted data. The process is shown below:
Upload Pipeline
HAIL modifies the upload pipeline of HDFS to support index creation and sorting. One major difference is how the HAIL client parses each file into rows. In HDFS, each block is defined by a constant number of bytes, while HAIL parses the file based on content, never splitting a row between two blocks. After the blocks have been defined, each one of them is converted to a binary PAX representation.
HAIL also modifies the checksum mechanism of HDFS. In HDFS, only the datanode receiving the last replica verifies the checksums and then acknowledges the packets back to the other datanodes. However, in HAIL, no checksum or data is flushed to disk, as sorting is required first. A typical size of a block is 64MB, which allows HAIL to perform several block sortings and index creations completely in memory. After sorting, indexing and metadata creation, each datanode computes their own checksums and data is flushed to disk.
At this point, I have to make clear that there is no impact on the fault-tolerance properties of HDFS. Data is only reorganized inside the same block, therefore in case of failure, no information is lost except from the index of that particular replica.
Query Pipeline
From the application perspective, a MapReduce program requires only a few changes in order to use HAIL. Bob needs to annotate his map function and specify a selection predicate and the projected attributes. In other words, he needs to specify the values to filter the input records and the fields he wants the query to return as result. This information will help the runtime schedule the tasks on nodes containing the appropriate indexes. This annotation simplifies the map function, where Bob does not have to include the filtering anymore.
From HAIL's perspective, the record splitting policy needs to be modified, as each replica is now of different size, due to indexes. HAIL groups several data blocks inside one input split, resulting in less map tasks. If the job requires a full scan or no relevant index exists, HAIL falls back to standard Hadoop mechanisms.
Evaluation
Extensive evaluation is provided in the paper. The system is compared to Hadoop and Hadoop++ (previous work of the same group). Experiments are performed to test upload performance, impact of index creation, scaling, query performance and new splitting mechanism impact. Six different clusters are used and two different datasets. Applications include queries with varying degree of selectivities.
The results show that HAIL matches or often outperforms Hadoop in data uploading, even though it has to sort blocks and create indexes and demonstrate the efficiency of PAX representation. Query execution times significantly decrease when appropriate indexes are present and having less map tasks reduces end-to-end job runtimes. However, in the case of a failure of a map task, more data needs to be re-processed, slowing down recovery.
Links
Happy reading and coding,
V.


