Paper of the week: Nectar
Automatic Management of Data and Computation in Datacenters
This week's paper, Nectar, comes from Microsoft Research and provides a way to encounter some very fundamental problems that occur in modern datacenters. Nectar is fundamentally built to interact with Dryad and LINQ...
Wait! Don't stop reading!
I know you know that Microsoft dropped Dryad a while ago and that they decided on admitting Hadoop's victory, blah blah blah. But trust me, the ideas in this paper are beyond the specific platform and it's worth reading nevertheless :)
Motivation
Nectar basically deals with data management in large datacenters. Manual data management can cause a lot of pain, including data loss and waste of resources due to careless or oblivious users. Especially in big data scenarios, where user applications generate large amounts of output data, it is common that this data is only accessed once and then the user forgets to delete it. Consequently, large parts of the datacenter's storage is occupied by data that is never or rarely accessed.
System Overview
Nectar automates data management by associating the data with the computation which created it. Datasets can be uniquely identified by the computations that produced them. All successfully completed programs run on the datacenter are saved in a special store and can be used in order to re-create deleted data. In this way, data and computations can be used interchangeably: High-value data is cached and can be used instead of computation when possible and low-value data is garbage-collected and can be retrieved in case it is needed, by re-running the computation that produced it.
Nectar's input is a LINQ program. Like many of the high-level analysis languages, LINQ's operators are functional, i.e. they accept a dataset as input, do a transformation on it and output a different dataset. This property allows Nectar to create the association between datasets and programs.
Datasets are classified in Nectar as primary or derived. Primary datasets are created once and are not garbage-collected. Derived datasets are created from primary or other derived datasets. A mapping is created between them and the program that produced them and they can be automatically deleted and re-created. In a scenario where we want to analyze website click logs, the raw click logs files would be primary, while all datasets produced by analysis on the click logs, e.g. most popular URLs, would be derived.
The Nectar architecture consists of a client-side program rewriter, a cache server, a garbage collection service and a datastore on the underlying distributed file system, TidyFS. When a DryadLINQ program is submitted for execution, it first goes through a rewriting step. The rewriter consults the cache and tries to transform the submitted program into a more efficient one. The rewriter uses a cost estimator to choose the best alternative. The rewriter also decided which intermediate datasets to cache and always creates a cache entry for the final output of a computation. Rewriting can be useful in two scenarios: (a) sub-expression replacement by cached datasets and (b) incremental computations where cached data can be reused and save computation time. After the rewriting, the optimized program is compiled into a Dryad job and executed.
The cache service is distributed and datacenter-wide. It serves lookup requests and in the case of a miss, it has a mechanism of backtracking all necessary computations in order to recreate the requested dataset. It also implements a cache replacement policy, keeping track of the least frequently accessed datasets.
The garbage collector runs in the background without interfering with program execution or other datacenter jobs. It identifies datasets that are not referenced by any cache entry and deletes them.
Evaluation Results
Nectar is deployed and evaluated on an internal 240-node cluster of Microsoft Research and was also evaluated using execution logs from 25 other production clusters. It was discovered that after its deployment, it could eliminate up to 7000 hours of daily computation.
Overall, we could summarize Nectar's advantages as following:
- Space savings by caching, garbage collection and on-demand re-creation of datasets
- Execution time savings by sub-computation sharing and support for incremental computations
- Easy content management and dataset retrieval
Implementation details and extensive evaluation can be found in the papers.
Thoughts
Personally, I love caching. I believe it's a brilliant idea and I think we should use it more often, wherever applicable. And especially in big data systems. This recent (and enlightening) study on MapReduce workloads clearly shows that we can only benefit from caching.
Happy coding and reading,
V.

