Sector and Sphere
design and implementation of a high-performance data cloud
This week's post discusses Sector and Sphere, two systems built at the University of Illinois, which together provide a framework for executing data-intensive applications in the cloud.
The vision of the project is to provide a computing cloud which can be geographically distributed. The authors build on the assumption that available clusters are connected by high-speed networks and that data inputs and query outputs are relatively large.
Sector
Sector provides the storage layer of the system. It assumes a large number of well-connected nodes, located in the same or different datacenters. Datasets are divided into Sector slices, which are also the units of replication in the system. An overview of the Sector architecture is shown below:
Sector is one of the very few large-scale data storage services which mentions security issues. Security is provided in the system by an external dedicated security server, which maintains user credentials, file access information and IP addresses of authorized nodes. The actual service is provided by a master node and a set of slave nodes. The master stores metadata, monitors the slaves and handles client requests. Client connect to the master over SSL. The master then contacts the security server in order to verify the user's credentials and obtain their access privileges. When a client requests to read a file, the master will check permissions and elect a slave node to serve the request. Communication between clients and slaves, as well as among slaves, is always coordinated by the master node.
Sector relies on the local native file system of the node to store files. Each Sector slice is stored as a separate file in the native file system and is not further split (e.g. in blocks like in Hadoop). apart from metadata, the master also stores information about the current topology and available resources of the slaves. The master also periodically checks the number of existing copies of each file in the system. If this is under the specified replication factor, it chooses a slave node to create a copy.
Sector uses UDP (with a reliable message-passing library on top) for messages and UDT for data transfers.
Sphere
Sphere provides the computing layer of the system. It transparently manages data movements, message-passing, scheduling and fault-tolerance.
Sphere's computing paradigm is based on stream-processing. In this context, each slave corresponds to an ALU in a GPU or a core in a CPU. A Sphere program takes a stream as input and produces one or more streams as output. Each stream is divided into segments, which are assigned to different Sphere Processing Engines (SPEs) for execution. The computing paradigm is shown in the following diagram:
One could easily implement MapReduce-like computations in Sphere. When writing a Sphere UDF, one can specify a bucket ID for each record of the output. This way, one can imitate the shuffling phase of MapReduce.
When a client sends a request for data-processing to the master, it receives a list of slaves available for computation. The client selects the nodes it needs and start an SPE on each of them. While the SPE is running, it periodically reports its progress back to the client.
The Sphere client is responsible for splitting the input into segments and uniformly distributing it to the available SPEs. Data locality and data access concurrency are the main goals of the scheduling rules. In order to detect failures, the Sphere client monitors and the running SPEs and uses timeouts to regard them as failed. There exists no check-pointing mechanism, therefore an SPE that failed will have to be re-scheduled and re-computed completely. Sphere also uses a technique very similar to MapReduce in order to deal with stragglers.
Thoughts
Overall, the design of Sector and Sphere is straight-forward and a number of decisions seemed to have been taken in order to avoid implementation complexity. Sector performs best when dealing with a small number of relatively large files. The programming model is general and more flexible than MapReduce, lower-level though. The Terasort benchmark is used for evaluation and there is a comparison with Hadoop/MR, although not extensive enough.
Links
Happy reading and coding,
V.


