Paper of the week: ASTERIX

Towards a scalable, semistructured data platform for evolving-world models

ASTERIX is a data-intensive storage, management and analysis platform resulting from the cooperation among UC Irvine, UC Riverside and UC San Diego.

ASTERIX was designed according to the requirements of an "evolving world model". In this kind of model, think of the system as a place where we can store a small portion of the real world in the form of data. As the world is changing, the data is changing, therefore capturing the world's evolution. We can store and analyze information about events of the present and the past, we allow data to change and new types of data to be added in the future.

In such a model we can ask several interesting questions about the current or the past state of the world, or "as-of" queries:

What is the best route to get to the Olympic Stadium right now?
What is the traffic situation like on Saturday nights close to the city center?
How many visitors that visited the City Hall during the past year also went for dinner in that nearby restaurant?

ASTERIX is based and extends work from three main areas.
Its data model is inspired by work done in the area of semi-structured data, i.e. XML, JSON, Avro and XQuery.
It is greatly influenced by parallel database systems, to which it adds support for more complex data and fuzzy queries, as well as scalability to larger clusters and fault tolerance. 
Inevitably, ASTERIX also borrows from the area of data-intensive computing, however preserving its database ideas. Its difference from other data-intensive frameworks, e.g. Hadoop, is that its runtime was built having a high-level analysis language in mind and not the other way around. Consider the following: Map-Reduce is an over-simplified programming model, which is in fact such a bad fit for high-level analytics that it quickly spawned projects like Pig and Hive. But what if Pig was developed before Hadoop? Would then Hadoop be a reasonable choice as its runtime?
The answer is most probably no.

_
Data Model

In the "evolving world model" a system needs to provide support for various types of data, i.e. structured, semi-structured and also unstructured. Updates can be frequent, including not only values, but also schema changes. And of course, the amount of data is large and data can be arrive in the system in high rates. In other words, all three V's of "Big Data" need to be handled.

ASTERIX defines its own data model, the ASTERIX Data Model (ADM). In this model, data is semi-structured, typed and self-describing. Data is organized in datasets, which can be accessed in parallel, indexed, partitioned and possibly replicated across the cluster. Datasets are entities close to tables in a relational world. By default, data schemes are "open", meaning that schemes can be updated by adding new types in the future. Datasets related to an application can be grouped in a collection called a dataverse, which would correspond to a database in a relational context. External datasets are also supported in a Hive-like way. Users can create a dataverse writing in a data definition language (DDL) where they have to specify a primary key and an optional partitioning key for each dataset.

_
Query Language

The ASTERIX Query Language (AQL), inspired by XQuery, is designed to fit the needs of the data model described above. AQL has a declarative syntax and offers the expressive power similar to this of XQuery and Jaql. AQL queries form a logical plan of operators, a DAG (Directed Acyclic Graph), which then goes through a static optimization process. The goal is to provide a reusable algebra layer, which would make it easy to support different popular languages, such as Pig and Hive. After optimization, the logical plan is transformed into a Hyracks job, i.e. a job that can be executed on the ASTERIX runtime system, described in a following section.

_
System Architecture

Similar to Map-Reduce, ASTERIX targets large shared-nothing cluster architectures. Users submit AQL queries to the system, which compiles them and generates a runtime execution plan. ASTERIX's execution layer, Hyracks, is responsible for determining the degrees of parallelism for each level of the submitted job. In order to take decisions, Hyracks takes into consideration the characteristics of the operators being used, as well as the current state of the cluster. Hyracks is multi-purpose and general enough to even support Map-Reduce-style computation and other kinds of operator graphs. 

There exist two kinds of nodes in an ASTERIX cluster. The Metadata Nodes store information about the datasets and also about the system resources. They are also responsible for query compilation and planning. The rest of the nodes are Compute Nodes and they only contain what is necessary to manage dataset partitions and participate in the execution of Hyracks jobs.

_
Hyracks Overview

Hyracks is the execution engine of ASTERIX, analogous to Nephele in Stratosphere. In a similar way, a Hyracks job is a DAG of operator nodes called HODs (Hyracks Operator Descriptors) connected by CDs (ConnectorDescriptors). CDs expose data dependencies and encapsulate the data distribution logic that should be used during runtime. An example of a CD would be (1:1) or (M:N hash-merge). A HOD creates several HADs (Hyracks Activity Nodes), which provide information to Hyracks on how a job can be divided into stages. Hyracks then is able to decide the evaluation order of each stage and set its degree of parallelism. Finally, HONs (Hyracks Operator Nodes) are created and are responsible for the actual execution.

_
The first open-source release of ASTERIX is expected later this year, while Hyracks is already available on GoogleCode.

_
Links

V.