Motivation

For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.

User Model

AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:

Create a dataset with a columnar format

CREATE TYPE DatasetType AS {
	uid: uuid
};
    
CREATE DATASET ColumnDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"column"}
}

CREATE DATASET RowDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"row"}
}

CREATE DATASET DefaultFromatDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED; -- row format is the default

Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.

Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>

Columnizing ADM Records

In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.

First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.

Object

Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:

JSON Example

{"id": 0}
{"id": 1, "name": {"first": "Ann"}}
{"id": 2, "name": {"first": "John", "last": "Smith"}}

Here is the schema for the three records

We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String.

Page tree

Columnar Format: AsterixDB Mega Attributes Across (AMAX)

Motivation

User Model

Columnizing ADM Records

Object