Motivation
For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.
User Model
AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:
CREATE TYPE DatasetType AS { uid: uuid }; CREATE DATASET ColumnDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED WITH { -- Specify the format as such "dataset-format":{"format":"column"} } CREATE DATASET RowDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED WITH { -- Specify the format as such "dataset-format":{"format":"row"} } CREATE DATASET DefaultFromatDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED; -- row format is the default
Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.
Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>
Columnizing ADM Records
In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.
First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.
Object
Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:
{"id": 0} {"id": 1, "name": {"first": "Ann"}} {"id": 2, "name": {"first": "John", "last": "Smith"}}
Here is the schema for the three records
We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String.