Motivation
For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.
User Model
AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
CREATE TYPE DatasetType AS {
uid: uuid
};
CREATE DATASET ColumnDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
-- Specify the format as such
"dataset-format":{"format":"column"}
}
CREATE DATASET RowDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
-- Specify the format as such
"dataset-format":{"format":"row"}
}
CREATE DATASET DefaultFromatDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED; -- row format is the default
|
Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.
Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>
Columnizing ADM Records
In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.
First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.
Object
Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:
Code Block | ||
---|---|---|
| ||
{"id": 0}
{"id": 1, "name": {"first": "Ann"}}
{"id": 2, "name": {"first": "John", "last": "Smith"}} |
Here is the schema for the three records
draw.io Diagram | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String.