Motivation

For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.

User Model

AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:

Code Block

language	sql
firstline	1
title	Create a dataset with a columnar format
linenumbers	true

CREATE TYPE DatasetType AS {
	uid: uuid
};
    
CREATE DATASET ColumnDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"column"}
}

CREATE DATASET RowDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"row"}
}

CREATE DATASET DefaultFromatDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED; -- row format is the default

Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.

Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>

Columnizing ADM Records

In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.

First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.

Object

Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:

Code Block

title	JSON Example

{"id": 0}
{"id": 1, "name": {"first": "Ann"}}
{"id": 2, "name": {"first": "John", "last": "Smith"}}

Here is the schema for the three records

draw.io Diagram

border	true

diagramName	Schema_object
simpleViewer	false
width
links	auto
tbstyle	top
lbox	true
diagramWidth	221
revision	1

We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String.

Page tree

Versions Compared

Old Version 1

New Version 2

Key

Motivation

User Model

Columnizing ADM Records

Object

Page tree

Page History

Versions Compared

Old Version 1

New Version 2

Key

Motivation

User Model

Columnizing ADM Records

Object