Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Motivation

For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.

User Model

AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:

Code Block
languagesql
firstline1
titleCreate a dataset with a columnar format
linenumberstrue
CREATE TYPE DatasetType AS {
	uid: uuid
};
    
CREATE DATASET ColumnDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"column"}
}

CREATE DATASET RowDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"row"}
}

CREATE DATASET DefaultFromatDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED; -- row format is the default

Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.

Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>

Columnizing ADM Records

In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.

First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.

Object

Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:

Code Block
titleJSON Example
{"id": 0}
{"id": 1, "name": {"first": "Ann"}}
{"id": 2, "name": {"first": "John", "last": "Smith"}}

Here is the schema for the three records

draw.io Diagram
bordertrue
diagramNameSchema_object
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth221
revision1

We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String.