Motivation

For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.

User Model

AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:

Create a dataset with a columnar format
CREATE TYPE DatasetType AS {
	uid: uuid
};
    
CREATE DATASET ColumnDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"column"}
}

CREATE DATASET RowDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED WITH {
	-- Specify the format as such
	"dataset-format":{"format":"row"}
}

CREATE DATASET DefaultFromatDataset(DatasetType)
PRIMARY KEY uid AUTOGENERATED; -- row format is the default

Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.

Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>

Columnizing ADM Records

In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns. The implementation here is slightly different from what we have in original paper – we have changed it to simplify the code and make easier to read (smile) We will highlight the differences as we explain in the following subsections.

First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. 

Object

Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:

JSON Example
{"id": 0}
{"id": 1, "name": {"first": "Ann"}}
{"id": 2, "name": {"first": "John", "last": "Smith"}}

Here is the schema for the three records

We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String. Below each scalar value in the same, there's a number which represent a ColumnIndex. The following tables show the column representation of those scalar values.

Definition LevelValue
10
11
12

id column (ColumnIndex = 0) – max definition level = 1

Definition LevelValue
0MISSING
2Ann
2John

name.first column (ColumnIndex = 1) – max definition level = 2

Definition LevelValue
0MISSING
1MISSING
2Smith

name.last column (ColumnIndex = 2) – max definition level = 2

Each column stores the values of one "attribute" or a field value (e.g., name.first) for the three records. the Definition Level here shows us whether a record has NULL, MISSING or a value for a specific field. NULL and MISSING could occur in different nesting levels. The definition level value tells us which of the nested values were present (i.e., not NULL or MISSING). The maximum definition level is the number of levels in a path. For example, the path for the field name.first is root(0) → name (1) → first (2). Hence, the maximum definition level is 2. When the definition level of a value equals to the maximum definition level, we the scalar value is present and it is not NULL or MISSING. To illustrate, let us see the values for the field name.last. We can see that the definition levels where 0, 1, and 2 for the three records respectively. The definition level 0 means that the object name is MISSING in the first record (see the JSON records above), the definition level 1 means the object name was present but not the field last, and the last definition level 2 means the string value name.last is present and the value is "Smith". 

Arrays and Multisets

Arrays and multisets can have what the Dremel papers calls repeated values. That means an attribute in a record could have more than one value. Therefore, we need to determine the start and the end of each repeated values. Arrays and multisets in ADM and JSON could be nested (e.g., arrays of arrays/ arrays of objects where each field can have nested arrays). It is important to determine which value belongs to which record, and it is order (or index). Let's see an example:

JSON Example
{"id": 0, "numbers": [1, 2]}
{"id": 1, "numbers": [4, 5]}
{"id": 2}
{"id": 3,"numbers": [6, 7, 8, 9]}

the schema, 

and numbers' column values

Definition LevelValue
21
22
0<DELIMITER>
24
25
0<DELIMITER>
0MISSING
26
27
28
29
0<DELIMITER>

numbers column (ColumnIndex = 1) – max definition level = 2 – max-delimiter = 0

In this example, we only have one columns as there is only one scalar value. However, the scalar value belongs to an array. Thus, multiple values could belong to a single record (e.g., the first record has three values). The maximum definition level is 2 root (0) → numbers (1) → int (2). We also have max-delimiter which is 0. We see that the first three values are delimited by a "<DELIMITER>", which has a definition level 0 (or our max-delimiter value). The delimiter role is to determine the end of a repeated value. Thus, we know that the first three values [1, 2] belong to the first record. The next two \[4, 5]belong the second record. However, after the second delimiter, we see that we have a missing value with a definition level 0. A delimiter can only occur only if the array is present (i.e., there's a value with a definition level > 0). Hence, we know that the definition level 0 for the value that proceed the second delimiter is because the array itself is MISSING. 

Let's take another (more interesting) example

JSON Example
{"id": 0, "numbers": [[1, 2], [4, 5, 6]]}
{"id": 0, "numbers": [[7, 8]]}
{"id": 0, "numbers": [[10]]}

Definition LevelValue
31
32
1<DELIMITER>
34
35
36
1<DELIMITER>
0<DELIMITER>
37
38
1<DELIMITER>
0<DELIMITER>
310
1<DELIMITER>
0<DELIMITER>

numbers column (ColumnIndex = 1) – max definition level = 3 – max-delimiter = 1

In this example, we have two nested arrays. Thus, the max delimiter is 1, which tells us that the definition levels 0 and 1 can be as delimiters. So, after [1, 2]  we see that we have a delimiter with a definition level 1, which indicates the end of the inner array. Then, we have [4, 5 ,6] which also delimited by the definition level 1 as it is the second inner array. After the second delimiter, we have another delimiter with a definition level 0, which tells us that the outer array is finished at this point and the following value belong to the second record. Following this, we see that each array (inner or outer) is delimited once. The following values also have the same pattern (i.e., values followed by delimiters). As opposed to the original paper, we delimit every inner array before any outer array. In the original paper  consecutive delimiters (e.g.,1 and 0) are compressed by only having a single delimiter 0. This "to some extent" complicated the code as we need to check if an inner arrays is still not yet delimited.

  • No labels