Motivation
For analytical workloads, users usually run scan queries on a few attributes. For such workload, columnar formats are usually superior to the row-major formats.
User Model
AsterixDB supports both row-major and column-major formats. When creating a dataset, you can specify the dataset's format (either row or column) in the WITH-clause. For example:
CREATE TYPE DatasetType AS { uid: uuid }; CREATE DATASET ColumnDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED WITH { -- Specify the format as such "dataset-format":{"format":"column"} } CREATE DATASET RowDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED WITH { -- Specify the format as such "dataset-format":{"format":"row"} } CREATE DATASET DefaultFromatDataset(DatasetType) PRIMARY KEY uid AUTOGENERATED; -- row format is the default
Here, the created dataset ColumnDataset is in a column-major format. The other dataset (i.e., RowDataset) is in a row-major format. If no format is specified, then the default is a row-major format.
Other parameters can be specified in the "dataset-format" field, which will describe later in <LINK-TO-ADVANCED-PARAMETERS>
Columnizing ADM Records
In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns.
First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.
Object
Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:
{"id": 0} {"id": 1, "name": {"first": "Ann"}} {"id": 2, "name": {"first": "John", "last": "Smith"}}
Here is the schema for the three records
We can see that the root has two fields: id and name and their types are Integer and Object, respectively. The object name has two fields first and last, where both fields are of type String. Below each scalar value in the same, there's a number which represent a ColumnIndex. The following tables show the column representation of those scalar values.
Definition Level | Value |
---|---|
1 | 0 |
1 | 1 |
1 | 2 |
id column (ColumnIndex = 0) – max definition level = 1
Definition Level | Value |
---|---|
0 | MISSING |
2 | Ann |
2 | John |
name.first column (ColumnIndex = 1) – max definition level = 2
Definition Level | Value |
---|---|
0 | MISSING |
1 | MISSING |
2 | Smith |
name.last column (ColumnIndex = 2) – max definition level = 2
Each column stores the values of one "attribute" or a field value (e.g., name.first) for the three records. the Definition Level here shows us whether a record has NULL, MISSING or a value for a specific field. NULL and MISSING could occur in different nesting levels. The definition level value tells us which of the nested values were present (i.e., not NULL or MISSING). The maximum definition level is the number of levels in a path. For example, the path for the field name.first is root(0) → name (1) → first (2). Hence, the maximum definition level is 2. When the definition level of a value equals to the maximum definition level, we the scalar value is present and it is not NULL or MISSING. To illustrate, let us see the values for the field name.last. We can see that the definition levels where 0, 1, and 2 for the three records respectively. The definition level 0 means that the object name is MISSING in the first record (see the JSON records above), the definition level 1 means the object name was present but not the field last, and the last definition level 2 means the string value name.last is present and the value is "Smith".
Arrays and Multisets
Arrays and multisets can have what the Dremel papers calls repeated values. That means an attribute in a record could have more than one value. Therefore, we need to determine the start and the end of each repeated values. Arrays and multisets in ADM and JSON could be nested (e.g., arrays of arrays/ arrays of objects where each field can have nested arrays). It is important to determine which value belongs to which record, and it is order (or index). Let's see an example:
{"id": 0, "numbers": [1, 2]} {"id": 1, "numbers": [4, 5]} {"id": 2} {"id": 3,"numbers": [6, 7, 8, 9]}
the schema,
and numbers' column values
Definition Level | Value |
---|---|
2 | 1 |
2 | 2 |
0 | <DELIMITER> |
2 | 4 |
2 | 5 |
0 | <DELIMITER> |
0 | MISSING |
2 | 6 |
2 | 7 |
2 | 8 |
2 | 9 |
0 | <DELIMITER> |
numbers column (ColumnIndex = 1) – max definition level = 2 – max-delimiter = 0
In this example, we only have one columns as there is only one scalar value. However, the scalar value belongs to an array. Thus, multiple values could belong to a single record (e.g., the first record has three values). The maximum definition level is 2 root (0) → numbers (1) → int (2). We also have max-delimiter which is 0. We see that the first three values are delimited by a "<DELIMITER>", which has a definition level 0 (or our max-delimiter value). The delimiter role is to determine the end of a repeated value. Thus, we know that the first three values [1, 2] belong to the first record. The next two \[4, 5]belong the second record. However, after the second delimiter, we see that we have a missing value with a definition level 0. A delimiter can only occur only if the array is present (i.e., there's a value with a definition level > 0). Hence, we know that the definition level 0 for the value that proceed the second delimiter is because the array itself is MISSING.
Let's take another (more interesting) example
{"id": 0, "numbers": [[1, 2], [4, 5, 6]]} {"id": 0, "numbers": [[7, 8]]} {"id": 0, "numbers": [[10]]}
Definition Level | Value |
---|---|
3 | 1 |
3 | 2 |
1 | <DELIMITER> |
3 | 4 |
3 | 5 |
3 | 6 |
1 | <DELIMITER> |
0 | <DELIMITER> |
3 | 7 |
3 | 8 |
1 | <DELIMITER> |
0 | <DELIMITER> |
3 | 10 |
1 | <DELIMITER> |
0 | <DELIMITER> |
numbers column (ColumnIndex = 1) – max definition level = 3 – max-delimiter = 1
In this example, we have two nested arrays. Thus, the max delimiter is 1, which tells us that the definition levels 0 and 1 can be as delimiters. So, after [1, 2] we see that we have a delimiter with a definition level 1, which indicates the end of the inner array. Then, we have [4, 5 ,6] which also delimited by the definition level 1 as it is the second inner array. After the second delimiter, we have another delimiter with a definition level 0, which tells us that the outer array is finished at this point and the following value belong to the second record. Following this, we see that each array (inner or outer) is delimited once. The following values also have the same pattern (i.e., values followed by delimiters)