Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In AsterixDB, we used a Dremel (or Apache Parquet)-like format to represent ADM (or JSON) data in columns. Here, we will go through a few concepts and example on how ADM records are represented in columns. The implementation here is slightly different from what we have in original paper – we have changed it to simplify the code and make easier to read (smile) We will highlight the differences as we explain in the following subsections.

First thing first, a schema of the ingested values are required. Hence, we use a similar technique as detailed in this paper to infer the schema. We will give an overview later on in <LINK-TO-INGESTION-WORKFLOW>.  

Object

Values in an object could be nested and null/missing values could occur in different nesting levels. Let's take a few examples:

...

In this example, we have two nested arrays. Thus, the max delimiter is 1, which tells us that the definition levels 0 and 1 can be as delimiters. So, after [1, 2]  we see that we have a delimiter with a definition level 1, which indicates the end of the inner array. Then, we have [4, 5 ,6] which also delimited by the definition level 1 as it is the second inner array. After the second delimiter, we have another delimiter with a definition level 0, which tells us that the outer array is finished at this point and the following value belong to the second record. Following this, we see that each array (inner or outer) is delimited once. The following values also have the same pattern (i.e., values followed by delimiters). As opposed to the original paper, we delimit every inner array before any outer array. In the original paper  consecutive delimiters (e.g.,1 and 0) are compressed by only having a single delimiter 0. This "to some extent" complicated the code as we need to check if an inner arrays is still not yet delimited.