ID	IEP-92
Author	Aleksey Demakov
Sponsor
Created
Status	DRAFT

Motivation

Considering e.g. a storage engine internal operation, its interaction with the SQL engine, or a database system as a whole we can see that row-like tuples are used in a variety of cases.

We already have IEP-54[1] that defines a representation of table rows suitable for a key-value storage engine. It clearly separates key and value columns by keeping them in two distinct chunks. The separation is strictly defined by the table schema. The primary key columns go to the IEP-54[1] key chunk and other columns go to the value chunk.

However this approach does not cover all the cases when a database system deals with row-like data. For instance, execution of an SQL query produces a result set that may consist of rows containing columns from different tables, SQL-expressions, aggregate functions, etc. For such an arbitrary set of columns the distinction between key and value columns makes no sense at all.

When a storage engine deals with secondary indexes it picks some columns from a row and forms a tuple where the distinction between primary-key and non-primary-key columns does not play a role.

For a parametrized SQL query the query parameters also might need to be packed together into single tuple to be fed to the database.

This proposal introduces a new binary format that might be suitable for all of these cases and more.

The design principles behind this format are as follows:

Keep it simple. Easy to build and parse.
Provide O(1) random access to every value.
Make it reasonably compact.

Description

We consider only the case when the tuple content is completely defined by a corresponding schema. Schema relieves tuples of any metadata and thus provides space savings. The downside is that the tuple data can only be properly interpreted if the corresponding schema is available. For our use cases this is not a problem as the schema can always be determined from the context.

However for fast random access to individual tuple fields the general tuple structure depends only on the schema size. No other metadata is needed to find the required field. But one needs to learn the field data type from the schema when it is required to interpret the field as a value.

Tuple Layout

A binary tuple is kept as a sequence of fields contiguously going in memory one after another in the same order as they go in the tuple schema. The number of fields is defined by the schema too. Each field is a sequence of bytes that can be interpreted only after learning the field type in the schema.

The field location information is represented as a table of field offsets. This table allows to find both the offset and the length of each field.

To represent NULL values a tuple may contain a nullmap. Nullmap is skipped if there are no actual NULL values in a given tuple.

To be more precise the layout of a tuple looks like this:

Header;
Nullmap;
Offset table;
Value area.

Header

It contains only one byte with flags.

Bits 0-1: Size class of the variable-length area.

Bit 2: Set if the nullmap is present.

Size class encodes the number of bytes used for entries in the offset table. It is encoded like this:

0b00 - 1 byte

0b01 - 2 bytes

0b10 - 4 bytes

0b11 - 8 bytes

So the entry size can be calculated as follows:

Size = 1 << (flags & 0b11)

NOTE: By the way, the last option works fine for C++ and possibly other languages but for Java it will be hard to support, so it might be excluded.

Nullmap

If a schema has nullable columns then the corresponding tuple may contain a nullmap. The nullmap is skipped if there are no actual NULL values in a given tuple instance.

Nullmap is a bitset that contains N bits, occupies (N + 7) / 8 bytes, where `N` is the number of columns in the schema.

In the bitset if a bit is set to `1` then the respective field is NULL.

Offset Table

The number of elements in the table is equal to the number of columns in the schema.

The size of table elements might be 1, 2, 4, or 8 bytes depending on the header flags.

Each element in the table specifies where the corresponding field ends. Or in other words, for each given field the table stores the offset where the next field starts. The last element in the table specifies the end of the last field and at the same time the end of the whole value area and consequently the end of the entire tuple.

The offset is measured relative to the start of the value area.

The start of the first element is implied and is always equal to zero.

Value area

Contains data for each tuple field one after another as described above.

Field Values

The list of supported data types for tuple fields is the same as in IEP-54[1]. Fixed-size values are stored in the little-endian byte order. If a value is equal to NULL then it is absent in the value area. This means that in the offset table the corresponding entry is equal to the previous entry. At the same time the corresponding bit in the nullmap is set.

Hence for any variable-length field as we find its length from the offset table we can encounter the case when the length is equal to zero. At this point the value has to be disambiguated in this way:

If the corresponding nullmap bit is clear then this is a zero-length (empty) value;
If the corresponding nullmap bit is set then this is a NULL value.

This approach is naturally extended to fixed-size fields. But instead of a zero-length value we interpret it as a value of zero. That is

0 for all integer types;
0.0 for floating point types;
00000000-0000-0000-0000-000000000000 for UUID;
00:00:00.000000 for time;
Jan 1, 1970 00:00:00.000000 for timestamp;
Jan 1, 1 BC (as 1 BC immediately precedes 1 AD in the Gregorian calendar)

Integer Field Compression

This is optional and subject to discussion.

Uniform use of offset tables for all fields lets us take advantage of variable-length principle for integer fields.

Thus for integer values it is possible to keep only their significant bytes, omitting their high insignificant bytes. That is even if the original data type of a field is INTEGER and should occupy 4 bytes but the actual value is below 128 and above -129 then we can store it just as a single byte and reflect this in the offset table.

Corollary

If the number of fields is N and t is an array that stores a binary tuple we can find the answers for the following:

Does the tuple contain a nullmap?

hasNullmap = t[0] & 0b100;

How many bytes are occupied by the nullmap?

nullmapBytes = hasNullmap ? (N + 7) / 8 : 0;

How many bytes are occupied by one offset table entry?

offsetEntryBytes = 1 << (t[0] & 0b11);

How many bytes are occupied by the offset table?

offsetTableBytes = N * offsetEntryBytes;

What is offset of the value area?

valueBaseOffset = 1 + nullmapBytes + offsetTableBytes;

What is the whole tuple size?

tupleSize = valueBaseOffset + valueAreaSize; // valueAreaSize goes from the last offset table entry

Reference Links

1. IEP-54: Schema-first Approach

Page tree

Motivation

Description

Tuple Layout

Header

Nullmap

Offset Table

Value area

Field Values

Integer Field Compression

Corollary

Reference Links

Tickets

Page tree

IEP-92: Binary Tuple Format

Motivation

Description

Tuple Layout

Header

Nullmap

Offset Table

Value area

Field Values

Integer Field Compression

Corollary

Reference Links

Tickets