ID

IEP-92

Author

Aleksey

Aleksey Demakov

Sponsor

Created

16 Jun 2022

Status


colour	Grey
title	DRAFT

...

We already have IEP-54[1] that defines specifies a representation of table rows suitable for a key-value storage engine. It clearly separates key and value columns by keeping them in two distinct chunks. The separation is strictly defined by the table schema. The primary key columns go to the IEP-54[1] key chunk and other columns go to the value chunk.

...

For a parametrized SQL query the query parameters also might need to be packed together into a single tuple to be fed to the database.

...

The design principles behind this the format are as follows:

Keep it simple. Easy to build and parse.
Provide O(1) random access to every value.
Make it reasonably compact.

...

We consider only the case when the tuple content is completely defined by a corresponding schema. A tuple schema may match a table schema (when tuples represent table rows) or may be very different (e.g. when tuples represent a result set of some complex SQL query).

Schema relieves tuples of from bearing any metadata and thus provides space savings. The downside is that the tuple data can only be properly interpreted if the corresponding schema is available. For our use cases this is not a problem as the schema can always be determined from the context.

However for , to provide fast and easy random access to individual tuple fields, the general basic tuple structure layout depends only on the schema size. No other metadata is needed to find get the required field. But one needs to learn the field data type from the schema when it is required to interpret the field as a value.

...

A binary tuple is kept as a sequence of fields contiguously going in memory one after another in the same order as they go in the tuple schema. The number of fields is defined by the schema too. Each field is a sequence of bytes that can be interpreted only after learning the field type in the schema.

This The field location information is represented as a table of field offsets. This table allows to find determine both the location offset and the length of each field.

To represent NULL values a tuple may contain a nullmap. Nullmap is skipped if there are no actual NULL values in a given tuple.To be more precise the layout of a tuple looks like this:

Header;
Nullmap;
Offset table;
Value area.

Header

It contains only one byte with flags in 1 byte.

Bits 0-1: Size class of the variable-length area.

Bit 2: Set if the nullmap is presentFlag indicating that the size class is not optimal.

Size class encodes the number of bytes used for entries in the offset table. It is encoded like this:

0b00 - 1 bytesbyte

0b01 - 2 bytes

0b10 - 4 bytes

...

NOTE: By the way, the last option works fine for C++ and possibly other languages but for Java it will be hard to support, so it might be excluded.

Nullmap

If a schema has nullable columns then the corresponding tuple may contain a nullmap. The nullmap may be skipped if there are no actual NULL values in a given tuple instance.

Nullmap is a bitset that contains N bits, occupies (N + 7) / 8 bytes, where `N` is the number of columns in the schema.

In the bitset every bit set to `1` means that the respective field is NULL.

Offset Table

The number of elements in the table is equal to the number of columns in the schema.

...

Each element in the table specifies where the corresponding field ends. Or in In other words, for each given field the table stores the offset where the next field starts. The last element in the table specifies the end of the last field and at the same time the end of the whole value area and consequently the end of the entire tuple.

...

The start of the first element is implied and is always equal to zero.

Value area

Contains A value area contains data for each tuple field one after another as described above.

Field Values

Potentially a field can contain any kind of data. However there are rules for representation of common data types there. These rules are specified in the following section. Data types for which there are no specified rules still may be put into a tuple as a Binary field, but the interpretation in such cases is fully up to a particular application.

Field Encoding

A tuple field is a sequence of bytes. This sequence is interpreted according to the associated data type provided by the tuple schema. There are different encoding rules for different values. For some data types the rules include a very simple compression mechanism. So even if the original type has fixed size it may take different number of bytes when encoded as a tuple field.

The list of supported data types for tuple fields is the same as in IEP-54[1]. Fixed-size values are stored in the little-endian byte order. If a value is equal to NULL then it is absent in the value area. This means that in the offset table the corresponding entry is equal to the previous entry. At the same time the corresponding bit in the nullmap is set.
Hence for any

For some variable-length field as we find its length from the offset table types we can encounter the case when the length is equal to zero. At this point the value has to be disambiguated in this way:

If the corresponding nullmap bit is clear then this is a zero-length (empty) value;
If the corresponding nullmap bit is set then this is a NULL value.

This approach is naturally extended to fixed-size fields. But instead of zero-length value we interpret it as a value of zero. That is 0 for integer fields, 0.0 for floating point fields, and whatever default value is for more complex types like date and time.

Integer Field Compression

This is optional and subject to discussion.

Uniform use of offset tables for all fields lets us take advantage of variable-length principle for integer fields.

a value with zero length. To distinguish a zero-length variable-length value from a null value we use a special magic byte that denotes an empty value. That is an empty value is encoded as single-byte sequence – 0x80. In turn if the original variable-length value starts with the magic byte 0x80 byte then this byte is doubled.

The Number and Decimal types are never empty, at least one significant byte is always present. The variable-length types that use the magic byte for encoding are as follows:

String;
Binary;
Bitmask.

The list of supported data types is as follows:

Type	Field Size	Description
Int8	1	1-byte signed integer
Int16	1, 2	2-byte signed integer, but may occupy less space due to compression mechanism described below
Int32	1, 2, 4	4-byte signed integer, but may occupy less space due to compression mechanism described below
Int64	1, 2, 4, 8	8-byte signed integer, but may occupy less space due to compression mechanism described below
Float	4	4-byte floating-point number
Double	4, 8	8-byte floating-point number, but may occupy 4 bytes if fits into float w/o loss of precision
Number	variable	Variable-length integer
Decimal	variable	Variable-length fixed-point number, the scale is determined by the schema
UUID	16	UUID
String	variable	An utf-8 encoded string
Binary	variable	Variable-length arbitrary binary data
Bitmask	variable	Variable-length binary data representing a bit-string
Date	3	A timezone-free date (a year, month, day)
Time	4, 5, 6	A timezone-free time (hour, minute, second, microseconds)
DateTime	7, 8, 9	A timezone-free datetime encoded as (date, time)
Timestamp	8, 12	Number of microseconds since Jan 1, 1970 00:00:00.000000 (with no timezone)
Duration	8, 12	See below
Period	3, 6, 12	See below
Boolean	1	A boolean value (either `true` of `false`)

Integer Representation

All integer values are stored in the little-endian byte order.

For Thus for integer values it is possible to keep only their significant bytes, omitting their high insignificant bytes. That is even if the original data type of a field is INTEGER and should occupy 4 bytes but the actual value is below 128 and above -129 then we can store it just as a single byte and reflect this in the offset table.

For signed integers the most significant bit is the sign bit. Thus a compressed signed integer must be sign-extended on decompression.

In SQL standard there are no unsigned integers. So their support is omitted for now. But should it become needed for some reason then a compressed unsigned integer must be zero-extended on decompression.

UUID Representation

A UUID value occupies 16 bytes. It is encoded with a slight twist. It is split into two 8-byte integers – the most significant and least significant ones. The most significant goes first, the least significant goes after it. But each of these integers is stored in little-endian byte order. So a UUID 00112233-4455-6677-8899-aabbccddeeff will be encoded as byte sequence 77 66 55 44 33 22 11 00 ff ee dd cc bb aa 99 88.

Date Representation

A date field occupies 3 bytes.

23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
year - 15 bits as a signed two's-complement value, so the highest bit is a sign bit															month - 4 bits				month day - 5 bits
byte 2								byte 1								byte 0

Time Representation

A time field may occupy 4, 5, or 6 bytes depending on the factional part precision that may be up to a millisecond, microsecond, and nanosecond respectively.

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
padding - 5 zero bits					hour - 5 bits					minute - 6 bits						second - 6 bits						milliseconds - 10 bits
byte 3								byte 2								byte 1								byte 0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
3 zero bits			hour - 5 bits					minute - 6 bits						second - 6 bits						microseconds - 20 bits
byte 4								byte 3								byte 2								byte 1								byte 0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
zero	hour - 5 bits					minute - 6 bits						second - 6 bits						nanoseconds - 30 bits
byte 5								byte 4								byte 3								byte 2								byte 1								byte 0

DateTime Representation

A DateTime field is represented as a Date followed by a Time. So it may occupy from 7 to 9 bytes depending on the factional part of the Time value.

Timestamp Representation

A timestamp field may occupy either 8 or 12 bytes. It consists of a 64-bit seconds part optionally followed by a 32-bit nanoseconds part. The seconds part is a signed number representing the time since so called Epoch time – Jan 1, 1970 00:00:00. The nanoseconds part varies between 0 and 999,999,999. It can be omitted meaning that it is equal to zero.

Bitmask Representation

A bitmask field is represented as a byte sequence in little-endian order.

Number and Decimal representation

Number and Decimal have identical representation. The Decimal type has a scale parameter but it is stored in the schema rather than in each tuple. Fields of these types are represented as a byte sequence containing the two's-complement binary value. Unlike all the other types here the bytes go in big-endian order. That is the most significant byte is at the very start of the sequence and the least significant byte is at its end.

Duration Representation

A duration field may occupy either 8 or 12 bytes. It consists of a 64-bit signed seconds part optionally followed by a 32-bit nanoseconds part in range from 0 to 999,999,999.

Period Representation

A period field may occupy 3, 6, or 12 bytes. It consists of 3 signed varlen int parts: years, months, days. When every part fits into 8 bits, we write 3 signed 8-bit integers. When every part fits into 16 bits, we write 3 signed 16-bit integers. Otherwise, every part is written as a signed 32-bit integer.

Unlike Date, every part is independent from others and can be in range from Integer.MIN_VALUE to Integer.MAX_VALUE.

Boolean Representation

A single byte containing 1 for the value of true and 0 for the value of false.

Corollary

If the number of fields is N and t is an array that stores a binary tuple we can find the answers for the following:

How many bytes are occupied by one offset table entry?

offsetEntryBytes = 1 << (t[0] & 0b11);

How many bytes are occupied by the offset table?

offsetTableBytes = N * offsetEntryBytes;

What is offset of the value area?

valueBaseOffset = 1 + offsetTableBytes;

What is the whole tuple size?

tupleSize = valueBaseOffset + valueAreaSize; // valueAreaSize goes from the last offset table entry

Corollary #2

In order to build a tuple using minimum possible space it is required to learn what is the total length of all non-null values. After that we can figure out the minimum possible size of the offset table entries.

Thus, generally speaking, building a binary tuple is a two-pass procedure. Sometimes it might be possible to turn this into a single pass (almost) by over-provisioning the allocated storage for the worst case and then fixing it up at the end.

Reference Links

1. IEP-54: Schema-first Approach

Tickets

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	labels=iep-92
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 8

New Version Current

Key

Header

Nullmap

Offset Table

Value area

Field Values

Field Encoding

Integer Field Compression

Integer Representation

UUID Representation

Date Representation

Time Representation

DateTime Representation

Timestamp Representation

Bitmask Representation

Number and Decimal representation

Duration Representation

Period Representation

Boolean Representation

Corollary

Corollary #2

Reference Links

Tickets

Page tree

Page History

Versions Compared

Old Version 8

New Version Current

Key

Header

Nullmap

Offset Table

Value area

Field Values

Field Encoding

Integer Field Compression

Integer Representation

UUID Representation

Date Representation

Time Representation

DateTime Representation

Timestamp Representation

Bitmask Representation

Number and Decimal representation

Duration Representation

Period Representation

Boolean Representation

Corollary

Corollary #2

Reference Links

Tickets