ID	IEP-54
Author	Valentin Kulichenko Alexey Goncharuk
Sponsor	Valentin Kulichenko
Created	28 Aug 2020
Status	DRAFT

Motivation

The way Ignite works with data schemas is inconsistent:

The binary protocol creates schemas for anything that is serialized. These schemas are updated implicitly – the user doesn't have any control over them.
SQL engine has its own schema that is separate from the binary schema, although SQL runs on top of binary objects. SQL schema is created and updated explicitly by the user.
Caches themselves are basically schemaless – you're allowed to store multiple versions of multiple data types in a single cache.

This creates multiple usability issues:

SQL schema can be inconsistent or even incompatible with the binary schema. If either of them is updated, the second is not affected.
SQL can't be used by default. The user has to explicitly create the SQL schema, listing all fields and making sure the list is consistent with the content of binary objects.
Binary schemas are decoupled from caches. So, if a cache is destroyed, the binary schema is not removed.
Etc.

Description

The general idea is to have a one-to-one mapping between data schemas and caches/tables. There is a single unified schema for every cache, it is applied to both data storage itself and to the SQL.

When a cache is created, it is configured with a corresponding data schema. There must be an API and a tool to see the current version of the schema for any cache, as well as make updates to it. Schema updates are applied dynamically without downtime.

DDL should work on top of this API providing similar functionality. E.g. CREATE TABLE invocation translates to a cache creation with the schema described in the statement.

Anything stored in a cache/table must be compliant with the current schema. An attempt to store incompatible data should fail.

The binary protocol should be used only as the data storage format. All serialization that happens for communication only should be performed by a different protocol. The data storage format will be coupled with the schemas, while the communication is independent of them. As a bonus, this will likely allow for multiple optimizations on both sides, as serialization protocols will become more narrow purposed.

BinaryObject API should be reworked, as it will not represent actual serialized objects anymore. It should be replaced with something like BinaryRecord or DataRecord representing a record in a cache or table. Similarly to the current binary objects, records will provide access to individual fields. A record can also be deserialized into a class with any subset of fields represented in the record.

Schema Definition API

There are several ways a schema can be defined. The initial entry point to the schema definition is SchemaBuilder java API:

TBD

The schema builder calls are transparently mapped to DDL statements so that all operations possible via a builder are also possible via DDL and vice versa.

Additionally, we may introduce an API that will infer the schema from a key-value pair using class fields and annotations. The inference happens on the calling site of the node invoking the table modification operation.

Table schema should be automatically exposed to the tabel configuration subtree so that simple schema changes are available via ignite CLI and the schema can be defined during the table creation via ignite CLI.

Data restrictions

Schema-first approach imposes certain natural requirements which are more strict than binary object serialization format:

The column type must be of one of a predefined set of available 'primitives' (including Strings, UUIDs, date & time values)
Arbitrary nested objects and collections are not allowed as column values. Nested POJOs should either be inlined into a schema or stored as BLOBs
Date & time values should be compressed with preserving natural order and decompression should be a trivial operation (like applying bitmask).

The suggested list of supported built-in data types is listed in the table below:

Type	Size	Description
Bitmask(n)	⌈n/8⌉ bytes	A fixed-length bitmask of n bits
Int8	1 byte	1-byte signed integer
Uint8	1 byte	1-byte unsigned integer
Int16	2 bytes	2-byte signed integer
Uint16	2 bytes	2-byte unsigned integer
Int32	4 bytes	4-byte signed integer
Uint32	4 bytes	4-byte unsigned integer
Int64	8 bytes	8-byte signed integer
Uint64	8 bytes	8-byte unsigned integer
Float	4 bytes	4-byte floating-point number
Double	8 bytes	8-byte floating-point number
Number([n])	Variable	Variable-length number (optionally bound by n bytes in size)
Decimal	Variable	Variable-length floating-point number
UUID	16 bytes	UUID
String	Variable	A string encoded with a given Charset
Date	3 bytes	A timezone-free date encoded as a year (15 bits), month (4 bits), day (5 bits)
Time	4 bytes	A timezone-free time encoded as padding (5 bits), hour (5 bits), minute (6 bits), second (6 bits), millisecond (10 bits)
Datetime	7 bytes	A timezone-free datetime encoded as (date, time)
Instant	8 bytes	Number of milliseconds since Jan 1, 1970 00:00:00.000 (with no timezone)
BLOB	Variable	Variable-size byte array

Data Layout

Given a set of user-defined columns, this set is then rearranged so that fixed-sized columns go first. This sorted set of columns is used to form a tuple. Tuple layout is as follows:

Field	Size
Schema version	2 bytes
Flags	1 byte
Key columns hash	4 bytes
Key columns:
Key columns full size	2 (3?) bytes
Key columns varsize columns offsets table size	2 bytes
Key columns varsize columns offsets table	Variable (number of non-null non-default varsize columns * 2(3?))
Key columns null-defaults map	⌈number of columns / 8⌉
Key columns fixed size values	Variable
Key columns variable size values	Variable
Value columns:
Value columns full size	2 (3?) bytes
Value columns varsize columns offsets table size	2 bytes
Value columns varsize columns offsets table	Variable (number of non-null non-default varsize columns * 2(3?))
Value columns null-defaults map	⌈number of columns / 8⌉
Value columns fixed size values	Variable
Value columns variable size values	Variable

The flags field is a bitmask with each bit treated as a flag, with the following flags available (from flag 0 being the LSB to flag 7 being MSB):

Flag 0: tombstone. If the flag is set, the value chunk is omitted, and the tuple represents a tombstone
Flag 1: skip key nullmap. If the flag is set, all values in the key chunk are non-null and non-default, so that the null map for the key chunk is omitted
Flag 2: skip value nullmap. If the flag is set, all values in the value chunk are non-null and non-default, so that the null map for the value chunk is omitted
Flags 3-7: Reserved for future use

Schema evolution

Unlike Ignite 2.x approach, where binary object schema ID is defined by a set of fields that are present in a binary object, for the schema-first approach we assign a monotonically growing identifier to each version of the cache schema. The ordering guarantees should be provided by the underlying metadata storage layer (for example, the current distributed metastorage implementation or consensus-based metadata storage). The schema identifier should be stored together with the data tuples (but not necessarily with each tuple individually: we can store schema ID along with a page or larger chunks of data). The history of schema versions must be stored for a long enough period of time to allow upgrade all existing data stored in a given cache.

Given schema evolution history, a tuple migration from version N-k to version N is a straightforward operation. We identify fields that were dropped during the last k schema operations and fields that were added (taking into account default field values) and update the tuple based on the field modifications. Afterward, the updated tuple is written in the schema version N layout format. The tuple upgrade may happen on read with an optional writeback or on next update. Additionally, tuple upgrade in background is possible.

Since the tuple key hashcode is inlined to the tuple data for quick key lookups, we require that the set of key columns do not change during the schema evolution. In the future, we may remove this restriction, but this will require careful hashcode calculation adjustments since the hash code value should not change after adding a new column with default value. Removing a column from the key columns does not seem possible since it may produce duplicates, and checking for duplicates may require a full scan.

Additionally to adding and removing columns, it will be possible to allow column type migrations when type change is non-ambiguous (a type upcast, e.g. Int8 → Int16, or by means of a certain expression, e,g, Int8 → String using CAST expression). Type conversions that narrow the column range (e.g. Int16 → Int8) must only be allowed using explicit expressions that will allow Ignite to validate that no RangeOutOfBoundsException is possible during the conversion.

For example, consider the following sequence of schema modifications expressed in SQL-like terms:

CREATE TABLE Person (id INT, name VARCHAR(32), lastname VARCHAR(32), taxid int);
ALTER TABLE Person ADD COLUMN residence VARCHAR(2) DEFAULT "GB";
ALTER TABLE Person DROP COLUMN lastname, taxid;
ALTER TABLE Person ADD COLUMN lastname DEFAULT "N/A";

This sequence of modifications will result in the following schema history

ID	Columns	Delta
1	id, name, lastname, taxid	N/A
2	id, name, lastname, taxid, residence	+ residence ("GB")
3	id, name, residence	-lastname, -taxid
4	id, name, residence, lastname	+lastname ("N/A")

With this history, upgrading a tuple (1, "John", "Doe") of version 1 to version 4 means erasing columns lastname and taxid and adding columns residence with default "GB" and lastname (the column is returned back) with default "N/A" resulting in tuple (1, "John", "GB", "N/A").

Class-agnostic schema mapping

It's clear that given a fixed schema, we can generate an infinite number of classes that match the column of this schema. This observation can be used to simplify ORM for the end-users. For the APIs which return Java objects, the mapping from schema columns to the object fields can be constructed dynamically, allowing to deserialize a single tuple into instances of different classes.

For example, let's say we have a schema PERSON (id INT, name VARCHAR (32), lastname VARCHAR (32), residence VARCHAR (2), taxid INT). Each tuple of this schema can be deserialized into the following classes:

class Person {
    int id;
    String name;
    String lastName;
}

class RichPerson {
    int id;
    String name;
    String lastName;
    String residence;
    int taxId;
}

For each table, a user may specify a default Java class binding, and for each individual operation a user may provide a target class for deserialization:

Person p = table.get(key, Person.class);

Given the set of fields in the target class, Ignite may optimize the amount of data sent over the network by skipping fields that would be ignored during deserialization.

Type mapping

Ignite will provide out-of-box mapping from standard platform types (Java, C#, C++) to built-in primitives. A user will be able to alter this mapping using some external mechanism (e.g. annotations to map long values to Number). Standard mapping is listed in the table below:

Built-in	Java	.NET
Bitmask(n)	BitSet	byte[]
Int8	byte (Byte if nullable)	sbyte
Uint8	short with range constraints	byte
Int16	short (Short if nullable)	short
Uint16	int with range constratints	ushort
Int32	int (Integer if nullable)	int
Uint32	long with range constratints	uint
Int64	long (Long if nullable)	long
Uint64	BigInteger with range constratints	ulong
Float	float (Float if nullable)	float
Double	double (Double if nullable)	double
Number([n])	BigInteger	BigInteger
Decimal	BigDecimal	decimal
UUID	UUID	Guid
String	String	string
Date	LocalDate	NodaTime.LocalDate
Time	LocalTime	NodaTime.LocalTime
Datetime	LocalDateTime	NodaTime.LocalDateTime
Instant	Date (Instant?)	NodaTime.Instant
BLOB	byte[]	byte[]

Java has no native support for unsigned types. We still can introduce 'unsigned' flag to schema type or separate binary type-codes, and allow to map to the closest types of wider range. E.g. map Uint8 → short and recheck constraints during serialization.

If one will try to serialize object with 'short' value out of Uint8 range then it end up with exception (ColumnValueIsOutOfRangeException).

Dynamic schema expansion (flexible schemas)

One of the important benefits of binary objects was the ability to store objects with different sets of fields in a single cache. We can accommodate for a very similar behavior in the schema-first approach.

When an object is inserted into a table, we attempt to 'fit' object fields to the schema columns. If a Java object has some extra fields which are not present in the current schema, the schema is automatically updated to store additional extra fields that are present in the object.

On the other hand, if an object has fewer fields than the current schema, the schema is not updated auto(such scenario usually means that an update is executed from an outdated client which did not yet receive a proper object class version). In other words, columns are never dropped during automatic schema evolution; a column can only be dropped by an explicit user command.

Risks and Assumptions

n/a

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-54-Schema-first-approach-for-3-0-td49017.html

Reference Links

n/a

Tickets

n/a

Page tree

IEP-54: Schema-first Approach

Motivation

Description

Schema Definition API

Data restrictions

Data Layout

Schema evolution

Class-agnostic schema mapping

Type mapping

Dynamic schema expansion (flexible schemas)

Risks and Assumptions

Tickets

Discussion Links

Reference Links

Tickets