ID	IEP-110
Author	Roman Puchkovskiy
Sponsor	Sergey Chugunov
Created	20 Jul 2023
Status	DRAFT

Motivation

Schema synchronization defines the design of the schema synchronization framework.

This document defines how DDL operations interact with DML operations (namely, how DML operations after the DDL are handled in transactions started before the DDL).

The document also describes how we deal with basic schema changes (these are changes that have nothing to do with indices and don’t require existing data validation) under the Schema Synchronization framework. Compatibility issues are covered for basic schema changes.

Description

Basic requirements

This is a dependent design, some requirements are defined in the Schema synchronization design. As a highlight, we should avoid blocking both transaction processing (due to DDLs) and DDLs (due to running transactions); if we must choose between blocking and aborting some transactions, abortion is the preferred approach.

Here we add the following requirements:

Handling of a transborder transaction should not produce any weird (from the point of view of the user) behavior. An example of a weirdness is if a transaction is run on more than one schema during its life, or it looks (to other transactions) that it did so.
After a CREATE TABLE, the created table must be available for any transaction started before the CREATE TABLE was executed.
If an ALTER TABLE happens after a transaction was started and only later (after the DDL) the table gets enlisted in the transaction, the transaction should not be aborted due to the schema change.

Overall flow

A client comes and tries to execute an operation in a transaction (it sends its schema version clientSV in each tuple it sends in the request)
The tx coordinator gets the request, takes baseTs=tableEnlistTs(table, tx), then takes a schema version tableEnlistSV corresponding to the ts and, if clientSV does not match tableEnlistSV, returns an error to the client (the error contains tableEnlistSV); by getting the error, the client retries its request using tableEnlistSV. The coordinator caches tableEnlistSV for each table in each transaction.
If the client request validation succeeds, the coordinator sends tableEnlistSV to the primary (either implicitly [in binary tuples of the request] or explicitly)
(As a result, for the same table in the same tx, there will be the same tableEnlistSV)
The primary uses tableEnlistSV as I (initial schema) when validating schema compatibility for each transaction read/write operation
The coordinator uses tableEnlistSV as I (initial schema) when validating schema compatibility during commit
If the transaction is aborted due to schema validation failure (items 5 and 6), an error code is returned to the client that says that the transaction should be retried
If the client gets a retriable error code, it retries the transaction

The diagram illustrates the following:

Client schema version validation and refresh
Successful writes
Schema validation at commit
Retry of an aborted transaction

Weirdness and DDL linearization wrt DML

We are not going to implement transactional DDL for now, also we don’t want to block transactions while executing DDL. This means that transborder transactions (TB) are possible (those transactions that start on one schema, but commit on another one).

There are 3 options:

We allow a TB to read/write on the new schema (if the change is compatible) as soon as a DDL executes (so the TB sees the effect of the DDL right away wrt reads/writes). The effect of such TBs could look weird from the POV of transactions reading tuples written by the TB: even though they seem to appear at the same time (commitTs), some of them would adhere to one set of constraints (corresponding to the old schema), but some would adhere to another set of constraints (corresponding to the new schema).
We allow a TB to read/write after a DDL happens (if it’s compatible), but it still reads/writes on the old schema. This might look weird from the POV of the user running the TB: the DDL is not transactional, but the effect of a DDL is not kicking in right away.
We don’t allow a TB to read/write after a DDL happens (so it can only commit [if the schema change is compatible])

We choose option 3 to avoid the weirdness of the first two options at the expense of a slightly increased probability of rollbacks. To make sure it holds, we require that a DDL operation is linearized wrt any DML operation (in other words, if we treat a DDL as a ‘transaction’, it is strictly serializable). This means that a DDL operation immediately affects any transactions that started before it, and new transactions will start using the new schema. No transaction can have reads/writes run both on the old schema and the new schema AND have the possibility to detect that they were executed on different schemas.

When schema changes are validated

When committing, we validate that I (the initial schema at tableEnlistTs) is compatible with C (the commit schema at commitTs)
At each read/write operation in a transaction (that has its opTs), if schema at opTs is different than I, we fail the operation and abort the transaction (we do so to make it impossible for a transaction to observe that it runs on a different schema)
When a schema change happens, we find all transactions that touched this table (and started on an earlier schema) and mark all transactions for which the schema change is incompatible as rollback-only. This is not needed for correctness, it’s an optimization to make sure we abort transactions as soon as possible if they will never be able to commit.

The result is that the ability to commit transborder transactions for forward-compatible schema changes is a minor optimization. It might have a noticeable effect only for very short transactions (implicit ones and explicit transactions having just a couple of operations each). This optimization might be postponed.

Schema compatibility

Object types

All objects in a database are either top-level (defined directly in the Catalog) or belong to some other object. Please note that the notion of being a top-level object or a dependent object is specific to this design and might be different from usual considerations (for instance, here we consider a table to be a top-level object, even though one might say that a table belongs to a database).

Top-level:

Table
View
Distribution zone

Belonging to a table:

Index
Column
Constraint

General schema compatibility

When a schema change happens, the change is compatible from the point of view of any transaction whose enlisted tables are different from the top-level object of the change in question. So for a transaction that only touches table A, a change in table B or any of indices of table B is compatible in both directions.

When a schema change from I to C actually consists of a few steps (for instance, the table was altered a few times since I), each of the steps (which is in turn a schema change) is validated in sequence. If any of them is not compatible, the whole change is not compatible. It works the same for forward and backward compatibility.

When a schema change is compound (that is, it consists of a few simple schema changes), it is represented for validation purposes as a sequence of these simple changes and then each simple change is validated, in order. The compound change is compatible if and only if all the simple changes are compatible.

An example of such a compound change is a mixed ALTER TABLE statement like the following: ALTER TABLE t1 DROP COLUMN c1, ADD COLUMN c2 VARCHAR(20), that will be represented as ‘drop column c1, then add column c2’.

Please note that columns are identified by their IDs, not names, so after a column is dropped and readded with the same name, the result is actually a column different from the original column.

(It seems that for GA we are only going to support the ability to drop a few columns at once, so for GA the compound schema changes support will be very limited.)

Initial schema I is forward-compatible with commit schema C if all of the following requirements are met:

A tuple written with schema I is allowed to exist under schema C (i.e. it does not violate any constraints on the new schema)
Let’s suppose that tuple X was written on I. When reading it on C, we apply an upgrading conversion. After such a conversion, the resulting tuple should not contain less information than X (this makes DROP COLUMN incompatible).
Not a single transaction reading X on C should be able to detect that X was actually written on I (this rules out modifications of default values)

Backward compatibility

Imagine the following scenario.

A transaction T1 is started on a schema I
A schema change happens, so the schema changes to C
Another transaction T2 writes a tuple using schema C and commits on C
T1 reads that tuple

If a schema change at step 2 did not happen, such a scenario would be perfectly valid according to the Transaction protocol. But, as the schema change has happened, we need to represent it as a schema change transaction Ts (a virtual transaction, just for the sake of argument, because the current design does not represent schema changes as ‘real’ transactions). Let’s look at the serialization ordering:

T1 < Ts (because T1 runs on I completely and sees no schema change)
Ts < T2 (because T2 runs on C)
T2 < T1 (because T1 reads a tuple written by T2)

We have a cycle, so the serializability is broken and all 3 transactions cannot commit.

The notion of backward compatibility is about the cases where we pretend that no schema update has happened (because, whether the transaction runs on the old I or the new C does not affect the effects of the tx). For example, if the schema change is an addition of a nullable column, then, when T1 reads the row from T2, the row gets automatically downgraded to I, so T1 has no chance of noticing that something has changed; such a change could be called backward compatible as it would allow to pretend that no schema change has happened, so there is no Ts, so the serializability is not broken and T1 can still commit.

Currently, the described scenario still seems suspicious, so, for the sake of safety, we always forbid T1 to commit (so, in our terms, any schema change is backward incompatible).

Note that, when determining forward compatibility, only the ‘type’ of the change is considered and not the actual writes that were made in the transaction (to avoid scanning all the transaction’s writes to the table on the commit prepare phase). In contrast to this, if we had backward compatibility, we would consider backward compatibility for each read, so the rules would be able to consider not just the type of the schema change, but also the tuple’s contents.

Basic schema changes

Table-related changes

Tables

(Note: there is a table in the end of the document summarizing the table-related rules)

Creation of a table is forward-compatible
Drop of a table is not forward-compatible because the table does not exist in the commit schema
Rename of a table is not forward-compatible because the table identified with the original name (seen by the transaction) does not exist in the commit schema

Checking the existence of a table when reading/writing

When a table is dropped, it does not make sense to pretend that it’s still there until a commit and only on commit abort the transaction. Each RW transaction operation has a timestamp associated with this (this timestamp moves the corresponding partition forward). Before an operation gets executed, the schema must be obtained for the operation timestamp; if the table does not exist in this schema, the transaction should be aborted and the operation should fail.

Columns

Columns are identified by their IDs, not names, for schema compatibility calculations. This means that, if a column X exists, then gets dropped and then column X gets added again, the second X column has a different ID, so it’s a column different from the original X.

Addition of a column with a known default value (it is nullable or has a default). Forward compatible (because absence of a value is not a problem: the column is nullable/has a default).
Removal of a column. Forward incompatible (because, if we upgrade an old tuple to the new schema, we get a tuple that has less information).
Addition of a column that is not null and has no default value. Not allowed by SQL.
Renaming of a column is forward compatible (because internally we identify columns by IDs, not by names).

Column nullability

Making a non-null column nullable. Always forward-compatible.

Default values

Any manipulation with default values (adding/renaming/modifying a default value) is forward-incompatible.

Column type

Changing a column type when there is an exact conversion from the old type to the new one. These are:

Widening conversions between integral types (like int->long)
Widening conversions between floating-point types (float->double)
Widening conversions between date-time types (increasing precision)
Widening conversions between decimal types (increasing precision)
Increasing allowed length of VARCHAR
Increasing allowed length of VARBINARY
Conversions of anything to VARCHAR with enough capacity

Always forward-compatible (because the value written in a narrow type in the old schema can always be read in a wide type in the new schema).

Constraints

Dropping any constraint is forward-compatible.

Creation of constraints is planned to be defined in another IEP.

Combining type, nullability and default value changes

ALTER TABLE syntax allows to change column type, nullability and default value all at once.

When making a validation for any table schema change, the rules above are applied (in order); all that match the change cause the corresponding compatibility validation. As a result, a complex change (like INT NOT NULL -> LONG DEFAULT 42) will trigger a few rules and all of them will be applied, in order.

Index changes

This is out of scope of this design and planned for another IEP.

View changes

CREATE/DROP/RENAME are analogous to the corresponding table changes.

Distribution zone changes

Creation of a distribution zone precedes creation of a table, so no special handling is required.

Drop of a distribution zone follows drop of a table, so, again, no special handling is required.

Alteration of a distribution zone that does not change partitions count does not affect the tuples written, so special handling is not needed here as well. Changing partition count is not supported for now.

Quirks

The design has a quirk. Namely, the following scenario is possible:

The client comes with a first operation in table A in transaction tx
The coordinator takes its HLC.now() as tableEnlistTs and resolves the corresponding schema version as tableEnlistSV, then it sends a request to a primary
A schema change on A activates at ddlTs (ddlTs > tableEnlistTs)
The primary takes ITS OWN HLC.now() as opTs (opTs > ddlTs) and validates the schema version corresponding to opTs against tableEnlistSV.

The validation will fail as the schema version corresponding to opTs is different from tableEnlistTs, even though, logically, tableEnlistTs corresponds to the same operation.

This seems to be acceptable as:

The schema change happened kinda simultaneously with the operation
The client would just retry the transaction (it should be ready to retry anyway)
Such cases should be extremely rare

Possible simplifications

Aborting a transaction that enlists a table that was ALTERed after the transaction started, but before it was enlisted

One of the basic requirements sounds like this:

If an ALTER TABLE happens after a transaction was started and only later (after the DDL) the table gets enlisted in the transaction, the transaction should not be aborted due to the schema change.

This forces us to maintain tableEnlistTs for each enlisted table in each transaction both on the coordinator and primaries; this complicates the code.

A possible simplification is to give this requirement up and just abort a transaction that tries to enlist a table that was ALTERed after the transaction was started. This would allow to use max(beginTs, tableCreationTs) as baseTs, instead of tableEnlistTs; both beginTs and tableCreationTs are available.

The problem here is that a table might be renamed. If we resolve a table by the same name ‘A’ a minute ago and a minute after, we might get different tables. To still be consistent with how we work with tables in the same transaction, we still have to track either tableEnlistTs or a mapping from table names to IDs on the coordinator. On primaries, this is not needed (they are chosen by table ID).

The suggestion is to start the implementation without the mentioned requirement and only support it later.

Postponed optimizations

The following possibilities where evaluated, but postponed for a possible later implementation:

We don’t allow any read/write operation on a transaction after an ALTER TABLE happened on an enlisted table, even for the white-listed compatible DDLs (like ADD COLUMN or dropping a constraint) to make sure that even a transborder transaction is not able to see any weirdness due to working on 2 schemas. Such weirdness can only be observed when looking at table metadata (this way, the transaction might find out that a new column had appeared) because the client-side schema is fixed for the duration of the transaction due to the validation on the coordinator. We might consider such a level of weirdness negligible and allow transborder transactions read/write even after a compatible DDL.
For now, we consider an addition of a NOT NULL column without a default value as impossible, but we could allow it for empty tables. This requires an additional design.
For now, we treat RENAME TABLE as incompatible, this is simple and consistent, but it will make transborder transactions touching the renamed table to be aborted. It is possible to use a two-stage mechanism to have the old name as an alias along with the new one for some time to allow the older transactions to finish.
For safety, we consider all changes as backward incompatible, but it might be possible to make some of them backward compatible later to reduce abort rate due to schema changes

Risks and Assumptions

N/A

Discussion Links

N/A

Reference Links

Tickets

key	summary	type	created	updated	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-110: Schema synchronization: basic schema changes