ID | IEP-98 |
Author | |
Sponsor | |
Created |
|
Status | DRAFT |
Consider table's schema being altered just before the data transaction has started. Some nodes might have already applied the new table definition, while others not. As a result, when the transaction gets committed, it might get committed with different versions of the table schema on different nodes (or on some nodes the table might have been already considered dropped, while on others not), which breaks consistency.
To ensure consistency, we must use the same version of the table schema when committing a transaction across all participant nodes.
Additionally, it is imperative that a client is able to "view their own schema updates." In the event that a client initiates a schema update and receives confirmation that the update has been implemented, they must observe that the update has taken effect on any node they access.
A schema update U for an object Q looks like "since the moment Tu, object Q has schema S". Tu is the activation moment of the new schema version S; in other words, S activates at Tu. Schema version S is effective from its activation moment Tu till the activation moment Tu’ of the next schema version S’ of the same object (including the start, excluding the end of the interval) or indefinitely if S is the most recent version of the schema at the moment; so it’s either [Tu, Tu’) or [Tu, +inf).
A schema update U is visible to a node N at moment Top if the node N got a message M containing this update at a moment T<=Top. (Schema update messages are distributed using Meta-Storage RAFT group (so a message consumption is an application of it to a state machine), so the update messages are delivered in order (hence, if a message is not delivered, none of the messages that follow it will be delivered) and not duplicated)). This is due to the fact that there is exactly one Meta-Storage RAFT state machine at each node.
We want that for any given timestamp Tq, all nodes get the same answer to the question: ‘whether a schema update U is visible and effective or not at the moment Tq’.
This can be reformulated: a schema update saying that a new schema version activates at Tu splits all transactions into two sets: those that commit strictly before Tu (Tc < Tu) and hence operate on the old schema version, and those that commit starting with Tu (Tc ≥ Tu) and hence operate on the new schema version, regardless of the nodes that participate in the corresponding transaction.
An important thing here is the clock we use to define the timestamps.
As the nodes’ clocks do not have to be synchronized, it’s the node’s local resolution of "whether a timestamp is reached by the node’s clock or not" that matters, it’s not relative to some absolute clock (about which each node only has an approximate idea).
Let’s call the property defined above the Atomicity Property (AP).
The following property makes AP to hold: no node can get a message about an update U(Tu,S) later than the node has fetched (and used in handling some transaction) a schema for a moment Top >= Tu.
To maintain the real-time atomicity, we would need to coordinate between all nodes of the cluster on each schema update to make sure they get the update before they can commit a transaction using it, so a lagging node would make all nodes wait (not processing any transactions); alternatively, if a schema update might not be acknowledged by all nodes in a timely fashion, a schema update could be held in an inactive state for an indeterminate period of time creating a form of unavailability of the whole cluster.
That's why we separate the timestamps assigned to the events (like schema version activation and schema version fetches) from the actual moments when the corresponding events tagged with these timestamps are processed on each node. We have:
A simplified diagram demonstrating the flow of information on a schema update (low-level details (timestamp arithmetics for optimization) are not shown):
Even on the Meta-storage leader, a just-installed schema update is not visible right away, because it activates after DD in future. This might be confusing for a user that just issues an ALTER TABLE query because ‘read own writes’ is violated. This can be alleviated by making sure that we complete the future used to install the schema update not later than DD passes, but if a client fails before the wait elapses and reconnects, they might expect to see the update.
A primary replica needs to wait for safeTime to reach Top-DD before processing a transaction at a reference timestamp Top. If the node does not get any Meta-Storage updates (i.e. it lags behind the Meta-Storage leader), it cannot proceed.
Possible situations:
Note that if a primary P lags, then transactions where P does not participate are not affected.
An unpleasant thing is that a node lagging too much behind the Meta-Storage freezes transaction processing even in the absence of any schema updates.
// N/A
// N/A
// Links or report with relevant JIRA tickets.