ID	IEP-91
Author	Alexey Scherbakov
Sponsor	Alexey Scherbakov
Created	24 May 2022
Status	DRAFT

If I have seen further it is by standing on ye shoulders of Giants

Isaac Newton

Motivation

One of the major features of AI3, as a distributed database, is the ability to execute multiple table operations as single atomic operation, known as transaction. We need to design modern and robust distributed transaction protocol, taking into account current best practices. Both key-value and SQL database access methods will rely upon it. Comparing to AI2, we aim to support transactional SQL from the beginning and remove limitations like size of transaction.

Definitions

In this section I'll give some definitions encountered though the text, for easier understanding.

Record (aka Row, Tuple, Relation) - a collection of attribute-value pairs.

Transaction - a sequence of logically related partially ordered actions (reads or writes) over the database objects.

Atomicity - a transaction property which declares: either all actions are carried out or none are.

Consistency - a property which moves a database from one consistent state to another after finish. A meaning of the consistent state is defined by a user.

Isolation - a measure of mutual influence between interleaved transactions.

Durability - a transaction property which guarantees that database state remains unchanged after a transaction is committed, despite any failures.

Schedule - a way of executing interleaved transactions.

Serial schedule - a schedule where all transactions are executed sequentially.

Serializable schedule - a schedule which is equivalent to some serial execution of interleaved transactions.

Concurrency control (CC) - a technique to preserve database consistency in case of interleaved committed transactions.

Multi-version concurrency control (MVCC) - a family of concurrency control techniques based on writing multiple record versions (copy-on-write).

Recoverable schedule - a schedule which is not affected by aborting some of involved transactions. A transaction reads only committed values to achieve this.

Interactive transaction - a transaction whose operation set is not known apriory. Can be aborted at any time, if not committed yet.

Cascading abort - a situation in which the abort of one transaction causes the abort of another dependent transaction to avoid inconsistency.

Design Goals

To define key points of the protocol design, let's look at some features, which can be provided by the product, and value them from 1 to 3, where 3 means maximum importance for product success.

Strong isolation
Support for interactive transactions
Conflict resistance
Read-only (long lived) transactions
Consistent replica reads
Optimized for fast path execution
Geo-distribution aware
Unlimited or very large transaction size
Transactional DDL
Data loss toleration

Let's take a look at each feature in detail and give it a value.

Strong isolation

Here we take into account the isolation property of a transaction. The strongest isolation is known to be Serializable, implying all transactions pretend to execute sequentially. This is very convenient to a user, because it prevents hidden data corruptions https://pmg.csail.mit.edu/papers/adya-phd.pdf and security issues http://www.bailis.org/papers/acidrain-sigmod2017.pdf. The price for this can be reduced throughput/latency due to increased overhead from CC protocol. Another options is to allow a user to choose a weaker isolation level, like SNAPSHOT. The ultimate goal is to implement Serializability without sacrificing performance too much, having Serializable as default isolation level. I measure it with 2

Support for interactive transactions

This is the most intuitive way to use transactions. I measure it with 3

Conflict resistance

This is a general property of a transactional protocol, defining how many transactions will be restarted in case of serialization conflict, causing a progress loss. For example, optimistic CC causes more frequent restarts under contention, because a conflict check is delayed until a commit. Avoiding cascade aborts also reduces a number of restarts. I measure it with 1

Read-only long lived transactions

Such transactions can be used to build complex OLAP reports, without affecting concurrent OLTP load. Any SQL read query is naturally mapped to this type of a transaction. Such transactions can also read snapshot data in the past, at some timestamp. Must have, I measure it with 3

Consistent replica reads

Very useful feature for load-balancing, especially in conjunction with the previous. I measure it with 3

Optimized for common scenarios

We can try to optimize the protocol to handle common scenarios better. For example, small sized transactions can be optimized by buffering writes until a commit to reduce lock held time. I measure it with 1

Geo-distribution awareness

Geo-distributed clusters are gaining popularity. While they suffer from network latency issues due to light of speed limit, they are the best for high availability. So, the protocol should minimize a number of messages send between regions. I measure it with 2

Unlimited or very large transaction size

Some databases limit the number and total size of records enlisted in a transaction. This is not convenient for a user. I measure it with 3

Transactional DDL

Nice to have, can help with migration scenarios. I measure it with 1

Data loss toleration

It's important to know how many node failures we can tolerate until declaring the unavailability due to temporary data loss (or full in case of in-memory deployment). More is better. I measure it with 2

High level interpretation

Looking at the evaluation, it's easy to notice what our freshly-baked protocol design favors usability over performance. It doesn't mean we don't need performance - we just need the acceptable level of performance, and, more importantly, scalability. Optimizations can be postponed until later.

Let’s define key points of a design. It’s necessary to have:

Interactive transactions
Long running report-like read-only queries, which are able to execute on replicas.
Serializable isolation
Optimized latency for geo-distributed setups, when replicas are spread between geographic regions for high availability(HA)
Unlimited (or very large) txn size

The first requirement disables deterministic protocols like Calvin, because they need to know the txn read-write set in advance (or require the expensive reconnaissance step).

The second requirement is only achievable using MVCC, where interleaved txn reads don't block writes and vice versa.

The third requirement implies a CC protocol (actually a kind of MVCC) which allows for serialized schedules. We will look for alternatives later.

The fourth requirement means we need to reduce to a minimum a number of inter data center communications during transaction execution.

The fifth requirement implies not buffering a whole transaction in memory.

The system also have to be horizontally scalable. To achieve scalability, the data will be partitioned using a hash or range partitioning method. The exact partitioning method is not important for the purpose of this document. We treat a partition here as a synonym for a shard. Each partition is assigned to a cluster node and replicated to a predefined number of replicas to achieve high availability. Adding more nodes increases a number of partitions in the cluster (or reduces a number of partitions per node in case of static partitioning), thus increasing the scalability.

Note that a correct data partitioning is a key factor for a cluster efficiency. This is a topic for another IEP and will not be covered here.

A transaction can span multiple partitions, making it distributed. Providing atomicity on a commit is an additional difficulty in distributed environment. Typically this is achieved by using two-phase commit protocol or it's improved consensus based version https://www.microsoft.com/en-us/research/uploads/prod/2004/01/twophase-revised.pdf.

Turns out we want a Google Spanner clone. It seems it was designed keeping the similar goals in mind. Other known clones are Cockroachdb, Yugabyte.

We aim to reuse common replication infrastructure. This means data records will be durably replicated using a consensus based protocol, like RAFT. This approach tolerates f failed nodes from n total nodes, where n >= 2f + 1. Other products can do better, for example FoundationDB tolerates f failed nodes from n, where n >= f + 1 (but the consensus is still required). A CC protocol is not tied to the underliying replication protocol. We can change the replication protocol in the future, if we want.

Serializability

Before continuing further towards the discussion of CC protocol, which provides serializable schedules, let's dive into the serialization theory.

Example 1. Assume we have 3 transactions:

T1 = r1[x] r1[z] w1[x], T2 = r2[y] r2[z] w2[y], T3 = w3[x] r3[y] w3[z]

and two schedules:

S1 = w3[x] r3[y] w3[z] r2[y] r2[z] w2[y] r1[x] r1[z] w1[x]

S2 = w3[x] r1[x] r3[y] r2[y] w3[z] r2[z] r1[z] w2[y] w1[x]

We assume a commit of each transaction. What we can tell about serializability of S2 ? Recall the serializable schedule definition: to be serializable, it must be equivalent to some serial execution order of transactions T1, T2, T3.

Two actions on the same data object, modified by different transactions, conflict, if at least one of them is a write. The three anomalous situations can be described in terms of when the actions of two transactions T1 and T2 conflict with each other: in a write-read (WR) conflict T2 reads a data object previously written by T1; we define read-write (RW) and write-write (WW) conflicts similarly. These conflicts cause anomalies like dirty reads, unrepeatable reads, lost updates, and other.

The S1 is obviously serial: it corresponds to execution sequence: T3, T2, T1. It's not that obvious for S2 if it's serializable or not. To prove it me should find a equivalent serializable schedule. We can attempt to swap non-conflicting operation (preseving the order of conflicting) until the equivalent schedule is produced.

w3[x] r1[x] r3[y] r2[y] w3[z] r2[z] r1[z] w2[y] w1[x] → w3[x] r3[y] r1[x] r2[y] w3[z] r2[z] r1[z] w2[y] w1[x] → w3[x] r3[y] r2[y] r1[x] w3[z] r2[z] r1[z] w2[y] w1[x] → w3[x] r3[y] r2[y] w3[z] r1[x] r2[z] r1[z] w2[y] w1[x] → w3[x] r3[y] w3[z] r2[y] r1[x] r2[z] r1[z] w2[y] w1[x] → w3[x] r3[y] w3[z] r2[y] r2[z] r1[x] r1[z] w2[y] w1[x] → w3[x] r3[y] w3[z] r2[y] r2[z] w2[y] r1[x] r1[z] w1[x]

So, S2 is serializable and equivalent to T3, T2, T1. Such schedules are called conflic serializable - they can be converted to serial schedule by swapping non-conflicting operations, which have no effect on execution outcome. Every conflict serializable schedule is serializable, but not vice versa.

Example 2. Consider the transactions:

T1 = r1[x] w1[x] c1, T2 = w2[x] c2

and a schedule: S = r1[x] w2[x] c2 w1[x] c1

S is serializable (it's equivalent to T1, T2), but not conflict serializable.

It is useful to capture all potential conflicts between the transactions in a schedule in a precedence graph, also called a serializability graph.

The precedence graph for a schedule S contains:

A node for each committed transaction in S.
An arc from Ti to Tj if an action of Ti precedes and conflicts with one of Tj's actions.

A schedule S is conflict serializable if and only if its precedence graph is acyclic. An equivalent serial schedule in this case is given by any topological sort over the precedence graph.

Conflict serializability is sucient but not necessary for serializability. A more general sucient condition is view serializability.

Intuitively, two schedules are equivalent if the executions that produced them leave the database in the same state. The formal definition of equivalence is:

Two schedules are view equivalent if and only if:

Both schedules have the same initial reads.
Each read operation reads from the same write operation in both schedules
Both schedules have the same final writes.

We say a read operation j reads from a write operation i if read operation j reads a value most recently written by write operation i. The final write of a schedule is the schedule's last write operation.

A schedule is view serializable if it is view equivalent to some serial schedule. Every conflict serializable schedule is view serializable, although the converse is not true.

It can be shown that any view serializable schedule that is not conflict serializable contains a blind write. Blind write is simply when a transaction writes without reading.

Enforcing or testing view serializability turns out to be much more expensive, and the concept therefore has little practical use.

Example 3. Consider the transaction:

T1=r1[x] w1[x], T2=r2[x] w2[x]

and a schedule: S = r2[x] r1[x] w2[x] w1[x]

S is not serializable schedule, because it's not equivalent neither T1, T2 nor T2, T1

Until now we have talked about schedules having only committed transactions. We must also take into consideration schedules containing aborted transactions (it can happen also due to internal error or a crash), which brings the recoverability notion.

Example 4. Consider two transactions

T1 = w1[x] a1, T2 = r2[x] w2[x] c2

and S = w1[x] r2[x] w2[x] c2 a1

Such schedule is unrecoverable, because if the T2 is committed it can't be undone on T1 abort.

Example 5. Another example:

T1 = w1[x] c1, T2 = w2[x] a2

and S = w2[x] w1[x] a2 c1

Here w1[x] will be undone by a2, causing lost update for T1.

If in a schedule,

a transaction performs a dirty read operation from an uncommitted transaction
and its commit operation is delayed till the uncommitted transaction either commits or aborts

then such a schedule is called as a recoverable schedule.

If a schedule only reads data written by already committed transactions, it called cascadeless schedule. Uncommitted writes are possible. It avoids cascading aborts.

A strict schedule only reads or writes conflicting data written by already committed transactions. Strict schedules are recoverable, do not require cascading aborts, and actions of aborted transactions can be undone by restoring the original values of modifed objects.

Example 6:

T1 = r1[x] w1[x] c1, T2 = r2[x] w2[x] c2

S = r1[x] r2[x] w1[x] c1 w2[x] c2

So, S is the strict schedule which is not serializable.

This theorethical insight will come in handy then reasoning about CC protocol.

CC protocol WIP

A CC protocol ensures that only schedules with desirable properties are generated.

We will look for a CC protocol producing serializable schedules and matching aforementioned requirements. It is also desirable to be strict for ease of recovery and absence of cascading aborts. We also set a goal of an implementation ease to speed up the development of initial release. More complex protocols can be added later as an optimizations.

MV2PL (S2PL vs non-S2PL)

MVOCC

MVTO

WRITE SNAPSHOT https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf

READ SNAPSHOT https://dl.acm.org/doi/10.1145/2168836.2168853

SGT

Timestamp generation. problems / HLC / bounded clock skew

MGL extension to S-X locks

Comparison to known protocols.

Description

// Provide the design of the solution.

Consistency model

// Describe the model

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Choosing MV2PL seems a well-rounded solution, but there are some risks... range lock may be too prohibitive

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

// Links or report with relevant JIRA tickets.

Page tree

IEP-91: Transaction protocol