ID	IEP-91
Author	Alexey Scherbakov
Sponsor	Alexey Scherbakov
Created	24 May 2022
Status	DRAFT

If I have seen further it is by standing on ye sholders of Giants

Isaac Newton

Motivation

One of the major features of AI3, as a distributed database, is the ability to execute multiple table operations as single atomic operation, known as transaction. We need to design modern and robust distributed transaction protocol, taking into account current best practices. Both key-value and SQL database access methods will rely upon it. Comparing to AI2, we aim to support transactional SQL from the beginning and remove limitations like size of transaction.

Definitions

In this section I'll give some definitions encountered though the text, for easier understanding.

Record (aka Row, Tuple, Relation) - a collection of attribute-value pairs.

Transaction - a sequence of logically related partially ordered actions (reads or writes) over the database objects.

Atomicity - a transaction property which declares: either all actions are carried out or none are.

Consistency - a property which moves a database from one consistent state to another after finish. A meaning of the consistent state is defined by a user.

Isolation - a measure of mutual influence between interleaved transactions.

Durability - a transaction property which guarantees that database state remains unchanged after a transaction is committed, despite any failures.

Schedule - a way of executing interleaved transactions.

Serial schedule - a schedule where all transactions are executed sequentially.

Serializable schedule - a schedule which is equivalent to some serial execution of interleaved transactions.

Concurrency control (CC) - a technique to preserve database consistency in case of interleaved committed transactions.

Multi-version concurrency control (MVCC) - a family of concurrency control techniques based on writing multiple record versions (copy-on-write).

Recoverable schedule - a schedule which is not affected by aborting some of involved transactions. A transaction reads only committed values to achieve this.

Interactive transaction - a transaction whose operation set is not known apriory. Can be aborted at any time, if not committed yet.

Cascading abort - a situation in which the abort of one transaction causes the abort of another dependent transaction to avoid inconsistency.

Design Goals

To define key points of the protocol design, let's look at some features, which can be provided by the product, and value them from 1 to 3, where 3 means maximum importance for product success.

Strong isolation
Support for interactive transactions
Conflict resistance
Read-only (long lived) transactions
Consistent replica reads
Optimized for fast path execution
Geo-distribution aware
Unlimited or very large transaction size
Transactional DDL
Data loss toleration

Let's take a look at each feature in detail and give it a value.

Strong transaction isolation

Here we take into account the isolation property of a transaction. The strongest isolation is known to be Serializable, implying all transactions pretend to execute sequentially. This is very convenient to a user, because it prevents hidden data corruptions https://pmg.csail.mit.edu/papers/adya-phd.pdf and security issues http://www.bailis.org/papers/acidrain-sigmod2017.pdf. The price for this can be reduced throughput/latency due to increased overhead from CC protocol. Another options is to allow a user to choose a weaker isolation level, like SNAPSHOT. The ultimate goal is to implement Serializability without sacrificing performance too much, having Serializable as default isolation level. I measure it with 2

Support for interactive transactions

This is the most intuitive way to use transactions. I measure it with 3

Conflict resistance

This is a general property of a transactional protocol, defining how many transactions will be restarted in case of serialization conflict, causing a progress loss. For example, optimistic CC causes more frequent restarts under contention, because a conflict check is delayed until a commit. Avoiding cascade aborts also reduces a number of restarts. I measure it with 1

Read-only long lived transactions

Such transactions can be used to build complex OLAP reports, without affecting concurrent OLTP load. Any SQL read query is naturally mapped to this type of a transaction. Such transactions can also read snapshot data in the past, at some timestamp. Must have, I measure it with 3

Consistent replica reads

Very useful feature for load-balancing, especially in conjunction with the previous. I measure it with 3

Optimized for common scenarios

We can try to optimize the protocol to handle common scenarios better. For example, small sized transactions can be optimized by buffering writes until a commit to reduce lock held time. I measure it with 1

Geo-distribution awareness

Geo-distributed clusters are gaining popularity. While they suffer from network latency issues due to light of speed limit, they are the best for high availability. So, the protocol should minimize a number of messages send between regions. I measure it with 2

Unlimited or very large transaction size

Some databases limit the number and total size of records enlisted in a transaction. This is not convenient for a user. I measure it with 3

Transactional DDL

Nice to have, can help with migration scenarios. I measure it with 1

Data loss toleration

It's important to know how many node failures we can tolerate until declaring the unavailability due to temporary data loss (or full in case of in-memory deployment). More is better. I measure it with 2

High level interpretation

Looking at the evaluation, it's easy to notice what our freshly-baked protocol design favors usability over performance. It doesn't mean we don't need performance - we just need the acceptable level of performance, and, more importantly, scalability. Optimizations can be postponed until later.

Let’s define key points of a design. It’s necessary to have:

Interactive transactions
Long running report-like read-only queries, which are able to execute on replicas.
Serializable isolation
Optimized latency for geo-distributed setups, when replicas are spread between geographic regions for high availability(HA)
Unlimited (or very large) txn size

The first requirement disables deterministic protocols like Calvin, because they need to know the txn read-write set in advance (or require the expensive reconnaissance step).

The second requirement is only achievable using MVCC, where interleaved txn reads don't block writes and vice versa.

The third requirement implies a CC protocol (actually a kind of MVCC) which allows for serialized schedules. We will look for alternatives later.

The fourth requirement means we need to reduce to a minimum a number of inter data center communications during transaction execution.

The fifth requirement implies not buffering a whole transaction on the coordinator node.

The system also have to be horizontally scalable. To achieve scalability, the data will be partitioned using hash or range partitioning method. The exact partitioning method is not important for the purpose of this document. We treat a partition here as a synonym for a shard. Each partition is assigned to a cluster node and replicated to a predefined number of replicas to achieve high availability. Adding more nodes increases a number of partitions in the cluster (or reduces a number of partitions per node in case of static partitioning), thus increasing the scalability.

Note that a correct data partitioning is a key factor for a cluster efficiency. This is a topic for another IEP and will not be covered here.

A transaction can span multiple partitions, making it distributed. Providing atomicity on a commit is an additional difficulty in distributed environment. Typically this is achieved by using two-phase commit protocol or it's improved consensus based version https://www.microsoft.com/en-us/research/uploads/prod/2004/01/twophase-revised.pdf.

Turns out we want a Google Spanner clone. It seems it was designed keeping the similar goals in mind. Other known clones are Cockroachdb, Yugabyte.

We aim to reuse common replication infrastructure. This means data records will be durably replicated using a consensus based protocol, like RAFT. This approach tolerates f failed nodes from n total nodes, where n >= 2f + 1. Other products can do better, for example FoundationDB tolerates f failed nodes from n, where n >= f + 1 (but the consensus is still required). A CC protocol is not tied to the underliying replication protocol. We can change the replication protocol in the future, if we want.

Serializability

Before continuing further towards the discussion of CC protocol, which provides serializable schedules, let's dive into the serialization theory. Assume we have 3 transactions:

T1 = r1[x] r1[z] w1[x], T2 = r2[y] r2[z] w2[y], T3 = w3[x] r3[y] w3[z]

and two schedules:

S1 = w3[x] r3[y] w3[z] r2[y] r2[z] w2[y] r1[x] r1[z] w1[x]

S2 = w3[x] r1[x] r3[y] r2[y] w3[z] r2[z] r1[z] w2[y] w1[x]

What we can tell about serializability of S2 ? Recall the serializable schedule definition: to be serializable, it must be equivalent to some serial execution order of transactions T1, T2, T3.

Two actions on the same data object, modified by different transactions, conflict, if at least one of them is a write. The three anomalous situations can be described in terms of when the actions of two transactions T1 and T2 conflict with each other: in a write-read (WR) conflict T2 reads a data object previously written by T1; we define read-write (RW) and write-write (WW) conflicts similarly. These conflicts cause anomalies like dirty reads, unrepeatable reads, lost updates, and other.

The S1 is obviously serial: it corresponds to execution sequence: T3, T2, T1. It's not that obvious for S2 if it's serializable or not. To prove it me should find a equivalent serializable schedule. We can attempt to swap non-conflicting operation (preseving the order of conflicting) until the equivalent schedule is produced.

w3[x] r1[x] r3[y] r2[y] w3[z] r2[z] r1[z] w2[y] w1[x] → w3[x] r1[x] r3[y] r2[y] w3[z] r2[z] r1[z] w1[x] w2[y] → w3[x] r3[y] r2[y] w3[z] r2[z] r1[x] r1[z] w1[x] w2[y] → w3[x] r3[y] w3[z] r2[y] r2[z] r1[x] r1[z] w1[x] w2[y] → w3[x] r3[y] w3[z] r2[y] r2[z] w2[y] r1[x] r1[z] w1[x]

So, S2 is serializable and equivalent to T3, T2, T1. Such schedules are called conflic serializable - they can be converted to serial schedule by swapping non-conflicting operations, which have no effect on execution outcome. Every conflict serializable schedule is serializable, but not vice versa.

Consider the transactions:

T1 = r1[x] w1[x] c1, T2 = w2[x] c2

and a schedule: r1[x] w2[x] c2 w1[x] c1

This schedule is serializable, but not conflict serializable.

We must also take into account schedules containing aborted transactions.

Consider two transactions

T1 = w1[x] a1, T2 = r2[x] w2[x] c2

and a schedule

S = w1[x] r2[x] w2[x] c2 a1

Such schedule is unrecoverable, because T2 is committed and can't be undone on T1 abort.

T1 = w1[x] c1, T2 = w2[x] a2

S = w2[x] w1[x] a2 c1

Here w1[x] will be undone by a2, causing lost update on c1.

For a schedule to be recoverable, it's necessary to read only committed data. Recoverable schedules also avoid cascading aborts.

Description

// Provide the design of the solution.

Consistency model

// Describe the model

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

// Links or report with relevant JIRA tickets.

Page tree

IEP-91: Transaction protocol