Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Such transactions can be used to build complex OLAP reports, without affecting concurrent OLTP load. Any SQL read query is naturally mapped to this type of a transaction. Such transactions can also read snapshot data in the past. Must have, I measure it with 3

...

We can try to optimize the protocol to handle common scenarios better.  For example, small sized transactions can be optimized by buffering writes until a commit to reduce lock held time. I measure it with 1

Geo-distribution

...

awareness

Geo-distributed clusters are gaining popularity. While they suffer from network latency issues due to light of speed limit, they are the best for high availability. So, the protocol should minimize a number of messages send between regions. I measure it with 2

...

Looking at the evaluation, it's easy to notice what a protocol design favors conveniece of use usability over performance. It doesn't mean we don't need performance - we need the acceptable level of performance, and, more importantly, of cource, scalability.scalability. Optimizations can be postponed until later.

Let’s define key points of a design.

It’s necessary to have 

  1. Interactive transactions 
  2. Long running report-like read-only queries, which are able to execute on backups.
  3. Serializable isolation
  4. Optimized latency for geo-distributed setups, when replicas are spread between geographic regions for high availability(HA)
  5. Unlimited (or very large) txn size

The first requirement disables deterministic protocols like Calvin, because they need to know the txn read-write set in advance (or require the expensive reconnaissance step). 

The second requirement is only achievable using MVCC, where reads don't block writes and vice versa. 

The third requirement implies a CC protocol (an extension to MVCC) which allows for serialized schedules.

The fourth requirement means we need to reduce to a minimum a number of inter data center communications during transaction execution.

The fifth requirement implies not keeping a whole transaction on the coordinator node.

The system also have to be horizontally scalable. To achieve scalability, the data will be partitioned using hash or range partitioning method. We treat the data partition here as a synonym for data shard. Each partition is assigned to a cluster node and replicated to a predefined number of replicas to achieve high availability. A distributed transaction protocol makes cross-partition transactions preserve ACID guaranties 

Providing atomicity on commit is an additional difficulty in distributed environment. Typically this is achieved by using two-phase commit protocol or it's improved consensus based version https://www.microsoft.com/en-us/research/uploads/prod/2004/01/twophase-revised.pdf.

Turns out we want a Google Spanner clone. It seems it was designed keeping the similar goals in mind. Other popular clones are cockroachdb, yugabyte.

We aim to reuse common replication infrastructure. This means the records will be durably replicated using the raft consensus protocol. On top of the replication layer we will have a distributed transaction coordination protocol. This approach tolerates f failed nodes from n total nodes, where n >= 2f + 1. Other products can do better, for example FoundationDB tolerates f failed nodes from n, where n >= f + 1

There are two main things - CC and atomic commitment.

...