Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

IDIEP-91
Author
Sponsor
Created

 

Status

Status
colourGreen
titleDRAFTACTIVE


Table of Contents

If I have seen further, it is by standing on ye shoulders of Giants

...

In this section, I'll give some definitions encountered through the text. It can be used to find a definition of a specific term quickly.

Record (aka Row, Tuple, Relation) -a collection of attribute-value pairs.

...

SG - serializability graph

RO - the abbreviation for "read-only"

RW - the abbreviation for "read-write"

TO - the abbreviation for "timestamp ordering"

Design Goals

To define key points of the protocol design, let's take a look at some features, which can be provided by the product, and value them from 1 to 3, where 3 means maximum importance for product success.

...

Let's evaluate each feature:

...

Strong isolation

Here we take into account the isolation property of a transaction. The strongest isolation is known to beSerializable, implying all transactions pretend to execute sequentially. This is convenient for users because it prevents hidden data corruption and security issues. This price may be reduced throughput/latency due to increased overhead from CC protocol. An additional option is to allow a user to choose a weaker isolation level, likeSNAPSHOT. The ultimate goal is to implement Serializability without sacrificing performance too much, having Serializable as the default isolation level. 

...

Such transactions can be used to build analytical reports, which can take minutes without affecting (and being affected by) concurrent OLTP load. Any SQL select for read query is naturally mapped to this type of aransaction. These transactions can run for several minutes.  transaction. Such transactions can also read snapshot data in the past at some timestamp. This is also known as HTAP.

...

Code Block
languagejava
titleIgniteTransactions
public interface IgniteTransactions {
	Transaction begin();
	CompletableFuture<Transaction> beginAsync();

    /**
     * Executes a closure within a transaction.
     *
     * <p>If the closure is executed normally (no exceptions) the transaction is automatically committed.
     * <p> In case of serialization conflict (or other retriable issue), the transaction will be automatically retried, so the closure must be a "pure function".

     * @param clo The closure.
     *
     * @throws TransactionException If a transaction can't be finished successfully.
     */     
     void runInTransaction(Consumer<Transaction> clo);

    <T> T runInTransaction(Function<Transaction, T> clo); 

    <T> CompletableFuture<T> runInTransactionAsync(Function<Transaction, CompletableFuture<T>> clo) 
}


Code Block
languagejava
titleTransaction


public interface Transaction {
    /**
     * Synchronously commits a transaction.
     *
     * @throws TransactionException If a transaction can't be committed.
     */
    void commit() throws TransactionException;

    /**
     * Asynchronously commits a transaction.
     *
     * @return The future.
     */
    CompletableFuture<Void> commitAsync();

    /**
     * Synchronously rolls back a transaction.
     *
     * @throws TransactionException If a transaction can't be rolled back.
     */
    void rollback() throws TransactionException;

    /**
     * Asynchronously rolls back a transaction.
     *
     * @return The future.
     */
    CompletableFuture<Void> rollbackAsync();
}

...

If the index has no explicit partition key, its data is partitioned implicitly, the same as the PK partition key.

Additional information can be found here: IEP-86: Colocation Key

Sub-partitioning

The data can be additionally locally partitioned (within a partition) to improve query performance and parallelism.

...

Execute a transaction's validation and write phase together as a critical section: while ti being in the val-write phase, no other tk can enter its val-write phase

...

An RO transaction is associated with a read timestamp at the beginning. It scans indexes to get access to table data via indexed rowIds. Because it passes normal locking, rowIds must be somehow filtered according to their visibility. This filtering process is known as a write intent resolution and is crucial for RO transactions processing. Write intent resolution is required then when a version chain corresponding to rowId has an uncommitted version.

...

  • If there is unresolved write intent (uncommitted version in the chain head), it initiates the intent resolution 
    • Checks a local txn state map for committed or aborted state - allow read if the state is committed and commitTs <= readTs.
    • If not possible, send a TxStateReq  to the txn coordinator - this initiates the coordinator path. Coordinator address is fetched from the txn state map.
    • If a coordinator path was not able to resolve the intent, one of the following has happened - the coordinator is dead or txn state is not available in the cache. Calculate a commit partition and send the TxStateReq to its primary replica - this initiates the commit partition path.
    • Retry commit partition path until a success or timeout.

...

  • If the local txn is finished, return the response with the outcome: commit or abort. The txn state is stored in a local cache, mentioned in the RW section.

    If the local txn is finishing (waiting for finish state replication), wait for outcome and return response with the outcomeresult: commit or abort

    If the outcome result is to commit, additional timestamp check is required: a commit timestamp must be <= readTs. If the condition is not held, the outcome result is changed to commit abort.

    If local txn is active, adjust the txn coordinator node HLC according to readTs to make sure the txn commit timestamp is above the read timestamp. The read timestamp must be installed before txn is started to commit, so commit timestamp is assigned after the read timestamp. This must be achieved by some sort of concurrency control, preferably non-blocking. In this case we must ignore the write intent, so the outcome is to abort.

    If txn state is not found in a local cache and txn is not active, return NULL.

...

Because read locks are not replicated, it's possible that another transaction is mapped to a new primary replica and invalidates a read for the current not yet committed transaction.

Assume a transactions

T1: a= r1(x), w1(y), c1

T2: r2(y), w2(x), c2

Due to PR1 failure and a lost read lock, a following history is possible.

H = r1(x) w2(x) c2 w1(y) c1

Assume the RO transaction T3 and the history

H2 = r1(xr2(y) w2(x) c2 r3(x) r3(y) w1(y) c1The RO transaction will return the inconsistent result to a user, not corresponding to any serial execution of T1, T2, T3

This is illustrated by the diagram:

Image Removed

Note that H2 is not allowed by SS2PL, instead it allows

H3 = r1(x) w1(y) c1 w2(x) c2history is not serializable.

We prevent this history by not committing TX1 if its commit timestamp lies outside lease ranges.

...

In the following diagram the lease refresh discovers that the lease is no longer valid, and the transaction is aborted.

Leases can must be refreshed periodically by TxnCoordinator, depending on the lease duration. Successful refreshing of a lease extends transaction deadline up to a lease upper bound.

The more frequent refreshing reduces the probability of txn abort due to reaching the deadline.The reasonable refresh rate can be calculated as a fraction of the lease duration D, for example  D/3is D/2, where D is a lease duration interval.

Commit reordering for dependent transactions

...

This picture shows the possible commit reordering between T1 and T2 due to delay in applying the commit for T1, despite the fact T2 has a larger commit timestamp. This leads to a situation when T2 becomes visible before T1. But it seems not to cause any consistency issues, because up-to-date data is correct and the possible issues can only be with RO transactions at a timestamp T2. Likely Luckily, we already have the intent written for T1 and the write intent resolution procedure will eventually wait for committed intent. So, for now I’m considering this reordering harmless.

...

https://github.com/ascherbakoff/ai3-txn-mvp MVP for RW based CC, with data and index versioning support.

Tickets

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-15081
// Links or report with relevant JIRA tickets.