Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here, we'd like to pause and revisit our choice of handling update conflicts. Our isolation requirements have not changed. A LOCK_FREE snapshot isolation remains to be explored as a possible alternative to OPTIMISTIC_CONCURRENCY. As discussed before, there are 2 ways to handle update conflicts 1) Conflict avoidance - done by systems such as ones supporting serializable isolation 2) Conflict resolution - either done by locks or lock free by readers. Let's discuss the LOCK FREE conflict resolution model. 

CRDTs


To understand CRDTs, lets Lets first understand the types of consistency models (ACID) that ensure correctness

  1. Strong Consistency  - In a strongly consistent system, a change applied to a set of data is ONLY applied if we can ensure that any failure will not affect the state of the system as seen by any readers of the system. In a single node system, this is straight forward to understand, but this becomes extremely tricky in a distributed system. Here, usually a 2PhaseCommit or 3PhaseCommit protocol is used to ensure the system is strongly consistent. This requires a consensus protocol such as Paxos , or Raft etc that can ascertain the eligibility of all nodes in the distributed system to agree and accept the changes applied. Here, availability is traded for consistency in the CAP theorem. 
  2. Eventual Consistency - In an eventually consistent system, a changes applied to a set of data is ONLY applied if we can ensure that failures in parts of the systems are OK and readers may see different states (stale data) of the systems for a "period" of time. Usually, systems supporting this kind of consistency models use a conflict resolution methodology for example Latest Writer Wins (LWW) using timestamps as the conflict resolver or a Quorum based methodology by comparing data from a majority of the nodes. SSuch systems generally do NOT require any consensus based protocol and trade availability for consistency in the CAP theorem. 


All of this is applicable to us when thinking of Here, the problem of strong and eventual consistency can be viewed as a problem of as analogous to update conflicts from multiple writers. If To achieve LOCK FREE snapshot isolation, we need to have different updates are made applied to the same datum without any synchronization , how do we or consensus, here LOCKS, but finally get the correct, strongly consistent view of the datum ?

...

Another option to achieve strong eventual consistency is by exploring a Conflict-Free Replicated Data Types (CRDT). In a CRDT model, data can be updated independently & concurrently without the need for any co-ordination/consensus but with a predefined set of rules that are applied during conflict resolution at a later stage. 

...

  1. No need for background consensus
  2. No need to rollback
  3. Available, fault-tolerant, scalable

Note that this means that the system is NOT immediately strongly consistent but is strongly consistent when readers access them. 

A good example is the Last Writer Wins (LWW) methodology with timestamp based resolution used to achieve eventual consistency. One of the problems with timestamp based resolution is that all servers need to be clock synchronized using NTP and there can be no scope for any clock drift. Ensuring this is an extremely hard job and requires special systems. For a Hudi based system, this gets even more complex. The Note that LWW methodology assumes that the latest update of datum is a full snapshot. the timestamp of the write is associated to the lowest granularity of the datum, in databases, this is generally a column of a row.


CRDTs

Hudi supports partial row updates, which means 2 different writes/updates could be updating 2 different columns in the same row. For a system that tracks data at column level, this does not matter. But for Hudi, that tracks data at a row level, LWW would end up causing the lost updates problem, bringing us back to the same problem of update conflicts. Since input data to Hudi are generally coming from a different source such as Kafka which may receive out of order writes, Hudi by itself cannot assign ordered timestamps to these updates. Hence, LWW based on timestamp does not directly work here. 

If Here, if along with a timestamp based resolution, we also had a deterministic merge function that allows for conflict resolution of these 2 updates, we could possibly apply this function at a later stage and provide correct results. This is what An implementation of such a merge function is a CRDT supports

In a CRDT data structure, updates are

  1. Commutative → Order of updates does not matter
  2. Idempotent → Applying same updates repeatedly does not matter
  3. Associative → Order of groups of updates does not matter

This merge function can be implemented by users to ensure a correct ordering of updates based on internal semantics of the data which may be timestamps or other such logic. 

To summarize, using a CRDT, we can allow multiple writers to commit concurrently without the need for a lock to perform conflict resolution while still providing ACID semantics. When readers construct the latest state of the table, a CRDT ensures the 2 different snapshots of the data get applied and a deterministic, serial order can be maintained on update conflicts

NOTE : One of the problems of eventual consistency with CRDTs arises with replication across multiple nodes making it not ACID compliant. Since Hudi relies on underlying DFS based replication, this concern does not apply. 


Total Ordering and Overwrite Latest Payload 

Although CRDTs posses the above qualities, without a time based ordering, data can be committed in any order and readers can view the data in any order. This violates Hudi's current single-writer serialized snapshot isolation in which writes always happen in monotonically increasing timestamp order and that is how readers view the data. Additionally, Hudi already supports some out of the box methodologies to merge existing data with newly written data. Since these cannot rely on the internal semantics of the data (as that is dependent on user payloads) we need a mechanism / rule (another CRDT (smile) ) to ensure these merge methodologies generate correct, expected results during concurrent writes as they would during serial writes.

Here, Hudi's commit timestamp comes to the rescue. Hudi has a _hoodie commit_time associated with every record. A default implementation of a CRDT would be to provide an ordered set of updates to a datum in order of commit times. With this, applying the merge function in this order creates a CRDT. For Hudi's default implementation, the merge function available is the overwrite_latest_payload that simply takes the latest record. 


START_COMMIT_TIMESTAMP vs END_COMMIT_TIMESTAMP

Let's take the example of these overlapping transactions as below 


Image Added


Consider 4 transactions T1, T2, T3, T4 starting and ending at different instants of time. Ideally, as per general transactions in a database, one would assume the following serial order for these transactions if they were to be done in a serial fashion.


Image Added

This works fine for general purpose database transactions where the input (updates to rows) are already prepared before starting the transactions. In Hudi, this is not true. Due to the nature of LAZY evaluation of data (Dataframes, Datasets etc..), when the actual update is prepared can be different from when the transaction start time. Hence, ordering transactions / updates on the start_commit_timestamp may not yield correct results. 


An alternate options is to serially order them based on transactions end time (end_commit_timestamp), the serial order would look as follows 


Image Added


As you can see, the T2 jumps from being the second transaction to the last one. 


Now it's hard to determine what should be the correct order to choose, start or end commit timestamp since we do not know when the input data is actually materialized. Assuming materialization of data is based on some kind of checkpoint and that the latest snapshot read for a particular transaction is based on the transactions start timestamp (since that can be decided at the beginning in the AbstractHoodieWriteClient), we choose to order transactions based on START_COMMIT_TIMESTAMP.

Note, implementing this requires changes in Hudi to ensure that the latest snapshot of the table is sealed during the start of the transaction and does not change during the course of the transaction all the way till the end. Without this change, our above assumptions are violated that will leave the table in an inconsistent state. More specifically, this requires sealing the ActiveTimeline and not reloading it during the entirety of the transaction. 


Table Services


Scheduling 


Versioning (Cleaning)


ExecutingSTRONG BUT NOT ACID - no gaurantee of exact same state across all actors


Periodic, all-replica compression by flattening and rebuilding tree – Requires consensus, but only rarely – Needs to be aborted if ongoing updates at any replica

...