Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Read more here to understand concurrency control

Guarantees

Hudi uses MVCC to ensure snapshot isolation and provide ACID semantics between a single writer and multiple readers. 

In this RFC, we propose to support a feature to allow concurrent writing to a Hudi table. This will enable users to start multiple writer jobs writing in parallel to non-overlapping files. If a concurrent write to a file is issued from 2 writers, the first one to commit will succeed.  The following guarantees provided by Hudi single writer model will NOT be guaranteed in parallel writing multi writer with mode

  1. Unique records across multiple writers during inserts
    1. If multiple writers are writing to different files in parallel, Hudi cannot guarantee uniqueness of keys across partitions, unlike the single writer model. Ensuring unique keys is left up to the users
  2. ReadSerializability across partitions
    1. Since different writers to different files can finish at varying times, thus committing data written to files in any order, Hudi cannot guarantee read serializability.
  3. Global index support (only HbaseIndex)
    1. Since Global Index (e.g. HbaseIndex) requires a unique key across all partitions in the table, Hudi cannot support Global Index for tables requiring parallel writing  multi writing due to constraint (1) above. Note, GlobalSimpleIndex works fine since it first partitions based on HoodieKey and then checks the record key per partition. 

Implementation

The good news is that Hudi’s MVCC based reader/writer isolation has already laid out a great foundation for parallel writing. There are few features which prevent parallel writing. supporting multiple writers. The following section describes what those features are and how we plan to alter them to support multi writer. 

...

The current Hudi writer job supports automatic inline rollback. The rollback feature is used to clean up any failed writes that may have happened in the immediate past runs of the job. The clean up process could involve cleaning up invalid data files, index information as well as any metadata information that is used to constitute the Hudi table. Hudi does this by checking if the commit under .hoodie directory is a completed commit or an inflight commit. If it is an inflight commit, it will perform the necessary rollback before progressing to ingest the next batch of data. When there are multiple different clients attempting to write to the table, this can result in jobs rolling back each other’s inflight data leading to chaos and corruption. To ensure we support rollback with parallel multi with multi writer, we need a way to avoid this. To remove inline rollbacks, there are few things that need to be done 

...

Need to support providing a test-and-set kind of mechanism to provide unique instant times to parallel writersmultiple writers.

Pros

  • No loopholes and is guaranteed to provide unique commit times in all cases

...

Once the proposed solution is implemented, users will be able to run jobs in parallel by simply launching multiple writers. Upgrading to the latest Hudi release providing this feature will automatically upgrade your timeline layout. launch multiple writers.

Test Plan


Code Block
1) Test the following scenario with a long running job

// Add compaction, clustering
// Check archival kicking in
// Schema Evolution ? (older commit vs new commit)
For TableTypes in (COW, MOR)
	For ConcurencyMode in (SingleWriter, OCC (with non-overlapping & overlapping file ids))
		Test 1:  EAGER cleaning, without metadata (validate no regression)
		Test 2:  EAGER cleaning, with metadata (validate no regression)
		Test 3:  EAGER cleaning, with metadata, with async cleaning (validate no regression)
		Test 4:  LAZY cleaning, without metadata
		Test 5:  LAZY cleaning, with metadata
		Test 6:  LAZY cleaning, with metadata, with async cleaning
	    Test 7:  Incremental Pull (out of order execution of commits for OCC) 


2) Add MultiWriter test cases

3) Enhance Hoodie Test Suite to test Spark Datasource with Locks

...