Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. insert overwrite: overwrite partitions touched (default):.   Example: Say a table has 3 total partitions (p0, p1, p2). Client performs insert overwrite with 10 records. Lets say all 10 new records belong to p2.  Then overwrite is only performed on p2.  All previous records in p0, p1 will continue to exist as before.
  2. insert overwrite_table: overwrite all partitions. For the above example, p0 and p1 will have 0 records after the write operation completes successfully. p2 will only have new records

...

  • Need to add versioning related complexity to partitions. Example: clean process for older versioned partitions. 
  • Supporting incremental reads across different versions might be difficult.
  • Additional listing is required to figure out latest version (Or rely on consolidated metadata to query latest partition version)
  • HUDI uses user specified partition paths. This is a change of behavior to add a version and additional complexity is required to support all query engines.


API

Regardless of implementation approach chosen, we need to add/change existing high level API. At the moment, we incline towards creating two new operations on spark dataframe instead of adding flags on existing operation because the actions taken are equivalent to deleting all pre-existing data. But open to feedback.

...