Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We can do a typical tombstone, by adding an entry to the bucket with null value. So, during index lookup, we could still maintain the same parallelism to look up in all HFiles parallely within a given bucket, but choose the latest value if multiple values are returned. (What I meant here is thatFor eg, if HFile1 returns FileId1 for record1, and HFile3 returns null for record1, we would take HFile3's value and hence record1 considered to be deleted).   For the commit timestamp, we should rely on either filename or commit metadata instead of embedding commit timestamp to the value of every entry which would definitely bloat the size of index in general. 

Adding delete support might complicate the compaction a bit. Due to this, compaction might have to read entire content of all HFiles that needs to be compacted, to find the resolved values for all entries before writing to the new base HFile. Hopefully this might not add too much overhead. Also, compaction will ignore deleted entries to avoid unnecessary storage for delete entries. So, it may not possible to determine if a record was never inserted or was it deleted after insertion. Let me know if this is an issue with Hoodie

Note: One point to remember is that, we should also support re-insertion of deleted entries. Above mentioned scheme would work for this use-case as well w/o any modifications. 

...