Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Clustering is scheduled (files f1,f2,f3 -> g1)
  2. Clustering in inflight 
  3. Ingestion writes scheduled
  4. Ingestion writes inflight
  5. Ingestion has updates for f1
  6. Clustering finished after taking a lock and checking no other commit has succeeded, put the REPLACE file with mapping f1,f2,f3 -> g1 or put this information in the consolidated metadata
  7. Ingestion tries to finish acquires a lock
  8. Sees clustering has finished in the meantime, reads and intersects it’s file ids and clustering file ids
  9. F1 is overlapping
  10. Writes new REPLACE metadata to reverse the mapping of f1,f2,f3 -> g1 that was done before with the overlapping fileIds.
    1. NOTE that the entire mapping needs to be reversed since records can go from M file groups to N file groups
    2. Need to ensure that the previous version (before clustering) is not cleaned
  11. Now during merge of consolidated metadata or the REPLACE timeline we take care of this scenario
  12. Side effect
    1. Redundant clustering operation, previous one’s work is not used, another one needs to be scheduled
    2. Queries will ping pong back and forth between different number of files, layout 
  13. Idea is to cluster for non-updating data, so this is acceptable in cases like this

...