THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- Clustering is scheduled (files f1,f2,f3 -> g1)
- Clustering in inflight
- Ingestion writes scheduled
- Ingestion writes inflight
- Ingestion has updates for f1
- Clustering finished after taking a lock and checking no other commit has succeeded, put the REPLACE file with mapping f1,f2,f3 -> g1 or put this information in the consolidated metadata
- Ingestion tries to finish acquires a lock
- Sees clustering has finished in the meantime, reads and intersects it’s file ids and clustering file ids
- F1 is overlapping
- Writes new REPLACE metadata to reverse the mapping of f1,f2,f3 -> g1 that was done before with the overlapping fileIds.
- NOTE that the entire mapping needs to be reversed since records can go from M file groups to N file groups
- Need to ensure that the previous version (before clustering) is not cleaned
- Now during merge of consolidated metadata or the REPLACE timeline we take care of this scenario
- Side effect
- Redundant clustering operation, previous one’s work is not used, another one needs to be scheduled
- Queries will ping pong back and forth between different number of files, layout
- Idea is to cluster for non-updating data, so this is acceptable in cases like this
...