Discussion thread	https://lists.apache.org/thread/7qjzbcfzdshqb3h7ft31v9o3x43t8k6r
Vote thread
ISSUE
Release	TBD

Motivation

Position delete is a solution to implement the Merge-On-Read (MOR) structure, which has been adopted by other formats such as Iceberg[1] and Delta[2]. By combining with Paimon's LSM tree, we can create a new position deletion mode unique to Paimon.

Under this mode, extra overhead (lookup and write delete file) will be introduced during writing, but during reading, data can be directly retrieved using "data + filter with position delete", avoiding additional merge costs between different files. Furthermore, this mode can be easily integrated into native engine solutions like Spark + GlutonGluten[3] in the future, thereby significantly enhancing read performance.

...

Delete file is used to mark the deletion of original file. The following figure illustrates how data updating and deleting under the delete file mode:

Image Modified

Currently, there are two ways to represent the deletion of records:

...

"data rate / max num" = 20% / 2,000,000, means randomly call "RoaringBitmap.add(x)", which "x" is randomly in the range of 0 to 2,000,000 for a total of 20% * 2,000,000 = 400,000 times to build the bitmap, then serialize it to file, next deserialize from file, finally call "RoaringBitmap.addcontains(x)" for 400,000 times to simulate filter.

measure: the total time of calling "RoaringBitmap.add(x)", time of serialize to file, time of deserialize from file, the serialized file size, and the total time of calling "RoaringBitmap.contains(x)".

data rate / max num	add(ms)	serialization(ms)	deserialization(ms)	file size(MB)	constains(ms)
20% /2,000,000	43	5	26	0.24	7
50% /2,000,000	47	3	52	0.24	5
80% /2,000,000	57	1	24	0.24	8
20% /20,000,000	450	13	247	2.4	49
50% /20,000,000	629	6	222	2.4	76
80% /20,000,000	1040	5	222	2.4	121
20% /200,000,000	5079	44	2262	24	442
50% /200,000,000	9469	43	2773	24	1107
80% /200,000,000	13625	38	2233	24	1799
20% /2,000,000,000	93753	568	22290	239	5747
50% /2,000,000,000	166070	679	22339	239	14735
80% /2,000,000,000	218233	553	22684	239	26504

Summarize the following points:

...

Currently, when set 'changelog-producer' = 'lookup', the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.
In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
The merge engine is still available.

...

Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
...

Compatibility, Deprecation, and Migration Plan

Conversion between delete file mode (just temporarily call it) and original LSM

LSM -> delete file mode: can be directly switched (add a parameter to control whether to enable delete file).
delete file mode -> LSM, in theory, perform a full compaction, then clean up the old snapshot, and then disabled delete file mode.

[1]: https://github.com/apache/iceberg

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Motivation

Compatibility, Deprecation, and Migration Plan

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Motivation

Compatibility, Deprecation, and Migration Plan