...
This would mean that Spark, Presto and Hive MOR incremental pull have to implement the same semantics of ordering updates like in a RealtimeView and in compaction.
NOTE that incremental pull based on END_COMMIT_TIME is still required for COPY_ON_WRITE tables using optimistic concurrency control management.
Overhead of MergeOnRead with CRDTs
COW vs MOR
Let's dive deeper into the merge cost of MergeOnRead tables. Current Costs
COW read path
- Vectorized retrieval of latest value of columns
- Predicate pushdown to parquet reader
- No need to perform "merge" operation
- No materialization of columns to rows
- ONE costs for serialization and deserialization
- Parquet to InternalRow for Spark
- Parquet to ColumnRow for Presto
- Parquet to ArrayWriteable for Hive
MOR read path
- NO Vectorized retrieval of latest value of columns
- Need to perform "merge" operation, requires Memory and CPU
- Materialization of columns to rows (whole row is required), increases Memory and I/O cost of how much data is read from disk
- THREE costs for serialization and deserialization
- Parquet to GenericRecord
- LogBlock Avro bytes to GenericRecord
- GenericRecord to InternalRow, ColumnRow, ArrayWriteable
- No Predicate Pushdown even with parquet base since row is required to merge contents
As we see above, since CRDTs have to be commutative, the current "in-memory" merge step should ensure the correct result is generated after applying the rules.
OverwriteLatestPayload is NOT a CRDT, this might require to first sort the LogBlocks in increasing order of START_COMMIT_TIME since updates can happen out of order and then apply the OverwriteLatestPayload.