Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

While we can merely refer to this as stream processing, we call it incremental processing, to distinguish from purely stream processing solutions like pipelines built using Apache Flink, Apache Apex or Apache Kafka Streams.

What is the difference between

...

copy-on-write (COW) vs

...

merge-on-read (MOR) storage types ?

Copy On Write - This table storage type enables clients to ingest data on columnar file formats, currently parquet. Any new data that is written to the Hudi dataset using COW table COW storage type, will write new parquet files. Updating an existing set of rows will result in a rewrite of the entire parquet files that collectively contain the affected rows being updated. Hence, all writes to such datasets are limited by parquet writing performance, the larger the parquet file, the higher is the time taken to ingest the data.

Merge On Read - This table This storage type enables clients to  ingest data quickly onto row based data format such as avro. Any new data that is written to the Hudi dataset using MOR table type, will write new log/delta files that internally store the data as avro encoded bytes. A compaction process (configured as inline or asynchronous) will convert log file format to columnar file format (parquet). Two different InputFormats expose 2 different views of this data, HoodieInputFormat Read Optimized view exposes columnar parquet reading performance while HoodieRealTimeInputFormat Realtime View exposes columnar and/or log reading performance respectively. Updating an existing set of rows will result in either a) a companion log/delta file for an existing base parquet file generated from a previous compaction or b) an update written to a log/delta file in case no compaction ever happened for it. Hence, all writes to such datasets are limited by avro/log file writing performance, much faster than parquet. Although, there is a higher cost to pay to read log/delta files vs columnar (parquet) files.

...

How do I choose a storage type for my workload ?


A key goal of Hudi is to provide upsert functionality that is orders of magnitude faster than rewriting entire tables or partitions. 


Choose Copy-on-write storage if : 

  • You are looking for a simple alternative, that replaces your existing parquet tables without any need for real-time data.
  • Your current job is rewriting entire table/partition to deal with updates, while only a few files actually change in each partition.
  • You are happy keeping things operationally simpler (no compaction etc), with the ingestion/write performance bound by the parquet file size and the number of such files affected/dirtied by updates
  • Your workload is fairly well-understood and does not have sudden bursts of large amount of update or inserts to older partitions. COW absorbs all the merging cost on the writer side and thus these sudden changes can clog up your ingestion and interfere with meeting normal mode ingest latency targets.


Choose merge-on-read storage if :

  • You want the data to be ingested as quickly & queryable as much as possible.
  • Your workload can have sudden spikes/changes in pattern (e.g bulk updates to older transactions in upstream database causing lots of updates to old partitions on DFS). Asynchronous compaction helps amortize the write amplification caused by such scenarios, while normal ingestion keeps up with incoming stream of changes.

Immaterial of what you choose, Hudi provides 

  • Snapshot isolation and atomic write of batch of records
  • Incremental pulls
  • Ability to de-duplicate data


Find more  draw.io DiagrambordertrueviewerToolbartruefitWindowfalsediagramNameTableTypeChoiceFlowDiagramsimpleViewerfalsewidthdiagramWidth801revision3Find more details on trade offs between cow & mor storage types here.

Is Hudi an analytical database? 

...