Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

What is the difference between COW (copy on write) vs MOR (merge on read) storage types ?

Copy On Write - This table type enables clients to append/upsert data on columnar file formats, currently parquet. Any new data that is written to the Hudi dataset using COW table type, will write new parquet files. Updating an existing set of rows will result in a rewrite of the entire parquet files that collectively contain the affected rows being updated. Hence, all writes to such datasets are limited by parquet writing performance, the larger the parquet file, the higher is the time taken to ingest the data.

...

More details can be found here.

How do I choose a storage type for my workload ?

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNameTableTypeChoiceFlowDiagram
simpleViewerfalse
width
diagramWidth801
revision3

...

Find more details on trade offs between cow & mor storage types here.

How do I achieve fast ingestion & upserts while at the same time not creating a ton of small files ?

HoodieWriteConfig exposes knobs to allow for such flexibility. 

...

HoodieDeltaStreamer users

HoodieWriteClient users

I see many versions of the same data, how can I control this ?


How do I control partitioning for a Hudi table ?