Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Storage TypeType of workloadPerformanceTips
copy on writebulk_insertShould match vanilla spark writing + an additional sort to properly size filesproperly size bulk insert parallelism to get right number of files. use insert if you want this auto tuned 
copy on writeinsertSimilar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned. Performance would be bound by how parallel you can write the ingested data. Tune this limit up, if you see that writes are happening from only a few executors.
copy on write

upsert/

de-duplicate & insert

Both of these would involve index lookup.  Compared to naively using Spark (or similar framework)'s JOIN to identify the affected records, Hudi indexing is often 7-10x faster as long as you have ordered keys (discussed below) or <50% updates.

Compared to naively overwriting entire partitions, Hudi write can be several magnitudes faster depending on how many files in a given partition is actually updated. For e.g, if a partition has 1000 files out of which only 100 is dirtied every ingestion run, then Hudi would only read/merge a total of 100 files and thus 10x faster than naively rewriting entire partition.

Ultimately performance would be bound by how quickly we can read and write a parquet file and that depends on the size of the parquet file, configured here .

Also be sure to properly tune your bloom filters

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-56
 will auto-tune this.


merge on readbulk insert

Currently new data only goes to parquet files and thus performance here should be similar to copy_on_write bulk insert. This has the nice side-effect of getting data into parquet directly for query performance. 

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-86
 will add support for logging inserts directly and this up drastically.  



merge on readinsertSimilar to above.
copy on write

upsert/

de-duplicate & insert

Indexing performance would remain the same as copy-on-write, while ingest latency for updates (costliest I/O operation in copy_on_write) are sent to log files and thus with asynchronous compaction provides very very good ingest performance with low write amplification. 


Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/decreasing. You can almost always achieve this. Even if you have UUID keys, you can follow tricks like this to get keys that are ordered. See also Tuning Guide for more tips on JVM and other configurations. 

What performance can I expect for Hudi reading/queries? 

<Answer WIP>

How do I to avoid creating tons of small files?

HoodieWriteConfig exposes knobs to allow for such flexibility. 

DataSource Spark API users

HoodieDeltaStreamer users

  • For ReadOptimized views, you can expect the same best in-class columnar query performance as a standard parquet table in Hive/Spark/Presto
  • For incremental views, you can expect speed up relative to how much data usually changes in a given time window and how much time your entire scan takes. For e.g, if only 100 files changed in the last hour in a partition of 1000 files, then you can expect a speed of 10x using incremental pull in Hudi compared to full scanning the partition to find out new data. 
  • For real time views, you can expect performance similar to the same avro backed table in Hive/Spark/Presto 

How do I to avoid creating tons of small files?

A key design decision in Hudi was to avoid creating small files and always write properly sized files, trading off more time on ingest/writing to keep queries always efficient. Common approaches to writing very small files and then later stitching them together only solve for system scalability issues posed by small files and let queries slow down by exposing small files to them anyway. 

For copy-on-write, this is as simple as configuring the maximum size for a base/parquet file  and the soft limit below which a file should be considered a small file. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmalFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. 

For merge-on-read, there are few more configs to set. Specifically, you can configure the maximum log size and a factor that denotes reduction in size when data moves from avro to parquet files. 

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-26
 will take this to the next level, by even collapsing smaller file groups into larger ones.HoodieWriteClient users

Contributing to FAQ 

A good and usable FAQ should be community-driven and crowd source questions/thoughts across everyone. 

...