That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate compaction job that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in continuous mode where the ingestion and compaction are both managed concurrently in a single spark run time.

Performance

What performance/ingest latency can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

The speed at which you can write into Hudi depends on the write operation and some trade-offs you make along the way like file sizing.

Storage Type	Type of workload	Performance	Tips
copy on write	bulk_insert	Should match vanilla spark writing + an additional sort to properly size files	properly size bulk insert parallelism to get right number of files. use `insert` if you want this auto tuned
copy on write	insert	Similar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned.
copy on write	upsert/ de-duplicate & insert	Both of these would involve index lookup, comparing to vanilla spark writing

Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/decreasing. You can almost always achieve this. Even if you have UUID keys, you can follow tricks like this to get keys that are ordered. See also Tuning Guide for more tips on JVM and other configurations. <Answer WIP>

What performance can I expect for Hudi reading/queries?

...

Space shortcuts

Page tree

Versions Compared

Old Version 50

New Version 51

Key

Performance

What performance/ingest latency can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

What performance can I expect for Hudi reading/queries?

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 50

New Version 51

Key

Performance

What performance/ingest latency can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

What performance can I expect for Hudi reading/queries?