Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate compaction job that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in continuous mode where the ingestion and compaction are both managed concurrently in a single spark run time.


Performance 

What performance/ingest latency can I expect for Hudi writing?

<Answer WIP>

What ingest latency can I expect out of Hudi? 

The speed at which you can write into Hudi depends on the write operation and some trade-offs you make along the way like file sizing.


Storage TypeType of workloadPerformanceTips
copy on writebulk_insertShould match vanilla spark writing + an additional sort to properly size filesproperly size bulk insert parallelism to get right number of files. use insert if you want this auto tuned 
copy on writeinsertSimilar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned. 
copy on writeupsert/ de-duplicate & insertBoth of these would involve index lookup, comparing to vanilla spark writing 


Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/decreasing. You can almost always achieve this. Even if you have UUID keys, you can follow tricks like this to get keys that are ordered. See also Tuning Guide for more tips on JVM and other configurations. <Answer WIP>

What performance can I expect for Hudi reading/queries? 

...