That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate compaction job that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in continuous mode where the ingestion and compaction are both managed concurrently in a single spark run time.

Performance

What performance/ingest latency can I expect for Hudi writing?

The speed at which you can write into Hudi depends on the write operation and some trade-offs you make along the way like file sizing. Just like how databases incur overhead over direct/raw file I/O on disks, Hudi operations may have overhead from supporting database like features compared to reading/writing raw DFS files. That said, Hudi implements advanced techniques from database literature to keep these minimal. User is encouraged to have this perspective when trying to reason about Hudi performance. As the saying goes : there is no free lunch (not yet atleast)

Storage Type Type of workload Performance Tips

copy on write bulk_insert Should match vanilla spark writing + an additional sort to properly size files properly size bulk insert parallelism to get right number of files. use insert if you want this auto tuned

copy on write insert Similar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned. Performance would be bound by how parallel you can write the ingested data. Tune this limit up, if you see that writes are happening from only a few executors.

copy on write

upsert/

de-duplicate & insert

Both of these would involve index lookup. Compared to naively using Spark (or similar framework)'s JOIN to identify the affected records, Hudi indexing is often 7-10x faster as long as you have ordered keys (discussed below) or <50% updates.

Compared to naively overwriting entire partitions, Hudi write can be several magnitudes faster depending on how many files in a given partition is actually updated. For e.g, if a partition has 1000 files out of which only 100 is dirtied every ingestion run, then Hudi would only read/merge a total of 100 files and thus 10x faster than naively rewriting entire partition.

Ultimately performance would be bound by how quickly we can read and write a parquet file and that depends on the size of the parquet file, configured here .

Also be sure to properly tune your bloom filters.

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-56

will auto-tune this.

merge on read

bulk insert

Currently new data only goes to parquet files and thus performance here should be similar to copy_on_write bulk insert.

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-86

will add support for logging inserts directly and this up drastically.

merge on read insert

Storage TypeType of workloadPerformanceTipscopy on writebulk_insertShould match vanilla spark writing + an additional sort to properly size filesproperly size bulk insert parallelism to get right number of files. use insert if you want this auto tuned copy on writeinsertSimilar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned.

copy on write

upsert/

de-duplicate & insert

Both of these would involve index lookup, comparing to vanilla spark writing

Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/decreasing. You can almost always achieve this. Even if you have UUID keys, you can follow tricks like this to get keys that are ordered. See also Tuning Guide for more tips on JVM and other configurations.

What performance can I expect for Hudi reading/queries?

...

<Answer WIP>

How do I to avoid creating tons of small files?

...

Space shortcuts

Page tree

Versions Compared

Old Version 51

New Version 52

Key

Performance

What performance/ingest latency can I expect for Hudi writing?

What performance can I expect for Hudi reading/queries?

<Answer WIP>

How do I to avoid creating tons of small files?

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 51

New Version 52

Key

Performance

What performance/ingest latency can I expect for Hudi writing?

What performance can I expect for Hudi reading/queries?

<Answer WIP>

How do I to avoid creating tons of small files?