...
Storage Type | Type of workload | Performance | Tips | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
copy on write | bulk_insert | Should match vanilla spark writing + an additional sort to properly size files | properly size bulk insert parallelism to get right number of files. use insert if you want this auto tuned | ||||||||||
copy on write | insert | Similar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned. | Performance would be bound by how parallel you can write the ingested data. Tune this limit up, if you see that writes are happening from only a few executors. | ||||||||||
copy on write | upsert/ de-duplicate & insert | Both of these would involve index lookup. Compared to naively using Spark (or similar framework)'s JOIN to identify the affected records, Hudi indexing is often 7-10x faster as long as you have ordered keys (discussed below) or <50% updates. Compared to naively overwriting entire partitions, Hudi write can be several magnitudes faster depending on how many files in a given partition is actually updated. For e.g, if a partition has 1000 files out of which only 100 is dirtied every ingestion run, then Hudi would only read/merge a total of 100 files and thus 10x faster than naively rewriting entire partition. | Ultimately performance would be bound by how quickly we can read and write a parquet file and that depends on the size of the parquet file, configured here . Also be sure to properly tune your bloom filters.
| ||||||||||
merge on read | bulk insert | Currently new data only goes to parquet files and thus performance here should be similar to copy_on_write bulk insert. This has the nice side-effect of getting data into parquet directly for query performance.
| |||||||||||
merge on read | insert | Similar to above. | |||||||||||
merge on read | upsert/ de-duplicate & insert | Indexing performance would remain the same as copy-on-write, while ingest latency for updates (costliest I/O operation in copy_on_write) are sent to log files and thus with asynchronous compaction provides very very good ingest performance with low write amplification. |
...