The Hudi cleaner process often runs right after a commit and deltacommit and goes about deleting old files that are no longer needed. If you are using the incremental pull feature, then ensure you configure the cleaner to retain sufficient amount of last commits to rewind.

How can I restore my dataset to a known good point in time?

What's Hudi's schema evolution story?

How do I run compaction for a MOR dataset?

Another consideration is to provide sufficient time for your long running jobs to finish running. Otherwise, the cleaner could delete a file that is being or could be read by the job and will fail the job. Typically, the default configuration of 24 allows for an ingestion running every 30 mins to retain up-to 12 hours worth of data.

What's Hudi's schema evolution story?

Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.

How do I run compaction for a MOR dataset?

Simplest way to run compaction on MOR dataset is to run the compaction inline, at the cost of spending more time ingesting; This could be particularly useful, in common cases where you have small amount of late arriving data trickling into older partitions. In such a scenario, you may want to just aggressively compact the last N partitions while waiting for enough logs to accumulate for older partitions. The net effect is that you have converted most of the recent data, that is more likely to be queried to optimized columnar format.

That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate compaction job that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in continuous mode where the ingestion and compaction are both managed concurrently in a single spark run time.Simplest way to cio

Performance

What performance can I expect for Hudi writing?

...

Space shortcuts

Page tree

Versions Compared

Old Version 49

New Version 50

Key

How can I restore my dataset to a known good point in time?

What's Hudi's schema evolution story?

How do I run compaction for a MOR dataset?

What's Hudi's schema evolution story?

How do I run compaction for a MOR dataset?

Performance

What performance can I expect for Hudi writing?

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 49

New Version 50

Key

How can I restore my dataset to a known good point in time?

What's Hudi's schema evolution story?

How do I run compaction for a MOR dataset?

What's Hudi's schema evolution story?

How do I run compaction for a MOR dataset?

Performance

What performance can I expect for Hudi writing?