...
Hudi provides built in support for rewriting your entire dataset into Hudi one-time using the HDFSParquetImporter
tool available from the hudi-cli . You could also do this via a simple read and write of the dataset using the Spark datasource APIs. Once migrated, writes can be performed using normal means discussed here. This topic is discussed in detail here, including ways to doing partial migrations.
...
Jira | ||||||||
---|---|---|---|---|---|---|---|---|
|
Why does Hudi retain at-least one previous commit even after setting hoodie.cleaner.commits.retained': 1 ?
Hudi runs cleaner to remove old file versions as part of writing data either in inline or in asynchronous mode (0.6.0 onwards). Hudi Cleaner retains at-least one previous commit when cleaning old file versions. This is to prevent the case when concurrently running queries which are reading the latest file versions suddenly see those files getting deleted by cleaner because a new file version got added . In other words, retaining at-least one previous commit is needed for ensuring snapshot isolation for readers.
How do I use DeltaStreamer or Spark DataSource API to write to a Non-partitioned Hudi dataset ?
...