...
Info | ||
---|---|---|
| ||
To contribute content to this FAQ, see here. |
General
When is Hudi a useful for me or my organization
...
For an insert
or bulk_insert
operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert
or consider specifying option to de-duplicate input in either datasource or deltastreamer.
Can I implement my own logic for how input records are merged with record on storage
...
Hudi provides built in support for rewriting your entire dataset into Hudi one-time using the HDFSParquetImporter
tool available from the hudi-cli . You could also do this via a simple read and write of the dataset using the Spark datasource APIs. Once migrated, writes can be performed using normal means discussed here. This topic is discussed in detail here, including ways to doing partial migrations.
...