How does Hudi handle duplicate record keys in an input?

When issuing an `upsert` operation on a dataset and the batch of records provided contains multiple entries for a given key, then then all of them are reduced into a single final value by repeatedly calling payload class's preCombine() method . By default, we pick the record with the greatest value (determined by calling .compareTo()) giving latest-write-wins style semantics.

For an insert or bulk_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert or consider specifying option to de-duplicate input in either datasource or deltastreamer.

Can I implement my own logic for how input records are merged with record on storage?

What are different ways of running compaction for a MOR dataset?

Similar to above, the payload class defines methods (combineAndGetUpdateValue(), getInsertValue()) that control how the record on storage is combined with the incoming update/insert to generate the final value to be written back to storage.

How do I delete records in the dataset using Hudi?

GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see here.<Answer WIP>

How do I migrate my data to Hudi?

<Answer WIP>Hudi provides built in support for rewriting your entire dataset into Hudi one-time using the HDFSParquetImporter tool available from the hudi-cli . You could also do this via a simple read and write of the dataset using the Spark datasource APIs. Once migrated, writes can be performed using normal means discussed here. This topic is discussed in detail here.

How can I pass hudi configurations to my spark job?

<Answer WIP> https://lists.apache.org/thread.html/ce35776e8899d620c095354c23933eb000eec63eaa70acfe60ae0b0c@<dev.hudi.apache.org>

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

How does the Hudi indexing work & what are its benefits?

What does the Hudi cleaner do?

...

How can I restore my dataset to a known good point in time?

...

?

What's Hudi's schema evolution story?

What are different ways of running compaction for a MOR dataset?

Simplest way to cio

Performance

What performance can I expect for Hudi writing?

...

Space shortcuts

Page tree

Versions Compared

Old Version 45

New Version 46

Key

How does Hudi handle duplicate record keys in an input?

Can I implement my own logic for how input records are merged with record on storage?

What are different ways of running compaction for a MOR dataset?

How do I delete records in the dataset using Hudi?

How do I migrate my data to Hudi?

How can I pass hudi configurations to my spark job?

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

How does the Hudi indexing work & what are its benefits?

What does the Hudi cleaner do?

How can I restore my dataset to a known good point in time?

?

What's Hudi's schema evolution story?

What are different ways of running compaction for a MOR dataset?

Performance

What performance can I expect for Hudi writing?

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 45

New Version 46

Key

How does Hudi handle duplicate record keys in an input?

Can I implement my own logic for how input records are merged with record on storage?

What are different ways of running compaction for a MOR dataset?

How do I delete records in the dataset using Hudi?

How do I migrate my data to Hudi?

How can I pass hudi configurations to my spark job?

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

How does the Hudi indexing work & what are its benefits?

What does the Hudi cleaner do?

How can I restore my dataset to a known good point in time?

?

What's Hudi's schema evolution story?

What are different ways of running compaction for a MOR dataset?

Performance

What performance can I expect for Hudi writing?