Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

How does Hudi handle duplicate record keys in an input? 

When issuing an `upsert` operation on a dataset and the batch of records provided contains multiple entries for a given key, then then all of them are reduced into a single final value by repeatedly calling payload class's preCombine() method . By default, we pick the record with the greatest value (determined by calling .compareTo()) giving latest-write-wins style semantics.

For an insert or bulk_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert or consider specifying option to de-duplicate input in either datasource or deltastreamer.

Can I implement my own logic for how input records are merged with record on storage? 

<Answer WIP>

What are different ways of running compaction for a MOR dataset?

Similar to above, the payload class defines methods (combineAndGetUpdateValue(), getInsertValue()) that control how the record on storage is combined with the incoming update/insert to generate the final value to be written back to storage. 

How do I delete records in the dataset using Hudi?

GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see here.<Answer WIP>

How do I migrate my data to Hudi?

<Answer WIP>Hudi provides built in support for rewriting your entire dataset into Hudi one-time using the HDFSParquetImporter tool available from the hudi-cli . You could also do this via a simple read and write of the dataset using the Spark datasource APIs. Once migrated, writes can be performed using normal means discussed here. This topic is discussed in detail here.

How can I pass hudi configurations to my spark job?

<Answer WIP> https://lists.apache.org/thread.html/ce35776e8899d620c095354c23933eb000eec63eaa70acfe60ae0b0c@<dev.hudi.apache.org

How do I delete records in the dataset using Hudi?

<Answer WIP>


Can I register my Hudi dataset with Apache Hive metastore?

<Answer WIP>

How does the Hudi indexing work & what are its benefits? 

<Answer WIP>

What does the Hudi cleaner do? 

...

How can I restore my dataset to a known good point in time?

<Answer WIP>

...

?

<Answer WIP>

What's Hudi's schema evolution story?

<Answer WIP>


What are different ways of running compaction for a MOR dataset?

Simplest way to cio

Performance 

What performance can I expect for Hudi writing?

...