...
When issuing an `upsert
` operation on a dataset and the batch of records provided contains multiple entries for a given key, then all of them are reduced into a single final value by repeatedly calling payload class's preCombine()
method . By default, we pick the record with the greatest value (determined by calling .compareTo()
) giving latest-write-wins style semantics. This FAQ entry shows the interface for HoodieRecordPayload if you are interested.
For an insert
or bulk_insert
operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert
or consider specifying option to de-duplicate input in either datasource or deltastreamer.
...