THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
Its common to have duplicates introduced in the Kafka events your micro-services emit due to various reasons such as
- Client/Producer failures
- Network errors
- At-least once data moving across Kafka clusters
This can lead to overcounting/accuracy problems like : over reporting impressions and billing ads incorrectly, low fidelity data in dashboards.
Here's how Hudi can help solve this problem with a simple option invoked on the DeltaStreamer tool.
// Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ /path/to/hudi-utilities-bundle-*.jar` \ --props s3://path/to/kafka-source.properties \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \ --source-ordering-field time \ --target-base-path s3:///hudi-deltastreamer/impressions --target-table uber.impressions \ --op BULK_INSERT --filter-dupes
// kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/versions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081