Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To achieve this goal today, we can use the [deltastreamer](<https://hudi.apache.org/docs/writing_data/#deltastreamer>) tool deltastreamer tool provided with Hudi, which runs within the Spark Engine to pull records from Kafka, and ingests data to Hudi tables. Giving users the ability to ingest data via the Kafka connect framework has a few advantages. Current Connect users can readily ingest their Kafka data into Hudi tables, levering the power of Hudi's platform without the overhead of deploying a spark environment.

...

To appreciate the design proposed in this RFC, it is important to understand Kafka Connect. It is a framework to stream data into and out of Apache Kafka. The core components of the Connect framework that are relevant to this RFC are connectors, tasks, and workers. The connector instance is a logical job that manages the copying of data from Kafka to another system. A connector instance manages a set of tasks that actually copy the data. Using multiple tasks allows for parallelism and scalable data copying. Connectors and tasks are logical execution units that are scheduled within workers. In distributed mode, workers are run across a cluster to provide scalability and fault tolerance. All workers can be configured with the same [group.id](<http://group.id>) and  and the connect framework automatically manages the execution of tasks across all available workers. As shown in the figure below, tasks are distributed across workers, and each task manages one or more distinct partitions of the Kafka topic.

...

The HudiSinkTask pseudo code with key interface implementations is shown below. On initialization or when a new partition is assigned to the task (OPEN API), a new instance of the TransactionParticipant is assigned to the partition. If the task manages partition 0, then an instance of TransactionCoordinator is also created, which runs continuously in a separate thread.

...

Kafka connect calls the PRECOMMIT periodically (the interval can be configured by user) to commit the last Kafka offset. The offset is returned by the TransactionParticipant, which is updated based on the written records that were committed. Since Kafka offsets are committed as Hudi files are committed, we suggest setting the interval for PRECOMMIT similar to the transaction intervals.

...

Once the state machine is updated, the writeRecords method flushes all the records from the buffer and writes it to Hudi using the HudiJavaClient A local variable storing the last written Kafka records is maintained as records are written. At the start of the transaction, this local variable is initialized to the Kafka offset committed to ensure we do not miss any records.


Image Added

Configurations

...