About hudi-client, We can split hudi-client module into two new modules: hudi-writer-common and hudi-spark. hudi-writer-common will have the HoodieIndex, HoodieTable abstract classes along with IOhandle classes, metrics, exceptions. Index implementations themselves now can move to hudi-spark. HoodieWriteClient and the table classes can also put into hudi-spark module. After this refactoring, we can introduce a new hudi-flink module to package flink specific implementation of the index.

About hudi-utlilites, we use some specific Spark data sources there. So we can either split the core deltastreamer logic as a hudi-deltastreamer-core or hudi-utilities-core and have the Sources themselves live in a separate module as hudi-utilities-spark, hudi-utilities-flink:

Image Removed

Step 2: integrate Hudi with Flink

After step 1, we have decoupled Hudi and spark. Now, we need to implement some functions just like Spark did, e.g. Index.

The implementation of the index feature is one of the parts in the Flink Job DAG. Flink Stateful API can provide the ability of state management. We can store the index via Flink stateful API. From a low-level abstraction, in unbounded streaming, window is a mechanism that split the unbounded stream into bounded stream. We can use the window in Flink to mapping the micro-batch(RDD) in Spark.

Indexing is one of the steps in the writing process which exists in the context of computation and is closely related to our computation engine. Therefore, the implementation of the existing indexes also needs to give corresponding implementations for different computation engines. The whole class diagram is shown below:

Image Removed

HoodieIndex, as a public interface, should be refactored into engine-independent classes. We can generalize Spark-related types. Like this:

...

Code Block

language	java

public class HoodieFlinkIndex<T extends HoodieRecordPayload>
  extends HoodieGeneralIndex<
    T,
    DataStream<HoodieRecord<T>>,
    DataStream<WriteStatus>,
    DataStream<HoodieKey>,
    DataStream<Tuple2<HoodieKey, Option<Tuple2<String, String>>>>> {

  public HoodieFlinkIndex(HoodieWriteConfig config) {
    super(config);
  }

  @Override
  public DataStream<Tuple2<HoodieKey, Option<Tuple2<String, String>>>> fetchRecordLocation(
      DataStream<HoodieKey> hoodieKeys, HoodieTable<T> hoodieTable) {
    return null;
  }

  @Override
  public DataStream<HoodieRecord<T>> tagLocation(DataStream<HoodieRecord<T>> recordRDD,
      HoodieTable<T> hoodieTable) throws HoodieIndexException {
    return null;
  }

  @Override
  public DataStream<WriteStatus> updateLocation(DataStream<WriteStatus> writeStatusRDD,
      HoodieTable<T> hoodieTable) throws HoodieIndexException {
    return null;
  }

  //...
}

About HoodieTable, we can do the same refactor via Java Generic. For example, we can define a HoodieGenericTable like this:

Code Block

language	java

/**
 * Abstract implementation of a HoodieTable.
 * @param <T> The specific type of the {@link HoodieRecordPayload}.
 * @param <EC> The specific context type of the computation engine.
 * @param <WSDS> The specific data set type of the {@link WriteStatus}.
 * @param <P> The specific partition type.
 */
public abstract class HoodieGenericTable<T extends HoodieRecordPayload, EC, WSDS, P> implements Serializable {
    //...
}

Then, we will also introduce HoodieSparkTable and HoodieFlinkTable just like HoodieSparkIndex and HoodieFlinkIndex.

About hudi-utlilites, we use some specific Spark data sources there. So we can either split the core deltastreamer logic as a hudi-deltastreamer-core or hudi-utilities-core and have the Sources themselves live in a separate module as hudi-utilities-spark, hudi-utilities-flink:

Image Added

Step 2: integrate Hudi with Flink

After step 1, we have decoupled Hudi and spark. Now, we need to implement some functions just like Spark did, e.g. Index.

The implementation of the index feature is one of the parts in the Flink Job DAG. Flink Stateful API can provide the ability of state management. We can store the index via Flink stateful API. From a low-level abstraction, in unbounded streaming, window is a mechanism that split the unbounded stream into bounded stream. We can use the window in Flink to mapping the micro-batch(RDD) in Spark.

Indexing is one of the steps in the writing process which exists in the context of computation and is closely related to our computation engine. Therefore, the implementation of the existing indexes also needs to give corresponding implementations for different computation engines. The whole class diagram is shown below:

Image Added

Rollout/Adoption Plan

None

...

Space shortcuts

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Step 2: integrate Hudi with Flink

Step 2: integrate Hudi with Flink

Rollout/Adoption Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 12

New Version 13

Key

Step 2: integrate Hudi with Flink

Step 2: integrate Hudi with Flink

Rollout/Adoption Plan