Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

About hudi-client, We can split hudi-client module into two new modules: hudi-writer-common and hudi-spark. hudi-writer-common will have the HoodieIndex, HoodieTable abstract classes along with IOhandle classes, metrics, exceptions. Index implementations themselves now can move to hudi-spark. HoodieWriteClient and the table classes can also put into hudi-spark module. After this refactoring, we can introduce a new hudi-flink module to package flink specific implementation of the index.

About hudi-utlilites, we use some specific Spark data sources there. So we can either split the core deltastreamer logic as a hudi-deltastreamer-core or hudi-utilities-core and have the Sources themselves live in a separate module as hudi-utilities-spark, hudi-utilities-flink:

Image Removed

After step 1, we have decoupled Hudi and spark. Now, we need to implement some functions just like Spark did, e.g. Index.

The implementation of the index feature is one of the parts in the Flink Job DAG. Flink Stateful API can provide the ability of state management. We can store the index via Flink stateful API. From a low-level abstraction, in unbounded streaming, window is a mechanism that split the unbounded stream into bounded stream. We can use the window in Flink to mapping the micro-batch(RDD) in Spark.

Indexing is one of the steps in the writing process which exists in the context of computation and is closely related to our computation engine. Therefore, the implementation of the existing indexes also needs to give corresponding implementations for different computation engines. The whole class diagram is shown below:

Image Removed


HoodieIndex, as a public interface, should be refactored into engine-independent classes. We can generalize Spark-related types. Like this:

...

Code Block
languagejava
public class HoodieFlinkIndex<T extends HoodieRecordPayload>
  extends HoodieGeneralIndex<
    T,
    DataStream<HoodieRecord<T>>,
    DataStream<WriteStatus>,
    DataStream<HoodieKey>,
    DataStream<Tuple2<HoodieKey, Option<Tuple2<String, String>>>>> {

  public HoodieFlinkIndex(HoodieWriteConfig config) {
    super(config);
  }

  @Override
  public DataStream<Tuple2<HoodieKey, Option<Tuple2<String, String>>>> fetchRecordLocation(
      DataStream<HoodieKey> hoodieKeys, HoodieTable<T> hoodieTable) {
    return null;
  }

  @Override
  public DataStream<HoodieRecord<T>> tagLocation(DataStream<HoodieRecord<T>> recordRDD,
      HoodieTable<T> hoodieTable) throws HoodieIndexException {
    return null;
  }

  @Override
  public DataStream<WriteStatus> updateLocation(DataStream<WriteStatus> writeStatusRDD,
      HoodieTable<T> hoodieTable) throws HoodieIndexException {
    return null;
  }

  //...
}


About HoodieTable, we can do the same refactor via Java Generic. For example, we can define a HoodieGenericTable like this:


Code Block
languagejava
/**
 * Abstract implementation of a HoodieTable.
 * @param <T> The specific type of the {@link HoodieRecordPayload}.
 * @param <EC> The specific context type of the computation engine.
 * @param <WSDS> The specific data set type of the {@link WriteStatus}.
 * @param <P> The specific partition type.
 */
public abstract class HoodieGenericTable<T extends HoodieRecordPayload, EC, WSDS, P> implements Serializable {
    //...
}

Then, we will also introduce HoodieSparkTable and HoodieFlinkTable just like HoodieSparkIndex and HoodieFlinkIndex.


About hudi-utlilites, we use some specific Spark data sources there. So we can either split the core deltastreamer logic as a hudi-deltastreamer-core or hudi-utilities-core and have the Sources themselves live in a separate module as hudi-utilities-spark, hudi-utilities-flink:

Image Added

After step 1, we have decoupled Hudi and spark. Now, we need to implement some functions just like Spark did, e.g. Index.

The implementation of the index feature is one of the parts in the Flink Job DAG. Flink Stateful API can provide the ability of state management. We can store the index via Flink stateful API. From a low-level abstraction, in unbounded streaming, window is a mechanism that split the unbounded stream into bounded stream. We can use the window in Flink to mapping the micro-batch(RDD) in Spark.

Indexing is one of the steps in the writing process which exists in the context of computation and is closely related to our computation engine. Therefore, the implementation of the existing indexes also needs to give corresponding implementations for different computation engines. The whole class diagram is shown below:

Image Added



Rollout/Adoption Plan

None

...