You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

 RFC - 14  : JDBC incremental puller

Proposers

Approvers

Status

Current stateUNDER DISCUSSION

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

To provide an alternate scalable ingestor to ingest data incrementally through JDBC and handle reconciliation.

Background

This approach would completely avoid the need to have kafka cluster just to stream data at rest. Also present JDBC connect does not scale well for huge tables because of no distributed way of fetching partial data from a single table, leading to a single task per table. Unlike Sqoop which is a scalable incremental puller, we are going to avoid intermediate states and avoid extraneous data lifting action to the DAG.

Implementation

Motivation :

  1. Supporting data sources which does not support bin logs(SAP HANA, Tibero, Teradata) but support sql.
  2. Reducing resource wastage for batch based sync. For batch based sync, it is an overkill to stream data at rest using kafka.
  3. Avoiding maintaining kafka eco system and directly bringing data from sources.
  4. JDBC connect is sequential in nature. 1 table can be loaded using a single task using JDBC connect. 


We have identified major components for incremental JDBC puller.

  1. A component to figure optimal number of partitions to fetch data from source with an upper limit on the number of tasks and batch size(rows per file).
  2. Query builder based on incremental logic configured by the user. Strategy types can be 
    1. Incrementing column
    2. Timestamp column
    3. Timestamp and incrementing columns
    4. Customer query-based
    5. Full refresh 
  3.  Component to execute part operations independently with retry mechanism(spark map).
  4. Component to handle schema evolution and database-specific type conversions. 

Rollout/Adoption Plan

Test Plan


  • No labels