Proposers

Sagar Sumit

Approvers

Vinoth Chandar : REQUESTED_INFO

Status

Current state: UNDER DISCUSSION

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

To provide an alternate scalable ingestor to ingest data incrementally through JDBC and handle reconciliation.

Background

This approach would completely avoid the need to have kafka cluster just to stream data at rest. Also present JDBC connect does not scale well for huge tables because of no distributed way of fetching partial data from a single table, leading to a single task per table. Unlike Sqoop which is a scalable incremental puller, we are going to avoid intermediate states and avoid extraneous data lifting action to the DAG.

Implementation

Motivation :

Supporting data sources which does not support bin logs(SAP HANA, Tibero, Teradata) but support sql.
Reducing resource wastage for batch based sync. For batch based sync, it is an overkill to stream data at rest using kafka.
Avoiding maintaining kafka eco system and directly bringing data from sources.
JDBC connect is sequential in nature. 1 table can be loaded using a single task using JDBC connect.

We have identified major components for incremental JDBC puller.

A component to figure optimal number of partitions to fetch data from source with an upper limit on the number of tasks and batch size(rows per file).
Query builder based on incremental logic configured by the user. Strategy types can be
1. Incrementing column
2. Timestamp column
3. Timestamp and incrementing columns
4. Customer query-based
5. Full refresh
Component to execute part operations independently with retry mechanism(spark map).
Component to handle schema evolution and database-specific type conversions.

Space shortcuts

Page tree

RFC - 14 : JDBC incremental puller