Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Flink Doris Connector does not introduce any new interfaces or any existing interfaces that will be removed or changed.

Proposed Changes

Overall design

1. Source

There are two types of reading from DorisSource:

1.1. SCAN

Batch reading of Doris data is currently a bounded stream, usually used for data synchronization or joint analysis with other data sources.
1. First, the query will be spliced according to the query and sent to Doris to obtain the query plan.
2. The above Response will return the Tablet and BE node information where the query is located.
3. Use taskmanager to query specific tablet information concurrently


1.2. LOOKUP JOIN

For the scenario where the dimension table is in Doris, lookup join is performed, and JDBC is mainly used for querying.

2. Sink

Writing on the Doris side is mainly done through the Stream Load API , At the same time, Doris Sink will provide two writing methods

2.1. Streaming writing

When the Sink operator receives data, it will initiate a Stream Load request and maintain the http link until the Checkpoint ends to complete the data writing.

Exactly-Once

Stream Load provides two-phase commit api, refer to https://github.com/apache/doris/issues/7141
Combined with Stream Load's two-phase commit, end-to-end data consistency can be achieved based on Flink's two-phase commit.

2.2. Save batch writing

Streaming writing is submitted based on the checkpoint method and is strongly bound to the checkpoint, that is, the data visibility is the checkpoint interval. However, in some scenarios, the delay of user data needs to be decoupled from the checkpoint interval.

Batch writing is to cache data to the Sink, trigger writing based on thresholds such as the number of records, or periodically write the data in the cache to Doris.

Note:that batch writing provides at-least-once semantics and does not guarantee Exactly-Once semantics. However, it can be combined with Doris' primary key table to achieve Exactly-Once.


Compatibility, Deprecation, and Migration Plan

...