Jingsong Lee jingsonglee0@gmail.com

Discussion thread


Vote thread


ISSUE


Release

TBD

Motivation

Paimon primary key table already provides LSM file structure, it is a pity that the paimon can not provide a queryable service for lookup.

A distributed service can download Paimon files locally and provide a Lookup service. It does not affect the write process and read process, it is a separate server. It can be used as:

  1. Flink Lookup Join, reuse by multiple Flink Jobs.
  2. Online Service Lookup, this requires high stability. (it may not be so stable in the first version)

This PIP is a high-level design for Paimon QueryService, not including all details.

Overview

Query Service needs to be independent of the normal read and write processes, here is how it is launched and registered and how it is queried.

How to Launch

A Query Service can maintain service for multiple tables, in this first version, At Service startup, the user needs to specify which tables it is responsible for.

Query Service contains:

  1. Snapshot File Scanner: Read the latest snapshots of tables and emit file changes to downstream Executors.
  2. The executor maintains files for specific buckets and provides query service.
  3. The address server collects all addresses of executors and registers the address to the Paimon table file system.

How to Query

Users just need to get the Paimon table from the Catalog (need warehouse Path), and just create a TableQuery object, the TableQuery will do:

  1. Find the address server from the Paimon table file system.
  2. Connect the address server to get all executor addresses.
  3. Connect executors to lookup by key.

Implementation

  1. Distributed: In the first version, we can launch this service in a separate Flink Job. The topology should just be a DAG.
  2. RPC: The RPC for Executor and Address server can be GRPC.
  3. TableQuery client: 
    1. Maintain address for Address Server and Executors. Retry to get a new address if there are some exceptions.
    2. Maintain connections for Address Server and Executors. Retry to get a new connection if there are some exceptions.
    3. User LookupLevels class to lookup, which already contains cache, IO, and disk management.
    4. Provide one key lookup and batch keys lookup.

Compatibility, Deprecation, and Migration Plan

This is a new additional feature.

  • No labels