Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In order to address the issues above, we propose here to define a unified abstraction for lookup source cache and its related metrics.

Proposed Changes

We'd like to split the proposal into two kinds of caching strategies: LRU cache and all cache.

LRU cache

LRU is the most common caching strategy, which dynamically evicts entries in the cache according to the given configuration. For supporting LRU cache in lookup table, we propose several new interfaces to simplify the work for developers to implement lookup table functions and enable cache as optimization:

  • LookupFunction / AsyncLookupFunction, an extended version of TableFunction to clarify the semantic of lookup.
  • LookupCache / LookupCacheFactory, defining the cache and its factory used in lookup table.
  • DefaultLookupCacheFactory, a default implementation of a LRU cache that suitable for most use cases.
  • LookupCacheMetricGroup, defining metrics should be reported by the lookup cache.
  • LookupFunctionProvider / AsyncLookupFunctionProvider, as the API interacting with table source to get LookupFunction and LookupCacheFactory.

The LRU cache serves as a component in LookupJoinRunner, and would be pluggable by specifying LookupCacheFactory in LookupFunctionProvider. The developer of a lookup table needs to define a LookupFunctionProvider / AsyncLookupProvider in their implementation of LookupTableSource to specify the LookupFunction and the factory of the cache, then the planner will take over the cache factory, pass it to the LookupJoinRunner, and the cache will be instantiated during the runtime execution.to specify the LookupFunction and the factory of the cache, then the planner will take over the cache factory, pass it to the LookupJoinRunner, and the cache will be instantiated during the runtime execution.

All Cache

If the size of lookup table is relatively small to fit into the memory, and the lookup table doesn't change frequently, it'll be more efficient to load all entries of the lookup table into the cache to reduce network I/O, and refresh the table periodically. We'd like to name this use case as "all cache". Logically the reload operation is a kind of scan, so we'd like to reuse the ScanRuntimeProvider so that developers could reuse the scanning logic implemented in Source / SourceFunction / InputFormat. Considering the complexity of Source API, we'd like to support SourceFunction and InputFormat API first. Supporting Source API might require new topology and will be discussed later in another FLIP

Public Interfaces

Lookup Functions

...

We will use unit and integration test for validating the functionality of cache implementations.

Rejected Alternatives

...

Add cache in TableFunction implementations