Google Doc: https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit?usp=sharing
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Background and Motivation
When discussing how to support Hive built-in functions in the thread of “FLIP-57 Rework FunctionCatalog” (http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html), a module approach was raised.
As we discussed and looked deeper, we think it’s a good opportunity to broaden the design and the corresponding problem it aims to solve. The motivation now is to expand Flink’s core table system and enable users to do customizations by writing pluggable modules.
There are two aspects of the motivation:
- Enable users to integrate Flink with cores and built-in objects of other systems, so users can reuse whatever they are familiar with in other SQL systems seamlessly as core and built-ins of Flink SQL and Table
- Enpower users to write code and do customized developement for Flink table core
Modules define a set of metadata, including functions, user defined types, operators, rules, etc. Prebuilt modules will be added and provided, or users may choose write their own. Flink will take metadata from modules as extensions of its core built-in system that users can take advantages of. For example, users can define their own geo functions and geo data types and plug them into Flink table as built-in objects. Another example is users can use an out-of-shelf Hive module to use Hive built-in functions as part of Flink built-in functions.
Background - How Presto Supports Plugins
In Presto, users can write their own plugins by implementing the Plugin interface. In order for Presto to pick up the desired plugins at runtime, users have to drop all the plugin jars into a designated directory in Presto installation.
Presto support plugins via SPI. Class name of each plugin is provided to Presto via the standard Java ServiceLoader interface: the classpath contains a resource file named org.presto.spi.Plugin in the META-INF/services directory for discovery.
Proposal
Scope
In this FLIP we’ll design and develop a generic mechanism for pluggable modules in Flink table core, with a focus on built-in functions.
We’ll specifically create two module implementations in this FLIP
- CoreModule, with existing Flink built-in functions only
- HiveModule, supporting Hive built-in functions and numerous Hive versions
Overall Design
All modules will implement the module interface. The module interface defines a set of APIs to provide metadata such as functions, user defined types, operators, rules, etc. Each module can choose to provide all or only a subset of the metadata. All modules are managed by a moduleManager, and all pluggable metadata are loaded on demand in object lookup.
Flink’s existing core metadata will also be a module named as “CoreModule”. Since we want to focus on supporting functions thru modules,we’ll only migrate Flink’s existing built-in functions into the CoreModule at this moment as the first step.
All module metadata will be seen as a part of Flink table core, and won’t have namespaces.
Objects in modules are loaded on demand instead of eagerly, so there won't be inconsistency.
Users have to be fully aware of the consequences of resetting modules as that might cause that some objects can not be referenced anymore or resolution order of some objects changes. E.g. “CAST” and “AS” cannot be overriden in CoreModule and users should be fully aware of that.
How to Load Modules
To load modules, users have to make sure relevant classes are already in classpath.
Java/Scala:
// new APIs to TableEnvironment // unload a module instance from module list and other modules remain the same relative positions // list all the modules' names according to order in module list // note the following modules will be of the order they are specified |
Yaml file:
modules: # note the following modules will be of the order they are specified |
Based on the module type defined in yaml file, SQL CLI will invoke factory service to search the factory class that provides the given module name, and then set them in TableEnvironment.
A few clarifications
- By default, yaml file doesn’t have the “modules” section in effect, and core module will be loaded by default.
- If users specify the “modules” section in yaml file, modules will be strictly loaded according to that, if CoreModule is not specified there, it won’t be loaded.
In case users forgot to specify core module, “modules” section will be commented out in yaml file as following
#modules: # note the following modules will be of the order they are specified |
SQL:
- SHOW MODULES: show module names in the existing module list in order
- LOAD MODULE 'name' [WITH (‘type’=’xxx’, 'prop'='myProp', ...)] : load a module with given name and append to end of the module list
- UNLOAD MODULE 'name’ : unload a module by name from module list and other modules remain the same relative positions
NOTE: the SQL syntax has been discussed again and received some modifications, see FLINK-21045 and the discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLINK-21045-Support-load-module-and-unload-module-SQL-syntax-td48398.html
Resolution Order
Object will be resolved to modules in the order they are defined either in program or in yaml configs. When there are objects sharing the same name, resolution logic will go thru modules in order and return whatever the first one is found, the other ones sitting in the back in the order will be ignored. E.g. if modules are set as “xxx, yyy” where xxx and yyy modules both have a function named “f”, then “f” will always be resolved as that in xxx module.
This FLIP will not take into consideration how to enable users use “f” in yyy module. We may allow users to do so by using whitelist/blacklist in the future, but they are not in the scope of this FLIP.
Besides, users may want to define different resolution orders for different metadata, e.g. “xxx, yyy” for functions, but “yyy, xxx” for data types. They will not be taken in this FLIP too. We can tackle that problem incrementally when there’s a real need from users.
Classes
The following is a generic design with functions as a specific example.
Module Interface
Module interface defines a set of metadata that a module can provide to Flink. It provides default implementations for all the APIs thus an implementation can implement only what it’s able to supply.
interface Module { default Optional<FunctionDefinition> getFunctionDefinition(String name) { return Optional.empty() }; } |
ModuleFactory interface
ModuleFactory defines a factory that is used for descriptors to uniquely identify a module in service discovery, and create an instance of the module.
interface ModuleFactory extends TableFactory { |
CoreModule and CoreModuleFactory
CoreModule is a pre-defined singleton module that should contain all built-in metadata of Flink core.
We currently only move built-in functions into CoreModule.
public class CoreModule implements Module { .collect(Collectors.toSet()); |
class CoreModuleFactory { |
ModuleManager
ModuleManager is responsible for loading all the modules, managing their life cycles, and resolve module objects.
public class ModuleManager { modules.put("core", CoreModule.INSTANCE); public void loadModule(String name, Module module) { ... } public void unloadModule(String name) { ... } public Set<String> listFunctions() { .flatmap(e → e.stream())
|
FunctionCatalog
FunctionCatalog will hold ModuleManager to resolve built-in functions.
class FunctionCatalog implements FunctionLookup { public Optional<FunctionLookup.Result> lookupFunction(String name) { // search built-in functions in ModuleManager, rather than BuiltInFunctionsDefinitions // Resolution order depends on FLIP-57: Rework FunctionCatalog } } |
There was some proposals of merging FunctionCatalog with CatalogManager. It will not be considered in this FLIP.
How to Write and Use a Self-Defined Module - Using HiveModule as an Example
To support numerous Hive versions, we will use the shim approach, which is similar to that of existing HiveCatalog. Letting users explicitly specifying Hive versions is necessary since there are differences in Flink-Hive data conversions among different Hive versions.
public class HiveModule implements Module { private final String hiveVersion; public HiveModule(String hiveVersion) { @Override @Override |
public abstract class HiveModuleFactory implements ModuleFactory { |
Java/Scala:
tableEnv.loadModule("hive", new HiveModule("2.2.1")); |
Yaml file:
modules: - type: core name: core - type: hive name: hive hive-version: 2.2.1 |
Limitations and Future Work
As mention above, though this FLIP provides a generic design and mechanism for all module object types we want to support, we will only implement functions. Other objects can be added incrementally later on.
Reference
- Presto SPI and Function Plugin