...
Page properties | ||||||||
---|---|---|---|---|---|---|---|---|
|
...
Concepts and Primitives Lack Organization
Currently, DataStream API provides primitives that corresponds to concepts of many different levels. This makes it difficult to maintain consistency across different interfaces and breaks our principle of make it as flexible as possible and easy to understand and use. The main reasons are as follows:
It contains the most fundamental concepts that defines a stateful stream processing application: data streams, transforms, states, etc.
It also allows defining arbitrary transforms via process functions and custom operators.
It provides high-level implementations and utilities for common data processing patterns: map, aggregate, join, window, etc.
There are also short-cuts for specific source / sink implementations, mainly for testing purpose.
...
We have been saying that DataStream API is stream-batch unified. But that doesn't mean everything in DataStream API is stream-batch unified. This breaks our principle of clear and unambiguous definition and easy to understand and use. For example, processing time only makes sense in real-time stream processing, while caching of intermediate results is only supported for batch processing. In addition, DataStream API does not support some batch specific operations (such as sort) because it is hard to define the behavior in unbounded stream scenarios. It's hard for people to understand which APIs are batch-stream unified and which are batch / stream dedicated.
Proposed Plan
...
To address the above issues, we want to introduce another set of API,DataStream API V2, and help users to migrate from the current DataStream API (hereinafter referred to as "V1")
...
smoothly. That is to say, once the DataStream API V2 is ready, the V1 will not be immediately removed, but will co-exist with V2 for a reasonable migration period.
Design Goals
Through this new API, we aim to achieve the following goals:
Re-organize the concepts:
Separate fundamental and high-level concepts: Define the most primitive parts required for the API clearly, and separate different sets of high-level supports like batch-stream unified functions, window... etc.
Improve the definitions and primitives.
User only depends on pure APIs: Providing a thin and pure APIs layer. Both the user's job and flink-runtime will depends on this API layer. This eliminates the need for user jobs to (indirectly) depend on runtime internals.
Eliminate internal-exposing apis / contents.
Scope and Principles
...
Introducing a new API is a huge topic, so let's first clarify the scope and principles of this proposed API.
- Only focusing on APIs: The Internal implementations can reused previous work(e.g. transformations and operations) as as much as possible as a first step, and refactored later in a compatible and user-not-aware way . This can make the discussion more focused on API semantics. Besides, some some refactoring work may be easier if done after removing DataStream V1 .
- Incremental additions: Datastream Datastream API V1 supports a wide range of features. Our plan is to incrementally add supports for these features in V2. Therefore, you may find many features are missing in the early sub-FLIPs. Eventually, it is required for V2 to have at least equivalent functionalities as V1 before deprecating and removing V1, except for functionalities that are determined unwanted.
- APIs are initially experimental: By convention, @Experimental APIs should be promoted to @PublicEvolving in 2 minor releases. But as we are developing incrementally, the whole set of API would take multiple release cycles to complete. Therefore, we proposed not to start counting the promoting period for the early-merged APIs, until the whole API set is completed. APIs are initially experimental: By convention, @Experimental APIs should be promoted to @PublicEvolving in 2 minor releases. But as we are developing incrementally, the whole set of API would take multiple release cycles to complete. Therefore, we proposed not to start counting the promoting period for the early-merged APIs, until the whole API set is completed.
Proposed Changes
Based on whether its functionality must be provided by the framework, we divide the relevant concepts into two categories: fundamental primitives and high-Level extensions.
...
The reason why we consider it as the fundamental primitives is that users themselves cannot perceive watermarks from different parallelisms, so they cannot align them. Similar requirements must be provided by the framework.
Processing Timer Service
Currently,
...
Flink uses a unified abstraction for both Processing Time and Event Time. This is no longer the case in the new API. The main reasons are as follows:
- Processing time only makes sense for real-time stream processing, while event time is meaningful for both real-time stream processing and historical batch processing. Even Even if a unified abstraction is adopted, users still need to understand and carefully deal with the differences between them. Otherwise, It is easy to produce unexpected behavior.
- The advantage of unified abstraction is that users can easily switch between the two different time semantics without significantly changing the time-based processing logics. However, in reality, we are barely aware of any cases where the same processing logics are reused for both time sementicssemantics.
Therefore,
...
in the new API, we only provide timer service for processing time. Event timer can be achieved by leveraging the watermark handling interfaces. The reason processing timer service is considered part of the fundamental primitives is that, it is required to work based on task's mailbox thread model, in order not to introduce concurrency issues.
High-Level Extensions
High-Level extensions are like short-cuts / sugars, without which users can probably still achieve the same behavior by working with the fundamental APIs, but would be a lot easier with the builtin supports. It includes Built-In Functions, Event Timer Service and Window.
...
A special type of generalized watermark with temporal semantics and an alignment algorithm for this watermark whose behavior remains consistent with before.
Some operators based on event time.
Some components to extract and handle event time: timestamp assigner, watermark generator...
Window
Window is a special processing mechanism for data on the stream: It split the stream into “buckets”, over which we can apply computations. Theoretically, window can be implemented on a process function by defining a series of states. Therefore, we consider it as a high-level extensionWindow is a special processing mechanism for data on the stream: It split the stream into “buckets”, over which we can apply computations. Theoretically, window can be implemented on a process function by defining a series of states. Therefore, we consider it as a high-level extension.
Move common dependencies into separate module
...
DataStream API V2's Support for Event Time
...
Event time is an important component of timely stream processing. It is worth discussing with a separate FLIP, especially since the new API no longer consider it as the fundamental semantics. This FLIP will focus on the following aspects:
How to Implement event time watermark via generalized watermark mechanism.
How to extract and process event time in DataStream API V2.
...