Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state"Under Discussion"

...

JIRA: here (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)


Motivation

As described in FLIP-131, we are aiming at deprecating the DataSet API in favour of the DataStream API and the Table API. After this work is done, the user will Users should be able to write a program using the DataStream API and this that will execute efficiently on both bounded and unbounded input data. But before we reach this point, it is worth discussing and agreeing on the semantics of some operations as we transition from the streaming world to the batch one. 

...

This document discusses questions about semantics like the ones above and sketches a plan about “cleaning up” the current DataStream API by deprecating methods that were introduced in the past but have ended up being more of a maintenance burden. 

Batch vs Streaming Scheduling: explicit or derived?

The user code of an application in Flink is translated into a graph of tasks similar to the one shown in here. Flink’s runtime will then schedule these tasks differently depending on if they belong to a batch or a streaming application.

...

Rejected Alternative: An alternative could be to expose it through the executionConfig or directly in the Environments but those were rejected because we can achieve the same through a configuration option.

Processing Time Support in Batch 

The notion of time is crucial in Stream Processing. Time in Streaming comes in two flavours, Processing and Event Time. In a nutshell, Processing Time is the wall-clock time on the machine the record is processed, at the specific instance it is processed, while Event Time is a timestamp usually embedded in the record itself and indicating, for example, when a transaction happened, or when a measurement was taken by a sensor. For more details, please refer to the corresponding documentation page.

...

These options refer to BOTH batch and streaming and they will make sure that the same job written for streaming can also run for batch.

Event Time Support in Batch

Flink’s streaming runtime builds on the pessimistic assumption that there are no guarantees about the order of the events. This means that events may come out-of-order, i.e. an event with timestamp t may come after an event with timestamp t+1. In addition, the system can never be sure that no more elements with timestamp t < T can come in the future. To amortise the impact of this out-of-orderness, Flink, along with other frameworks in the streaming space, uses an heuristic called Watermarks. A watermark with timestamp T signals that no element  with timestamp t < T will follow.

...

Although the above does not directly imply any user-visible change, it has to be stressed out as in some cases, the same application executed on batch and streaming may lead to different results due to late elements being left out in case of streaming.

Broadcast State

A powerful feature of the DataStream API is the Broadcast State pattern. This feature/pattern was introduced to allow users to implement use-cases where a “control” stream needs to be broadcasted to all downstream tasks, and its broadcasted elements, e.g. rules, need to be applied to all incoming elements from another stream. An example can be a fraud detection use-case where the rules evolve over time.

...

Proposal (potentially not in 1.12): Build custom support for broadcast state pattern where the broadcast side is read first.

Relational methods in DataStream

As discussed in FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API) we see the Table API/SQL as the relational API, where we expect users to work with schemas and fields. Going forward, we envision the DataStream API to be "slightly" lower level API, with a more explicit control over the execution graph, operations, and state. Having said that we think it is worth deprecating and removing in the future all relational style methods in DataStream, which often use Reflection to access the fields and thus are less performant than providing an explicit extractors such as:

  • DataStream#project
  • Windowed/KeyedStream#sum,min,max,minBy,maxBy
  • DataStream#keyBy where the key specified with field name or index

Sinks

Current exactly-once sinks in DataStream rely heavily on Flink’s checkpointing mechanism and will not work with batch scheduling. Support for exactly-once sinks is outside the scope of this FLIP and there will be a separate one coming soon.

Iterations

Iterations are outside the scope of this FLIP and there will be a separate one in the future.