Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When users explicitly cache a table, the DAG is implicitly changed. This is because the optimizer may decide to change the cached node if it were not explicitly cached. As of now, when a node has more than one downstream node in the DAG, that node will be computed multiple times. This could be solved by introducing partial optimization for subgraphs, which is a feature available in Blink but not in Flink yet. This is an existing problem in Flink and this FLIP does not attempt to address that.

Future works

NOTE: There are a few phases to the default intermediate result storage. 

...

Goals

...

Prerequisites

...

Phase 1

...

Explicit cache, i.e. Table.cache()

...

Explicit cache removal, i.e. Table.invalidateCache()

...

RPC in TaskManager to remove result partitions.

...

Locality for default intermediate result storage.

...

Support of locality hint from ShuffleMaster

...

Phase 2

...

Pluggable external intermediate result storage.

...

Phase 3

...

Auto cache support, i.e cache at shuffle boundaries.

...

Old result partition eviction mechanism for ShuffleMaster / ShuffleService

...

Long Term

...

Locality in general for external intermediate result storage and external shuffle service.

...

Custom locality preference mechanism in Flink

...

Support of sophisticated optimization (use or not use cache)

...

Statistics on the intermediate results.

...

Cross-application intermediate result sharing.

...

External catalog service

This FLIP is focusing on phase 1. We will have new FLIPs for Phase 2 and Phase 3 in the future.

...