Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion threadhere

JIRA

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-1119913570

Released: TBD

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

NOTE: There are a few phases to the default intermediate result storage. 


Goals

Prerequisites

Phase 1

Explicit cache, i.e. Table.cache()


Explicit cache removal, i.e. Table.invalidateCache()

RPC in TaskManager to remove result partitions.

Locality for default intermediate result storage.

Support of locality hint from ShuffleMaster

Phase 2

Pluggable external intermediate result storage.


Phase 3

Auto cache support, i.e cache at shuffle boundaries.

Old result partition eviction mechanism for ShuffleMaster / ShuffleService

Long Term

Locality in general for external intermediate result storage and external shuffle service.

Custom locality preference mechanism in Flink

Support of sophisticated optimization (use or not use cache)

Statistics on the intermediate results.

Cross-application intermediate result sharing.

External catalog service

 The following section describes phase 1, which does not support pluggable intermediate result storage or auto caching.

...

To achieve auto caching, the cache service needs to be integrated with shuffle service. This is sort of a natural move. Inherently, external shuffle service and cache service share a lot of similarities:


Lifecycle

Access Pattern

Persistency

Mutability

PartitionedRequire SubPartitions

Shuffle Service

cross-stages

Scan

volatile

immutable

YesYes

Cache Service

cross-jobs

Scan

volatile

immutable

YesNo


Future works

...