Status

...

Page properties

Discussion thread

...

JIRA
	http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-1-Fine-grained-recovery-from-task-failures-td12510.html
Vote thread

...

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-4256

...

Release

1.9

...

Phase 1 (Released in 1.3)

...

With this we end up with the following pseudo-code for the core backtracking logic, which from a given task backtracks upstream towards blocking result partitions, and from there downstream to all consumers.:

// entry point for failover strategies

onTaskFailure(task):

containingRegion = determineFailoverRegion(task)

failoverRegion(containingRegion)

// alternatively return collection of vertices

private failoverRegion(containingRegionregion):

if (!hasRegionBeenScheduled(containingRegionregion)) {

// nothing to do

return;

}

resultPartitions = determineNeededResultPartitions(containingRegion)

for (resultPartition : resultPartitions) {

if (isPartitionStillAvailable(resultPartition)) {

// data still available, so in theory don't have to do anything

// exact details depend on shuffle service implementation and

// whether we can consume data from a TM without

// a task being deployed on it

} else {

producerRegion = getProducerRegion(resultPartition)

failoverRegion(producerRegion)

}

reschedule(containingRegion)

// restart all consumer regions that could be affected by this failover

// make behavior configurable?

consumersRegions = getConsumersForRegion(containingRegion)

for (consumerRegion : consumerRegions) {

failoverRegion(consumerRegion)

}

Partition life-cycle management

...

Page tree

Versions Compared

Old Version 8

New Version 9

Key

Status

Phase 1 (Released in 1.3)

Partition life-cycle management

Page tree

Page History

Versions Compared

Old Version 8

New Version 9

Key

Status

Phase 1 (Released in 1.3)

Partition life-cycle management