ID	IEP-25
Author	Sergey Chugunov
Sponsor	Yakov Zhdanov
Created	20 Jun 2018
Status	DRAFT

Motivation

Partition Map Exchange is an internal process crucial for Ignite cluster maintaining consistent view of partitions distribution across all nodes in the cluster. Any event triggering change of partitions distribution triggers PME as well, e.g. server nodes joining and leaving, dynamic caches starts etc.

PME consists of two phases:

On the first phase coordinator notifies all nodes about starting new exchange process and waits for all nodes to reply with their local partition maps.
On the second phase coordinator processes all received local maps, prepares full partition map and sends it to all other nodes.

The important thing about PME is that when it is in progress no new transactions can be started as partition map is changing. So if PME hangs for any reason (bug in code, undelivered messages, slow nodes) the whole cluster freezes requiring manual intervention to bring it back to operational state (usually it means restarting nodes that prevents PME to finish).

Proposed solution is to stop nodes automatically in known situations in which PME hangs.

Description

The following scenarios should be covered:

1 Non-coordinator nodes not replying with local partition maps

When coordinator on phase 1 detects that particular nodes don't reply in time with their local partition maps, it may decide to forcibly stop these nodes to unblock exchange.

It should be possible for user to configure a minimal number of copies. If stopping these nodes would cause some partitions to have less copies, coordinator should print warning to logs and keep waiting.

2 Non-coordinator node not applying full partition map

When non-coordinator node (say nodeA) has sent local partition map successfully but doesn't receive full partition map in time it should check status of its exchange on coordinator.

If coordinator informs that exchange has already been finished nodeA should stop itself as its partition map is out-of-date with the rest of the cluster. If coordinator hasn't initiated exchange for a given topology version, non-coordinator nodes should kick it from cluster.

Risks and Assumptions

Proposal requires defining new policy in public API for scenario#1, new protocol should be developed for scenario#2 (non-coordinator node requests status of specific exchange from coordinator).

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-25: Partition Map Exchange hangs resolving