ID	IEP-14
Author	Anton Vinogradov Andrey Gura
Sponsor	Anton Vinogradov Dmitry
Created	Feb 20 2018
Status	DRAFT

Motivation

Apache Ignite should have some general approach to handle critical failures.

Description

The following failures should be treated as critical:

System critical errors (e.g. OutOfMemoryError);
Unintentional system worker termination (e.g. due to an unhandled exception);
Cluster node segmentation.

User should have an ability to define node behavior in case of this failures.

System critical error - error which leads to the system's inoperability.

The following system critical errors should be handled with proposed approach:

File IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
- WAL
- Page store
- Meta store
- Binary meta store
IgniteOutOfMemoryException
OutOfMemoryError (we should have some memory reserved for this case at node startup to increase chances to handle OOM).

The following system workers are critical and ignite node will be inoperative in case of termination one of this worker:

disco-event-worker
tcp-disco-sock-reader
tcp-disco-srvr
tcp-disco-msg-worker
tcp-comm-worker
grid-nio-worker-tcp-comm
exchange-worker
sys-stripe
grid-timeout-worker
db-checkpoint-thread
wal-file-archiver
wal-write-worker
ttl-cleanup-worker
nio-acceptor

Important notes:

More than one Ignite node could be started in one JVM process.
Different nodes in one JVM process could belong to different clusters.

Initial design

IgniteConfiguration should be extended with methods:

public IgniteConfiguration setFailureHandler(FailureHandler hnd);

public FailureHandler getFailureHandler();

Where:

interface FailureHandler {
   void onFailure(FailureContext failureCtx);
}

class FailureContext {
   FailureType type;
   Throwable error;
}

enum FailureType {
   SEGMENTATION,
   SYSTEM_WORKER_TERMINATION,
   CRITICAL_ERROR
}

FailureHandler implementation will be able to handle Ignite critical failures accordingly to strategy provided by user.

The following implementations should be provided out of the box:

NoOpFailureHandler - Just ignores any failure. It's useful for tests and debugging.
RestartProcessFailureHandler - Specific implementation that could be used only with ignite.(sh|bat). Process must be terminated using Ignition.restart(true) call.
StopNodeFailureHandler - This implementation will stop Ignite node in case of critical error using Ignition.stop(true) or Ignition.stop(nodeName, true) call.
StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) - This implementation will try to stop node if tryStop value is true. If node can't be stopped during provided timeout or tryStop value is false then JVM process will be terminated forcibly ( Runtime.halt() ).

Default failure handler is StopNodeOrHaltFailureProcessor where tryStop value is false.

Risks and Assumptions

Discussion Links

Reference Links

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-14 Ignite failures handling