ID

IEP-14

Author

Andrey Gura

SponsorAnton Vinogradov

Andrey Gura

Dmitry

Created

Feb 20 2018

Status

colour

Grey

title
Green

DRAFT

Done

Table of Contents

Motivation

...

System critical errors (e.g. OutOfMemoryError);
Unintentional system worker termination (e.g. due to an unhandled exception);
Cluster node segmentation.

...

The following system critical errors should be handled with proposed approach:

File IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
- WAL
- Page store
- Meta store
- Binary meta store
IgniteOutOfMemoryException
OutOfMemoryError (we should have some memory reserved for this case at node startup to increase chances to handle OOM)AssertionError (we should handle assertions as failures in case -ea flag set) (should be covered at Throwable catch for every system worker as well).

The following system workers are critical and ignite node will be inoperative in case of termination one of this worker:

disco-event-workertcp-disco-sock-reader
tcp-disco-srvr
tcp-disco-msg-worker
tcp-comm-worker
grid-nio-worker-tcp-comm
exchange-worker
sys-stripe
grid-timeout-worker
db-checkpoint-thread
wal-file-archiver
wal-write-worker
wal-file-decompressor
ttl-cleanup-worker
nio-acceptor

Important notes:

More than one Ignite node could be started in one JVM process.
Different nodes in one JVM process could belong to different clusters.

Initial design

IgniteConfiguration should be extended with methods:

...

Code Block

language	java

interface FailureHandler {
   FailureActionboolean onFailure(Ignite ignite, FailureContext failureCtx);
}

class FailureContext {
   FailureType type;
   Throwable error;
}

enum FailureAction {
   TERMINATE_JVM,
   STOP_NODE,
   NO_OP;
}

enum FailureType {
   SEGMENTATION,
   SYSTEM_WORKER_TERMINATION,
   CRITICAL_ERROR
}

FailureHandler implementation will be able to handle (see FailureAction) each registered failure (see FailureContext).

DeafultFailureHandler must be initialized by default unless user provide specific implementation. DefaultFailureHandler must return STOP_NODE action for SEGMENTATION failure type and TERMINATE_JVM for the rest failures. User can use inheritance or composition in order to use default failure handler behavior.

FailureProcessor is responsible for different failure action processing accordingly to the value returned by FailureHandler implementation.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html

Reference Links

...

Ignite critical failures accordingly to strategy provided by user.

The following implementations should be provided out of the box:

NoOpFailureHandler - Just ignores any failure. It's useful for tests and debugging.
RestartProcessFailureHandler - Specific implementation that could be used only with ignite.(sh|bat). Process must be terminated using Ignition.restart(true) call.
StopNodeFailureHandler - This implementation will stop Ignite node in case of critical error using Ignition.stop(true) or Ignition.stop(nodeName, true) call.
StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) - This implementation will try to stop node if tryStop value is true. If node can't be stopped during provided timeout or tryStop value is false then JVM process will be terminated forcibly ( Runtime.halt() ).

Default failure handler is StopNodeOrHaltFailureProcessor where tryStop value is false.

Critical system worker must catch all exceptions ( Throwable and derived classes) in high-level try-catch block and take into account that thread could be terminated due to an programmatic mistake that leads to unintentional worker termination. So basic template should looks like the following code snippet:

Code Block

language	java

@Override
public void run() {
    Throwable err = null;

	try {
      // Critical worker's code.
    }
    catch(Throwable e) {
      err = e;
    }
    finally {
      // Call failure handler.
      FailureContext failureCtx = new FaulureCtx(FailureType.SYSTEM_WORKER_TERMINATION, err);

      ctx.failure().process(failureCtx);  // Handle failure. Where ctx - kernal context.
    }
}

Example of using FailureHandler in IgniteConfiguration via Spring XML:

Code Block

language	xml

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
    </property>
</bean>

Risks and Assumptions

Discussion Links

Reference Links

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-14) ORDER BY status
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 9

New Version Current

Key

Motivation

Initial design

Risks and Assumptions

Discussion Links

Reference Links

Risks and Assumptions

Discussion Links

Reference Links

Tickets

Page tree

Page History

Versions Compared

Old Version 9

New Version Current

Key

Motivation

Initial design

Risks and Assumptions

Discussion Links

Reference Links

Risks and Assumptions

Discussion Links

Reference Links

Tickets