You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

IDIEP-14
Author

Anton Vinogradov

Andrey Gura

Sponsor

Anton Vinogradov

Dmitry

CreatedFeb 20 2018
StatusDRAFT


Motivation

Apache Ignite should have some general approach to handle critical failures.

Description

The following failures should be treated as critical:

  • System critical errors (e.g. OutOfMemoryError);
  • Unintentional system worker termination due to an unhandled exception;
  • Cluster node segmentation.

User should have an ability to define node behavior in case of this failures.

System critical error - error which leads to the system's inoperability.

The following system critical errors should be handled with proposed approach:

  • IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
    • WAL
    • Page store
    • Meta store
    • Binary meta store
  • IgniteOutOfMemoryException
  • OutOfMemoryError (we should have some memory reserved for this case at node startup to increase chances to handle OOM)
  • AssertionError (we should handle assertions as failures in case -ea flag set) (should be covered at Throwable catch for every system worker as well).

The following system workers are critical and ignite node will be inoperative in case of termination one of this worker:

  • disco-event-worker
  • tcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • wal-write-worker
  • ttl-cleanup-worker
  • nio-acceptor

Initial design

IgniteConfiguration should be extended with methods:

public IgniteConfiguration setFailureHandler(FailureHandler hnd);

public FailureHandler getFailureHandler();

 

Where:

interface FailureHandler {
   FailureAction onFailure(FailureContext failureCtx);
}

class FailureContext {
   FailureType type;
   Throwable error;
}

enum FailureAction {
   RESTART_JVM,
   STOP_NODE,
   NO_OP;
}

enum FailureType {
   SEGMENTATION,
   SYSTEM_WORKER_TERMINATION,
   CRITICAL_ERROR
}

FailureHandler implementation will be able to handle (see FailureAction) each registered failure (see FailureContext).

DeafultFailureHandler must be initialized by default unless user provide specific implementation. DefaultFailureHandler must return STOP_NODE action for SEGMENTATION failure type and TERMINATE_JVM for the rest failures. User can use inheritance or composition in order to use default failure handler behavior.

FailureProcessor is responsible for different failure action processing accordingly to the value returned by FailureHandler implementation.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html

Reference Links

// Links to various reference documents, if applicable.

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels