You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

IDIEP-14
AuthorAnton Vinogradov
Sponsor

Anton Vinogradov

Dmitry

CreatedFeb 20 2018
StatusDRAFT


Motivation

Apache Ignite should have some general engine to handle critical failures.

Description

List of failures should be covered by this engine:

  • Critical Errors
  • Critical system workers crashes
  • Segmentation

List of system workers should be covered by this engine:

  • disco-event-worker
  • tcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • ttl-cleanup-worker
  • nio-acceptor

List of errors to be handled 

  • Persistence errors
  • IOOM errors (part of persistence errors?)
  • IO errors (list to be provided)
  • OOM (we should have some memory reserved for this case at node startup to increase chances to handle OOM)
  • Assertion errors (we should handle assertions as failures in case -ea flag set) (should be covered at Throwable catch for every system worker as well)

Initial design

IgniteConfiguration have to be extended with methods

IgniteConfiguration setIgniteFailureHandler(IgniteFailureHandler igniteFailureHnd)
IgniteFailureHandler getIgniteFailureHandler()

Where

interface IgniteFailureHandler {
   IgniteFailureAction onFailure(IgniteFailureContext failureCtx);
}

class IgniteFailureContext {
   IgniteFailureType type;
   Throwable cause;
}

enum IgniteFailureAction {
   RESTART_JVM,
   STOP,
   NOOP;
}

enum IgniteFailureType {
   SEGMENTATION,
   SYSTEM_WORKER_CRASHED,
   CRITICAL_ERROR
}

So, provided by user subclass of IgniteFailureHandler able to decide what to do (see. IgniteFailureAction) on each registered failure (see. IgniteFailureContext).

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html

Reference Links

// Links to various reference documents, if applicable.

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels