Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Apache Ignite should have some general approach to handle critical failures.

Description

List of The following failures should be covered by this engine:

  • Critical Errors
  • Critical system workers crashes
  • Segmentation

treated as critical:

  • System critical errors (e.g. OutOfMemoryError);
  • Unintentional system worker termination due to an unhandled exception;
  • Cluster node segmentation.

User should have an ability to define node behavior in case of this failures.

System critical error - error which leads to the system's inoperability.

The following system critical errors should be handled with proposed approach:

  • IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
    • WAL
    • Page store
    • Meta store
    • Binary meta store
  • IgniteOutOfMemoryException
  • OutOfMemoryError (we should have some memory reserved for this case at node startup to increase chances to handle OOM)
  • AssertionError (we should handle assertions as failures in case -ea flag set) (should be covered at Throwable catch for every system worker as well).

The following system workers are critical and ignite node will be inoperative in case of termination one of this workerList of system workers should be covered by this engine:

  • disco-event-worker
  • tcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • wal-write-worker
  • ttl-cleanup-worker
  • nio-acceptor

List of errors to be handled 

...

...

Initial design

IgniteConfiguration have to should be extended with methods:

Code Block
languagejava
public IgniteConfiguration setFailureHandler(FailureHandler hnd);

public FailureHandler getFailureHandler();

 

Where :

Code Block
languagejava
interface FailureHandler {
   FailureAction onFailure(FailureContext failureCtx);
}

class FailureContext {
   FailureType type;
   Throwable cause;
}

enum FailureAction {
   RESTART_JVM,
   STOP,
   NOOP;
}

enum FailureType {
   SEGMENTATION,
   SYSTEM_WORKER_CRASHEDTERMINATION,
   CRITICAL_ERROR
}

So, provided by user subclass of FailureHandler able to decide what to do FailureHandler implementation will be able to handle (see FailureAction) on each registered failure (see FailureContext).

...