Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

List of system workers should be covered by this engine:

  • disco-event-worker
  • tcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • wal-write-worker
  • ttl-cleanup-worker
  • nio-acceptor

List of errors to be handled 

  • Persistence errors
  • IOOM errors (part of persistence errors?)
  • IO errors (list to be provided)
  • OOM (we should have some memory reserved for this case at node startup to increase chances to handle OOM)
  • Assertion errors (we should handle assertions as failures in case -ea flag set) (should be covered at Throwable catch for every system worker as well)

Initial design

IgniteConfiguration have to be extended with methods

Code Block
languagejava
public IgniteConfiguration setIgniteFailureHandler(IgniteFailureHandler igniteFailureHnd)

...

;

public IgniteFailureHandler getIgniteFailureHandler();

Where

 

Code Block
languagejava
interface IgniteFailureHandler

...

 {
   IgniteFailureAction onFailure(IgniteFailureContext failureCtx);

...


}

...



class IgniteFailureContext

...

enum IgniteFailureAction {
   RESTART_JVM,
   STOP,
   NOOP;
}

...

 {
   IgniteFailureType type;
   Throwable cause;
}

enum IgniteFailureAction {
   RESTART_JVM,
   STOP,
   NOOP;
}

enum IgniteFailureType {
   SEGMENTATION,
   SYSTEM_WORKER_CRASHED,
   CRITICAL_ERROR
}

So, provided by user subclass of of IgniteFailureHandler able to decide what to do (see. IgniteFailureAction) on each registered failure (see. IgniteFailureContext).

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

...