Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

IDIEP-14
Author

Anton Vinogradov

Andrey Gura

Sponsor

Anton Vinogradov

DmitryAndrey Gura
CreatedFeb 20 2018
Status

Status
colour

Grey

Green
title

DRAFT

Done


Table of Contents

Motivation

...

The following system workers are critical and ignite node will be inoperative in case of termination one of this worker:

  • disco-event-workertcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • wal-write-worker
  • wal-file-decompressor
  • ttl-cleanup-worker
  • nio-acceptor

Important notes:

  • More than one Ignite node could be started in one JVM process.
  • Different nodes in one JVM process could belong to different clusters.

Initial design

IgniteConfiguration should be extended with methods:

...

Code Block
languagejava
interface FailureHandler {
   FailureActionboolean onFailure(Ignite ignite, FailureContext failureCtx);
}

class FailureContext {
   FailureType type;
   Throwable error;
}

enum FailureActionFailureType {
   RESTART_JVMSEGMENTATION,
 // JVM process must be started from   SYSTEM_WORKER_TERMINATION,
   CRITICAL_ERROR
}

FailureHandler implementation will be able to handle Ignite critical failures accordingly to strategy provided by user.

The following implementations should be provided out of the box:

  • NoOpFailureHandler - Just ignores any failure. It's useful for tests and debugging.
  • RestartProcessFailureHandler - Specific implementation that could be used only with ignite.(sh|bat). Process must be terminated using Ignition.restart(true)

...

  • call.
  • StopNodeFailureHandler - This implementation will stop Ignite node in case of critical error using Ignition.stop(true) or Ignition.stop(nodeName, true) call.
  • StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) - This implementation will try to stop node if tryStop value is true. If node can't be stopped during provided timeout or tryStop value is false then JVM process will be terminated forcibly ( Runtime.halt() ).

Default failure handler is StopNodeOrHaltFailureProcessor where tryStop value is false.

Critical system worker must catch all exceptions ( Throwable and derived classes) in high-level try-catch block and take into account that thread could be terminated due to an programmatic mistake that leads to unintentional worker termination. So basic template should looks like the following code snippet:

 

Code Block
languagejava
@Override
public void run() {
    Throwable err = null;

	try {
      // Critical worker's code.
    }
    catch(Throwable e) {
      err = e;
    }
    finally {
      // Call failure handler.
      FailureContext failureCtx = new FaulureCtx(FailureType.SYSTEM_WORKER_TERMINATION, err);

      ctx.failure().process(failureCtx);  // Handle failure. Where ctx - kernal context.
    }
}

 

Example of using FailureHandler in IgniteConfiguration via Spring XML:

 

Code Block
languagexml
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
    </property>
</bean>

 

Risks and Assumptions

 

Discussion Links

  1. Internal problems requiring graceful node shutdown, reboot, etc.
  2. IEP-14: Ignite failures handling (Discussion)

FailureHandler implementation will be able to handle (see FailureAction) each registered failure (see FailureContext).

DeafultFailureHandler must be initialized by default unless user provide specific implementation. DefaultFailureHandler must return STOP_NODE action for any failure type. User can use inheritance or composition in order to use default failure handler behavior.

FailureProcessor is responsible for different failure action processing accordingly to the value returned by FailureHandler implementation.

Risks and Assumptions

It's possible that node won't be stopped correctly in case of FailureAction.STOP_NODE due to some bugs and it can lead to process hanging. This bugs should be discovered ad fixed in the future.

Discussion Links

...

Reference Links

  1. Apache Ignite documentation: Ignite life cycle
  2. Apache Ignite documentation: Start from command line

Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Ignite AND labels IN (iep-14) ORDER BY status
serverId5aa69414-a9e9-3523-82ec-879b028fb15b