You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

IDIEP-7
Author

Anton Vinogradov

Sergey Puchnin

Sponsor

Anton Vinogradov

Yakov Zhdanov

CreatedNov 14, 2017
StatusDRAFT


Motivation

// Define the problem to be solved.

Description

// Provide the design of the solution.

System processes heartbeat


We can use the same approach from [1] to control an activity for any system crucial process.

For this, we can introduce two interfaces SystemThread and SystemThreadRegestry.

SystemThread.

This interface should specify:

  • public long lastActivity;
  • public long lastActivity();

Every implementation of SystemThread should update a lastActivity field on each pass through the main loop. For simplification, it might be a time field.
Interfaces GridWorker and IgniteThread are looked as good candidates to extend this interface.

SystemThreadRegestry.

This interface should specify:

  • private static final long systemThreadTimeOut = 5_000;
  • public List<SystemThread> systemThreadsList;
  • public void register (SystemThread t); to add a system process to watchdog monitoring
  • public void unregister (SystemThread t); to remove a system process to watchdog monitoring
  • public void checkSystemTread (); to check a state of registered system processes

In GridKernalContext.add(GridComponent comp) system threads:

 

  • disco-event-worker
  • tcp-disco-sock-reader
  • tcp-disco-srvr
  • tcp-disco-msg-worker
  • tcp-comm-worker
  • grid-nio-worker-tcp-comm
  • exchange-worker
  • sys-stripe
  • grid-timeout-worker
  • db-checkpoint-thread
  • wal-file-archiver
  • ttl-cleanup-worker
  • nio-acceptor

should be registered in systemThreadsList. Every call of checkSystemTread should go through a list of registered system process and check if current time - t.lastActivity() is less than systemThreadTimeOut. If the last activity was too long time ago the method should print a WARNING into a log with t.to_String().

 

If the processor jumps off and begins executing code out of order or a task freezes and is no longer running, it would still be possible that the heartbeat could be generated. The code could get stuck in the heartbeat function and continually only generate the heartbeat.

 

[1] https://issues.apache.org/jira/browse/IGNITE-6171

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

Deadlock Detection And Cluster Protection.

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels