ID	IEP-5
Author	Sergey Puchnin Alexey Goncharuk
Sponsor	Yakov Zhdanov
Created	18.10.2017
Status	DRAFT

Motivation

Currently, Ignite doesn't have single policy/approach for monitoring critical internal threads status and user operations timing out. These may cause cluster instability, hangs and different usability issues. This IEP is for describing possible problems and proposing solutions for.

Description

Currently, Ignite doesn't have single policy/approach for monitoring critical internal threads status and user operations timing out.
To improve the situation need to consider the following:

Add ability to monitor the current status of critical Ignite internal processes.
Elaborate a universal approach for using assertions/exceptions.
Suggest a universal approach for transactions and processes timeouts.

1. For monitoring a current status for Ignite internal processes

(covered now by IEP-14 Ignite failures handling)

Every critical process has an infinite loop to perform its main activity, we can use it to monitor thread activity status. For example, it's possible to create an interface with a method to provide a confirmation that the thread is alive and active and make each critical thread/worker implement this interface. This solution doesn't need to run external watchdog process. If any of system critical process isn't alive or active diagnostic information should be saved to log file.
For this, we can introduce two interfaces SystemThread and SystemThreadRegestry.

interface SystemThread {
	public long lastActivity();
}
 
/** Should be a component available by kernal context. */
interface SystemThreadRegistry {
	long  SYSTEM_THREAD_TIMEOUT = 5_000;


	/** Gets threads registered so far. */
	public List<SystemThread> systemThreads();


	/** Adds system process to monitoring. */
	public void register(SystemThread t);
 
	/** Removea system process from monitoring. */
	public void unregister(SystemThread t);


	/** Checks state of registered system processes and outputs warning or shuts down local node if necessary. */
	public void checkSystemTreads(); 
}

The registry should be available through kernel context and critical threads should be registered upon start.

The following critical threads considered so far:

disco-event-worker
tcp-disco-sock-reader
tcp-disco-srvr
tcp-disco-msg-worker
tcp-comm-worker
grid-nio-worker-tcp-comm
exchange-worker
sys-stripe
grid-timeout-worker
db-checkpoint-thread
wal-file-archiver
ttl-cleanup-worker
nio-acceptor

2. Approach for using assertions/exceptions.

For now, some checks made with assertion statements some with raising exceptions. For the system, it means we have two different sets of checks.
A disabling of The assert Statement in JVM' options leads to part of checks won't perform. Or if it rises it's impossible to caught and resolve an assertion statement.
It's necessary to review every using of assertion statements and try to replace by "IF statement" or "IgniteErrorException".
Some using the assert Statement is possible but not in support modules (JMS statistics for example) and not for checking method arguments.

3. Approach for transactions and processes timeouts.

As far as cache transactions Ignite has a config property to set up tx timeout per transaction and per node level (using TransactionConfiguration). Ignite also has ability to set timeout for compute tasks.

The following timeouts need to be considered and implemented:

atomic operations - to limit time of atomic updates
continuous query start
cache start
cache destroy
remote event listener installation
remote communication (message) listener installation
grid service start

We need to consider withTimeout(long timeout) notation where applicable and consider changing IgniteConfiguration to introduce new timeouts.

Even if a user does not provide a timeout for operation and operation takes a long time (e.g. installing system service takes long time) then the system needs to output warning once a minute about hanging process and reason of it. Therefore we need to think over some interface every internal future of the kind should implement.

interface TrackableFuture {
	long startTime();
	void reportStatus(IgniteLogger log);
}

reportStatus() should output status of the operation to logs (possibly sending requests to other nodes involved). Future should be checked for timing out from some system thread once per minute and output status to logs every time even if user timeout is not set for operation.

Every future should properly implement cancel() method and Ignite should provide an ability to cancel any future from outer process - web console or control.sh - so a user has an ability to unfreeze threads and cleanup resources.

Risks and Assumptions

All the changes need to be thoroughly tested.

Discussion Links

TBD

Reference Links

[1] Unable to render Jira issues macro, execution error.

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-5 A monitoring health for critical process and an approach for assertions/exceptions