ID | IEP-7 | ||||||
Author | Anton Vinogradov | ||||||
Sponsor | Anton Vinogradov | ||||||
Created | Nov 14, 2017 | ||||||
Status |
|
Table of Contents |
---|
// Define the problem to be solved.
// Provide the design of the solution.
...
Internal problems may cause unexpected cluster behaviour.
We should determine behavior in case any of internal problem happened.
Internal problems can be split to
1) OOM or any other reason cause node crash
2) Situations required graceful node shutdown with custom notification (covered now by IEP-14 Ignite failures handling)
- IgniteOutOfMemoryException
- Persistence errors
- ExchangeWorker exits with error
3) Prefomance issues should be covered by metrics
- GC STW duration
- Timed out tasks and jobs
- TX deadlock
- Hanged Tx (waits for some service)
- Java Deadlocks
4) Situations required external monitoring implementation
- GC STW duration exceed maximum possible length (node should be stopped before STW finished)
For this, we can introduce two interfaces SystemThread and SystemThreadRegestry.
This interface should specify:
Every implementation of SystemThread should update a lastActivity field on each pass through the main loop. For simplification, it might be a time field.
Interfaces GridWorker and IgniteThread are looked as good candidates to extend this interface.
This interface should specify:
In GridKernalContext.add(GridComponent comp) system threads:
* disco-event-worker
* tcp-disco-sock-reader
* tcp-disco-srvr
* tcp-disco-msg-worker
* tcp-comm-worker
* grid-nio-worker-tcp-comm
* exchange-worker
* sys-stripe
* grid-timeout-worker
* db-checkpoint-thread
* wal-file-archiver
* ttl-cleanup-worker
* nio-acceptor
should be registered in systemThreadsList. Every call of checkSystemTread should go through a list of registered system process and check if current time - t.lastActivity() is less than systemThreadTimeOut. If the last activity was too long time ago the method should print a WARNING into a log with t.to_String().
If the processor jumps off and begins executing code out of order or a task freezes and is no longer running, it would still be possible that the heartbeat could be generated. The code could get stuck in the heartbeat function and continually only generate the heartbeat.
[1] https://issues.apache.org/jira/browse/IGNITE-6171
// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.
...