Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


IDIEP-45
Author
 
Sponsor
 
Created

  

Status

Status
colour

Grey

Yellow
title

DRAFT

IN PROGRESS


Table of Contents

Motivation

...

Node failure detection speed-up

Adjustable timeouts

Some Already found that some constants used at failure detection are hardcoded and large.

Simplification

The Also, code responsible for this feature performs a lot of re-checks and re-waits and you may have detection time close to failureDetectionTimeout x2 or even x3.

Hunt for the Zombies

GC Another problem is GC, and it may increase failure detection dramatically, so, watchdog started from another JVM or from native code can help here.

While node in STW it can't let cluster know it exceeds possible STW duration.

Also, the node may start operating after the STW exceeding, this may cause additional performance degradation.

A good case is to detect GC time locally (i.e. using JVMTI agent) and kill the node in case of exceeding.

Discovery messaging speed-up

...

Zookeeper or similar coordinator should be used.

Current implementation of ZookeeperDiscoverySpi should be rewritten, because Zookeeper and similar coordinators (i.e. etcd) are not intended to store large amount of data.

TCP transport for delivering messages should be implemented and leave Zookeeper only coordination, service discovery and similar tasks.

Integration testing framework

Local checks provide non-relevant results. Real cluster checks may be non-reproducible.

To develop, check and be able to recheck fixes related to "Crash recovery speed-up" we should write a framework that allows us to create reproducible tests that can be performed in a real environment. 

A possible solution is to use the Ducktape framework.

Forked to IEP-56: Distributed environment tests.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.Active nodes aliveness WatchDog discussion

Reference Links

// Links to various reference documents, if applicable.

...

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolutionfixversions
maximumIssues20
jqlQueryproject = Ignite AND labels IN (iep-45) ORDER BY status
serverId5aa69414-a9e9-3523-82ec-879b028fb15b

...