ID	IEP-45
Author

Sponsor

Ivan Daschinsky

Vladimir Steshin

Created

30 Apr 2020

Status

colour

Grey

title
Yellow

DRAFT

IN PROGRESS

Table of Contents

Motivation

The main goal is to have a cache operation's latency less than 500 ms on node fail// Define the problem to be solved.

Description

left.
Currently, latency can be increased to seconds or even dozens of seconds.

Description

Switch speed-up

The Switch is a process that allows performing cluster-wide operations that lead to a new cluster state (eg. cache creation/destroy, node join/left/fail, snapshot creation/restore, etc).

PME-free switch

Historically, Switch performs PME (Partition Map Exchange) even in case partition distribution was_not/will_not_be changed.

It's possible to avoid PME on node left if partition distribution is fixed (eg. baseline node left on a fully rebalanced cluster).

Image Added

This optimization will allow us to continue operations (not affected by primary node failure) during or after the switch.

Cellular switch

In case nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries, it's possible to finish the Switch outside the affected cell before tx recovery finish.

Image Added

This optimization will allow us to start and even finish new operations without waiting for a cluster-wide Switch finish.

Recovery speed-up

Code should be analyzed for useless sleeps, recovery start priority, possible code optimizations, etc.

Node failure detection speed-up

Adjustable timeouts

Some constants used at failure detection are hardcoded and large.

Simplification

The code responsible for this feature performs a lot of re-checks and re-waits and you may have detection time close to failureDetectionTimeout x2 or even x3.

Hunt for the Zombies

GC may increase failure detection dramatically.

While node in STW it can't let cluster know it exceeds possible STW duration.

Also, the node may start operating after the STW exceeding, this may cause additional performance degradation.

A good case is to detect GC time locally (i.e. using JVMTI agent) and kill the node in case of exceeding.

Discovery messaging speed-up

Ring-based topology does not allow to perform the fast switch.

Zookeeper or similar coordinator should be used.

Current implementation of ZookeeperDiscoverySpi should be rewritten, because Zookeeper and similar coordinators (i.e. etcd) are not intended to store large amount of data.

TCP transport for delivering messages should be implemented and leave Zookeeper only coordination, service discovery and similar tasks.

Integration testing framework

Local checks provide non-relevant results. Real cluster checks may be non-reproducible.

To develop, check and be able to recheck fixes related to "Crash recovery speed-up" we should write a framework that allows us to create reproducible tests that can be performed in a real environment.

A possible solution is to use the Ducktape framework.

Forked to IEP-56: Distributed environment tests// Provide the design of the solution.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.Active nodes aliveness WatchDog discussion

Reference Links

// Links to various reference documents, if applicable.

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,updated,assignee,priority,status,fixversions
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-45) ORDER BY status
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

// Links or report with relevant JIRA tickets.

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Motivation

Description

Description

Switch speed-up

PME-free switch

Cellular switch

Recovery speed-up

Node failure detection speed-up

Adjustable timeouts

Simplification

Hunt for the Zombies

Discovery messaging speed-up

Integration testing framework

Risks and Assumptions

Discussion Links

Reference Links

Tickets

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Motivation

Description

Description

Switch speed-up

PME-free switch

Cellular switch

Recovery speed-up

Node failure detection speed-up

Adjustable timeouts

Simplification

Hunt for the Zombies

Discovery messaging speed-up

Integration testing framework

Risks and Assumptions

Discussion Links

Reference Links

Tickets