Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


IDIEP-45
Author
Sponsor
Created

  

Status
Status
colourGrey
titleDRAFT


Table of Contents

Motivation

// Define the problem to be solved.

Description

The main goal is to have a cache operation's latency less than 500 ms on node fail/left.
Currently, latency can be increased to seconds or even dozens of seconds.

Description

The task can be split into the following threads:

Switch speed-up

The Switch is a process that allows performing cluster-wide operations that lead to a new cluster state (eg. cache creation/destroy, node join/left/fail, snapshot creation/restore, etc). 

PME-free switch

Historically, Switch performs PME (Partition Map Exchange) even in case partition distribution was_not/will_not_be changed.

It's possible to avoid PME on node left if partition distribution is fixed (eg. baseline node left on a fully rebalanced cluster).

This optimization will allow us to continue operations (not affected by primary node failure) during or after the switch.

Cellular Affinity

In case nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries, it's possible to finish the Switch outside the affected cell before tx recovery finish.

This optimization will allow us to start and even finish new operations without waiting for cluster-wide Switch finish.

Recovery speed-up

Code should be analyzed for useless sleeps, recovery start priority, possible code optimizations, etc.

Node failure detection

Already found that some constants used at failure detection are hardcoded and large.

Also, code responsible for this feature performs a lot of re-checks and re-waits and you may have detection time close to failureDetectionTimeout x2 or even x3.

Another problem is GC, and it may increase failure detection dramatically, so, watchdog started from another JVM or from native code can help here.

Discovery messaging speed-up

Ring-based topology does not allow to perform the fast switch. 

Zookeeper or similar coordinator should be used// Provide the design of the solution.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

...

// Links to various reference documents, if applicable.

Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Ignite AND labels IN (iep-45) ORDER BY status
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
// Links or report with relevant JIRA tickets.