Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The main goal is to have a cache operation's latency less than 500 ms on node fail/left.
Currently, latency can be increased to seconds or even dozens of seconds.

Description

The task can be split into the following threads:

Switch speed-up

The Switch is a process that allows performing cluster-wide operations that lead to a new cluster state (eg. cache creation/destroy, node join/left/fail, snapshot creation/restore, etc). 

...

It's possible to avoid PME on node left if partition distribution is fixed (eg. baseline node left on a fully rebalanced cluster).

Image Added

This optimization will allow us to continue operations (not affected by primary node failure) during or after the switch.

...

Cellular switch

In case nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries, it's possible to finish the Switch outside the affected cell before tx recovery finish.

Image Added

This optimization will allow us to start and even finish new operations without waiting for a cluster-wide Switch finish.

...