ID	IEP-53
Author	Sergey Chugunov
Sponsor
Created	2020-08-27
Status	IMPLEMENTATION

Motivation

In several cases we need a special state of Ignite node when it accepts incoming commands via control scripts or JMX API but doesn't join the cluster.

In other words, we need Ignite node to be able to enter Maintenance Mode when maintenance actions could be applied.

Examples of use cases: IEP-47: Native persistence defragmentation, Unable to render Jira issues macro, execution error.

Description

Summary

In Maintenance Mode (later MM for short) node doesn't join the cluster but accepts user commands and/or executes other actions needed for maintenance.
MM is applicable only when persistence is enabled and storage is available.
To enter or leave MM node has to be restarted.

Suggested design

Node decides to switch to maintenance only on startup if a special maintenance marker (or several markers) called Maintenance Record (later MR for short) is presented on disk. Maintenance Record consists of unique ID, user-readable description and (if necessary) arguments to complete maintenance action.
Maintenance Records could be created both automatically (when an erroneous situation is found like in Unable to render Jira issues macro, execution error. ) or by user request (e.g. when user requests performing node's data files defragmentation). Maintenance Registry is responsible for managing MRs and provides API for such management.
When maintenance action for a particular MR is completed, the record could be removed from registry. If no records are left node will return to normal operations on next restart. Otherwise (at least one MR is still presented on disk) node will again enter MM.

Use Case 1 - Cleaning up potentially corrupted PDS

This case is described in ticket Unable to render Jira issues macro, execution error. where draft implementation of MM is ready for review. This case involves automatic creation of MR but requires manual actions to complete the maintenance. Case works as follows:

Node fails in the middle of checkpoint when WAL is disabled for one or several caches.
On next restart node detects that data files of that caches could be corrupted, creates an MR and shuts down.
On next restart node enters MM and waits for user to fix the problem (instead of failing again). In managed environments like Kubernetes it means that node won't be automatically restarted and user will be able to find possibly corrupted files and remove them.
When the files are removed (manual action) user removes MR from the registry and restarts the node.
Node starts up in normal mode and joins the cluster in a regular way.

Use Case 2 - Native Persistence Defragmentation

IEP-47: Native persistence defragmentation is dedicated to implement this rather big piece of functionality. But in a nutshell this case again involves MM in an opposite form: Maintenance Record is created manually but action is completed automatically. Case main steps:

User with control.{sh|bat} script or via other APIs requests creating MR for defragmenting all native persistence on the node or particular caches. MR is created and saved on disk.
User restarts the node, it enters Maintenance Mode finds MR about defragmentation and starts working on the task.
When defragmentation is done, MR is automatically deleted. On next restart node with defragmented PDS enters normal operations.

Risks and Assumptions

It is assumed that no major changes are needed in CommandHandler to enable connecting to a particular node (the one in MM).

Discussion Links

Discussion on dev-list

Tickets

// Links or report with relevant JIRA tickets.

Page tree

IEP-53: Maintenance Mode