Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


IDIEP-53
Author
Sponsor
Created 2020-08-27
Status

Status
colourGreen
title

IMPLEMENTATION

completed


Table of Contents

Motivation

...

  • Node decides to switch to maintenance only on startup if a special maintenance marker (or several markers) called Maintenance Record Task (later MR MT for short) is presented on disk. Maintenance Record Task consists of unique ID, user-readable description and (if necessary) arguments to complete maintenance action.
  • Maintenance Records Tasks could be created both automatically (when an erroneous situation is found like in
    Jira
    serverASF JIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyIGNITE-13366
    ) or by user request (e.g. when user requests performing node's data files defragmentation). Maintenance Registry is responsible for managing MRs MTs and provides API for such management.
  • When maintenance action for a particular MR MT is completed, the record task could be removed from registry. If no records tasks are left node will return to normal operations on next restart. Otherwise (at least one MR MT is still presented on disk) node will again enter MM.

...

This case is described in ticket 

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-13366
 where draft implementation of MM is ready for review. This case involves automatic creation of MR MT but requires manual actions to complete the maintenance. Case works as follows:

  1. Node fails in the middle of checkpoint when WAL is disabled for one or several caches.
  2. On next restart node detects that data files of that caches could be corrupted, creates an MR MT and shuts down.
  3. On next restart node enters MM and waits for user to fix the problem (instead of failing again). In managed environments like Kubernetes it means that node won't be automatically restarted and user will be able to find possibly corrupted files and remove them.
  4. When the files are removed (manual action) user removes MR MT from the registry and restarts the node.
  5. Node starts up in normal mode and joins the cluster in a regular way.

...

IEP-47: Native persistence defragmentation is dedicated to implement this rather big piece of functionality. But in a nutshell this case again involves MM in an opposite form: Maintenance Record Task is created manually but action is completed automatically. Case main steps:

  1. User with control.{sh|bat} script or via other APIs requests creating MR MT for defragmenting all native persistence on the node or particular caches. MR MT is created and saved on disk.
  2. User restarts the node, it enters Maintenance Mode finds MR MT about defragmentation and starts working on the task.
  3. When defragmentation is done, MR MT is automatically deleted. On next restart node with defragmented PDS enters normal operations.

Maintenance Action and Maintenance Workflow

Although MM supposes manual user intervention to fix the reason for maintenance, it can also be true that component requested MM knows how to fix the issues and can execute necessary actions automatically.
The only thing it may need is user command to execute these actions.

In case of PDS defragmentation also covered by MM all actions are executed automatically from the very beginning.

To cover both cases additional entity is suggested: MaintenanceAction. It is just an interface that could be called by Maintenance component when user requests its execution or when Maintenance component decides it is time to start automatic actions.

Workflow with MaintenanceAction may look like this:

  1. Maintenance Registry starts among first and reads from disk information about MaintenanceTasks registered earlier.
  2. Other components start after Maintenance and check MM if they should function differently in this mode. If node in MM they register special callback within Maintenance Registry that provides Maintenance Actions to the registry.
  3. After all components are started Maintenance Registry prepares maintenance: checks if user has already fixed issues manually during shut down, prints information about this to log and modifies/deletes Maintenance Tasks if needed.
  4. When Maintenance is prepared and there are still unresolved MaintenanceTasks Maintenance component starts automatic actions like PDS defragmentation or exposes list of user-triggered actions through CLI/JMX APIs and waits for user commands.

Implementation improvements

  1. With fixed structure of workflow (register tasks → restart → prepare maintenance → execute maintenance) it is not possible to request new actions when node is in maintenance mode. We could improve this by exploiting InternalSubscriptionProcessor and registering components as listeners for events like "Maintenance Task with ID registered..."
  2. Current implementation lacks validation of Maintenance Tasks except of basic validation of UUID structure. The problem here is that validation is component-specific and should be implemented by components themselves. If we employ InternalSubscriptionProcessor we could potentially implement validation of MaintenanceTasks in an elegant way: when component is notified about new task it knows how to validate, it validates it and sets specific flag to the task. When MaintenanceProcessor sees that no components have validated next task, it skips it as invalid.

Risks and Assumptions

  1. It is assumed that no major changes are needed in CommandHandler to enable connecting to a particular node (the one in MM).

...

Discussion on dev-list

Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Ignite AND labels IN (IEP-53)
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
// Links or report with relevant JIRA tickets.