Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Removing destroyQueue and changing it to the subscription to lifecycle

  • Removing all snapshot related dependencies such as nextSnapshot, snapshotMgr, DbCheckpointListener.Context(partially)

  • Adding the possibility to subscribe to a lifecycle in the configured order. (it's important for removing of snapshotMgr usage)

  • Subscribing to the checkpoint lifecycle directly from CacheDataStore instead of GridCacheOffheapManager

  • Collecting of CheckpointIterators without further preparation instead of collecting the pure pages with further sorting

  • Extracting implementation of CheckpointIterator outside of checkpoint - it's should be specific for each source. In the current case, PageMemoryImpl can hold the implementation of iterator which iterate over all pages in all segments in sorted order.

  • Reimplementation of PageReplaced such that it can't write to disk directly anymore but can ask, for example, CheckpointIterator to increase the priority of a certain page.

  • Moving all knowledge about the checkpoint buffer from the checkpoint to the page memory because the checkpoint buffer is a specific feature of certain page memory implementation.

  • (if possible)Removing stateFutures and using lifecycle instead of it.

Maintenance mode

Restrictions(wishes) which we have:

  • While defragmentation is in progress the node should not have any effect on the whole cluster.
  • It needs to somehow collect the current progress of defragmentation.
  • It should be possible to cancel defragmentation

No completed ideas yet how it should look. But perhaps it does make sense to have the started node which is not joined to the cluster(the correct state is right after the recovery step). But how to communicate to this node(rest API? control.sh? sysProperties?) and how to transit to/from this state are not clear right nowSeparate IEP is created for Maintenance Mode development as it is useful in other cases too: IEP-53: Maintenance Mode.

Risks and Assumptions

  • Corrupted page in new partitions will be impossible to restore from WAL, this can potentially lead to unrecoverable corruption that can be fixed with full rebalance only. To be fair, full rebalance itself has the same issue because WAL is usually disabled during it.
  • Defragmentation of the whole cache group would require as much storage space as the group itself in the worst case scenario (even more actually).
  • There has to be a way to cancel defragmentation. For example - the storage is out of free space and there's no way that the process can be finished after any number of restarts. It's important to prevent constant restart-fail loops, because it makes the situation very hard to fix on environments like k8s.
  • The process might take a long time to complete. There has to be a way to track the progress.

...