...

Formulated by Ivan Bessonov, Anton Kalashnikov and Sergey Chugunov .

First The first naive approach is to make this operation offline (without active data loading), this will simplify a lot of things.

So, the requirement is to have an option that will trigger defragmentation process after node restart. I see following options:

defragmentate defragment everything;
defragmentate defragment specific cache groups;
defragmentate defragment specific partitions of specific cache groups.

API approaches will be discussed later.

This IEP includes 3 big parts:

Defragmentation itself - relocation of entries to top of the file as compact as possible and removing empty pages at the tail then.
Checkpoint changes - defragmentation needs to evict pages to disk during the process which pretty similar to checkpoint behavior but the current checkpoint implementation doesn't allow it to be used flexibly so it should be slightly changed.
Maintenance mode - current proposal of defragmentation does not allow the load during the defragmentation so it should be some kind of special state of the node where the node can be available for monitoring and management but not have any effect on the cluster.

Implementation

Common description

There are a number of approaches that could be implemented. I will describe the one that we consider to be a potential candidate for implementation. This solution is not a real-time approach that means any operation except the defragmentation should be forbidden.

Proper moment to start the procedure is right after the logical recovery. Latest The latest checkpoint must be completed, so all the data is on disk and internally consistent.

First we’ll look at regular partitions that are not related to SQL. There are two main structures containing data lay there: cache data tree and pending entries tree. We’ll start with the data tree for the reasons described later.

...

it is assumed that the process will be single-threaded, this means that we don’t need locks for insert operations into new partition;
every insert into the tree will be strictly in the tail, this knowledge can be exploited:
- we can use the specific operation to insert into the tail without any comparisons;
- or we can construct the tree by layers , because we know the amount and the order of entries. This might be the most efficient way but it will require a lot of new code with careful testing;
entries should not be deserialized to byte arrays where it’s possible. Memory consumption should be restricted. For example, direct memory copying can be utilized to achieve this.

During the restructuring of the cache data tree we should store mapping oldLink→newLink. First reason is that the pending tree uses links to compare entries, so the order of elements in the pending tree will be changed after defragmentation. This mapping should be persisted to the storage right next to partition it represents for the reasons described later in the document.

As already mentioned, pending entries tree cannot be copied using the same algorithm, because the order of the elements can't be preserved. So regular B+Tree inserts have to be used for entries with updated links.

WAL should be disabled, of course. We should probably write some markers into it once the operation is completed, which might be helpful for debugging later.

...

There are two options here - some disk-based hashmap or B+Tree. I think the later latter is preferable for the following reasons:

...

Once we start deleting partitions, there’s no way back and operation has to be completed after restart if a node failed for some reason. That’s obvious, and fault-tolerant batch-rename operation doesn't seem hard at all.

In other scenarios, it would be convenient to use already defragmented partition after fail/restart if the operation was partially completed.

...

reuse regular checkpointing. Interval might be decreased for this operation because it’s expected to be relatively fast and data-intensive. Start-end markers are not required since we don't actually affect current partitions and recovery mechanism is much less complex.
lightweight checkpointing, by that I mean writing pages once they become dirty. Sounds simple enough but may be maybe much slower if there are a lot of small cache entries and write operation is not buffered. Good thing is that writes will mostly be append-only. Obviously, force operation is only required once per partition;
also lightweight checkpointing, but we write pages on eviction, not on marking. This approach saves from constant rewriting of small entries but append-only heuristic stops being expected.

SQL indexes

index.bin can have an unlimited amount of trees with possibly inlined data in them. What’t What's even worse is that we have a single index file per all partitions. This means that defragmentation of a single partition affects the whole index - physical links in it are not valid anymore. And the index must always be the last to defragmentatedefragment.

We already have a mapping oldLink→newLink for all partitions, so we should copy the index file page by page while replacing links in all of them according to the mapping. This can easily be done in parallel - it’s important because indexes are huge. (OR we can reinsert everything into new trees like as we did with partitions if we need to defragmentate defragment index file as well, same code can be reused)

In the first case checkpoint is not needed because we write a new file in append-only mode. We may not even need a data region for the same reason - we just read the whole file page by page.

Really The really simple option is to just delete index.bin, but it’s way too desperate.

...

I think that the most appropriate place for it is the MetaStorage, because it survives node restarts. Parameters should be deleted once the operation is completed. But I don’t think that progress information should be stored there as well - it’s much easier to have WAL completely disabled rather then than deciding which component can use it and which cannot. We should not leave any junk in WAL, or in original partitions.

...

JMX looks like an appropriate option. It doesn't depend on the node being in topology which is a good thing. I believe that it should already be available before discovery is started. Confirmation required. Also, JMX can in theory show the progress of the operation, but I'm not sure about the usability side of such thing.

Cancelling Canceling the operation

Again, it looks like there's no choice besides JMX here.

...

REST is technically started before the node is joined to the cluster. It should allow us to have a connection with control.sh, but it is essential to know the port.

...

Check if there are defragmentation parameters.
1. If there are no such parameters, delete all temporary partitions, mappings, locks, etc.
Complete logical recovery, checkpoint everything.
Disable WAL.
Run the process for every requested cache group.
1. Run the process for every requested partition:
  1. Check if the partition has already been defragmentateddefragmented. If it has does then go to the next partition.
  2. Get the cache data tree, iterate through its entries and insert them into new data tree.
  3. Copy pending entries tree using the mapping from previous step.
  4. Write and sync mapping and partition files, rename partition file to the completed partition file or create some marker file.
  5. Go to the next partition.
2. Check if index.bin has already been defragmentateddefragmented. If it has then skip the next step.
3. For index.bin there are two options:
  1. Defragmentate Defragment it like a partition file if it was requested.
  2. Otherwise initiate copying of the index with links replacement:
    1. Copy Meta tree with blank values instead of tree links.
    2. Copy every index tree and insert links to their meta pages into the Meta tree (too many metas, I know).
  3. Write and sync index file, rename it to the completed file or create some marker as in 4.a.iv.
4. Initiate cascade renaming. It's important to leave the end-marker meaning that cache group was defragmentated defragmented so the process won't start once again in a failover scenario.
5. Go to the next cache group.
Enable WAL, remove defragmentation parameters from metastorage.
Delete marker files from step 4.d

...

Proposed by Anton Kalashnikov

As was described before it needs to find some way to save pages to disk during defragmentation. One of the approaches is refactoring of the current checkpoint to add more flexibility in use. Also it can be helpful in other ignite operations such as recovery where it also impossible to use the current implementation of the checkpoint.

The idea to simplify the checkpoint responsibility like that the checkpoint Checkpoint responsible only for the consistent(which was changed under readLock) collection of data(pages) from 'Source' and storing its it to 'Target'.

In the high-level of abstraction main parts looks like:

...

Removing destroyQueue and changing it to the subscription to lifecycle
Removing all snapshot related dependencies such as nextSnapshot, snapshotMgr, DbCheckpointListener.Context(partially)
Adding the possibility to subscribe to a lifecycle in the configured order. (it's important for removing of snapshotMgr usage)
Subscribing to the checkpoint lifecycle directly from CacheDataStore instead of GridCacheOffheapManager
Collecting of CheckpointIterators without further preparation instead of collecting the pure pages with further sorting
Extracting implementation of CheckpointIterator outside of checkpoint - it's should be specific for each source. In the current case, PageMemoryImpl can hold the implementation of iterator which iterate over all pages in all segments in sorted order.
Reimplementation of PageReplaced such that it can't write to disk directly anymore but can ask, for example, CheckpointIterator to increase the priority of a certain page.
Moving all knowledge about the checkpoint buffer from the checkpoint to the page memory because the checkpoint buffer is a specific feature of certain page memory implementation.
(if possible)Removing stateFutures and using lifecycle instead of it.

Maintenance mode

Restrictions(wishes) which we have:

While defragmentation is in progress the node should not have any effect on the whole cluster.
It needs to somehow collect the current progress of defragmentation.

...

It should be possible to cancel defragmentation

No completed ideas yet how it should look. But perhaps it does make sense to have the started node which is not joined to the cluster(the correct state is right after the recovery step). But how to communicate to this node(rest API? control.sh? sysProperties?) and how to transit to/from this state are not clear right now.

Risks and Assumptions

Corrupted page in new partitions will be impossible to restore from WAL, this can potentially lead to unrecoverable corruption that can be fixed with full rebalance only. To be fair, full rebalance itself has the same issue because WAL is usually disabled during it.
Defragmentation of the whole cache group would require as much storage space as the group itself in the worst case scenario (even more actually).
There has to be a way to cancel defragmentation. For example - the storage is out of free space and there's no way that the process can be finished after any number of restarts. It's important to prevent constant restart-fail loops, because it makes the situation very hard to fix on environments like k8s.
Process The process might take a long time to complete. There has to be a way to track the progress.

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Implementation

Common description

SQL indexes

Maintenance mode

Risks and Assumptions

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Implementation

Common description

SQL indexes

Maintenance mode

Risks and Assumptions