ID	IEP-47
Author	Ivan Bessonov
Sponsor
Created	2020-05-19
Status	IMPLEMENTATION

Motivation

Given the nature of current persistent storage implementation, we never truly delete pages, we only mark them for further reuse. This can lead to the extreme situations when there’s basically no data in partition but it occupies many gigabytes of storage space. There’s no solution for this right now rather then deleting the whole partition and waiting for the rebalance procedure to finish, which is not good for production scenarios.

Right now there’s a necessity for partitions defragmentation, by which we assume at least removal of all empty pages.

Description

Formulated by Ivan Bessonov, Anton Kalashnikov and Sergey Chugunov .

The first naive approach is to make this operation offline (without active data loading), this will simplify a lot of things.

So, the requirement is to have an option that will trigger defragmentation process after node restart. I see following options:

defragment everything;
defragment specific cache groups;
defragment specific partitions of specific cache groups.

API approaches will be discussed later.

This IEP includes 3 big parts:

Defragmentation itself - relocation of entries to top of the file as compact as possible and removing empty pages at the tail then.
Checkpoint changes - defragmentation needs to evict pages to disk during the process which pretty similar to checkpoint behavior but the current checkpoint implementation doesn't allow it to be used flexibly so it should be slightly changed.
Maintenance mode - current proposal of defragmentation does not allow the load during the defragmentation so it should be some kind of special state of the node where the node can be available for monitoring and management but not have any effect on the cluster.

Implementation

Common description

There are a number of approaches that could be implemented. I will describe the one that we consider to be a potential candidate for implementation. This solution is not a real-time approach that means any operation except the defragmentation should be forbidden.

Proper moment to start the procedure is right after the logical recovery. The latest checkpoint must be completed, so all the data is on disk and internally consistent.

First we’ll look at regular partitions that are not related to SQL. There are two main structures containing data lay there: cache data tree and pending entries tree. We’ll start with the data tree for the reasons described later.

The idea is to iterate through the tree and insert everything into a new fake partition (like part-xxx.bin.tmp, for example). Free list and reuse list can be stored in memory during this operation and persistent in the final step only (reuse lists should be empty btw). Now, there are several ways to optimize this process:

it is assumed that the process will be single-threaded, this means that we don’t need locks for insert operations into new partition;
every insert into the tree will be strictly in the tail, this knowledge can be exploited:
- we can use the specific operation to insert into the tail without any comparisons;
- or we can construct the tree by layers because we know the amount and the order of entries. This might be the most efficient way but it will require a lot of new code with careful testing;
entries should not be deserialized to byte arrays where it’s possible. Memory consumption should be restricted. For example, direct memory copying can be utilized to achieve this.

During the restructuring of the cache data tree we should store mapping oldLink→newLink. First reason is that the pending tree uses links to compare entries, so the order of elements in the pending tree will be changed after defragmentation. This mapping should be persisted to the storage right next to partition it represents for the reasons described later in the document.

As already mentioned, pending entries tree cannot be copied using the same algorithm, because the order of the elements can't be preserved. So regular B+Tree inserts have to be used for entries with updated links.

WAL should be disabled, of course. We should probably write some markers into it once the operation is completed, which might be helpful for debugging later.

Tracking pages must be either marked corrupted or just be filled with dirty flags for all pages.

How to store links mapping

There are two options here - some disk-based hashmap or B+Tree. I think the latter is preferable for the following reasons:

it's already implemented in the code;
pages stored on disk have CRC so we secure ourselves from data corruption caused by storage issues.

Fault tolerance

Once we start deleting partitions, there’s no way back and operation has to be completed after restart if a node failed for some reason. That’s obvious, and fault-tolerant batch-rename operation doesn't seem hard at all.

In other scenarios, it would be convenient to use already defragmented partition after fail/restart if the operation was partially completed.

To mark completed partitions I propose using special file extensions or some lock files. This way all information will survive restarts without the necessity to write anything to WAL or other places.

If there are no defragmentation parameters during the start then all potential leftovers of previous runs should be cleaned up from the storage.

Strategies of writing data on disk

I see two obvious options here:

reuse regular checkpointing. Interval might be decreased for this operation because it’s expected to be relatively fast and data-intensive. Start-end markers are not required since we don't actually affect current partitions and recovery mechanism is much less complex.
lightweight checkpointing, by that I mean writing pages once they become dirty. Sounds simple enough but maybe much slower if there are a lot of small cache entries and write operation is not buffered. Good thing is that writes will mostly be append-only. Obviously, force operation is only required once per partition;
also lightweight checkpointing, but we write pages on eviction, not on marking. This approach saves from constant rewriting of small entries but append-only heuristic stops being expected.

SQL indexes

index.bin can have an unlimited amount of trees with possibly inlined data in them. What's even worse is that we have a single index file per all partitions. This means that defragmentation of a single partition affects the whole index - physical links in it are not valid anymore. And the index must always be the last to defragment.

We already have a mapping oldLink→newLink for all partitions, so we should copy the index file page by page while replacing links in all of them according to the mapping. This can easily be done in parallel - it’s important because indexes are huge. (OR we can reinsert everything into new trees as we did with partitions if we need to defragment index file as well, same code can be reused)

In the first case checkpoint is not needed because we write a new file in append-only mode. We may not even need a data region for the same reason - we just read the whole file page by page.

The really simple option is to just delete index.bin, but it’s way too desperate.

Data Encryption

We should not forget that this feature exists. All new partitions must be encrypted with the same key and method. I believe that it would require some additional refactoring, but nothing too complicated.

API notes.

Where do we store parameters?

I think that the most appropriate place for it is the MetaStorage, because it survives node restarts. Parameters should be deleted once the operation is completed. But I don’t think that progress information should be stored there as well - it’s much easier to have WAL completely disabled rather than deciding which component can use it and which cannot. We should not leave any junk in WAL, or in original partitions.

How to initiate operation

JMX looks like an appropriate option. It doesn't depend on the node being in topology which is a good thing. I believe that it should already be available before discovery is started. Confirmation required. Also, JMX can in theory show the progress of the operation, but I'm not sure about the usability side of such thing.

Canceling the operation

Again, it looks like there's no choice besides JMX here.

About REST

REST is technically started before the node is joined to the cluster. It should allow us to have a connection with control.sh, but it is essential to know the port.

Overall algorithm, in short

Check if there are defragmentation parameters.
1. If there are no such parameters, delete all temporary partitions, mappings, locks, etc.
Complete logical recovery, checkpoint everything.
Disable WAL.
Run the process for every requested cache group.
1. Run the process for every requested partition:
  1. Check if the partition has already been defragmented. If it does then go to the next partition.
  2. Get the cache data tree, iterate through its entries and insert them into new data tree.
  3. Copy pending entries tree using the mapping from previous step.
  4. Write and sync mapping and partition files, rename partition file to the completed partition file or create some marker file.
  5. Go to the next partition.
2. Check if index.bin has already been defragmented. If it has then skip the next step.
3. For index.bin there are two options:
  1. Defragment it like a partition file if it was requested.
  2. Otherwise initiate copying of the index with links replacement:
    1. Copy Meta tree with blank values instead of tree links.
    2. Copy every index tree and insert links to their meta pages into the Meta tree (too many metas, I know).
  3. Write and sync index file, rename it to the completed file or create some marker as in 4.a.iv.
4. Initiate cascade renaming. It's important to leave the end-marker meaning that cache group was defragmented so the process won't start once again in a failover scenario.
5. Go to the next cache group.
Enable WAL, remove defragmentation parameters from metastorage.
Delete marker files from step 4.d

Checkpoint change proposal

Proposed by Anton Kalashnikov

As was described before it needs to find some way to save pages to disk during defragmentation. One of the approaches is refactoring of the current checkpoint to add more flexibility in use. Also it can be helpful in other ignite operations such as recovery where it also impossible to use the current implementation of the checkpoint.

The idea to simplify the checkpoint responsibility like that the checkpoint responsible only for the consistent(which was changed under readLock) collection of data(pages) from 'Source' and storing it to 'Target'.

In the high-level of abstraction main parts looks like:

CheckpontWorker(same as Checkpointer) - executing checkpoint one by one. Checkpoint execution starts with a timeout or can be triggered from outside. Only one checkpoint can be executed at one time.
CheckpointLifecycle(almost same as DbCheckpointListener) - tool for observation of checkpoint process(before checkpoint, under write lock, after checkpoint, etc.)
CheckpointSource - supplier of data that should be consistently written to 'Target'. It should return CheckpointIterator which will provide the specific data in the future.
CheckpointIterator - iterator over data to write. It can fill the received buffer instead of return the value in case of optimization.
CheckpointTarget - writer to target place. It has to be connected to CheckpointSource perhaps it should be one interface.

Proposed changes relative to current implementation:

Removing destroyQueue and changing it to the subscription to lifecycle
Removing all snapshot related dependencies such as nextSnapshot, snapshotMgr, DbCheckpointListener.Context(partially)
Adding the possibility to subscribe to a lifecycle in the configured order. (it's important for removing of snapshotMgr usage)
Subscribing to the checkpoint lifecycle directly from CacheDataStore instead of GridCacheOffheapManager
Collecting of CheckpointIterators without further preparation instead of collecting the pure pages with further sorting
Extracting implementation of CheckpointIterator outside of checkpoint - it's should be specific for each source. In the current case, PageMemoryImpl can hold the implementation of iterator which iterate over all pages in all segments in sorted order.
Reimplementation of PageReplaced such that it can't write to disk directly anymore but can ask, for example, CheckpointIterator to increase the priority of a certain page.
Moving all knowledge about the checkpoint buffer from the checkpoint to the page memory because the checkpoint buffer is a specific feature of certain page memory implementation.
(if possible)Removing stateFutures and using lifecycle instead of it.

Maintenance mode

Restrictions(wishes) which we have:

While defragmentation is in progress the node should not have any effect on the whole cluster.
It needs to somehow collect the current progress of defragmentation.
It should be possible to cancel defragmentation

No completed ideas yet how it should look. But perhaps it does make sense to have the started node which is not joined to the cluster(the correct state is right after the recovery step). But how to communicate to this node(rest API? control.sh? sysProperties?) and how to transit to/from this state are not clear right now.

Risks and Assumptions

Corrupted page in new partitions will be impossible to restore from WAL, this can potentially lead to unrecoverable corruption that can be fixed with full rebalance only. To be fair, full rebalance itself has the same issue because WAL is usually disabled during it.
Defragmentation of the whole cache group would require as much storage space as the group itself in the worst case scenario (even more actually).
There has to be a way to cancel defragmentation. For example - the storage is out of free space and there's no way that the process can be finished after any number of restarts. It's important to prevent constant restart-fail loops, because it makes the situation very hard to fix on environments like k8s.
The process might take a long time to complete. There has to be a way to track the progress.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Open Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Closed Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-47: Native persistence defragmentation