ID | IEP-47 |
Author | |
Sponsor | |
Created | 2020-05-19 |
Status | IMPLEMENTATION |
Given the nature of current persistent storage implementation, we never truly delete pages, we only mark them for further reuse. This can lead to the extreme situations when there’s basically no data in partition but it occupies many gigabytes of storage space. There’s no solution for this right now rather then deleting the whole partition and waiting for the rebalance procedure to finish, which is not good for production scenarios.
Right now there’s a necessity for partitions defragmentation, by which we assume at least removal of all empty pages.
Formulated by Ivan Bessonov, Anton Kalashnikov and Sergey Chugunov .
The first naive approach is to make this operation offline (without active data loading), this will simplify a lot of things.
So, the requirement is to have an option that will trigger defragmentation process after node restart. I see following options:
defragment everything;
defragment specific cache groups;
defragment specific partitions of specific cache groups.
API approaches will be discussed later.
This IEP includes 3 big parts:
There are a number of approaches that could be implemented. I will describe the one that we consider to be a potential candidate for implementation. This solution is not a real-time approach that means any operation except the defragmentation should be forbidden.
Proper moment to start the procedure is right after the logical recovery. The latest checkpoint must be completed, so all the data is on disk and internally consistent.
First we’ll look at regular partitions that are not related to SQL. There are two main structures containing data lay there: cache data tree and pending entries tree. We’ll start with the data tree for the reasons described later.
The idea is to iterate through the tree and insert everything into a new fake partition (like part-xxx.bin.tmp, for example). Free list and reuse list can be stored in memory during this operation and persistent in the final step only (reuse lists should be empty btw). Now, there are several ways to optimize this process:
it is assumed that the process will be single-threaded, this means that we don’t need locks for insert operations into new partition;
every insert into the tree will be strictly in the tail, this knowledge can be exploited:
we can use the specific operation to insert into the tail without any comparisons;
or we can construct the tree by layers because we know the amount and the order of entries. This might be the most efficient way but it will require a lot of new code with careful testing;
entries should not be deserialized to byte arrays where it’s possible. Memory consumption should be restricted. For example, direct memory copying can be utilized to achieve this.
During the restructuring of the cache data tree we should store mapping oldLink→newLink. First reason is that the pending tree uses links to compare entries, so the order of elements in the pending tree will be changed after defragmentation. This mapping should be persisted to the storage right next to partition it represents for the reasons described later in the document.
As already mentioned, pending entries tree cannot be copied using the same algorithm, because the order of the elements can't be preserved. So regular B+Tree inserts have to be used for entries with updated links.
WAL should be disabled, of course. We should probably write some markers into it once the operation is completed, which might be helpful for debugging later.
Tracking pages must be either marked corrupted or just be filled with dirty flags for all pages.
There are two options here - some disk-based hashmap or B+Tree. I think the latter is preferable for the following reasons:
Once we start deleting partitions, there’s no way back and operation has to be completed after restart if a node failed for some reason. That’s obvious, and fault-tolerant batch-rename operation doesn't seem hard at all.
In other scenarios, it would be convenient to use already defragmented partition after fail/restart if the operation was partially completed.
To mark completed partitions I propose using special file extensions or some lock files. This way all information will survive restarts without the necessity to write anything to WAL or other places.
If there are no defragmentation parameters during the start then all potential leftovers of previous runs should be cleaned up from the storage.
I see two obvious options here:
reuse regular checkpointing. Interval might be decreased for this operation because it’s expected to be relatively fast and data-intensive. Start-end markers are not required since we don't actually affect current partitions and recovery mechanism is much less complex.
lightweight checkpointing, by that I mean writing pages once they become dirty. Sounds simple enough but maybe much slower if there are a lot of small cache entries and write operation is not buffered. Good thing is that writes will mostly be append-only. Obviously, force operation is only required once per partition;
index.bin can have an unlimited amount of trees with possibly inlined data in them. What's even worse is that we have a single index file per all partitions. This means that defragmentation of a single partition affects the whole index - physical links in it are not valid anymore. And the index must always be the last to defragment.
We already have a mapping oldLink→newLink for all partitions, so we should copy the index file page by page while replacing links in all of them according to the mapping. This can easily be done in parallel - it’s important because indexes are huge. (OR we can reinsert everything into new trees as we did with partitions if we need to defragment index file as well, same code can be reused)
In the first case checkpoint is not needed because we write a new file in append-only mode. We may not even need a data region for the same reason - we just read the whole file page by page.
The really simple option is to just delete index.bin, but it’s way too desperate.
We should not forget that this feature exists. All new partitions must be encrypted with the same key and method. I believe that it would require some additional refactoring, but nothing too complicated.
Where do we store parameters?
I think that the most appropriate place for it is the MetaStorage, because it survives node restarts. Parameters should be deleted once the operation is completed. But I don’t think that progress information should be stored there as well - it’s much easier to have WAL completely disabled rather than deciding which component can use it and which cannot. We should not leave any junk in WAL, or in original partitions.
How to initiate operation
JMX looks like an appropriate option. It doesn't depend on the node being in topology which is a good thing. I believe that it should already be available before discovery is started. Confirmation required. Also, JMX can in theory show the progress of the operation, but I'm not sure about the usability side of such thing.
Canceling the operation
Again, it looks like there's no choice besides JMX here.
About REST
REST is technically started before the node is joined to the cluster. It should allow us to have a connection with control.sh, but it is essential to know the port.
Proposed by Anton Kalashnikov
As was described before it needs to find some way to save pages to disk during defragmentation. One of the approaches is refactoring of the current checkpoint to add more flexibility in use. Also it can be helpful in other ignite operations such as recovery where it also impossible to use the current implementation of the checkpoint.
The idea to simplify the checkpoint responsibility like that the checkpoint responsible only for the consistent(which was changed under readLock) collection of data(pages) from 'Source' and storing it to 'Target'.
In the high-level of abstraction main parts looks like:
CheckpontWorker(same as Checkpointer) - executing checkpoint one by one. Checkpoint execution starts with a timeout or can be triggered from outside. Only one checkpoint can be executed at one time.
CheckpointLifecycle(almost same as DbCheckpointListener) - tool for observation of checkpoint process(before checkpoint, under write lock, after checkpoint, etc.)
CheckpointSource - supplier of data that should be consistently written to 'Target'. It should return CheckpointIterator which will provide the specific data in the future.
CheckpointIterator - iterator over data to write. It can fill the received buffer instead of return the value in case of optimization.
CheckpointTarget - writer to target place. It has to be connected to CheckpointSource perhaps it should be one interface.
Proposed changes relative to current implementation:
Removing destroyQueue and changing it to the subscription to lifecycle
Removing all snapshot related dependencies such as nextSnapshot, snapshotMgr, DbCheckpointListener.Context(partially)
Adding the possibility to subscribe to a lifecycle in the configured order. (it's important for removing of snapshotMgr usage)
Subscribing to the checkpoint lifecycle directly from CacheDataStore instead of GridCacheOffheapManager
Collecting of CheckpointIterators without further preparation instead of collecting the pure pages with further sorting
Extracting implementation of CheckpointIterator outside of checkpoint - it's should be specific for each source. In the current case, PageMemoryImpl can hold the implementation of iterator which iterate over all pages in all segments in sorted order.
Reimplementation of PageReplaced such that it can't write to disk directly anymore but can ask, for example, CheckpointIterator to increase the priority of a certain page.
Moving all knowledge about the checkpoint buffer from the checkpoint to the page memory because the checkpoint buffer is a specific feature of certain page memory implementation.
(if possible)Removing stateFutures and using lifecycle instead of it.
Restrictions(wishes) which we have:
No completed ideas yet how it should look. But perhaps it does make sense to have the started node which is not joined to the cluster(the correct state is right after the recovery step). But how to communicate to this node(rest API? control.sh? sysProperties?) and how to transit to/from this state are not clear right now.
// Links to discussions on the devlist, if applicable.
// Links to various reference documents, if applicable.