...
Formulated by Ivan Bessonov, Anton Kalashnikov and Sergey Chugunov .
First The first naive approach is to make this operation offline (without active data loading), this will simplify a lot of things.
So, the requirement is to have an option that will trigger defragmentation process after node restart. I see following options:
defragmentate defragment everything;
defragmentate defragment specific cache groups;
defragmentate defragment specific partitions of specific cache groups.
API approaches will be discussed later.
This IEP includes 3 big parts:
There are a number of approaches that could be implemented. I will describe the one that we consider to be a potential candidate for implementation. This solution is not a real-time approach that means any operation except the defragmentation should be forbidden.
Proper moment to start the procedure is right after the logical recovery. Latest The latest checkpoint must be completed, so all the data is on disk and internally consistent.
First we’ll look at regular partitions that are not related to SQL. There are two main structures containing data lay there: cache data tree and pending entries tree. We’ll start with the data tree for the reasons described later.
...
it is assumed that the process will be single-threaded, this means that we don’t need locks for insert operations into new partition;
every insert into the tree will be strictly in the tail, this knowledge can be exploited:
we can use the specific operation to insert into the tail without any comparisons;
or we can construct the tree by layers , because we know the amount and the order of entries. This might be the most efficient way but it will require a lot of new code with careful testing;
entries should not be deserialized to byte arrays where it’s possible. Memory consumption should be restricted. For example, direct memory copying can be utilized to achieve this.
During the restructuring of the cache data tree we should store mapping oldLink→newLink. First reason is that the pending tree uses links to compare entries, so the order of elements in the pending tree will be changed after defragmentation. This mapping should be persisted to the storage right next to partition it represents for the reasons described later in the document.
As already mentioned, pending entries tree cannot be copied using the same algorithm, because the order of the elements can't be preserved. So regular B+Tree inserts have to be used for entries with updated links.
WAL should be disabled, of course. We should probably write some markers into it once the operation is completed, which might be helpful for debugging later.
...
There are two options here - some disk-based hashmap or B+Tree. I think the later latter is preferable for the following reasons:
...
Once we start deleting partitions, there’s no way back and operation has to be completed after restart if a node failed for some reason. That’s obvious, and fault-tolerant batch-rename operation doesn't seem hard at all.
In other scenarios, it would be convenient to use already defragmented partition after fail/restart if the operation was partially completed.
...
reuse regular checkpointing. Interval might be decreased for this operation because it’s expected to be relatively fast and data-intensive. Start-end markers are not required since we don't actually affect current partitions and recovery mechanism is much less complex.
lightweight checkpointing, by that I mean writing pages once they become dirty. Sounds simple enough but may be maybe much slower if there are a lot of small cache entries and write operation is not buffered. Good thing is that writes will mostly be append-only. Obviously, force operation is only required once per partition;
index.bin can have an unlimited amount of trees with possibly inlined data in them. What’t What's even worse is that we have a single index file per all partitions. This means that defragmentation of a single partition affects the whole index - physical links in it are not valid anymore. And the index must always be the last to defragmentatedefragment.
We already have a mapping oldLink→newLink for all partitions, so we should copy the index file page by page while replacing links in all of them according to the mapping. This can easily be done in parallel - it’s important because indexes are huge. (OR we can reinsert everything into new trees like as we did with partitions if we need to defragmentate defragment index file as well, same code can be reused)
In the first case checkpoint is not needed because we write a new file in append-only mode. We may not even need a data region for the same reason - we just read the whole file page by page.
Really The really simple option is to just delete index.bin, but it’s way too desperate.
...
I think that the most appropriate place for it is the MetaStorage, because it survives node restarts. Parameters should be deleted once the operation is completed. But I don’t think that progress information should be stored there as well - it’s much easier to have WAL completely disabled rather then than deciding which component can use it and which cannot. We should not leave any junk in WAL, or in original partitions.
...
JMX looks like an appropriate option. It doesn't depend on the node being in topology which is a good thing. I believe that it should already be available before discovery is started. Confirmation required. Also, JMX can in theory show the progress of the operation, but I'm not sure about the usability side of such thing.
Cancelling Canceling the operation
Again, it looks like there's no choice besides JMX here.
...
REST is technically started before the node is joined to the cluster. It should allow us to have a connection with control.sh, but it is essential to know the port.
...
...
Proposed by Anton Kalashnikov
As was described before it needs to find some way to save pages to disk during defragmentation. One of the approaches is refactoring of the current checkpoint to add more flexibility in use. Also it can be helpful in other ignite operations such as recovery where it also impossible to use the current implementation of the checkpoint.
The idea to simplify the checkpoint responsibility like that the checkpoint Checkpoint responsible only for the consistent(which was changed under readLock) collection of data(pages) from 'Source' and storing its it to 'Target'.
In the high-level of abstraction main parts looks like:
...
Removing destroyQueue and changing it to the subscription to lifecycle
Removing all snapshot related dependencies such as nextSnapshot, snapshotMgr, DbCheckpointListener.Context(partially)
Adding the possibility to subscribe to a lifecycle in the configured order. (it's important for removing of snapshotMgr usage)
Subscribing to the checkpoint lifecycle directly from CacheDataStore instead of GridCacheOffheapManager
Collecting of CheckpointIterators without further preparation instead of collecting the pure pages with further sorting
Extracting implementation of CheckpointIterator outside of checkpoint - it's should be specific for each source. In the current case, PageMemoryImpl can hold the implementation of iterator which iterate over all pages in all segments in sorted order.
Reimplementation of PageReplaced such that it can't write to disk directly anymore but can ask, for example, CheckpointIterator to increase the priority of a certain page.
Moving all knowledge about the checkpoint buffer from the checkpoint to the page memory because the checkpoint buffer is a specific feature of certain page memory implementation.
(if possible)Removing stateFutures and using lifecycle instead of it.
Restrictions(wishes) which we have:
...
No completed ideas yet how it should look. But perhaps it does make sense to have the started node which is not joined to the cluster(the correct state is right after the recovery step). But how to communicate to this node(rest API? control.sh? sysProperties?) and how to transit to/from this state are not clear right now.
...