...
ID | IEP-32 | ||||||
Author | |||||||
Sponsor |
| ||||||
Created | 06 Mar 2019 | ||||||
Status |
|
Table of Contents |
---|
Currently batch updates on page memory level are not implemented in Ignite. Internal structures, such as BPlusTree and FreeList, do not support them. The performance of batch operations (putAll, datastreamer, preloader) can be improved by implementing batch updates on page memory level.
Profiling the current rebalancing process shows that most of the time is spent on working with FreeList and BPlusTree B+ Tree (details).
Currently, when updating the batch of cache entries in off-heap, each of them is processed separately. Such update in PageMemory can be divided into consists of the following steps:
To prevent a redundant secondary search in B+ Tree the invoke operation was introduced.
Let's describe the B+ Tree in more detail to understand the need for invoke operationIn Ignite search and update of key in tree implemented through the "invoke" operation, to understand the needs for this operation, it is necessary to describe B+ tree in more detail.
The keys of the tree (hashes) are stored on the B+ tree Tree pages (index pages), the cache key-value (data row) itself is stored on data pages. Each item on the index page includes a link to the data page item (pageId + itemId). In general, a B-tree + Tree supports find, put and remove operations. For put/ andremove, you must firstly first find the point of insertion/update/removal. So, cache entry update without invoke operation should can look like this:
The invoke operation uses an in-place update and To prevent a redundant secondary search in the case described above the "invoke" operation was introduced. This operation has the following execution scheme of execution:
Code Block |
---|
OffHeapManager.invoke
|-> BPlusTree.invoke
| |
| |
| |
| |-> Invoke.invokeClosure()
| | |-> UpdateClosure.call(foundRow)
| | | |- compare entry version
| | | |- newRow = create newRow on data page
| | | |- treeOp = PUT/REMOVE/NOOP
| | |
| | \-> read treeOp -> create operation closure (Put(newRow)/Remove(row))
| |
| \-> Invoke.found()/notFound() -> exec opearation closure
|
|
\-> finishUpdate/finishRemove |
:
draw.io Diagram | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Saving a key-value Saving a cache entry on data page consists of the following operations:
The batch will improve:
draw.io Diagram
Divide the input data rows into 2 lists:
Sequentially write objects and fragments that occupy the whole page. The data page is taken from "reuse" bucket, if there is no page in reuse bucket - allocate a new one.
For remaining (regular) objects (including the remainders ("heads") of large objects) find page with enough space in FreeList (allocate new one if there is no such page) and fill it up to the end.
...
TBD: describe the implementation.
Overall changes to support batch updates in PageMemory can be divided into following phases.
Examine the performance difference between the following approaches and select the best:
A. single updates (current approach)
B. sort + BPlusTree.invokeAll() + FreeList.insertDataRow
C. sort + BPlusTree.findAll + FreeList.insertDataRows + BPlusTree.putAll
For testing purposes, a prototype was created with simplified Phase 1 implementation, which includes FreeList optimization (batch writing to data pages), but does not include optimization for B+ Tree (
Improves the time to remove/insert pages in FreeList, pages locking and update item counters.
Improves the time of searching and inserting a range of keys.
...
). The rebalancing process was chosen as the easiest and most suitable for testing batch inserts in PageMemory.
Microbenchmark prepares a supply message and measures the time spent by the demander to handle it.
Parameters: 1 node, 1 cache, 1 partition, 100 objects, message size is not limited, 4k page.
Entry size (bytes) | 44-104 | 140-340 | 340-740 | 740-1240 | 1240-3040 | 2000-3000 | 1040-8040 | 4040-16040 (fragmented) | 100-32000 (fragmented mostly) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Time improvement (%) | 43.4 | 37.1 | 33.9 | 28.2 | 0 | 5.4 | 10.1 | 8.6 | 1.1 |
Checked the total rebalancing time on the following configuration:
Cluster: 2 nodes
Cache: transactional, partitioned 1024 partitions, 1 backup
Data size: 40 GB
Page size: 4096 bytes
Rebalance message size: 512 KB
Count of prefetch messages: 10
The improvement in rebalancing time with batch insertion is mostly noticeable when writing small objects and decreases on larger objects.
Entry size (bytes) | 140-240 | 240-540 | 500-800 | 700-800 | 800-1200 |
---|---|---|---|---|---|
Rebalancing time improvement (%) | 22 | 19 | 9.5 | 8 | 2 |
// Links to discussions on the devlist, if applicable.
// Links to various reference documents, if applicable.
Jira | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|