ID | IEP-32 |
Author | |
Sponsor | |
Created |
|
Status | DRAFT |
Motivation
Currently batch updates on page memory level are not implemented in Ignite. Internal structures, such as BPlusTree and FreeList, do not support them. The performance of batch operations (putAll, datastreamer, preloader) can be improved by implementing batch updates on page memory level.
Competitive analysis
Profiling
Profiling the current rebalancing process shows that most of the time spent on working with FreeList and B+ Tree (details).
Process overview
Currently, when updating the batch of cache entries in off-heap, each of them is processed separately. Such update in PageMemory consists of the following steps:
- Search for key in B+ Tree
- Store key-value to data page
- Insert/update key in B+ tree page
To prevent a redundant secondary search in B+ Tree the invoke operation was introduced.
Invoke in B+ Tree
Let's describe the B+ Tree in more detail to understand the need for invoke operation.
The keys of the tree (hashes) are stored on the B+ Tree pages (index pages), the cache key-value itself is stored on data pages. Each item on the index page includes a link to the data page item. In general, a B+ Tree supports find, put and remove operations. For put and remove, you must first find the point of insertion/update/removal. So, cache entry update without invoke operation can look like this:
- Search B+ Tree for link to old key-value (find)
- The old value does not differ in length - a simple value update of key-value on the data page
- The old value differs in length - the link to it changes:
- Store a new key-value into data page
- Put B+ Tree key (with "secondary" find) to update link to data page item
- Remove old key-value from data page
The invoke operation has the following execution scheme:
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlBhdmVsIFBlcmVzbGVnaW4iLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiUGF2ZWwgUGVyZXNsZWdpbiIsImxhbmd1YWdlIjoiZW4iLCJkaWFncmFtRGlzcGxheU5hbWUiOiIiLCJzRmlsZUlkIjoiIiwiYXR0SWQiOiIxMDMwOTYxMzkiLCJkaWFncmFtTmFtZSI6Imludm9rZSIsImFzcGVjdCI6IiIsImxpbmtzIjoiYXV0byIsImNlb05hbWUiOiJJRVAtMzIgQmF0Y2ggdXBkYXRlcyBpbiBQYWdlTWVtb3J5IiwidGJzdHlsZSI6InRvcCIsImNhbkNvbW1lbnQiOmZhbHNlLCJkaWFncmFtVXJsIjoiIiwiY3N2RmlsZVVybCI6IiIsImJvcmRlciI6dHJ1ZSwibWF4U2NhbGUiOiIxIiwib3duaW5nUGFnZUlkIjoxMDMwOTM0MTUsImVkaXRhYmxlIjpmYWxzZSwiY2VvSWQiOjExMDY5MzE2NCwicGFnZUlkIjoiIiwibGJveCI6dHJ1ZSwic2VydmVyQ29uZmlnIjp7ImVtYWlscHJldmlldyI6IjEifSwib2RyaXZlSWQiOiIiLCJyZXZpc2lvbiI6NCwibWFjcm9JZCI6IjcxNWM2ZjZlLTk4YjMtNGJkOS1iYjRhLWY3YWY3OTZkYmE5OSIsInByZXZpZXdOYW1lIjoiaW52b2tlLnBuZyIsImxpY2Vuc2VTdGF0dXMiOiJPSyIsInNlcnZpY2UiOiIiLCJpc1RlbXBsYXRlIjoiIiwid2lkdGgiOiI2MDAiLCJzaW1wbGVWaWV3ZXIiOmZhbHNlLCJsYXN0TW9kaWZpZWQiOjE1NTI0NjM3MTQwMDAsImV4Y2VlZFBhZ2VXaWR0aCI6ZmFsc2UsIm9DbGllbnRJZCI6IiJ9
Store key-value
Saving a key-value on data page consists of the following operations:
- Remove a page from FreeList, which has enough space (if there is no such page - allocate a new one)
- Lock page
- Write cache entry data
- Update page counters
- Unlock page
- Based on the remaining free space insert page into FreeList
Advantages
The batch will improve:
- Average insertion time by reducing count of FreeList actions on each data row
- Average search/update time in B+ Tree
Proposed changes
Implementation should consists of two major related improvements:
- Batch writing to data pages
- Batch updates in B+ Tree
Diagram overview
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlBhdmVsIFBlcmVzbGVnaW4iLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiUGF2ZWwgUGVyZXNsZWdpbiIsImxhbmd1YWdlIjoiZW4iLCJkaWFncmFtRGlzcGxheU5hbWUiOiIiLCJzRmlsZUlkIjoiIiwiYXR0SWQiOiIxMDMwOTM0MjAiLCJkaWFncmFtTmFtZSI6ImJhdGNoIHVwZGF0ZXMiLCJhc3BlY3QiOiIiLCJsaW5rcyI6ImF1dG8iLCJjZW9OYW1lIjoiSUVQLTMyIEJhdGNoIHVwZGF0ZXMgaW4gUGFnZU1lbW9yeSIsInRic3R5bGUiOiJ0b3AiLCJjYW5Db21tZW50IjpmYWxzZSwiZGlhZ3JhbVVybCI6IiIsImNzdkZpbGVVcmwiOiIiLCJib3JkZXIiOnRydWUsIm1heFNjYWxlIjoiMSIsIm93bmluZ1BhZ2VJZCI6MTAzMDkzNDE1LCJlZGl0YWJsZSI6ZmFsc2UsImNlb0lkIjoxMTA2OTMxNjQsInBhZ2VJZCI6IiIsImxib3giOnRydWUsInNlcnZlckNvbmZpZyI6eyJlbWFpbHByZXZpZXciOiIxIn0sIm9kcml2ZUlkIjoiIiwicmV2aXNpb24iOjUsIm1hY3JvSWQiOiJhZDRiZDM3MC1iZmMyLTQ2MmItOTE4My1lYmQyODM1ODMzZTEiLCJwcmV2aWV3TmFtZSI6ImJhdGNoIHVwZGF0ZXMucG5nIiwibGljZW5zZVN0YXR1cyI6Ik9LIiwic2VydmljZSI6IiIsImlzVGVtcGxhdGUiOiIiLCJ3aWR0aCI6IjYwMCIsInNpbXBsZVZpZXdlciI6ZmFsc2UsImxhc3RNb2RpZmllZCI6MTU1MTk0NzA1MTAwMCwiZXhjZWVkUGFnZVdpZHRoIjpmYWxzZSwib0NsaWVudElkIjoiIn0=
Batch write to data pages
Divide the input into 2 lists:
- Objects whose size is equal to or greater than the size of a single data page.
- Other objects and remainders (heads) of large objects.
Sequentially write objects and fragments that occupy the whole page. The data page is taken from reuse bucket, if there is no page in reuse bucket - allocate a new one.
For remaining (regular) objects (including the "heads" of large objects), find the most free page with enough space in FreeList (allocate new one if there is no such page) and fill it up to the end.
Batch update in B+ Tree
TBD: describe the implementation.
Proposed plan
Overall changes to support batch updates in PageMemory can be divided into following phases.
Phase 1: Batch insert new keys
- Implement insertDataRows operation in FreeList - insert several data rows at once.
- Implement invokeAll operation in BPlusTree: support searching and inserting range of keys.
- Enable batch insertion in preloader (enabled by a special system property).
Phase 2: Batch update existing keys
- InvokeAll operation should support batch update of existing keys.
Phase 3: DataStreamer support
- Add support for batch updates in DataStreamer (isolated updater).
Phase 4: putAll support
- Add support for batch updates in IgniteCache putAll.
Phase 5: MVCC support
- Add support for MVCC (TRANSACTIONAL_SNAPSHOT) cache mode.
Risks and Assumptions
- Concurrent update of the cache entry is blocked until the batch operation with the same key is completed, therefore the batch size must be strictly limited.
- Duplicate keys processing (WAL rebalancing) based on assumption that duplicate keys will be ordered according to cache entry version (a newer version is the latest).
- Heap usage/GC pressure.
Prototype testing results
For testing purposes, a prototype was created with simplified Phase 1 implementation, which includes FreeList optimization (batch writing to data pages), but does not include optimization for B+ Tree (searching and inserting a range of keys). The rebalancing process was chosen as the easiest and most suitable for testing batch inserts in PageMemory.
Synthetic testing results.
Microbenchmark prepares a supply message and measures the time spent by the demander to handle it.
Parameters: 1 node, 1 cache, 1 partition, 500 objects, message size is not limited.
Entry size (bytes) | 44-56 | 56-104 | 140-240 | 240-540 | 500-800 | 800-1200 | 2000-3000 |
| 1000-8000 | 4000-16000 | 100-32000 |
---|
Time improvement (%) | 45.1 | 45.2 | 42.5 | 41.6 | 43.0 | 40.8 | 68.4 | 70.6 | 69.9 | 64.2 |
---|
Testing on dedicated servers
Checked the total rebalancing time on the following configuration:
Cluster: 2 nodes
Cache: transactional, partitioned 1024 partitions, 1 backup
Data size: 40 GB
Page size: 4096 bytes
Rebalance message size: 512 KB
Count of prefetch messages: 10
The improvement in rebalancing time with batch insertion is mostly noticeable when writing small objects and decreases on larger objects.
Entry size (bytes) | 140-240 | 240-540 | 500-800 | 700-800 | 800-1200 |
---|
Rebalancing time improvement (%) | 22 | 19 | 9.5 | 8 | 2 |
---|
Discussion Links
// Links to discussions on the devlist, if applicable.
Reference Links
// Links to various reference documents, if applicable.
Tickets
key |
summary |
type |
updated |
assignee |
reporter |
priority |
status |
resolution |
JQL and issue key arguments for this macro require at least one Jira application link to be configured
|
The improvement in rebalancing time when using batch insertion is mostly noticeable when writing small objects and decreases with increasing object size.the improvement in total rebalancing time is reduced in cases of demander idle waiting for the next messages from the supplier