...

ID

IEP-32

Author

Pavel Pereslegin

Sponsor

Created

06 Mar 2019

Status

colour	Grey
title	DRAFT

Table of Contents

Motivation

Currently batch updates on page memory level are not implemented in Ignite. Internal structures, such as BPlusTree and FreeList, do not support them. The performance of batch operations (putAll, datastreamer, preloader) can be improved by implementing batch updates on page memory level.

Competitive analysis

Profiling

Profiling the current rebalancing process shows that most of the time is spent on working with FreeList and BPlusTree B+ Tree (details).

Process overview

Currently, when updating the batch of cache entries in off-heap, each of them is processed separately. Such update in PageMemory can be divided into consists of the following steps:

Search for key in B+ tree.Tree
Store entry key-value (data row) into to data page.
Insert/update key in B+ tree page.

To prevent a redundant secondary search in B+ Tree the invoke operation was introduced.

Invoke in B+ Tree

Let's describe the B+ Tree in more detail to understand the need for invoke operationSearch and update of key in tree implemented through the invoke operation, to understand the needs for this operation, it is necessary to describe B+ tree in more detail.

The keys of the tree (hashes) are stored on the B+ tree Tree pages (index pages), the cache key-value (data row) itself is stored on data pages. Each item on the index page includes a link to the data page item. In general, a B+ tree Tree supports find, put and remove operations. For put andremove, you must first find the point of insertion/update/removal. So, cache entry update without invoke operation should can look like this:

Search B+ tree Tree for link to old data row old key-value (find).
The old value does not differ in length - a simple value update of data row of key-value on the data page.
The old value differs in length - the link to it changes:
Store a new data row key-value into data page.
Put B+ tree Tree key (with "secondary" find) to update link to data page item.
Remove old data row old key-value from data page.

The invoke operation uses an in-place update and To prevent a redundant secondary search in the case described above the invoke operation was introduced. This operation has the following execution scheme of execution:

Code Block

OffHeapManager.invoke
                |-> BPlusTree.invoke
                |               |
                |               |
                |               |
                |               |-> Invoke.invokeClosure()
                |               |           |-> UpdateClosure.call(foundRow)
                |               |           |                   |- compare entry version
                |               |           |                   |- newRow = create newRow on data page
                |               |           |                   |- treeOp = PUT/REMOVE/NOOP
                |               |           |                  
                |               |           \-> read treeOp -> create operation closure (Put(newRow)/Remove(row))
                |               |
                |               \-> Invoke.found()/notFound() -> exec operation closure
                |
                |
                \-> finishUpdate/finishRemove

:

draw.io Diagram

border	true
viewerToolbar	true

fitWindow	false
diagramName	invoke
simpleViewer	false
width	600
diagramWidth	800
revision	4

Store key-value

Saving a key-value Saving a data row on data page consists of the following operations:

Remove a page from FreeList, which has enough storage space (if there is no such page - allocate a new one).
Lock page.
Write cache entry data.
Update page counters.
Unlock page.
Based on the remaining free space insert page into FreeList.

Proposed changes

...

Advantages

The batch will improve:

Average insertion time by reducing count of FreeList actions on each data row
Average search/update time in B+ Tree

Proposed changes

...

Implementation should consists of two major related improvements:
Batch writing to data pages.
Batch updates in BPlusTree.B+ Tree

Batch writing to data pages

...

Divide the input data rows into 2 lists:
Objects whose size is equal to or greater than the size of a single data page.
Other objects and remainders (heads) of large objects.
Sequentially write objects and fragments that occupy the whole page. The data page is taken from "reuse" bucket, if there is no page in reuse bucket - allocate a new one.
For remaining (regular) objects (including the remainders ("heads") of large objects) find Find the most free page with enough space for data row in FreeList (allocate new one if there is no such page - allocate new one) and fill it up to the end.
Improves average insertion time by reducing count of FreeList actions on each data row (insert/remove pages, pages locking and page counters updates).

Batch updates in BPlusTree.

Batch update in B+ Tree

TBD: describe the implementation.

Proposed plan

Overall changes to support batch updates in PageMemory can be divided into following phases.

Phase 1: Batch insertion in FreeList to improve rebalancing

Implement insertDataRows operation in FreeList - insert several data rows at once.
Preloader should insert a batch of data rows before initializing cache entries. In the case when the cache entry is initialized incorrectly, preloader should rollback changes and remove pre-created data row.

Phase 2: DataStreamer support

Add support for batch inserts in FreeList in the isolated updater (similar to the preloader).

Phase 3: putAll support

Implement batch operations in B+ tree (findAll/putAll/removeALl/invokeAll).
Examine the performance difference between the following approaches and select the best:
A. single updates (current approach)
B. sort + BPlusTree.invokeAll() + FreeList.insertDataRow
C. sort + BPlusTree.findAll + FreeList.insertDataRows + BPlusTree.putAll

Phase 4: MVCC support

Add support for MVCC (TRANSACTIONAL_SNAPSHOT) cache mode.

Risks and Assumptions

For BPlusTree batch operations, ordered keys are required, moreover, an attempt to simultaneously lock the same keys in a different order lead to a deadlock, so batch insertion into the page memory must be performed on an unlocked entries. Alternatively, keys passed in batches from different components (preloader, datastreamer, putAll) should be locked in the same order.
Heap usage/GC pressure.

Prototype testing results

For testing purposes, a prototype was created with simplified Phase 1 implementation, which includes FreeList optimization (batch writing to data pages), but does not include optimization for B+ Tree (Improves the time of searching and inserting a range of keys.

Risks and Assumptions

...

). The rebalancing process was chosen as the easiest and most suitable for testing batch inserts in PageMemory.

Synthetic testing results.

Microbenchmark prepares a supply message and measures the time spent by the demander to handle it.

Parameters: 1 node, 1 cache, 1 partition, 100 objects, message size is not limited, 4k page.

Entry size (bytes)

44-104

140-340

340-740

740-1240

1240-3040

2000-3000

1040-8040

4040-16040

(fragmented)

100-32000

(fragmented mostly)

Time improvement (%)

43.4

37.1

33.9

28.2

0

5.4

10.1

8.6

1.1

Testing on dedicated servers

Checked the total rebalancing time on the following configuration:

Cluster: 2 nodes
Cache: transactional, partitioned 1024 partitions, 1 backup
Data size: 40 GB
Page size: 4096 bytes
Rebalance message size: 512 KB
Count of prefetch messages: 10

The improvement in rebalancing time with batch insertion is mostly noticeable when writing small objects and decreases on larger objects.

Entry size (bytes)	140-240	240-540	500-800	700-800	800-1200
Rebalancing time improvement (%)	22	19	9.5	8	2

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,updated,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-32) order by key
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 10

New Version Current

Key

Motivation

Competitive analysis

Profiling

Process overview

Invoke in B+ Tree

Store key-value

Proposed changes

Advantages

Proposed changes

Implementation should consists of two major related improvements:
Batch writing to data pages.
Batch updates in BPlusTree.B+ Tree

Batch writing to data pages

Batch updates in BPlusTree.

Batch update in B+ Tree

Proposed plan

Phase 1: Batch insertion in FreeList to improve rebalancing

Phase 2: DataStreamer support

Phase 3: putAll support

Phase 4: MVCC support

Risks and Assumptions

Prototype testing results

Risks and Assumptions

Synthetic testing results.

Testing on dedicated servers

Discussion Links

Reference Links

Tickets

Page tree

Page History

Versions Compared

Old Version 10

New Version Current

Key

Motivation

Competitive analysis

Profiling

Process overview

Invoke in B+ Tree

Store key-value

Proposed changes

Advantages

Proposed changes

Implementation should consists of two major related improvements:Batch writing to data pages.Batch updates in BPlusTree.B+ Tree

Batch writing to data pages

Batch updates in BPlusTree.

Batch update in B+ Tree

Proposed plan

Phase 1: Batch insertion in FreeList to improve rebalancing

Phase 2: DataStreamer support

Phase 3: putAll support

Phase 4: MVCC support

Risks and Assumptions

Prototype testing results

Risks and Assumptions

Synthetic testing results.

Testing on dedicated servers

Discussion Links

Reference Links

Tickets

Implementation should consists of two major related improvements:
Batch writing to data pages.
Batch updates in BPlusTree.B+ Tree