Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

IDIEP-32
Author
Sponsor

 


Created 06 Mar 2019
Status
Status
colourGrey
titleDRAFT

...

Let's describe the B+ Tree in more detail to understand the need for invoke operation.

The keys of the tree (hashes) are stored on the B+ Tree pages (index pages), the cache key-value itself is stored on data pages. Each item on the index page includes a link to the data page item. In general, a B+ Tree supports find, put and remove operations. For put andremove, you must first find the point of insertion/update/removal. So, cache entry update without invoke operation can look like this:

  • Search B+ Tree for link to old key-value (find)
  • The old value does not differ in length - a simple value update of key-value on the data page
  • The old value differs in length - the link to it changes:
    • Store a new key-value into data page
    • Put B+ Tree key (with "secondary" find) to update link to data page item
    • Remove old key-value from data page

The invoke operation uses an in-place update and has the following execution scheme:

...

  1. Batch writing to data pages
  2. Batch updates in B+ Tree

Diagram overview

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNamebatch updates
simpleViewerfalse
width600
diagramWidth1061
revision5

Batch write to data pages

Batch writing to data pages

Divide the input data rows into 2 lists:

  1. Objects whose size is equal to or greater than the size of a single data page.
  2. Other objects and remainders (heads) of large objects.

Sequentially write objects and fragments that occupy the whole page. The data page is taken from "reuse" bucket, if there is no page in reuse bucket - allocate a new one.

For remaining (regular) objects (including the remainders ("heads") of large objects) find page with enough space in FreeList (allocate new one Find the most free page with enough space for data row in FreeList ( if there is no such page - allocate new one) and fill it up to the end.

Batch update in B+ Tree

TBD: describe the implementation.

...

Overall changes to support batch updates in PageMemory can be divided into following phases.

Phase 1:

...

Batch insertion in FreeList to improve rebalancing

  • Implement insertDataRows operation in FreeList - insert several data rows at once.
  • Implement invokeAll operation in BPlusTree: support searching and inserting range of keys.
  • Enable batch insertion in preloader (enabled by a special system property).

Phase 2: Batch update existing keys

  • InvokeAll operation should support batch update of existing keys.

...

  • Preloader should insert a batch of data rows before initializing cache entries. In the case when the cache entry is initialized incorrectly, preloader should rollback changes and remove pre-created data row.

Phase 2: DataStreamer support

  • Add support for batch updates in DataStreamer (inserts in FreeList in the isolated updater (similar to the preloader).

Phase

...

3: putAll support

  • Add support for Implement batch updates in IgniteCache putAll.

...

  • operations in B+ tree (findAll/putAll/removeALl/invokeAll).
  • Examine the performance difference between the following approaches and select the best:
    A.  single updates (current approach)
    B.  sort + BPlusTree.invokeAll() + FreeList.insertDataRow
    C.  sort + BPlusTree.findAll + FreeList.insertDataRows + BPlusTree.putAll

Phase 4: MVCC support

  • Add support for MVCC (TRANSACTIONAL_SNAPSHOT) cache mode.

Risks and Assumptions

  1. Memory fragmentationFor BPlusTree batch operations, ordered keys are required, moreover, an attempt to simultaneously lock the same keys in a different order lead to a deadlock, so batch insertion into the page memory must be performed on an unlocked entries. Alternatively, keys passed in batches from different components (preloader, datastreamer, putAll) should be locked in the same order.
  2. Heap usage/GC pressure.

Prototype testing results

For testing purposes, a prototype was created with simplified Phase 1 implementation, which includes FreeList optimization (batch writing to data pages), but does not include optimization for B+ Tree (searching and inserting a range of keys). The rebalancing process was chosen as the easiest and most suitable for testing batch inserts in PageMemory.

Synthetic testing results

...

.

Microbenchmark prepares a supply message and measures the time spent by the demander to handle it.

Parameters: 1 node, 1 cache, 1 partition, 100 objects, message size is not limited, 4k page.

Entry size (bytes)44-104140-340340-740740-12401240-30402000-3000
1040-8040

4040-16040

(fragmented)


100-32000

(fragmented mostly)

Time improvement (%)43.437.133.928.205.410.18.6
1.1

Testing on dedicated servers

Checked the total rebalancing time on the following configuration:

Cluster: 2 nodes
Cache: transactional, partitioned 1024 partitions, 1 backup
Data size: 40 GB
Page size: 4096 bytes
Rebalance message size: 512 KB
Count of prefetch messages: 10

The improvement in rebalancing time with batch insertion is mostly noticeable when writing small objects and decreases on larger objectsWe checked the rebalancing time on the prototype with the proposed changes in FreeList (Cluster of 2 nodes, transactional partitioned cache, 1 backup, 1024 partitions).

Entry size (bytes)
100
140-
200
240
200
240-
500
540500-800700-800800-1200
Improvement
Rebalancing time improvement (%)22199.582

Discussion Links

// Links to discussions on the devlist, if applicable.

...