Ignite Durable Memory

This article covers internal design of Durable Memory. Intented to Ignite developers

Contents

Motivation

Let us cover reasons why Ignite uses durable memory

1. Ignite is backed by 3rd party DB solution to store data, only required data was loaded into Ignite. But running any full scan query (requires whole set) causes loading all data into cache.

2. Before 2.0 there was persistence solution, Local File Store. But still, running Query selecting all cache data using old model required us too long time. Reasons of this are device features:

RAM - random access, byte addressed
HDD - block addressed, long random access

Memory mapped file on HDD will be anyway slow. It is required to store data not in “random” way (at is in java Heap), but in some organized manner, grouped by block.

3. Main requirement: when node starts it should handle cache requests (start operations) without long delay to load all data from disk.

Solution

Let's divide memory into pages (synonyms: buffers, block, chunks). Let's consider Page as fundamental unit of all memory. Memory addressing become page based.

Query may require SQL index data to execute. It is required to build this index. If we use old model, we will have to read all data to process first query.

Instead of this, index data is also page based and as part of durable memory is stored to disk.

Let's introduce integer number - index of block, idx (defined within current node)

idx * blockSize = file offset

Different caches and its partitions brings more complex addressation, but still it is possible to map from page ID to position in particular file

pageId => file + offset in this file

Let’s use Page ID for linking 2 pages, and then for dereferencing real position on file

We are now able to organize pages into more complex structure, for example, tree. Ignite uses B+ Tree B+Tree is self balancing. This protects from growing tree in put-remove case.

As a result for start operating with B+ Tree we can just load root (metadata page) and start reading results without all tree loading into RAM. Tree pages can be merged into one if <50% of space is used.

Ignite also uses special component that manages information about pages currently available in memory.

SQL query now can start running without full data available in memory. For recently started cluster first time SQL query run will be, of cause, slower.

If memory amount less than whole index size, Ignite still can operate. When free memory amount will be insufficient to allocate new page, some page will be removed from RAM back to disk. Removal decision is based on latest touch time.

Algorithm used are LRU-S/LRU 2, see also Variants on LRU

Page based eviction

Let's suppose RAM memory is fully filled with pages, and it is required to allocate new. It is required to evict some page from memory

Ignite uses Eviction Policy to determine which page to select to be evicted.

Simplest algorithm would be selected for eviction is LRU, but it requires double linked list. It is not simple to implement such structure in off heap.

Algorithm used instead is Random-LRU (most recent access timestamp is stored for a data page)

Entry eviction

Eviction is used not only in Ignite Persistent Store - mode. Same technique is required if Ignite is used as fast access cache with 3rd party DB as persistence.

In that case we need to remove one cache entry, but removing entry from the middle of page will cause pages fragmentation.

Instead of this we can evict old random page, read all entries and remove all entries one-by-one.

There is only one exception: entry may be currently locked under transaction. In this case such page is excluded from eviction.

This method allows to clean up big continuous segment of memory (usually whole page)

Random-2-LRU

There is second option for eviction. In that algorithm two most recent access timestamps are stored for every data page.

In case of touch page: Oldest timestamp is overwritten with current time.

In case of eviction: Oldest timestamp is used for eviction decision.

This policy solves case of one-time access of data, for example, one full scan query. Pages touched during running this query is not considered hot. See also documentation

Free lists

Ignite manages free lists to solve issue with fragmentation in pages (not full pages).

Cache entry [Key, Value] pairs have different size and after placing first entry, pages will have different free sizes

Free list - list of pages, structured by amount of space remained within page.

During selection of page to store new value pair Ignite does the following:

Consult marshaller about size in bytes of this value pair
Upper-round this value to be divisible by 8 bytes
Use value from previous step to get page list from free list
Select some page from appropriate list of free pages. This page will have required amount of free space

Long objects

If object is longer than page size, it will require several pages to store

Object is saved from end to beginning.

This allows Ignite to touch page only once. For each new page we already know link to previosly allocated. This allows to reduce number of locks on pages.

For object part which is less than page size, we can use partially filled page from free list.

Let's consider object field update. If marshaller reported that updated field requires same size optimization may be used. Such update does not require page structure change or memory movement.

Page has dirty flag. If value stored in page was changed, but not yet flushed to disk, page is marked as dirty.

In previous case (updated field value has same length) only one page will be marked as dirty.

Page structure

Page types and headers

There is PageIO - class for reading and writing pages. Several implementations BplusIO, DataPageIO

---

Page header

Type - 2 bytes, determines class of page implementation
Version - 2 bytes
Crc - 4 bytes
pageId - for backward converting unsafe memory offset info page id (forward by page ID we can resolve offset in unsafe memory).
Reserved - 3*8 bytes

Data page has its own header. It contains:

Free space - to avoid recalculation
Direct count
Indirect count

After header page is filled with items.

Item - internal offset reference to payload in page, 2 bytes

Values are filled from the end to beginning. Items are filled from beginning to end.

Link (pageid+order in page) to KV pair allows to know exact item in page.

Value delete from Data Page

Deletion of last added item is simple - we can remove It3, and K,V pair without any additional changes.

Other algorithm is activated for case of deletion of item from middle of page.

In that case we move data of element with higher number into space, that become free. But also we need keep consistent link. This link may be referenced outside, for example, by BTree. We keep item (2 bytes) at same place. Indirect pointer is written to this place instead of pointer to data.

There is also compaction background process.

At insert we don’t need to iterate to find free space, we still can insert after latest item.

During deletion of indirect items there is another process activated.

Whole free size is tracked as free size, even if fragmentation occurred. Compaction may be required if insert is not possible. Compaction will change all offsets within page.

To get link from page id offset is written to highest bits of page id.

Fragmentation is performed in values area, references to values are kept unmodified to achieve consistency in b-tree

Element add after some deleted: Indirect to direct replacement

Probable implementation - need to be verified against DataPageIO implementation

BPlus Tree Structure

Link allows to read KV pair or K only.

Binary search is used, need to read and check log N of data pages to complete search of value. Optimisation is done to avoid odd page reads: If indexed value requires less bytes than some threshold, value is written into three

Duplicate keys is not possible in B-Tree.

Hash Index is also B-Tree (not hash table), key is hashcode and value is link.

Memory policy

Memory Policy is especially important when disc configuration is enabled

Several caches in previous version were allocating uncontrolled amount of memory. First cache perfomed allocations wins. There was no way to limit this memory for particular cache.

In new version it is possible using Memory Policy. 1 Memory Policy may include N caches

Data may be separated in the end-user system: archive data and operational data

We can specify how much memory it is possible to allocate for cache or cache group.

Reference tables (dictionaries) are usually small, and may be assigned to be allocated to memory always.

Results of memory structure changes

Previous Ignite versions - caches were on heap, offheap required configuration, small regions were allocated for each usage

Too long GC pause can look totally the same as failed node from the remote node's point of view

More heap size used causes longer GC Pause. Long GC causes cluster failure bacames more probable.

Page memory is used for all data.

Caches are placed in off heap.

References

https://apacheignite.readme.io/v2.1/docs/durable-memory

https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Architecture

https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Overview

Page tree

Ignite Durable Memory - under the hood

Motivation

Solution