Ignite Durable Memory

This article covers internal design of Durable Memory. Intented to Ignite developersContents

Motivation

Let us cover reasons why Ignite uses durable memory

1. Consider case Ignite is backed by 3rd party DB solution to store data, only required data was loaded into Ignite. But running any full scan query (which requires whole data set to be in-memory) causes loading all data into cache.

2. Consider simplest implementation of persistence - one or several memory mapped files. In this case running Query to select some cache entry by field using this model required us too long time.

Reasons of this are device features:

RAM - random access, byte addressed
HDD - block addressed, long random access

Memory mapped file on HDD will be anyway slow. It is required to store data not in “random” way (at is in java Heap), but in some organized manner, grouped by block.

3. Main requirement: when node starts it should handle cache requests (start to operate) without long delay to load all data from disk.

Solution

Let's divide memory into pages (synonyms: buffers, block, chunks). Let's consider Page as fundamental unit of all memory. Memory addressing become page based.

Query may require SQL index data to execute. It is required to build this index. If we use memory mapped file, we will have to read all data to process first query.

Instead of this, index data is also page based and as part of durable memory is stored to disk.

Let's introduce integer number - index of block, idx (defined within current node)

idx * blockSize = file offset

Different caches and its partitions brings more complex addressation, but still it is possible to map from page ID to position in particular file

pageId => file + offset in this file

Let’s use Page ID for linking 2 pages, and then for dereferencing real position on file

We are now able to organize pages into more complex structure, for example, tree. Ignite uses B+ Tree B+Tree is self balancing. This protects from growing tree in put-remove case.

As a result for start operating with B+ Tree we can just load root (metadata page) and start reading results without all tree loading into RAM. Tree pages can be merged into one if <50% of space is used.

Ignite also uses special component that manages information about pages currently available in memory.

SQL query now can start running without full data available in memory. For recently started cluster first time SQL query run will be, of cause, slower.

If memory amount less than whole index size, Ignite still can operate. When free memory amount will be insufficient to allocate new page, some page will be removed from RAM back to disk. Removal decision is based on latest touch time.

Algorithm used are LRU-S/LRU 2, see also Variants on LRU

Page based eviction

Let's suppose RAM memory is fully filled with pages, and it is required to allocate new. It is required to evict some page from memory

Ignite uses Eviction Policy to determine which page to select to be evicted.

Simplest algorithm would be selected for eviction is LRU, but it requires double linked list. It is not simple to implement such structure in off heap.

Algorithm used instead is Random-LRU (most recent access timestamp is stored for a data page)

Entry eviction

Eviction is used not only in Ignite Persistent Store - mode. Same technique is required if Ignite is used as fast access cache with 3rd party DB as persistence.

In that case we need to remove one cache entry, but removing entry from the middle of page will cause pages fragmentation.

Instead of this we can evict old random page, read all entries and remove all entries one-by-one.

There is only one exception: entry may be currently locked under transaction. In this case such page is excluded from eviction.

This method allows to clean up big continuous segment of memory (usually whole page)

Random-2-LRU

There is second option for eviction. In that algorithm two most recent access timestamps are stored for every data page.

In case of touch page: Oldest timestamp is overwritten with current time.

In case of eviction: Oldest timestamp is used for eviction decision.

This policy solves case of one-time access of data, for example, one full scan query. Pages touched during running this query is not considered hot. See also documentation

Free lists

Ignite manages free lists to solve issue with fragmentation in pages (not full pages).

Cache entry [Key, Value] pairs have different size and after placing first entry, pages will have different free sizes

Free list - list of pages, structured by amount of space remained within page.

During selection of page to store new value pair Ignite does the following:

Consult marshaller about size in bytes of this value pair
Upper-round this value to be divisible by 8 bytes
Use value from previous step to get page list from free list
Select some page from appropriate list of free pages. This page will have required amount of free space

Long objects

If object is longer than page size, it will require several pages to store

Object is saved from end to beginning.

This allows Ignite to touch page only once. For each new page we already know link to previosly allocated. This allows to reduce number of locks on pages.

For object part which is less than page size, we can use partially filled page from free list.

Let's consider object field update. If marshaller reported that updated field requires same size optimization may be used. Such update does not require page structure change or memory movement.

Page has dirty flag. If value stored in page was changed, but not yet flushed to disk, page is marked as dirty.

In previous case (updated field value has same length) only one page will be marked as dirty.

Page structure

Page types and headers

There is class PageIO - it is abstract class for reading and writing pages. Several implementations also exist: BplusIO, DataPageIO

Every page contain Page header. Page header includes

Type - 2 bytes, determines class of page implementation
Version - 2 bytes
Crc - 4 bytes
pageId - for backward converting unsafe memory offset info page id (forward by page ID we can resolve offset in unsafe memory).
Reserved - 3*8 bytes

Data page has its own header in addition to general page. It contains:

Free space - to avoid recalculation
Direct count
Indirect count

After page header page contains items.

Item - internal reference (offset) to payload within page. Item is 2 bytes in length.

See page structure at picture below:

Values are inserted into data page from the end to beginning. Items are filled from beginning to end.

To address particular Key-Value pair ignite uses Link = Page ID + order in page.

Link allows to read K,V pair as N-th item in page.

Value delete from Data Page

Deletion of lastest added item is trivial. We can just remove Item, and Key-Value pair without any additional changes (see It3, K,V3 at picture).

More complex algorithm is activated for case of deletion of item from middle of page.

In that case

we remove uneeded key-value pair (K,V2 for example, see picture below)
we move data of element with higher number into space, that become free. (K,V3 for example)

In the same we need keep external Link to K,V3 pair consistent (Page ID + order=3). This link may be referenced outside, for example, by B+Tree.

We keep Item (2 bytes) at same place for moved K,V3.
Indirect pointer is written into Item for K,V3 referring to Item with order 2. Item with order 2 refers to real data. See picture below.

There is also compaction background process.

This techique brings advantage: at inserting new K,V pair we don’t need to iterate over page to find free space. We can insert after latest item instead.

During deletion of indirect items there is another algorithm is activated.

Page free size is tracked at corresponsing field, regardless to fragmentation occurred. Compaction may be required if insert is not possible. Compaction will change all K,V pairs offsets within page. Defragmentation is performed in values area, references to values are kept unmodified to achieve consistency in B-tree.

To transform Page ID into Link it is possible to write offset to highest bits of Page ID.

Example of re-insertion new K,V4 element after some other K,V2 deleted. Following picture shows Indirect Item to direct Item replacement

BPlus Tree Structure

B+Tree structure is build mostly the same as binary tree. If requied value was not found, it is compared with some value in the tree. Greater values can be found using rigth link, less - using left.

Binary search is used to find required K. It is required to touch log N of data pages to complete search of value.

Link in B+ Tree allows to read K,V pair or K only depending to object size. There is optimisation done to avoid odd page reads: If indexed value requires less bytes than some threshold, value is written into tree.

Duplicate keys is not possible in B+ Tree.

Hash Index is also B+ Tree (not hash table), key is hashcode and value is link.

Memory policy

Memory Policy is especially important when Ignite Persistent Store configuration is enabled.

Several caches in previous version may allocate uncontrolled amount of memory. In case of limited resources first cache which perfomed allocations wins. There was no way to limit this memory for particular cache.

In 2.0 Ignite version it is possible to control it using Memory Policy. One Memory Policy may include several caches (relation is 1.to.N)

All data may be separated now in the end-user system to following classes: archive data and operational data

User now can specify how much memory it is possible to allocate for cache or cache group.

Reference tables (dictionaries) are usually small, and may be assigned to be allocated to memory always.

Results of memory structure changes

For previous Ignite versions - caches were on heap by default. Offheap caches were required configuration, and a number of small regions were allocated for each usage.

Quite long GC pause can look totally the same as failed node from the remote node's point of view.

More heap size used for data causes longer GC Pause. Long GC causes cluster failure bacames more probable.

For 2.1+ Ignite versions: Page memory is used for all data. Caches are placed in off-heap memory. These pages can now be saved and loaded from disk (transparently to user).

Ignite node with persistent store enabled may now start to operate without reading all pages data from disk.

Persistent Store Architecture and Persistent Store Overview

Page tree

Ignite Durable Memory - under the hood

Motivation

Solution