Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


 

Excerpt

This article covers the internal design of

...

 

Contents

Table of Contents

Motivation

Let us cover reasons why Ignite uses durable memory

the Ignite multi-tier storage architecture. Intended for Ignite developers.

The reasoning behind the storage architecture is described in the main documentation: https://ignite.apache.org/docs/latest/memory-architecture

Table of Contents



Ignite Multi-Tier Storage Architecture

Let's split the

1. Ignite is backed by 3rd party DB solution to store data, only required data was loaded into Ignite. But running any full scan query (requires whole set) causes loading all data into cache.

2. Before 2.0 there was persistence solution, Local File Store. But still, running Query selecting all cache data using old model required us too long time. Reasons of this are device features:

  • RAM - random access, byte addressed
  • HDD - block addressed, long random access

Memory mapped file on HDD will be anyway slow. It is required to store data not in “random” way (at is in java Heap), but in some organized manner, grouped by block.

3. Main requirement: when node starts it should handle cache requests (start operations) without long delay to load all data from disk.

Solution

Ignite Durable Memory

Let's divide memory into pages (synonyms: buffers, block, chunks). Let's consider the Page as a fundamental unit of all memory. Memory addressing become page based.

Query may require SQL index data to execute. It is required to build this index. If we use old model, we will have to read all data to process first query.

the whole memory is split into. This makes the memory addressing page-based.

Next, sometimes a query might require an SQL index for execution. The memory has to store not only data entries inside of the pages but builds and maintains indexes as well. If Ignite used a memory-mapped files approach, it would be required to read all the data to process the first query. Instead of this, index data is also page based and as part of durable memory is stored to disk.kept in pages and can store in RAM on disk if the latter is enabled.

Now letLet's introduce an integer number -that will define an index of block, Page - idx (defined within current node)4 bytes, unique within cache, partition and in current local node). In continuous memory segment page is located at specific offset:

No Format
idx * blockSizepageSize = file offset

Different caches and its partitions brings more complex addressation, but still it is possible to map from page ID to position in particular file

No Format
pageId => file + offset in this file

Let’s use Page ID for linking 2 pages, and then for dereferencing real position on file

Image Removed

Let's also add partition identifier (2 bytes), and composed identifier is effectivePageId, see PageIdUtils#effectivePageId

Using this page identifier it's possible to link pages together to build dependencies between them. For instance, this is how we can link 2 pages together by keeping child page ID in the root page:

Image Added

The ability to link the pages gives a way We are now able to organize pages into more complex structure, for example, tree. Ignite uses B+ Tree B+Tree is self balancing. This protects from growing tree in put-remove case. structures such as trees. Ignite taps into this by using B+Tree for index and data pages arrangements. 

If Ignite Persistence is used then after a restart Ignite will load the tree's root node (metadata page) and be able to iterate over the tree layers preloading each new missing page in memory from disk on demand. Also, the tree As a result for start operating with B+ Tree we can just load root (metadata page) and start reading results without all tree loading into RAM. Tree pages can be merged into one if <50% less of 50% of space within page is used .

Image Removed

by payload data.

Image Added

B+Tree are used for SQL indexes: tree maps field value to reference to entry value.

Region and segment structure

Page content is stored in RAM in segments. Actually memory region allocated in RAM is not continious sequence of pages. 

Ignite uses a special Ignite also uses special component that manages information about pages currently available in memory segment (LoadedPagesTable, ex FullPageIdTable) and ID mapping to region address. There is one page ID table per memory .

SQL query now can start running without full data available in memory. For recently started cluster first time SQL query run will be, of cause, slower.

segment chunk (UnsafeChunk, DirectMemoryRegion). For unsafe segment chunk sun.misc.Unsafe is used internally, which provides memory for off-heap data storage.

LoadedPagesTable (PageIdTable) manages mapping from Page ID to relative pointer map (rowAddr) within unsafe segment chunk.

Each segment manages it's own lock. Lock protects segment content (page table) from concurrent modification.  By default segment count is determined by available CPUs, but may be configured by DataStorageConfiguration.setConcurrencyLevel().

Splitting each data region (DataRegionConfiguration) to a number of segments allows to minimise contention on one lock for region (striping technique is used).

Structure of segments and Loaded Pages Map is shown at fugure:

Gliffy Diagram
size600
namesegmentsAndPageTable
pagePin2

Since Ignite version 2.5 Loaded Pages Id Table uses Robin Hood hashing: backward shift deletion algorithm for maintatining HashMap of FullPageId (CacheId, PageId)->(Address, PartGeneration).

To select approrpiate segment Ignite hashes Full Page Id, and (hash % segmentNo) gives the number.

In segment in LoadedPagesTable Ignite also hashes FullPageId to select approptiate bucket in the table.

In case of hash collision inside LoadedPagesTable lineral probe is used to find page by ID. If empty bucket is found, search is finished: page is not present in RAM.

Segment and Page Locks

Segment lock is read-write:

 

  • Read lock is held to segment to resolve real unmanaged memory address from page ID (page pin, acquire).
  • Read lock is held to release previously acquired page.
  • Write lock is held if segment pages are being modified (e.g. page is rotated from RAM to disk, or new page is allocated).

Page itself also has its own lock. This lock may be acquired only for page available in RAM and only for pinned (acquired) page.

Gliffy Diagram
namesegmentsAndPagesLocks
pagePin2

Since Ignite version 2.5 there are two optimizations introduced. Segment write lock

  1. is not held during initial page loading from disk
  2. and it is not held during actual page write to disk.

For read case segment lock is relased, but page write lock is acquired instead, so user thread will wait on page lock instead of segment lock during page is loaded. This decreases contention on segment lock during long running IO operation. Segment lock is hold for short time to acquire page and insert it into table, but released before IO.

In case of flusing dirty page (see section page replacement below) there is additional synchronized map of pages being written to disk, so page read can't be started for such pages until removal from this map. This map is usually not big and most likely page is not there.

Eviction, rotation and expiration

This section describes possible pages and entries operations related to rotation with disk or completely removal data from grid.

TermActivatedCommentsConfigurationLevel of operationIn memory only modePersistency mode

Expiration (aka TTL)

Time

Sets expire time of entry after entry creation/access/update

ExpiryPolicy (Factory)

Entry

(plus)

(plus)/(minus) a number of issues exist

Eviction

Region is full

Completely removes entry from grid. Reasonable with 3rd party persistence

PageEvictionMode

Entry (+ page is used to find more entries to remove)

(plus)

N/A

On Heap eviction

Depends on policy

Near caches and On-Heap caches (1.x)

EvictionPolicy

Entry

(plus) only for near /on-heap caches

Page replacement

Region is full

Ignite operates

Not configurable by user

Page

N/A

(plus) Always enabled


Page replacement (rotation with disk)

If you enable Ignite native persistence (that is covered in Ignite Persistent Store - under the hood), then the paging is still handled automatically by Ignite.

PageIdTable (FullPageIdTable) table is used to check

  • if a page is actually available in memory
  • or has to be loaded from disk.

As a result, even SQL queries can be fully utilized from Ignite applications when only a subset of pages is in memory. In this case required pages will be taken from disk during query execution. 

If memory amount is less than whole size of B+tree for SQL indexIf memory amount less than whole index size, Ignite still can operate. When free memory amount will be insufficient to allocate new page, some page will be removed from RAM back to disk. Removal decision is based on latest touch time.

Algorithm used are LRU-S/LRU 2, see also Variants on LRU 

Ignite runs out of RAM memory for the new pages allocation, some of the pages will be purged from RAM (as it can be loaded from disk later).

This process is named page rotation (page replacement, swap to disk, historical naming - page eviction).

...

Let's suppose RAM memory is fully filled with pages, and it is required to allocate new. It is required to evict some page from memory

Image Removed

Ignite uses Eviction Policy to determine which page to select to be evicted.

.

Example of rotation of new page and clear page in RAM is shown in picture.

Image Added

 Simplest algorithm would be selected for eviction selecting page to rotate is LRU, but it requires double linked list. It is not simple to implement such structure in off heap.

Algorithm used instead is Random-LRU (most recent access timestamp is stored for a data page)

Entry eviction

page). This algorihtm is used always if Persistent Data store mode enabled.  Ignite will adaptively page-in and page-out memory to disk in a fashion very similar to how modern operating systems handle virtual memory. 

Replacement (rotation) of pages is started only if memory segment is full and it is impossible to allocate page in segment chunk.

During first page rotated warning message is printed to logs


No Format
Page replacements started, pages will be rotated with disk, this will affect storage performance (consider increasing DataRegionConfiguration#setMaxSize).

Message for versions early than 2.5 was:

Expand


No Format
Page evictions started, this will affect storage performance (consider increasing MemoryConfiguration#setPageCacheSize).



If this message appeared in log it means segment is full and for allocation of new page it is required to purge some other page.

Page replacement may have negative influence to performance because in some cases it is possible that Ignite continiously evicts page from RAM and some steps later reqiures this page data. In this case it is required to re-read data from disc.

Entry Eviction

If Native Peristence is not used, then upcoming hit to memory limit requires Ignite to clean up some data (otherwise IgniteOutOfMemory may occur). In this case, users have to decide how to deal with out-of-memory situations by specifying their own eviction policy. Ignite will not magically page to disk and so users need to configure eviction. 

Ignite uses Eviction Policy to determine which page to select to be evicted. Policy is configured by user. This configuration is useful Eviction is used not only in Ignite Persistent Store - mode. Same technique is required if Ignite is used as fast access cache with 3rd party DB as persistence . (such as write-behind to an RDBMS).


If eviction is configured In that case we need to remove one cache entry, but removing entry from the middle of page will cause pages fragmentation.

...

This method allows to clean up big continuous segment of memory (usually whole page)

Image Removed

 Image Added

Eviction policy only makes sense when native persistence is disabled because what it does is actually freeing up memory when a user hits the memory limit.

The only way to do this is to destroy inserted data because there is no other way to free memory.

Eviction mechanism, it is both per-page and per-entry:

  • first, we choose a page which fits most for eviction (however, we cannot simply erase the page because quite a lot of data structures are referring to that page).
  • second, we read keys that the chosen page contains and then clear individual cache entries, which in turn will clear up all necessary links.
     

If there are no concurrent updates, the page becomes empty and will be reused for other user data. 

Random-2-LRU

There is second option for eviction if Persitent Data store is not enabled. In that algorithm two most recent access timestamps are stored for every data page.

...

In case of eviction: Oldest timestamp is used for eviction decision.

Image RemovedImage Added

This policy solves case of one-time access of data, for example, one full scan query. Pages touched during running this query is not considered hot. See also documentation

Entries and pages

Location of objects

Free lists

Ignite manages free lists to solve issue with fragmentation in pages (not full pages).

...

Free list - list of pages, structured by amount of space remained within page.

Image RemovedImage Added

During selection of page to store new value pair Ignite does the following:

  • Consult marshaller about size in bytes of this value pair
  • Upper-round this value to be divisible by 8 bytes
  • Use value from previous step to get page list from free list
  • Select some page from appropriate list of free pages. This page will have required amount of free space

Long objects

If object is longer than page size, it will require several pages to store

...

For object part which is less than page size, we can use partially filled page from free list.

Image RemovedImage Added

Let's consider object field update. If marshaller reported that updated field requires same size optimization may be used. Such update does not require page structure change or memory movement.

...

In previous case (updated field value has same length) only one page will be marked as dirty.

Page structure

Page types and headers

There is class PageIO - it is abstract class for reading and writing pages. Several implementations implementations also exist: BplusIODataPageIO

---

Every page contain Page header.  Page header includes

  • Type - 2 bytes,  determines class of page implementation
  • Version - 2 bytes
  • Crc - 4 bytes
  • pageId - for backward converting unsafe memory offset info page id (forward by page ID we can resolve offset in unsafe memory).
  • Reserved - 3*8 bytes

Data page has its own header in addition to general pageIt contains:

  • Free space - to avoid recalculation
  • Direct count
  • Indirect count

After header page header page is filled with contains items.

Item - internal reference (offset reference to payload in within page, . Item is 2 bytes in length.

See page structure at picture below:

Image AddedImage Removed

Values are filled inserted into data page from the end to beginning. Items are filled from beginning to end.

Link (pageidTo address particular Key-Value pair ignite uses Link = Page ID + order in page.

Link allows to read K,V pair as N-th ) to KV pair allows to know exact item in page.

Value delete from Data Page

Deletion of last latest added item is simple - we can remove It3, and K,V trivial. We can just remove Item, and Key-Value pair without any additional changes (see It3, K,V3 at picture).

Other More complex algorithm is activated for case of deletion of item from middle of page. 

In that case

  • we remove uneeded key-value pair (K,V2 for example, see picture below)
  • we move data of element with higher number into space, that become free. (K,V3 for example)

In the same But also we need keep consistent linkexternal Link to K,V3 pair consistent (Page ID + order=3). This link may be referenced outside, for example, by BTreeB+Tree.

  • We keep

...

  • Item (2 bytes) at same place for moved K,V3. 
  • Indirect pointer is written

...

  • into Item for K,V3 referring to Item with order 2. Item with order 2 refers to real data. See picture below.


Image Added

This techique brings advantage: at inserting new K,V pair

...

 

Image Removed

 

There is also compaction background process.

At insert we don’t need to iterate over page to find free space, we still . We can insert after latest item instead.

During deletion of indirect items there is another process algorithm is activated.

Whole Page free size is tracked as free size, even if at corresponsing field, regardless to fragmentation occurred. Compaction may be required if insert is not possible. Compaction will change all K,V pairs offsets within page.

To get link from page id offset is written to highest bits of page id.

Fragmentation  Defragmentation is performed in values area, references to values are kept unmodified to achieve consistency in bB-tree

 

Element add after some deleted: Indirect to direct replacement

Image Removed

 

Probable implementation - need to be verified against DataPageIO implementation

BPlus Tree Structure

Image Removed

 

Link allows to read KV pair or K only.

.

To transform Page ID into Link it is possible to write offset to highest bits of Page ID.

Image Added

Example of re-insertion new K,V4 element after some other K,V2 deleted. Following picture shows Indirect Item to direct Item replacement

 


BPlus Tree Structure

B+Tree structure is build mostly the same as binary tree. If requied value was not found, it is compared with some value in the tree. Greater values can be found using rigth link, less - using left.

Image Added


Binary search is used to find required K. It is required to touch Binary search is used, need to read and check log N of data pages to complete search of value. Optimisation is

Link in B+ Tree allows to read K,V pair or K only depending to object size. There is optimisation done to avoid odd page reads: If indexed value requires less bytes than some threshold, value is written into threetree.

Duplicate keys is not possible in B-+ Tree.

Hash Index is also B-+ Tree (not hash table), key is hashcode and value is link.

Memory policy

Data Region Configuration

DataRegionConfiguration (historical naming is Memory Policy) is especially important when disc Ignite Persistent Store configuration is enabled.

Several caches in previous Ignite version were allocating may allocate uncontrolled amount of memory. First cache In case of limited resources first cache which perfomed allocations wins. There was no way to limit this memory for particular cache.

In new 2.0 Ignite version it is possible using Memory Policy. 1 Memory Policy may include N cachesto control it using DataRegionConfiguration (before 2.3 was named Memory Policy). One data region may include several caches (relation is 1.to.N).

All data may be separated by user to following classesData may be separated in the end-user system: archive data and operational data:

Image Removed

Image Added

Image Removed

Image Added

We User now can specify how much memory it is possible to allocate for cache or cache group.

Reference tables (dictionaries) are usually small, and may be assigned to be allocated to memory always.

 

Storage Architecture: Ignite 2.x vs Ignite 1.x

The following paragraph summarizes the results

...

of memory structure changes

Previous For previous Ignite versions - caches were on heap , offheap by default. Offheap caches were required configuration, and a number of small regions were allocated for each usage.

Too Quite long GC pause can look totally the same as failed node from the remote node's point of view.

More heap size used for data causes longer GC Pause. Long GC causes cluster failure bacames more probable.


For 2.1+ Ignite versions: Page memory is used for all data. Caches are placed in off-heap memory.

 

References

https://apacheignite.readme.io/v2.1/docs/durable-memory

https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Architecture

These pages can now be saved and loaded from disk (transparently to user).

Ignite node with persistent store enabled may now start to operate without reading all pages data from disk.


See also

Persistent Store Architecture and Persistent Store Overview https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Overview