Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

This article covers the internal design of

...

the Ignite multi-tier storage architecture. Intended for Ignite developers.

The reasoning behind the

...

storage architecture is described in the main documentation:

...

https://

...

ignite.

...

apache.

...

org/docs/latest/

...

memory-

...

architecture

Table of Contents



Ignite

...

Multi-Tier Storage Architecture

Let's split the memory into pages (synonyms: buffers, block, chunks). Let's consider the Page as a fundamental unit the whole memory is split into. This makes the memory addressing page-based.

Next, sometimes a query might require an SQL index for execution. The memory has to store not only data entries inside of the pages but builds and maintain maintains indexes as well. If Ignite used a memory-mapped files approach, it would be required to read all the data to process the first query. Instead of this, index data is also kept in pages and can store both in RAM on disk if the latter is enabled.

Now let's introduce an integer number that will define an index of Page - idx (4 bytes, unique within cache, partition and in current local node). In continious continuous memory segment page is located at specific offset:

...

Let's also add partition identifier (2 bytes), and composed indetifier identifier is effectivePageId, see PageIdUtils#effectivePageId

Using this page identifier it's possible to link pages together to build dependencies between them. For instance, this is how we can link 2 pages together by keeping child page ID in the root page:

Image RemovedImage Added

The ability to link the pages gives a way to organize pages into more complex structures such as trees. Ignite taps into this by using B+Tree for index and data pages arrangements. 

If Ignite Persistence is used then after a restart Ignite will load the tree's root node (metadata page) and be able to iterate over the tree layers preloading each new missing page in memory from disk on demand. Also, the tree pages can be merged into one if less of 50% of space within page is used by payload data.

Image RemovedImage Added

B+Tree are used for SQL indexes: tree maps field value to reference to entry value.

Region and segment structure

Page content is stored in RAM in segments. Actually memory region allocated in RAM is not continious sequence of pages. 

...

Gliffy Diagram
size600
namesegmentsAndPageTable
pagePin2

Since Ignite version 2.5 Loaded Pages Id Table uses Robin Hood hashing: backward shift deletion algorithm for maintatining HashMap of FullPageId (CacheId, PageId)->(Address, PartGeneration).

...

In case of hash collision inside LoadedPagesTable lineral probe is used to find page by ID. If empty bucket is found, search is finished: page is not present in RAM.

Segment and Page Locks

Segment lock is read-write:

...

Gliffy Diagram
namesegmentsAndPagesLocks
pagePin2

Since Ignite version 2.5 there are two optimizations introduced. Segment write lock

...

In case of flusing dirty page (see section page replacement below) there is additional synchronized map of pages being written to disk, so page read can't be started for such pages until removal from this map. This map is usually not big and most likely page is not there.

Eviction, rotation and expiration

This section describes possible pages and entries operations related to rotation with disk or completely removal data from grid.

TermActivatedCommentsConfigurationLevel of operationIn memory only modePersistency mode

Expiration (aka TTL)

Time

Sets expire time of entry after entry creation/access/update

ExpiryPolicy (Factory)

Entry

(plus)

(plus)/(minus) a number of issues exist

Eviction

Region is full

Completely removes entry from grid. Reasonable with 3rd party persistence

PageEvictionMode

Entry (+ page is used to find more entries to remove)

(plus)

N/A

On Heap eviction

Depends on policy

Near caches and On-Heap caches (1.x)

EvictionPolicy

Entry

(plus) only for near /on-heap caches

Page replacement

Region is full

Ignite operates

Not configurable by user

Page

N/A

(plus) Always enabled


Page replacement (rotation with disk)

If durable memory operates with disk (native peristence you enable Ignite native persistence (that is covered in Ignite Persistent Store - under the hood), then the paging is still handled automatically by Ignite.

...

During first page rotated warning message is printed to logs

 


No Format
Page replacements started, pages will be rotated with disk, this will affect storage performance (consider increasing DataRegionConfiguration#setMaxSize).

...

Expand


No Format
Page evictions started, this will affect storage performance (consider increasing MemoryConfiguration#setPageCacheSize).

...



If this message appeared in log it means segment is full and for allocation of new page it is required to purge some other page.

Page replacement may have negative influence to performance because in some cases it is possible that Ignite continiously evicts page from RAM and some steps later reqiures this page data. In this case it is required to re-read data from disc.

Entry Eviction

If Native Peristence is not used, then upcoming hit to memory limit requires Ignite to clean up some data (otherwise IgniteOutOfMemory may occur). In this case, users have to decide how to deal with out-of-memory situations by specifying their own eviction policy. Ignite will not magically page to disk and so users need to configure eviction. 

...

If there are no concurrent updates, the page becomes empty and will be reused for other user data. 

Random-2-LRU

There is second option for eviction if Persitent Data store is not enabled. In that algorithm two most recent access timestamps are stored for every data page.

...

This policy solves case of one-time access of data, for example, one full scan query. Pages touched during running this query is not considered hot. See also documentation

Entries and pages

...

Location of objects

...

Free lists

Ignite manages free lists to solve issue with fragmentation in pages (not full pages).

...

  • Consult marshaller about size in bytes of this value pair
  • Upper-round this value to be divisible by 8 bytes
  • Use value from previous step to get page list from free list
  • Select some page from appropriate list of free pages. This page will have required amount of free space

Long objects

If object is longer than page size, it will require several pages to store

...

In previous case (updated field value has same length) only one page will be marked as dirty.

Page structure

Page types and headers

There is class PageIO - it is abstract class for reading and writing pages. Several implementations also exist: BplusIODataPageIO

...

Link allows to read K,V pair as N-th item in page.

Value delete from Data Page

Deletion of lastest latest added item is trivial. We can just remove Item, and Key-Value pair without any additional changes (see It3, K,V3 at picture).

...

  • We keep Item (2 bytes) at same place for moved K,V3. 
  • Indirect pointer is written into Item for K,V3 referring to Item with order 2. Item with order 2 refers to real data. See picture below.

...



This techique brings advantage: at inserting new K,V pair we don’t need to iterate over page to find free space. We can insert after latest item instead.

...

Example of re-insertion new K,V4 element after some other K,V2 deleted. Following picture shows Indirect Item to direct Item replacement 

 


BPlus Tree Structure

B+Tree structure is build mostly the same as binary tree. If requied value was not found, it is compared with some value in the tree. Greater values can be found using rigth link, less - using left.

 


Binary search is used to find required K. It is required to touch log N of data pages to complete search of value.

...

Hash Index is also B+ Tree (not hash table), key is hashcode and value is link.

Data Region Configuration

DataRegionConfiguration (historical naming is Memory Policy) is especially important when Ignite Persistent Store configuration is enabled.

...

All data may be separated by user to following classes: archive data and operational data:

Image Modified

Image Modified

User now can specify how much memory it is possible to allocate for cache group.

Reference tables (dictionaries) are usually small, and may be assigned to be allocated to memory always.

...

Storage Architecture: Ignite 2.x vs Ignite 1.x

Following The following paragraph summarizes the results of memory structure changes

...

Ignite node with persistent store enabled may now start to operate without reading all pages data from disk.

 


See alsohttps://apacheignite.readme.io/v2.1/docs/durable-memory

Persistent Store Architecture and Persistent Store Overview