Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Excerpt

This article covers internal design of Durable Memory. Intented to Ignite developers

 

Contents

Table of Contents

...

Let us cover reasons why Ignite uses durable memory

1. Consider case Ignite is backed by 3rd party DB solution to store data, only required data was loaded into Ignite. But running any full scan query (which requires whole data set to be in-memory) causes loading all data into cache.

2. Before 2.0 there was persistence solution, Local File Store. But still, running Query selecting all cache data using old Consider simplest implementation of persistence - one or several memory mapped files. In this case running Query to select some cache entry by field using this model required us too long time.

Reasons of this are device features:

...

3. Main requirement: when node starts it should handle cache requests (start operationsto operate) without long delay to load all data from disk.

...

Query may require SQL index data to execute. It is required to build this index. If we use old modelmemory mapped file, we will have to read all data to process first query.

...

Page structure

Page types and headers

There is class PageIO - it is abstract class for reading and writing pages. Several implementations implementations also exist: BplusIODataPageIO

---

Every page contain Page header.  Page header includes

  • Type - 2 bytes,  determines class of page implementation
  • Version - 2 bytes
  • Crc - 4 bytes
  • pageId - for backward converting unsafe memory offset info page id (forward by page ID we can resolve offset in unsafe memory).
  • Reserved - 3*8 bytes

Data page has its own header in addition to general pageIt contains:

  • Free space - to avoid recalculation
  • Direct count
  • Indirect count

After header page header page is filled with contains items.

Item - internal reference (offset reference to payload in within page, . Item is 2 bytes in length.

See page structure at picture below:

Values are filled inserted into data page from the end to beginning. Items are filled from beginning to end.

Link (pageidTo address particular Key-Value pair ignite uses Link = Page ID + order in page.

Link allows to read K,V pair as N-th ) to KV pair allows to know exact item in page.

Value delete from Data Page

Deletion of last lastest added item is simple - we can remove It3, and K,V trivial. We can just remove Item, and Key-Value pair without any additional changes (see It3, K,V3 at picture).

Other More complex algorithm is activated for case of deletion of item from middle of page.

 

In that case

  • we remove uneeded key-value pair (K,V2 for example, see picture below)
  • we move data of element with higher number into space, that become free. (K,V3 for example)

In the same But also we need keep consistent linkexternal Link to K,V3 pair consistent (Page ID + order=3). This link may be referenced outside, for example, by BTreeB+Tree.

  • We keep

...

  • Item (2 bytes) at same place for moved K,V3. 
  • Indirect pointer is written

...

  • into Item for K,V3 referring to Item with order 2. Item with order 2 refers to real data. See picture below.

 

 

There is also compaction background process.

At insert This techique brings advantage: at inserting new K,V pair we don’t need to iterate over page to find free space, we still . We can insert after latest item instead.

During deletion of indirect items there is another process algorithm is activated.

Whole Page free size is tracked as free size, even if at corresponsing field, regardless to fragmentation occurred. Compaction may be required if insert is not possible. Compaction will change all K,V pairs offsets within page.

To get link from page id offset is written to highest bits of page id.

Fragmentation  Defragmentation is performed in values area, references to values are kept unmodified to achieve consistency in bB-tree.

To transform Page ID into Link it is possible to write offset to highest bits of Page ID.

 

Element add Example of re-insertion new K,V4 element after some deleted: Indirect to direct replacementother K,V2 deleted. Following picture shows Indirect Item to direct Item replacement

Image Modified

 Probable implementation - need to be verified against DataPageIO implementation

BPlus Tree Structure

B+Tree structure is build mostly the same as binary tree. If requied value was not found, it is compared with some value in the tree. Greater values can be found using rigth link, less - using left.

 

Link allows to read KV pair or K only.

Binary search is used, need to read and check Binary search is used to find required K. It is required to touch log N of data pages to complete search of value. Optimisation is

Link in B+ Tree allows to read K,V pair or K only depending to object size. There is optimisation done to avoid odd page reads: If indexed value requires less bytes than some threshold, value is written into threetree.

Duplicate keys is not possible in B-+ Tree.

Hash Index is also B-+ Tree (not hash table), key is hashcode and value is link.

...

Memory Policy is especially important when disc Ignite Persistent Store configuration is enabled.

Several caches in previous version were allocating may allocate uncontrolled amount of memory. First cache In case of limited resources first cache which perfomed allocations wins. There was no way to limit this memory for particular cache.

In new 2.0 Ignite version it is possible to control it using Memory Policy. 1 Memory One Memory Policy may include N cachesseveral caches (relation is 1.to.N)

All data Data may be separated in now in the end-user system to following classes: archive data and operational data

We User now can specify how much memory it is possible to allocate for cache or cache group.

Reference tables (dictionaries) are usually small, and may be assigned to be allocated to memory always.

 

Results of memory structure changes

Previous For previous Ignite versions - caches were on heap , offheap by default. Offheap caches were required configuration, and a number of small regions were allocated for each usage.

Too Quite long GC pause can look totally the same as failed node from the remote node's point of view.

More heap size used for data causes longer GC Pause. Long GC causes cluster failure bacames more probable.


For 2.1+ Ignite versions: Page memory is used for all data. Caches are placed in off-heap memory. These pages can now be saved and loaded from disk (transparently to user).

Ignite node with persistent store enabled may now start to operate without reading all pages data from disk.

 References

See also

https://apacheignite.readme.io/v2.1/docs/durable-memoryhttps://cwiki.apache.org/confluence/display/IGNITE/

Persistent + Store +Architecturehttps://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+OverviewArchitecture and Persistent Store Overview