Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents:

Table of Contents

Ignite

...

Persistent Store

File types

There are following file types used for persisting data: Cache pages or page store, Checkpoint markers, and WAL segments.

...

Ignite with enabled persistence uses following folder structure:

2.3+Older versions (2.1 & 2.2)

Image Modified

The Pst-subfolder name is same for all storage folders.

A name is selected on start, may be based on node consistentId.


Expand

Image Modified

Consistent ID may be configured using IgniteConfiguration or generated from local IPs set by default.


Subfolders Generation

The subfolder name is generated on start. By default new style naming is used, for example node00-e819f611-3fb9-4dbe-a3aa-1f6de4af5d02

  • where 'node' is a constant prefix,
  • '00' is node index, it is an incrementing counter of local nodes under same PST root folder,
  • remaining is string representation UUID, and this UUID became node's consistent ID.

...


PST subfolder naming options explained:

...

Expand

1) A starting node binds to a port and generates old-style compatible consistent ID (e.g. 127.0.0.1:47500) using DiscoverySpi.consistentId(). This method still returns ip:port-based identifier.

2) The node scans the work directory and checks if there is a folder matching the consistent ID. (e.g. work\db\127_0_0_1_49999). If such a folder exists, we start up with this ID (compatibility mode), and we get file lock to this folder. See PdsConsistentIdProcessor.prepareNewSettings.

3) If there are no matching folders, but the directory is not empty, scan it for old-style consistent IDs. If there are old-style db folders, print out a warning (see warning text above), then switch to new style folder generation (step 4).

4) If there are existing new style folders, pick up the one with the smallest sequence number and try to lock the directory. Repeat until we succeed or until the list of new-style consistent IDs is empty. (e.g. work\db\node00-uuid, node01-uuid, etc).

5) If there are no more available new-style folders, generate a new one with next sequence number and random UUID as consistent ID. (e.g. work\db\node00-uuid, uuid overrides uuid in GridDiscoveryManager).

6) Use this consistent ID for the node startup (using value from GridKernalContext.pdsFolderResolver() and from PdsFolderSettings.consistentId()).

There is a system property to disable new-style generation and using old-style consistent ID (IgniteSystemProperties.IGNITE_DATA_STORAGE_FOLDER_BY_CONSISTENT_ID).

 


Page store

Ignite Durable Memory is the basis for all data structures. There is no cache state saved on heap now. 

...

Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) is a snapshot of dirty pages at checkpoint start. This collection allows writing pages which were changed since the last checkpoint.

Info

When the checkpoint process starts, pages marked for checkpoint are no longer marked as dirty ones in metrics.

Checkpoint Pool

In parallel with the process of writing pages to disk, some thread may need to update data in the page being written (or scheduled to being written).

...

  1. exponential backoff (start with ultra-short park, every next park will be <factor> times longer)
  2. and speed-based (collect the history of disk write speed measurements, extrapolate it to calculate "ideal" speed, and bound threads that generate dirty pages with that "ideal" speed) 

...

  1. .

...

  1. There are three main approaches:

    - Exponential backoff is used if over 2/3 of the checkpoint buffer is used up. If it is enabled, other throttling strategies are not used.
    - Clean pages protection is used if there are 0 checkpoint pages. It is used to protect pages at the start of the end of the checkpointing process.
    - Throttling is based on a comparison of the speed of checkpointing and dirty pages creation. Uses the speed of checkpointing at +10% as the throttling limit. 

Ignite node chooses one of them adaptively. 

How to detect that throttling is applied

There are two ways to find out that Pages Write Throttling affects data update operations.

  1. Take a thread dump -

How to detect that throttling is applied

There are two ways to find out that Pages Write Throttling affects data update operations.

  1. Take a thread dump - some threads will be waiting at LockSupport#parkNanos with "throttle" classes in a trace. Example stacktrace:

    Code Block
    "data-streamer-stripe-4-#14%pagemem.PagesWriteThrottleSandboxTest0%@2035" prio=5 tid=0x1e nid=NA waiting
      java.lang.Thread.State: WAITING
    	  at sun.misc.Unsafe.park(Unsafe.java:-1)
    	  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:232)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:220)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463)
    	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
    	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
    	  at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:102)
    	  at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1300)
    	  at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.createRow(GridCacheOffheapManager.java:1438)
    	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4338)
    	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4296)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3051)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2945)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1717)
    	  ...


  2. If throttling is applied, related statistics is dumped to log from time to time:

    Code Block
    [2018-03-29 21:36:28,581][INFO ][data-streamer-stripe-0-#10%pagemem.PagesWriteThrottleSandboxTest0%][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0,92, markDirty=9905 pages/sec, checkpointWrite=6983 pages/sec, estIdealMarkDirty=41447 pages/sec, curDirty=0,07, maxDirty=0,26, avgParkTime=741864 ns, pages: (total=169883, evicted=0, written=112286, synced=0, cpBufUsed=15936, cpBufTotal=241312)]
    
    

    The most relevant part of this message is percentOfPartTime metric. In the example it's 0.92 - writing threads are stuck in LockSupport.parkNanos() for 92% of the time, which means very heavy throttling.
    Each message appears in the log when percentOfPartTime reaches 20% border. 

...

Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.

 

WAL structure

 WAL consist of segments (files). The part of segments creates a work directory and files there are cyclically overwritten. Another part is archive - it is sequentially enumerated files, old files are deleted.

WAL file segments and rotation structureand rotation structure is shown at the picture below:

 

 


A number of segments may be not needed anymore (depending on History Size setting). Old fashion WAL History size setting is set in checkpoints number (See also WAL history size section below), the new one is set in bytes.  History size setting is mentioned here https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive

Local Recovery Process

Let’s assume node start process is running with existent files.

...

There several levels of guarantees (WALMode)



 

 
Implementation
Warranties
FSYNCfsync() on each commitAny crashes (OS and process crash)
LOG_ONY

write() on commit

Synchronisation is responsibility of OS

Kill process, but no OS fail
BACKGROUND

do nothing on commit

(records are accumulated in memory)

write() on timeout

kill -9 may cause loss of several latest updates
NONE

WAL is disabled

data is persisted only in case
of graceful cluster shutdown
(Ignite.cluster().active(false))

 


But there is several nodes containing same data and there is possible to restore data from other nodes.

...

Partition update counter is saved with update recods in WAL.

Node Join (with data from

...

persistence)

Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150

Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)

Image RemovedImage Added

Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.

...

Page size configuration for storage path [/work/db/node00-3a1415b8-aa54-4a63-a40a-c75ad48dd6b8]: 4096; Linux memory page size: 4096; Selected FS block size : 4096.
Selected FS block size : 4096
Direct IO is enabled for block IO operations on aligned memory structures. [block size = 4096, durable memory page size = 4096]
 

However, disabling plugin’s function is possible through system Property. To disable Direct IO set IgniteSystemProperties#IGNITE_DIRECT_IO_ENABLED to false.

...