Table of Contents:

Table of Contents

Ignite

...

Persistent Store

File types

There are following file types used for persisting data: Cache pages or page store, Checkpoint markers, and WAL segments.

...

Ignite with enabled persistence uses following folder structure:

2.3+

Older versions (2.1 & 2.2)

Image Modified

The Pst-subfolder name is same for all storage folders.

A name is selected on start, may be based on node consistentId.

Expand

Image Modified

Consistent ID may be configured using IgniteConfiguration or generated from local IPs set by default.

Subfolders Generation

The subfolder name is generated on start. By default new style naming is used, for example node00-e819f611-3fb9-4dbe-a3aa-1f6de4af5d02

where 'node' is a constant prefix,
'00' is node index, it is an incrementing counter of local nodes under same PST root folder,
remaining is string representation UUID, and this UUID became node's consistent ID.

...

PST subfolder naming options explained:

...

Expand

1) A starting node binds to a port and generates old-style compatible consistent ID (e.g. 127.0.0.1:47500) using DiscoverySpi.consistentId(). This method still returns ip:port-based identifier.

2) The node scans the work directory and checks if there is a folder matching the consistent ID. (e.g. work\db\127_0_0_1_49999). If such a folder exists, we start up with this ID (compatibility mode), and we get file lock to this folder. See PdsConsistentIdProcessor.prepareNewSettings.

3) If there are no matching folders, but the directory is not empty, scan it for old-style consistent IDs. If there are old-style db folders, print out a warning (see warning text above), then switch to new style folder generation (step 4).

4) If there are existing new style folders, pick up the one with the smallest sequence number and try to lock the directory. Repeat until we succeed or until the list of new-style consistent IDs is empty. (e.g. work\db\node00-uuid, node01-uuid, etc).

5) If there are no more available new-style folders, generate a new one with next sequence number and random UUID as consistent ID. (e.g. work\db\node00-uuid, uuid overrides uuid in GridDiscoveryManager).

6) Use this consistent ID for the node startup (using value from GridKernalContext.pdsFolderResolver() and from PdsFolderSettings.consistentId()).

There is a system property to disable new-style generation and using old-style consistent ID (IgniteSystemProperties.IGNITE_DATA_STORAGE_FOLDER_BY_CONSISTENT_ID).

Page store

Ignite Durable Memory is the basis for all data structures. There is no cache state saved on heap now.

...

Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) is a snapshot of dirty pages at checkpoint start. This collection allows writing pages which were changed since the last checkpoint.

Info
When the checkpoint process starts, pages marked for checkpoint are no longer marked as dirty ones in metrics.

Checkpoint Pool

In parallel with the process of writing pages to disk, some thread may need to update data in the page being written (or scheduled to being written).

...

exponential backoff (start with ultra-short park, every next park will be <factor> times longer)
and speed-based (collect the history of disk write speed measurements, extrapolate it to calculate "ideal" speed, and bound threads that generate dirty pages with that "ideal" speed)

...

.

...

There are three main approaches:

- Exponential backoff is used if over 2/3 of the checkpoint buffer is used up. If it is enabled, other throttling strategies are not used.
- Clean pages protection is used if there are 0 checkpoint pages. It is used to protect pages at the start of the end of the checkpointing process.
- Throttling is based on a comparison of the speed of checkpointing and dirty pages creation. Uses the speed of checkpointing at +10% as the throttling limit.

Ignite node chooses one of them adaptively.

How to detect that throttling is applied

There are two ways to find out that Pages Write Throttling affects data update operations.

Take a thread dump -

How to detect that throttling is applied

There are two ways to find out that Pages Write Throttling affects data update operations.

Take a thread dump - some threads will be waiting at LockSupport#parkNanos with "throttle" classes in a trace. Example stacktrace:

Code Block

"data-streamer-stripe-4-#14%pagemem.PagesWriteThrottleSandboxTest0%@2035" prio=5 tid=0x1e nid=NA waiting
  java.lang.Thread.State: WAITING
	  at sun.misc.Unsafe.park(Unsafe.java:-1)
	  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:232)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:220)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463)
	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
	  at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:102)
	  at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1300)
	  at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.createRow(GridCacheOffheapManager.java:1438)
	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4338)
	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4296)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3051)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2945)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1717)
	  ...

If throttling is applied, related statistics is dumped to log from time to time:

Code Block

[2018-03-29 21:36:28,581][INFO ][data-streamer-stripe-0-#10%pagemem.PagesWriteThrottleSandboxTest0%][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0,92, markDirty=9905 pages/sec, checkpointWrite=6983 pages/sec, estIdealMarkDirty=41447 pages/sec, curDirty=0,07, maxDirty=0,26, avgParkTime=741864 ns, pages: (total=169883, evicted=0, written=112286, synced=0, cpBufUsed=15936, cpBufTotal=241312)]

The most relevant part of this message is percentOfPartTime metric. In the example it's 0.92 - writing threads are stuck in LockSupport.parkNanos() for 92% of the time, which means very heavy throttling.
Each message appears in the log when percentOfPartTime reaches 20% border.

...

Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.

WAL structure

WAL consist of segments (files). The part of segments creates a work directory and files there are cyclically overwritten. Another part is archive - it is sequentially enumerated files, old files are deleted.

WAL file segments and rotation structureand rotation structure is shown at the picture below:

A number of segments may be not needed anymore (depending on History Size setting). Old fashion WAL History size setting is set in checkpoints number (See also WAL history size section below), the new one is set in bytes. History size setting is mentioned here https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive

Local Recovery Process

Let’s assume node start process is running with existent files.

...

There several levels of guarantees (WALMode)

	Implementation	Warranties
FSYNC	fsync() on each commit	Any crashes (OS and process crash)
LOG_ONY	write() on commit Synchronisation is responsibility of OS	Kill process, but no OS fail
BACKGROUND	do nothing on commit (records are accumulated in memory) write() on timeout	kill -9 may cause loss of several latest updates
NONE	WAL is disabled	data is persisted only in case of graceful cluster shutdown (Ignite.cluster().active(false))

But there is several nodes containing same data and there is possible to restore data from other nodes.

...

Partition update counter is saved with update recods in WAL.

Node Join (with data from

...

persistence)

Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150

Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)

Image RemovedImage Added

Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.

...

Page size configuration for storage path [/work/db/node00-3a1415b8-aa54-4a63-a40a-c75ad48dd6b8]: 4096; Linux memory page size: 4096; Selected FS block size : 4096.
Selected FS block size : 4096
Direct IO is enabled for block IO operations on aligned memory structures. [block size = 4096, durable memory page size = 4096]

However, disabling plugin’s function is possible through system Property. To disable Direct IO set IgniteSystemProperties#IGNITE_DIRECT_IO_ENABLED to false.

...

Page tree

Versions Compared

Old Version 49

New Version Current

Key

Ignite

Persistent Store

File types

Subfolders Generation

Page store

Checkpoint Pool

How to detect that throttling is applied

How to detect that throttling is applied

WAL structure

Local Recovery Process

Node Join (with data from

persistence)

Page tree

Page History

Versions Compared

Old Version 49

New Version Current

Key

Ignite

Persistent Store

File types

Subfolders Generation

Page store

Checkpoint Pool

How to detect that throttling is applied

How to detect that throttling is applied

WAL structure

Local Recovery Process

Node Join (with data from

persistence)