Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents:

Table of Contents

Ignite

...

Persistent Store

File types

There are following file types used for persisting data: Cache pages or page store, Checkpoint markers, and WAL segments.

...

Ignite with enabled persistence uses following folder structure:

2.3+Older versions (2.1 & 2.2)

Image Modified

The Pst-subfolder name is same for all storage folders.

A name is selected on start, may be based on node consistentId.


Expand

Image Modified

Consistent ID may be configured using IgniteConfiguration or generated from local IPs set by default.


Subfolders Generation

The subfolder name is generated on start. By default new style naming is used, for example node00-e819f611-3fb9-4dbe-a3aa-1f6de4af5d02

  • where 'node' is a constant prefix,
  • '00' is node index, it is an incrementing counter of local nodes under same PST root folder,
  • remaining is string representation UUID, and this UUID became node's consistent ID.

...


PST subfolder naming options explained:

...

Expand

1) A starting node binds to a port and generates old-style compatible consistent ID (e.g. 127.0.0.1:47500) using DiscoverySpi.consistentId(). This method still returns ip:port-based identifier.

2) The node scans the work directory and checks if there is a folder matching the consistent ID. (e.g. work\db\127_0_0_1_49999). If such a folder exists, we start up with this ID (compatibility mode), and we get file lock to this folder. See PdsConsistentIdProcessor.prepareNewSettings.

3) If there are no matching folders, but the directory is not empty, scan it for old-style consistent IDs. If there are old-style db folders, print out a warning (see warning text above), then switch to new style folder generation (step 4).

4) If there are existing new style folders, pick up the one with the smallest sequence number and try to lock the directory. Repeat until we succeed or until the list of new-style consistent IDs is empty. (e.g. work\db\node00-uuid, node01-uuid, etc).

5) If there are no more available new-style folders, generate a new one with next sequence number and random UUID as consistent ID. (e.g. work\db\node00-uuid, uuid overrides uuid in GridDiscoveryManager).

6) Use this consistent ID for the node startup (using value from GridKernalContext.pdsFolderResolver() and from PdsFolderSettings.consistentId()).

There is a system property to disable new-style generation and using old-style consistent ID (IgniteSystemProperties.IGNITE_DATA_STORAGE_FOLDER_BY_CONSISTENT_ID).

 


Page store

Ignite Durable Memory is the basis for all data structures. There is no cache state saved on heap now. 

...

Usage of checkpointLock provides the following warranties

  • 1 begin checkpoint, and 0 updates (tx'es commitno transactions commit()s) running or
  • 0 begin checkpoint and N updates (tx committransactions commit()s) running

Checkpoint begin does not wait transactions to finish. That means a transaction may start before a checkpoint , but the transaction will be committed after the checkpoint end ends or during its run.

Dirty The durable memory maintains dirty pages is set, when . When a page from non-dirty becomes dirty, it this page is added to this set.

Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write is a snapshot of dirty pages at checkpoint start. This collection allows writing pages which were changed since last the last checkpoint.

Checkpoint Pool

Info

When the checkpoint process starts, pages marked for checkpoint are no longer marked as dirty ones in metrics.

Checkpoint Pool

In parallel with the process of writing pages to disk, some thread may need In parallel with process of writing pages to disk, some thread may want to update data in the page being written (or scheduled to being written).

For such case Checkpoint pool , checkpoint pool (or checkpoint buffer) is used for pages being updated in parallel under simultaneous update with write. This pool has limitation.

Copy on write technique is used. If there is modification required in a page which is under checkpoint,  we will create Ignite creates a temporary copy of this page in checkpoint pool.


To perform write to a dirty page scheduled to be checkpointed followin is following is done:

  • write lock is acquired to the page to page to protect from inconsistent changes,
  • then a full copy of page the page is created in checkpoint pool (CP checkpoint pool is latest the latest chunk of each durable memory segment),
  • actual data is written to regular segment chunk after the copy is created,
  • and later copy of the page from the CP checkpoint pool will be written to disk.

If a page was not involved into checkpoint initially, and it is updated concurrenly with checkpointing processconcurrently with the checkpointing process following is done:

  •  it is updated directly in memory bypassing checkpoint pool.,
  • and actual data will be stored to disk during next checkpoint.

When a dirty page flushed to disk, dirty the dirty flag is cleared. Every future write to such page the before-mentioned page (which was initially involved into into the checkpoint, but was flushed) does not require the checkpoint pool usage, it is written dirrectly in directly into a segment.

Checkpoint Triggers

Following events triggers checkpointing:

  • Percent of dirty pages is a trigger for checkpointing (e.g. 75%).
  • Timeout is also trigger, do a trigger as well. User may specify to do a checkpoint every N seconds

...

  • .

Pages Write Throttling

Sharp Checkpointing has side-effects when throughput the throughput of data updates is greater than higher than the throughput of a physical storage device. Under heavy load of writes, the rate of operations per second rate periodically drops to zero:

...

When offheap memory accumulates too many dirty pages (pages with data not written to disk yet), Ignite node initiates checkpoint — — a process of writing a сonsistent snapshot of all pages to disk storage. If a dirty page is changed during ongoing checkpoint before being written to disk, its previous state is copied to a special data region — checkpoint buffer:

...

Slow storage devices cause long-running checkpoints. And if a load is high while a checkpoint is slow, two bad things can happen:

  • Checkpoint buffer can overflow
  • Dirty pages threshold for next checkpoint can be reached during current checkpoint

Any of two events above will cause Ignite causes Ignite node to freeze all updates until the end of current checkpoint. That's why operations/sec graph falls to zero.

Since Ignite 2.3, data storage configuration has writeThrottlingEnabled property. If it's enabled, there are two possible situations that can trigger throttling: 

...

Checkpoint buffer overflow protection is always enabled. 

  • If checkpoint buffer is being filled too fast. Fill ratio more than 66% triggers throttling.

Since Ignite 2.3, data storage configuration has writeThrottlingEnabled property. If it's enabled, following possible situation that can trigger throttling: 

  • If percentage of dirty pages increases too rapidly.

If throttling is triggered, threads that generate dirty pages are slowed with LockSupport.parkNanos(). Throttling stops when none of two conditions above is true (or when checkpoint is finishedfinishes). As a result, a node will provide constant provides constant operations/sec rate at the speed of storage device instead of initial burst and following long freeze.

There are two approaches to calculate necessary time to park thread:

  1. exponential backoff (start with ultra-short park, every next park will be <factor> times longer)
  2. and speed-based (collect the history of disk write speed measurements, extrapolate it to calculate "ideal" speed, and bound threads that generate dirty pages with that "ideal" speed). There are three main approaches:

    - Exponential backoff is used if over 2/3 of the checkpoint buffer is used up. If it is enabled, other throttling strategies are not used.
    - Clean pages protection is used if there are 0 checkpoint pages. It is used to protect pages at the start of the end of the checkpointing process.
    - Throttling is based on a comparison of the speed of checkpointing and dirty pages creation. Uses the speed of checkpointing at +10% as the throttling limit. 

and speed-based (collect history of disk write speed measurements, extrapolate it to calculate "ideal" speed and bound threads that generate dirty pages with that "ideal" speed) - Ignite node chooses one of them adaptively. 

How to detect that throttling is applied

...

  1. Take a thread dump - some threads will be waiting at LockSupport#parkNanos with "throttle" classes in a trace. Example stacktrace:

    Code Block
    "data-streamer-stripe-4-#14%pagemem.PagesWriteThrottleSandboxTest0%@2035" prio=5 tid=0x1e nid=NA waiting
      java.lang.Thread.State: WAITING
    	  at sun.misc.Unsafe.park(Unsafe.java:-1)
    	  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:232)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:220)
    	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463)
    	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
    	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
    	  at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:102)
    	  at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1300)
    	  at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.createRow(GridCacheOffheapManager.java:1438)
    	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4338)
    	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4296)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3051)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2945)
    	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1717)
    	  ...


  2. If throttling is applied, related statistics will be is dumped to log from time to time:

    Code Block
    [2018-03-29 21:36:28,581][INFO ][data-streamer-stripe-0-#10%pagemem.PagesWriteThrottleSandboxTest0%][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0,92, markDirty=9905 pages/sec, checkpointWrite=6983 pages/sec, estIdealMarkDirty=41447 pages/sec, curDirty=0,07, maxDirty=0,26, avgParkTime=741864 ns, pages: (total=169883, evicted=0, written=112286, synced=0, cpBufUsed=15936, cpBufTotal=241312)]
    
    

    The most relevant part of this message is percentOfPartTime metric. In the example it's 0.92 - writing threads are stuck in LockSupport.parkNanos() for 92% of the time, which means very heavy throttling.
    Message will appear in log when percentOfPartTime will reach Each message appears in the log when percentOfPartTime reaches 20% border. 

CRC validation

For each page, a control checksum (CRC32) is calculated. The CRC calculation is performed before each page is written to the page store. When a page is read from the page store, CRC is again calculated against the read data and validated against the CRC field also stored in the page.

If CRC validation fails, Ignite logs the failure and attempts to recover the page from WAL. If recovery succeeds, the node keeps running. If recovery fails, then node shuts down itself.

WAL

We can’t control a moment when a node crashes. Let's suppose we have saved tree leafs, but leafs but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case, all updates will can be lost.

In the same time, we can’t translate each memory page update to disk write operation each time - it is too slow.

Technique A technique to solve this named write-ahead loggingBefore doing an actual update, we append planned change information into a cyclic file named WAL log (operation name - . WAL write operation named as WAL append/WAL log).

After the crash, we can read and replay WAL using already saved page set. We can restore to state, which was last was the last committed state of crashed the crashed process. Restore is based operation based on both: pages store + and WAL.

Practically we can’t replay WAL from the beginning of times, Volume(HDD)<Volume(full WAL), and . And we need a procedure to throw out oldest part of changes in WAL, and this is done during checkpointing.

Consistent state comes only from a pair of WAL records and page store data.

Operation is acknowleged after operation was logged, and page(s) update was logged. Checkpoint will be started later by its triggers.

...

Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.

 

WAL structure

 WAL consist of segments (files). The part of segments creates a work directory and files there are cyclically overwritten. Another part is archive - it is sequentially enumerated files, old files are deleted.

WAL file segments and rotation structureand rotation structure is shown at the picture below:

 

 


A number of segments may be not needed anymore (depending on History Size setting). Old fashion WAL History size setting is set in checkpoints number (See also WAL history size section below), the new one is set in bytes.  History size setting is mentioned here https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive

Local Recovery Process

Let’s assume node start process is running with existent files.

...

There several levels of guarantees (WALMode)



 

 
Implementation
Warranties
FSYNCfsync() on each commitAny crashes (OS and process crash)
LOG_ONY

write() on commit

Synchronisation is responsibility of OS

Kill process, but no OS fail
BACKGROUND

do nothing on commit

(records are accumulated in memory)

write() on timeout

kill -9 may cause loss of several latest updates
NONE

WAL is disabled

data is persisted only in case
of graceful cluster shutdown
(Ignite.cluster().active(false))

 


But there is several nodes containing same data and there is possible to restore data from other nodes.

...

Partition update counter is saved with update recods in WAL.

Node Join (with data from

...

persistence)

Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150

Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)

Image RemovedImage Added

Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.

...

Page size configuration for storage path [/work/db/node00-3a1415b8-aa54-4a63-a40a-c75ad48dd6b8]: 4096; Linux memory page size: 4096; Selected FS block size : 4096.
Selected FS block size : 4096
Direct IO is enabled for block IO operations on aligned memory structures. [block size = 4096, durable memory page size = 4096]
 

However, disabling plugin’s function is possible through system Property. To disable Direct IO set IgniteSystemProperties#IGNITE_DIRECT_IO_ENABLED to false.

...