Percent of dirty pages is trigger for checkpointing (e.g. 75%).
Timeout is also trigger, do checkpoint every N seconds

Pages Write Throttling

Sharp Checkpointing has side-effects when throughput of data updates is greater than throughput of physical storage device. Under heavy load of writes, operations per second rate periodically drops to zero:

Image Removed

When offheap memory accumulates too many dirty pages (pages with data not written to disk yet), Ignite node initiates checkpoint — process of writing сonsistent snapshot of all pages to disk storage. If dirty page is changed during ongoing checkpoint before being written to disk, its previous state is copied to a special data region — checkpoint buffer:

Image Removed

Slow storage devices cause long-running checkpoints. And if load is high while checkpoint is slow, two bad things can happen:

Checkpoint buffer can overflow
Dirty pages threshold for next checkpoint can be reached during current checkpoint

Any of two events above will cause Ignite node to freeze all updates until the end of current checkpoint. That's why operations/sec graph falls to zero.

Since Ignite 2.3, data storage configuration has writeThrottlingEnabled property. If it's enabled, there are two possible situations that can trigger throttling:

If checkpoint buffer is being filled too fast. Fill ratio more than 66% triggers throttling.
If percentage of dirty pages increases too rapidly

If throttling is triggered, threads that generate dirty pages are slowed with LockSupport.parkNanos(). Throttling stops when none of two conditions above is true (or when checkpoint is finished). As a result, node will provide constant operations/sec rate at the speed of storage device instead of initial burst and following long freeze.
There are two approaches to calculate necessary time to park thread: exponential backoff (start with ultra-short park, every next park will be <factor> times longer) and speed-based (collect history of disk write speed measurements, extrapolate it to calculate "ideal" speed and bound threads that generate dirty pages with that "ideal" speed) - Ignite node chooses one of them adaptively.

How to detect that throttling is applied

There are two ways to find out that Pages Write Throttling affects data update operations.

Take a thread dump - some threads will be waiting at LockSupport#parkNanos with "throttle" classes in trace. Example stacktrace:

Code Block

"data-streamer-stripe-4-#14%pagemem.PagesWriteThrottleSandboxTest0%@2035" prio=5 tid=0x1e nid=NA waiting
  java.lang.Thread.State: WAITING
	  at sun.misc.Unsafe.park(Unsafe.java:-1)
	  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:232)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:220)
	  at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463)
	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
	  at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
	  at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:102)
	  at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1300)
	  at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.createRow(GridCacheOffheapManager.java:1438)
	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4338)
	  at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4296)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3051)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2945)
	  at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1717)
	  ...

...

If throttling is applied, related statistics will be dumped to log from time to time:

Code Block

[2018-03-29 21:36:28,581][INFO ][data-streamer-stripe-0-#10%pagemem.PagesWriteThrottleSandboxTest0%][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0,92, markDirty=9905 pages/sec, checkpointWrite=6983 pages/sec, estIdealMarkDirty=41447 pages/sec, curDirty=0,07, maxDirty=0,26, avgParkTime=741864 ns, pages: (total=169883, evicted=0, written=112286, synced=0, cpBufUsed=15936, cpBufTotal=241312)]

The most relevant part of this message is percentOfPartTime metric. In the example it's 0.92 - writing threads are stuck in LockSupport.parkNanos() for 92% of the time, which means very heavy throttling.
Message will appear in log when percentOfPartTime will reach 20% border.

CRC validation

For each page a control checksum (CRC32) is calculated. The CRC calculation is performed before each page is written to the page store. When a page is read from the page store, CRC is again calculated against the read data and validated against the CRC field also stored in the page.

If CRC validation fails, Ignite logs the failure and attempts to recover the page from WAL. If recovery succeeds, the node keeps running. If recovery fails, then node shuts down itself.

WAL

We can’t control moment when node crashes. Let's suppose we have saved tree leafs, but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case all updates will be lost.

...

Operation is acknowleged after operation was logged, and page(s) update was logged. Checkpoint will be started later by its triggers.

Crash Recovery

Local Crash Recovery

Crash Recovery can be

Local (most DB are able to do this)
and distributed (whole cluster state is restored).

WAL records for recovery

Crash recovery involves following records writtent in WAL, it may be of 2 main types Logical & Physical

Logical records

1. Operation description - which operation we want to do. Contains operation type (create, update, delete) and (Key, Value, Version) - DataRecord
2. Transactional record - this record is marker of begin prepare, prepared, and committed, or rollback transactions - (TxRecord)
3. Checkpoint record - marker of begin checkpointing (CheckpointRecord)

Structure of data record:

...

Update and create always contain full value value. In the same time several updates of the same key within transaction are merged into one latest update.

Physical records

1. Full page snapshot - record is issued for first page update after successfull checkpointing. Record is logged when page state changes from 'clean' to 'dirty' state (PageSnapshot)
2. Delta record - describes memory region change, page change. Subclass of PageDeltaRecord. Contains bytes changed in the page. e.g bytes 5-10 were changed to [...,]. Relatively small records for B+tree records

Page snapshots and related deltas are combined during WAL replay.

For particular cache entry update we log records in following order:

logical record with change planned - DataRecord with several DataEntry (ies)
page record:
1. option: page changed by this update was initially clean, full page is loged - PageSnapshot,
2. option: page was already modified, delta record is issued - PageDeltaRecord

Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.

WAL structure

WAL file segments and rotation structure

...

Local Recovery Process

Let’s assume node start process is running with existent files.

We need to check if page store is consistent.
Or we need to find out if crash was while Checkpoint (CP) was running

Ignite manages 2 types of CP markers on disk (standalone files, includes timestamp and WAL pointer):

...

If we observe only CP begin and there is no CP end marker that means CP not finished; we have not consistent page store.

No checkpoint process

For crash at the moment when there was no checkpoint process running restore is trivial, only logical record are applied.

...

Physical records (page snapshots and delta records) are ignored because page store is consistent.

Middle of checkpoint

Let’s suppose crash occurred at the middle of checkpoint. In that case restore process will discover markers for 1 CP1 and 2 start and CP 1 end.

...

If transaction begin record has no corresponding end, tx change is not applied.

Summary, limitations and performance

Limitations

Because CP are consistent we can’t start next CP until previous is not completed.

...

See also WAL history size section below.

WAL mode

There several levels of guarantees (WALMode)

...

But there is several nodes containing same data and there is possible to restore data from other nodes.

Distributed Recovery

Partition update counter. This mechanism was already used in continuous queries.

...

Partition update counter is saved with update recods in WAL.

Node Join (with data from persitence)

Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150

...

Order of nodes join is not relevant, there is possible situation that oldest node has older partition state, but joining node has higher partition counter. In this case rebalancing will be triggered by coordinator. Rebalancing will be performed from the newly joined node to existing one (note this behaviour may be changed under IEP-4 Baseline topology for caches)

Advanced Configuration

WAL History Size

In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )

...

By default WAL history size is 20 to increase probability that rebalancing can be done using logical deltas from WAL.

Estimating disk space

WAL Work maximum used size: walSegmentSize * walSegments = 640Mb (default)

...

WAL Work+WAL Archive max size may be estimated by

average load or
by maximum size.

1st way is applicable if checkpoints are triggered mostly by timer trigger.
Wal size = 2*Average load(bytes/sec) * trigger interval (sec) * walHistSize (number of checkpoints)
Where 2 multiplier coming from physical & logical WAL Records.

2nd way: Checkpoint is triggered by segments max dirty pages percent. Use persisted data regions max sizes:
sum(Max configured DataRegionConfiguration.maxSize) * 75% - est. maximum data volume to be writen on 1 checkpoint.
Overall WAL size (before archiving) = 2* est. data volume * walHistSize = 1,5 * sum(DataRegionConfiguration.maxSize) * walHistSize

Note applying WAL compressor may significiantly reduce archive size.

Setting input-output

I/O abstraction determines how disk features are accessed by native persistence.

Random Access File I/O

This type of I/O implementation opeate with files using standard Java inferface. java.nio.channels.FileChannel is used.

...

This type of IO is always used for WAL files.

Async I/O

This option is default since 2.4.

...

To set this implementation it is possible to set factory in config (DataStorageConfiguration#setFileIOFactory) or change using system property: IgniteSystemProperties.IGNITE_USE_ASYNC_FILE_IO_FACTORY = "true".

Direct I/O

Introduction and Requirements

Since Ignite 2.4 there is plugin for enabling Direct I/O mode.

...

Plugin works on Linux with the version of the kernel higher than 2.4.2. This plugin switches the input of the output for durable (page) memory to use the mode Direct IO for files. If incompatible OS or FS is used, plugin has no effect and fallbacks to regular I/O implementation. Durable memory page size should be not less than physical device block and Linux system page size. Durable memory page size should be divisible by underlying OS and FS blocks sizes. Usually both sizes are 4Kbytes, so using default page size is usually sufficient to enable plugin.

Configuration

There is no need to do additional configuration of plugin. It is sufficient to add ignite-direct-io.jar and Java Native Access (JNA) library jar (jna-xxx.jar) to classpath.

...

Enabled direct input-output mode allows Ignite to bypass the system cache pages Linux is fully conveys the management of pages to Ignite.

WAL and Native IO

Write Ahead log does not have blocks(chunks/pages) as Durable Memory Page store has. So Direct IO currently can't be enabled for WAL logging. WAL always goes through the conventional system I/O calls.

...

This advice gives only recommendation to Linux system, "do not store the file data in the page cache, as data is not required". This results WAL data is still going to page cache first, but according this advice Linux will flush and remove these data during next page cache scan.

Direct I/O & Performance

Direct I/O can bring possible negative effects to performance of reading pages. In Direct I/O mode all pages are read bypassing Linux cache directly from the disk.

...

Page tree

Versions Compared

Old Version 37

New Version 38

Key

Pages Write Throttling

How to detect that throttling is applied

CRC validation

WAL

Crash Recovery

Local Crash Recovery

WAL records for recovery

Logical records

Physical records

WAL structure

Local Recovery Process

No checkpoint process

Middle of checkpoint

Summary, limitations and performance

Limitations

WAL mode

Distributed Recovery

Node Join (with data from persitence)

Advanced Configuration

WAL History Size

Estimating disk space

Setting input-output

Random Access File I/O

Async I/O

Direct I/O

Introduction and Requirements

Configuration

WAL and Native IO

Direct I/O & Performance

Page tree

Page History

Versions Compared

Old Version 37

New Version 38

Key

Pages Write Throttling

How to detect that throttling is applied

CRC validation

WAL

Crash Recovery

Local Crash Recovery

WAL records for recovery

Logical records

Physical records

WAL structure

Local Recovery Process

No checkpoint process

Middle of checkpoint

Summary, limitations and performance

Limitations

WAL mode

Distributed Recovery

Node Join (with data from persitence)

Advanced Configuration

WAL History Size

Estimating disk space

Setting input-output

Random Access File I/O

Async I/O

Direct I/O

Introduction and Requirements

Configuration

WAL and Native IO

Direct I/O & Performance