Persistent Store Architecture

Page Memory

PageMemory is an abstraction for working with pages in memory. Internally it interacts with a file store which is responsible for allocating page ids, writing and reading pages. One should distinguish a concept of a page and a page buffer. A page is a block of data of a fixed length with a unique identifier called FullPageId. A page buffer is a region in memory associated with some page. PageMemory fully handles the process of loading pages to the corresponding page buffers and evicting unused page buffers from memory. At any moment in time page memory may keep any subset of page buffers (fitting to the allocated RAM).

FullPageId

FullPageId consists of cache ID (32 bits) and page ID (64 bits). Page ID is effectively a virtual page identifier that may change during the page lifecycle (see PageRotation below). EffectivePageId, which is a part of Page ID (partition ID, page index) does not change during the page lifecycle.

PageId and EffectivePageId

The structure of page ID is defined by the following diagram:

+---------+-----------+------------+--------------------------+

| 8 bits  | 8 bits    |   16 bits  |         32 bits          |
+---------+-----------+------------+--------------------------+
| OFFSET  | FLAGS     |PARTITION ID|        PAGE INDEX        |
+---------+-----------+------------+--------------------------+

                      |            EFFECTIVE PAGE ID          |

                      +---------------------------------------+

Offset is a reserved field used either for page ID rotation or for referencing a record in a data page
Flags is a reserved field used for page ID rotation
Partition ID is either a partition identifier in [0, 65500] or a reserved value 0xFFFF used for index partition. Other values are reserved for future use.
Page Index is a monotonically growing number within each partition

Page State

At any moment in time page can be in the following states:

Unloaded. There is no a corresponding page buffer loaded in memory
Clean. Page buffer is loaded and page buffer contents is equals to the data written to disk
Dirty. Page buffer has been modified and it's content is different from the data written to disk
Dirty in checkpoint. Page buffer has been modified, checkpoint started and page buffer has been modified again before the first modification has been written to disk. In this state PageMemory keeps two page buffers for each page - first one is for the current checkpoint (in progress) and second one is for the next checkpoint.

PageMemory keeps track of dirty page IDs and is capable of supplying this list to the checkpointer thread. When checkpoint starts, the collection of dirty pages is atomically reset to a new empty collection.

Page Modification

Each page buffer has an associated read write lock paired with a tag. Page tag is a part of page ID used for page rotation. In order to read or write page buffer contents one must acquire the read or write lock. The page buffer read write lock requires a correct tag to be passed in order for the lock to be acquired. If the passed in tag differs from the actual page tag, it means that the page has been reused and the link to this page is no longer valid (see page ID rotation below).

Internal Data Structures

Depending on page type and the amount of free space in a page it can be optionally tracked in either FreeList or ReuseList. FreeList is a data structure used to track partially free pages (applicable for data pages). ReuseList tracks completely free pages that can be used by any other data structure in the database.

Each partition has a special dedicated page (a meta page) which holds the state of the partition and page IDs serving as roots for all the data structures associated with the partition.

Data partitions

For data partitions, the following data structures are used:

FreeList (working as a reuse list at the same time) to track free and empty pages.
Partition hash index used as a main storage for the cache.
RowStore to keep key-value pairs

Index partition

For index partition, the following data structures are used:

ReuseList (since BTrees fully acquire pages and do not need to track free space)
Metadata storage. This data structure keeps track of all allocated indexes and contains pairs (index name, root page ID)

Page ID Rotation

The structure of BPlusTree combined with the presence of ReuseList means that it is possible to observe a state when upper pages of BPlusTree refer to a page which has been returned to the reuse list by a concurrent remove, and this page has been taken for yet another place in the same BPlusTree (ABA problem). In such situation it is possible to get a deadlock because the order of page locking can be violated. In order to avoid such situations, we update page tag each time a page is returned to a reuse list. In this case if a reader finds that the lock cannot be acquired because of the page tag mismatch, it restarts an operation from scratch, reading an updated link.

Checkpointing

Checkpointing is a process of writing dirty pages from memory to persistent storage.

The following key components are needed for checkpoint

Checkpoint read-write lock. This lock protects PageMemory from capturing intermediate state when data structures are being updated. A thread performing data structures update (cache update, partition migration, etc) must acquire read lock, update data structures and then release lock. Checkpointer must acquire write lock, capture IDs of dirty pages, release write lock and persist pages to disk.
WAL checkpoint record. This record is inserted to WAL when checkpointer holds the checkpoint write-lock. If WAL is replayed from the previous checkpoint record to this checkpoint record, data structures will be in consistent state.
Database checkpoint markers. Each marker is a file written to a local FS when checkpoint starts or finishes. It is guaranteed that checkpoint start marker is written after the checkpoint WAL marker and checkpoint end marker is written after all page stores are synced. These markers contain the position of WAL checkpoint record in the WAL.

After a checkpoint begun, PageMemory is moved to a copy-on-write mode. This means that if a dirty page buffer is modified, the page will be moved to the Dirty In Checkpoint state and PageMemory will create a temp copy of the page buffer which is visible for the PageMemory users (the current contents of the page). At the same time the checkpoint page buffer is visible only for checkpointer which will eventually flush it to disk. When checkpointer finishes writing, it will flush all copied pages to the main memory.

Upon each page buffer modification (before the page write lock release) the corresponding changes must be logged to the WAL. More specifically, if the page being modified is clean, the whole modified page buffer must be logged. If the page being modified is dirty, only binary page delta should be modified to reduce WAL consumption and improve performance.

Upon start, a node can observe the following states of checkpoint markers:

The last checkpoint start marker has a corresponding checkpoint end marker. This means that the process stopped or crashed when no checkpoint was in progress, thus page store (disk) contains a consistent state of data structures (at the time of checkpoint). We only need to replay logical updates that were logged to the WAL since the last checkpoint marker.
The last checkpoint start marker does not have a corresponding checkpoint end marker. This means that the process crashed or stopped during a checkpoint and page store contains half-written pages state. In this case the recovery consists of two steps. First, we must replay all WAL records starting from last finished checkpoint marker and ending with the start checkpoint marker. This will bring data structures to a consistent state. After this we must replay WAL logical updates from record B to the end of the log (the same as above).

Page Store

Page store is an interface responsible for allocating, reading and writing page buffers to a physical durable storage. Page store is allocated on per-partition basis. Refer to org.apache.ignite.internal.pagemem.store.PageStore for the store API. Here we describe the meaning of the main store methods

allocatePage(). This method should return next unused page index. When the store starts, the corresponding counter must revert to the index of last written page. In other words, persistence of allocated page indexes is performed by a checkpointer.
pages(). This method should return the total number of allocated pages.
read(). This method reads the page with the given index to the given page buffer.
write(). This method writes a page with the given page index from the given page buffer.
pageOffset(). This method should return the offset of the page with the given ID in a file.
sync(). Ensures that all writes issued to the store reached the media.

All methods of the page store must be thread-safe.

Write-Ahead-Log

WAL is an append-only data structure that logs logical database modifications and page changes. Refer to org.apache.ignite.internal.pagemem.wal.IgniteWriteAheadLogManager for the API. Here we describe the meaning of the main WAL methods

log(). Append a WAL record to the log and return a pointer to this record in the log file. It is not required that the record physically reaches the disk upon this method completion.
fsync(). Make sure that all the records (or at least all up to the passed in pointer) are flushed to disk.
replay(). Return an iterator of all logged records starting from the given pointer and to the end of the log.

Decoupling

In current API of the WAL we rely on java objects rather than on byte arrays. As per discussion with Veritas, we need to decouple the java records serialization logic and writing the serialized values to disk. In order to achieve ideal integration, we must supply a block-based WAL implementation.

Current implementation of WAL works based on segment-based WAL structure. WAL is written to a files of fixed length called segments, each segment has a monotonically growing absolute index. The segment is first written to one of the preallocated work files in WAL work directory. The index of the work file is the absolute index of the segment modulo the number of preallocated segments. In each moment in time only one work file is opened for writing. After a work file has been written, it is copied to the WAL archive directory by a background WAL archiver, then this work file is cleaned (filled with zeros) and given back to WAL for further writes. WAL archiver and WAL are properly synchronized so that WAL does not overwrite unarchived work files.

Rebalancing

Rebalancing logic is based on per-partition update counters. During rebalancing, every node sends it's update counters for each partition to the coordinator. Then coordinator assigns partition states according to update counters: nodes that have the highest value for a partition are considered its owners.

A node that needs to obtain more recent data for a partition will include its current update counter in GridDhtPartitionDemandMessage. A supplier might return either full partition or changes since that counter, extracted from WAL.

Page tree