1. Objective

Apache Ignite needs to support running SQL queries from disk without loading them in memory and, in general, be fully operational even if all the data is on disk. 

2. Terminology

NameDescription
Persistent StoreDisk-based (preferably SSD) file or block-level store for data stored in Apache Ignite
Write Ahead Log (WAL)
Temporary disk-based journal which keeps track of transactional updates that have not been stored in the Persistent Store yet
Sorted Index
Index that is capable of performing range-based SQL queries
Hash IndexIndex that is capable of doing direct equality-based lookups
Snapshottable Index

Index that is capable of performing timestamp-based snapshots of the data
Useful for when query results should not include updates to the data that happened after the query execution started 

3. Requirements 

  1. Need to be able to use SQL indexes from disk and from memory
  2. Need to store both, primary and backups, on disk
  3. Persistent Store format should be separated into file-per-partition for easier migration of data
  4. Persistent Store needs to be globally transactional and handle scenarios when some nodes committed successfully and others did not
     

4. Webinar: Persistent Store Overview

Watch the recording of the Apache Ignite Community meetup to quickly grasp the benefits, specificities, architecture and implementation details of the store. 

5. Design

4.1 Persistent Store

The main purpose of the Persistent Store is to provide data persistence and fault-tolerance, i.e. no data should ever be lost regardless of any type of failures

Data partitioning details:

  • Persistent Store will maintain a separate space (file) for every partition within a cache (the same approach is in-memory). 
  • Splitting data, based on cache partitions, is the most optimal way to achieve fast rebalancing during topology changes. 
  • Whenever a new node joins or an existing node leaves, cluster protocol should select a grid node with the up-to-date partition data and copy it to another node. 
  • If concurrent updates should happen during the partition file migration, those updates will be applied to the partition file after the copy is complete.
  • While partition is being migrated, it will be considered MOVING, i.e. not eligible to become primary partition yet. 

We may consider adding the ability to create new files, either periodically, or upon a request. In this case, the process of "applying" logs will create new files, instead of updating the old ones.
 

NOTE: Apache Ignite Persistent Store currently already supports the following functionality which we should be able to leverage:

  • file-based storage
  • in-memory hash-based indexes
  • fast background file compaction

4.2 Write-Ahead-Log (WAL)

Apache Ignite will maintain a WAL file for all caches. The purpose of the WAL file is to provide a recovery mechanism for the transactional data stored on disk. 

WAL update details:

  • For fastest performance, data will always be appended to the tail end of the WAL file.
  • Every transactional update will have transactional start-end demarcations and a timestamp
  • A background process will periodically unwind the WAL data and move it to Persistent Store (this will prevent unconditional growth of the WAL file).
  • In case of a failure, upon restart, Apache Ignite should be able to start operating immediately based on the latest WAL and Persistent Store data.
  • It should be possible to set the maximum size of the log file.

  • We should have several WAL files. Whether these should be the same files is an open question, we may decide to can create new ones each time.

  • We should be able to set the lifetime of the WAL and a place where older WAL files should be moved (after their contents are transferred to the Persistent Store). For example, there may be SSDs and conventional disks. All logs that are required for failure handling are kept on the SSDs and traditional disks, and those that are not needed anymore are removed from SSDs. In general, almost complete analogy with Oracle mode ARCHIVELOG.

  • At the end of a transaction log buffer should be forced to flush.

4.3 Hash Index

In addition to the Persistent Store, Apache Ignite will maintain a per-partition hash-index for key-based data access both, in memory and on disk. This hash-index will guarantee fast data lookups based on a primary key.

The on disk hash-index will always have the latest consistent state, up until the last WAL flush to the Persistent Store. In case of failures or restarts, Apache Ignite should be able to immediately recreate the state of the hash-index on disk by applying the remaining portion of the WAL that has not been copied yet to the Persistent Store. Once the disk state is restored, Apache Ignite can start loading the hash-index into memory in the background, without blocking any cache operations.

Based on configuration, hash-index should be optionally snapshottable (especially if we decide to use it for SQL queries).

4.5 Sorted Index

Apache Ignite will need to maintain on-disk and in-memory sorted indexes. These indexes will have the latest state up until the last WAL flush. It is important that indexes can be loaded in-memory as-is, in the same format as on-disk, without any additional processing, for better performance. Just like the hash-index, the sorted-indexes will be split into file-per-partition, which will make it very easy to copy partitions between nodes in case of failures or topology changes.

The on-disk sorted-index will have the same failure and recovery process, as described for the hash-indexes above.


NOTE:
 for in-memory representation of sorted index, need to consider removing per-partition split and keeping the whole index in one data structure in-memory. This will help avoid merge-sort overhead for range-based and order-by queries and  potentially provide better performance.

4.6 Memory Format

All the indexes that are stored on disk will also be maintained in-memory. However, Apache Ignite will be able to perform all the cache operations, including SQL queries, based exclusively on the on-disk indexes, without having to load them in-memory. The in-memory indexes are only maintained for performance reasons and do not have any effect on the overall supported functionality.

4. Fault Tolerance

4.1 Client Node Failures

Client node failures do not have effect on persisted state on servers.

4.2 Data Node Failures

In case when all the primaries failed with backups, application must detect that some of partitions are lost and needs to block all the queries until the failed node will be restarted, because otherwise query results will be wrong (as well as cache.get, etc...)

In case when backups or primaries still exist, the failure of a node does not affect query result correctness. Since we will store backup entries in Persistent Store as well as primaries, affinity change to new topology will not affect data consistency between stores on different nodes.

On restart, Apache Ignite will determine if another server contains the latest file for a certain partition, and will copy that file before resuming operation.

5. Transaction Consistency

It is possible that after all nodes agreed during the prepare phase, the commit phase will fail on some node. To handle such cases, nodes should persist the changes to the WAL during the prepare phase, but mark them as "uncommitted". Then, whenever the commit happens, the "uncommitted" flag should be flipped.

If after a crash, certain transactions were left in "uncommitted" state, then their persisted state should be ignored and loaded from other cluster nodes, primary or backups, which have the consistent state.

If after restart, none of the nodes have a semi-failed transaction in committed state, then the transaction should be rolled back and the "uncommitted" persisted state should be ignored.


NOTE: this protocol for transactional consistency is already implemented in Apache Ignite for in-memory transactions and will need to be expanded to support disk-based transactions.

6. Partition Recovery

Whenever crashes or topology changes occur, an internal cluster protocol will need to determine which cluster node has the most up-to-date partition files for Persistent Store, hash-index, and sorted-indexes, and copy these files to the nodes which either have outdated partition data or to the new nodes which have no data at all. For performance reasons we should investigate to only copy the missing delta instead of the whole partition files.

The copy functionality itself should be provided by a pluggable interface and potentially have several implementations. For example, we can copy partition files using standard Java-based File APIs or native Linux scp or cp commands.

  • No labels