Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current implementation of WAL works based on segment-based WAL structure. WAL is written to a files of fixed length called segments, each segment has a monotonically growing absolute index. The segment is first written to one of the preallocated work files in WAL work directory. The index of the work file is the absolute index of the segment modulo the number of preallocated segments. In each moment in time only one work file is opened for writing. After a work file has been written, it is copied to the WAL archive directory by a background WAL archiver, then this work file is cleaned (filled with zeros) and given back to WAL for further writes. WAL archiver and WAL are properly synchronized so that WAL does not overwrite unarchived work files.

Snapshots

Full Snapshot

Creation sequence:

  • Originating node starts partition exchange process which waits for all ongoing transaction to finish and blocks all new transactions
  • When all ongoing transactions finished, each node starts a special kind of checkpoint - a snapshot checkpoint. It implies all the actions needed for a usual checkpoint - it acquires checkpoint write lock, captures the collection of dirty pages, etc., plus it captures the total number of allocated pages for each partition.
  • After the collection of dirty pages is acquired on all nodes, the partition exchange completes and new transactions are allowed again
  • New snapshot session is started using snapshot SPI
  • During this checkpoint all page buffers are written both to disk and snapshot session
  • After checkpoint is finished, the snapshot worker is waked up which copies the rest of the pages to the snapshot session
  • If a new checkpoint begins before the snapshot is finished, the previous page content is forcibly propagated to the snapshot session before a new dirty page is written to disk.

After a full snapshot is initiated, each next page modification is also tracked in special tracking pages used for an incremental snapshot.

Incremental Snapshot

Incremental snapshot is possible if there was at least one full snapshot completed. The procedure of incremental snapshot creation is similar to full snapshot creation with the exception that only pages changed since the last full or incremental snapshot are written.

Rebalancing

Rebalancing logic is based on per-partition update counters. During rebalancing, every node sends it's update counters for each partition to the coordinator. Then coordinator assigns partition states according to update counters: nodes that have the highest value for a partition are considered its owners.

...