Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Check that all baseline nodes online.
  2. Start distributed process CACHE_GROUP_KEY_CHANGE_PREPARE, each node
    1. verifies that re-encryption not in progress (?)
    2. ensures that new key identifier does not exists
    3. adds new key
  3. After successful completion of PREPARE, start distributed process CACHE_GROUP_KEY_CHANGE_FINISH, each node
    1. saves logical WAL record (ENCRYPTION_STATUS_RECORD) with current page count in partitions.
    2. stores current page count as total pages for background re-encryption on partitions.
    3. sets new key for writing
    4. adds the mapping "WAL segment -> *old* key identifier" (to safely cleanup this key in the future)
    5. stores current pages count as total pages for background re-encryption (on applicable partitions).sets new key for writing
    6. starts background re-encryption

Background re-encryption

Process applies only for OWNING/MOVING partitions that are not currently clearedall existing partitions including index.

Scan all pages from specified range (metapageid metaPageId + [offset -> total])

  1. acquire /lock pagepage
    1. if checkpoint is finished (after key change) and page is dirty - skip this page.
    2. if checkpoint is not finished or page is not dirty
      1. lock page
      2. if checkpoint is not finished and page is dirty - save additional page snapshot into WALt
      3. unlock page (dirty=true, if page hasn't been dirty snapshot is logged into WAL)
  2. release page

Re-encryption progress is stored into metapage (int offset, int total), updates during checkpoint.

The process aborts for only when partition that is scheduled for evicting/clearing during re-encryptionis destroyed.

Cleanup old key

Old group key will be removed when

  1. re-encryption completed for cache group (and after that at least one checkpoint was completed)
  2. last WAL segment in which the key was used is removed (on node startup should check if the WAL history has been cleared manually)

Fault tolerance

If If CACHE_GROUP_KEY_CHANGE_PREPARE has not been successfully completed on all nodes, the process is interrupted and must be restarted.
When the process restarts, a new key identifier is generated (an unused key will be removed on finish-phase, when we set the new key for writing).

...

When a non-baseline node joins a cluster (with baseline change), it cleans up all existing data, so this shouldn't be a problem case.

If the node stops/fails during re-encryption, after restarting it must continue re-encryption from the stored offset (if :

  1. If checkpoint failed it should restore physical records from WAL, as usual).
  2. If checkpoint was not invoked reencryption is started from the beginning using saved logical WAL record.

// TBD

Risks and assumptions

...