Motivation

Cache encryption key rotation required in case of it compromising or at the end of crypto period (key validity period). in addition, such feature is required to provide support for encrypt and descrypt existing caches in the future.

Overview

Local partition re-encryption strategy is similar to partition snapshotting - create partition snapshot re-encrypted with the new key and then swap the original partition file with the new one.

Cluster-wide process consists of the following steps:

Prepare changing the encryption key - send new key and start re-encryption task on each affinity node.
Finish changing the encryption key - swap partitions and replace cache encryption key in the metastore.

Prepare changing the encryption key

The node initiator generates new encryption key(s) for cache group(s).
The distributed process starts a new cache encryption key change operation by sending an initial discovery message with the list of re-encrypted cache groups and encrypted keys.
The distributed process configured action initiates a new local re-encryption task on each node.

Local re-encryption task

Start copying of each partition file to the target directory with the re-encryption. These files will have dirty data due to concurrent checkpoint thread writes.
Collect all dirty pages related to ongoing checkpoint process and corresponding partition files and apply them (with re-encryption) to the copied file right after the copy process ends.
When local re-encryption of all required cache groups completes - send message that this phase is finished on this node (in other words, distributed process "prepare" is finished on local node).
Continue to collect and apply dirty pages encrypted with the new key to copied partition until "finish" phase is started.

Finish changing the encryption key

After completion of the key change preparation process, a new distributed process is initiated to complete the key change.

The discovery event from the distributed process pushes a new exchange task to the exchange worker to start PME (PME is required to prevent reordering of WAL records when key will be changed and to simplify initial design, this could and will be changed in the future)

While updates are blocked each local node:

Forces the checkpoint (required for WAL consistency?)
Swap all partition files:
1. Backup original file.
2. Move re-encrypted file at the place of the original.
Change encryption key(s) in metastore (update encryption keys history).
Remove partition backups (2a).

WAL

After changing the cache encryption key, its entries in the WAL will be encrypted with the new key. However, it must be possible to read older WAL records (at least to support historical rebalance).

For each cache, instead of a key, it is necessary to keep a history of keys in the form WALPointer -> key
(stored the maximum pointer for which the associated key is applicable).

When removing a WAL segment to which WALPointer(s) refers - key(s) should be also removed.
When the WAL is cleared, respectively, the key history must also be cleared (except the last one).

Recovery

By canceling the re-encryption procedure is meant clearing all temporary data.

If a node crashes during the replacement of the partitions, the original backup copies of the partitions are restored when the node starts.
If major topology changes during key rotation - cancelling whole procedure.
If cache is stopping during re-encryption - cancelling whole procedure, other minor topology changes should not affect re-encryption procedure.

(TBD) When baseline node with data joins the cluster and the cache group has a different key:
1. If historical rebalancing is not applicable encryption key will be changed when node joins and the partitions are cleared.
2. If historical rebalancing is applicable - existing data should be re-encrypted with the new key before(?) node joins the cluster.

Process management

TBD

Public API changes

TBD

Monitoring

Re-encryption process state

Key rotation required in case of it compromising or at the end of crypto period(key validity period).

Goal:

...

New processes:

...

Cache key rotation.

...

New administrator commands:

Current state of cache key rotation: node -> group name -> status -> encryption key hash.

Cache group keys rotation:

Process start:

...

Process description:

...

On message receive following actions are executed:

...

Process state: IN PROGRESS.

...

Further WAL records are encrypted with the new key.

...

Thread pool configured in IgniteConfiguration.

...

For each partition file.

...

The file is read page by page.

...

Page is unlocked.

...

Сompletion of partition re-encryption is accompanied by adding a WAL entry

...

Process state: FINISHED.

Motivation:

...

Memory footprint [Thread count]*[page size]

...

Minor affect on regular data operations.

...

To decrypt page we have to do the following steps:

...

If page not reencrypted yet we use old key for decryption.

...

If page reported as reencrypted(Bloom filter may be false positive) we:

Try to decrypt page with new key.
If fail we should try to use old key.

...

Unblock page from reencryption.

Process failover:

...

Scan partition from the beginning to last progress record point and just add eache page to reencrypted pages set.

...

After it we have X pages that MAY BE reencrypted(and may be not). We should find fir not reencrypted page:

Trying to decrypt page with new key.
If fails then page is found.
We should continue reencryption process starting from it.

Process completion:

Administrator initiates process completion via interface by using “cache key removal” command.
Design assume, administrator will check that all nodes successfully change cache key and reencrypt all pages and all required nodes are alive.

Cache group keys removal:

Process start:

Administrator initiates process via some kind of user interface(CLI, Visor, WebConsole, JMX, etc).

Process description:

Message is sent by discovery.

Message should contain:

new cache key hash.

When server node processed message following actions are executed:

Received cache key hash compared with known cache key hash.
Previous cache key removed from MetaStore.

Monitoring command:

...

.

Input: cache id.
Output:
- List of Tuples6
  - Node ID
  - Reencryption process state.
  - Count of partition to process.
  - Current partition index.
  - Current partition id.
  - Count of processed page in current partition.

Risks and assumptions

TBD

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	IGNITE-12843

Page tree

Versions Compared

Old Version 6

New Version 7

Key

Motivation

Overview

Prepare changing the encryption key

Local re-encryption task

Finish changing the encryption key

WAL

Recovery

Process management

Public API changes

Monitoring

Goal:

Cache group keys rotation:

Process start:

Process description:

Process failover:

Process completion:

Cache group keys removal:

Process start:

Process description:

Monitoring command:

Risks and assumptions

Tickets

Page tree

Page History

Versions Compared

Old Version 6

New Version 7

Key

Motivation

Overview

Prepare changing the encryption key

Local re-encryption task

Finish changing the encryption key

WAL

Recovery

Process management

Public API changes

Monitoring

Goal:

Cache group keys rotation:

Process start:

Process description:

Process failover:

Process completion:

Cache group keys removal:

Process start:

Process description:

Monitoring command:

Risks and assumptions

Tickets