You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 84 Next »

Motivation

Cache encryption key rotation required in case of it compromising or at the end of crypto period (key validity period). in addition, such feature is required to provide support for encrypt and descrypt existing caches in the future.

Security requirement

Payment card industry data security standard (PCI DSS) requires that key-management procedures include a defined cryptoperiod for each key type in use and define a process for key changes at the end of the defined cryptoperiod(s). An expired key should not be used to encrypt new data, but it can be used for archived data, such keys should be strongly protected (section 3.5 - 3.6) [1].
The maximum recommended key lifetime is 2 years [2], and on average it is supposed to be changed every few months [3].

Key rotation in other systems

MS SQL Server provide rotation of database encryption key with background re-encryption of existing data [4]. Oracle and MySQL, out of the box, do not provide an automatic procedure for rotating tablespace keys, master key rotation is supported [5][6], Currently, TDE is being developed for PostgreSQL, but support for tablespace key rotation is not planned [7].

Description

The overall process consists of the following steps

  • Rotate cache group key - add a new encryption key on each node and set it for writing.
  • Schedule background re-encryption for archived data and cleanup the old key when it completes.

Process description

To support multiple keys for reading encrypted data it is required to store a key identifier on each encrypted page and on each encrypted WAL record (see more details). The key identifier is a sequential counter and should be the same on all nodes.

  1. Start distributed process CACHE_GROUP_KEY_CHANGE_PREPARE, each node
    1. verifies that re-encryption is not in progress for the specified cache group.
    2. ensures that new key identifier does not exist
  2. After successful completion of PREPARE, start distributed process CACHE_GROUP_KEY_CHANGE_FINISH, each node
    1. saves logical WAL record (ENCRYPTION_STATUS_RECORD) with current groups and key identifiers to start re-encryption after logical recovery.
    2. save the new key in metastore (as inactive key)
    3. sets it for writing
    4. adds the mapping "WAL segment -> *old* key identifier" (to safely cleanup previous key in the future)
    5. starts background re-encryption of an existing data.

After the FINISH phase is complete, a new encryption key for writing is set on all nodes, i.e. the key change process is formally completed.

Background re-encryption of existing data will be completed sometime in the future, the new "reencryptionFinished" cache group metric can be used to track re-encryption progress.

Background re-encryption

The process applies for all existing partitions including index.

Every time the cache group key changes, we store the current page count of the partition in the meta page (this value is used as the total page count to re-encrypt).

Scan all pages from specified range (metaPageId + [offset -> total])

  1. acquire page
    1. if the checkpoint is finished (after key change) and page is dirty - skip this page.
    2. if the checkpoint is not finished or page is not dirty
      1. lock page
      2. unlock page (dirty=true)
  2. release page

Re-encryption progress is stored into metapage (int offset, int total), it updates during the checkpoint.

The process aborts only when a partition is destroyed.

At node startup, during partition initialization, if the total number of pages for re-encryption is greater than zero - this cache group is scheduled for re-encryption.

Cleanup old key

Old cache group encryption key will be removed when

  1. re-encryption completed for cache group (and after that at least one checkpoint was successfully completed)
  2. last WAL segment in which the key was used is removed

Changes in memory page format

PageMetaIO and PagePartitionMetaIO

Reencryption status requires an additional 8 bytes on the meta page of each partition.
Index partition uses PageMetaIO to read/write meta information.
Each other partition uses PagePartitionMetaIO to read/write meta information.

Partition meta starts just after the end of the page meta.

To store an additional 8 bytes partition meta shifted by 8 bytes.

WAL delta records have also been modified to store re-encryption status.

Encrypted (persisted) page

Each encrypted page has reserved free space to store CRC of encrypted data.
The size of this free space depends on the size of the encryption block, but cannot be less than 8 bytes (Ignite default encryption implementation (KeystoreEncryptionSpi) uses AES with 16 bytes block size).

Added 1 byte for encryption key ID on each encrypted page (after CRC).


(WAL records ENCRYPTED_RECORD and ENCRYPTED_DATA_RECORD have been changed accordingly)

Fault tolerance

Distributed key rotation

Node join is rejected during the key rotation, but this limitation may be revised in the future.

When a node joins the cluster (before/after key rotation), it receives the current encryption keys for the cache groups used for writing. If the encryption key is a new key, then the node sets it for writing and starts the background re-encryption process (in other words, the node automatically "rotates" the encryption key. when joining a cluster, if necessary).
Therefore, a node may leave the cluster during a key change, or a node may be absent and rejoin later (it does not matter if the baseline changes or not), it will receive a new key and schedule re-encryption, if necessary.

Background re-encryption

If the node stops/fails during re-encryption, after restarting it continue re-encryption from the stored offset:

  1. If checkpoint failed it should restore physical records from WAL, as usual.
  2. If checkpoint was not invoked re-encryption is started from the beginning using saved logical WAL record (that was recorded during key rotation).

Risks and assumptions

  • Background re-encryption may affect performance. Performance impact can be managed using the following configuration options:
    1. reencryptionThreadCnt - number of threads used for re-encryption.
    2. reencryptionBatchSize - number of pages that are scanned during re-encryption under checkpoint lock.
    3. reencryptionRateLimit - page scanning speed limit in megabytes per second.
  • The WAL history can be not enough to store all entries between checkpoints (this should be carefully tuned by properly setting the size of the WAL history and tuning the re-encryption performance).
  • The WAL history (for delta rebalancing) may be lost for all cache groups due to background re-encryption.

Process management

// TBD

Public API changes

IgniteEncryption

New method will be introduced

public IgniteFuture<Void> changeCacheGroupKey(Collection<String> cacheOrGroupNames)

Metrics

Re-encryption process state in CacheGroupMetrics

  • ReencryptionPagesLeft - (long) Total pages left for reencryption.
  • ReencryptionFinished - (boolean) Indicates whether reencryption is finished or not (it will set to true only when checkpoint is finished).

Reference Links

  1. PCI DSS Requirements and Security Assessment Procedures
    https://www.pcisecuritystandards.org/documents/PCI_DSS_v3-2-1.pdf
  2. How Often Do I Need to Rotate Encryption Keys on My SQL Server?
    https://info.townsendsecurity.com/bid/49019/How-Often-Do-I-Need-to-Rotate-Encryption-Keys-on-My-SQL-Server
  3. PCI DSS and key rotations simplified
    https://www.crypteron.com/blog/pci-dss-key-rotations-simplified/
  4. Transparent Data Encryption in MS SQL Server
    https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption?view=sql-server-ver15
  5. Oracle Transparent Data Encryption FAQ
    https://www.oracle.com/database/technologies/faq-tde.html
  6. InnoDB Data-at-Rest Encryption
    https://dev.mysql.com/doc/refman/8.0/en/innodb-data-encryption.html
  7. Transparent data encryption feature proposed in pgsql-hackers.
    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Key_Rotation

Tickets

Unable to render Jira issues macro, execution error.

  • No labels