Motivation

With the growing amount of information stored, the importance of stored data to individuals, companies and governments is ever increasing. In addition to traditional storage criteria such as performance, capacity and reliability, security is also becoming an important feature of storage systems.and there are some industries that have mandatory legal requirements for it, such as bank, transportation, etc.

To use of encrypted data has the following benefits:

  1. Confidentiality: Protection against unauthorized information leakage
  2. Integrity: Protection against unauthorized modification of data.
  3. Authenticity: Protection against unauthorized access to data.

Other data lake frameworks (like hudi, iceberg) already support data encryption capability.:

  1. hudi : Hudi already supports writing encrypted parquet files using spark. [1]
  2. iceberg : iceberg provide the basic encryption interface so that user can add custom implemention.
  • EncryptionManager : Module for encrypting and decrypting table data files[2]
  • KeyManagementClient : A minimum client interface to connect to a key management service (KMS) [3]

As a data lake framework, it is very important for paimon in supporting data encryption to meet enterprise security standards. This document describes how to extend the current paimon architecture to provide users with out-of-the-box encryption capabilities.


Goals

Not related to engine

It is not related to the engine, when enable encryption, users can read and write paimon by any engines (Flink, Spark, Java API)

Pluggable KMS

The key management service (KMS) is pluggable in the system,

Pluggable Encryption Mechanism

The encryption mechanism is pluggable in the system, system provide envelope encryption and plaintext mechanism by default, user can extend it base on their needs.

High Performance

Minimize interaction between paimon and KMS as much as possible to improve throughput

Column level encryption

Using Parquet and ORC native encryption features

Public Interfaces


KmsClient

Provide a KmsClient interface to connect to KMS.


/** KmsClient. */
public interface KmsClient extends Serializable, Closeable {
    void configure(CoreOptions options);

    CreateKeyResult createKey();

    byte[] getKey(String keyId);

    byte[] encryptDataKey(byte[] plaintext, byte[] kek);

    byte[] decryptDataKey(byte[] encryptedDataKey, byte[] kek);

    String identifier();

    class CreateKeyResult implements java.io.Serializable {
        private final String keyId;
        private final byte[] key;

        public CreateKeyResult(String keyId, byte[] key) {
            this.keyId = keyId;
            this.key = key;
        }

        public String keyId() {
            return keyId;
        }

        public byte[] key() {
            return key;
        }
    }
}


EncryptionManager

Provide a interface for encrypting and decrypting table data files


/** Module for encrypting and decrypting table data files. */
public interface EncryptionManager extends Serializable {

    KeyMetadata encrypt(KmsClient.CreateKeyResult createKeyResult);

    byte[] decrypt(KeyMetadata keyMetadata);

    String identifier();

    void configure(CoreOptions options);

}


Proposed Changes

Basic concepts

1.  KMS
    Key Management Service (KMS) is a service that makes it easy for you to create and control the cryptographic keys that are used to protect your data.

    We can use some open-source software to build KMS, such as Hadoop and Ranger, or use managed services provided by public clouds, such as AWS and Aliyun.

2.  Envelope encryption
    Envelope encryption is the practice of encrypting plaintext data with a data key, and then encrypting the data key under another key.

3. Symmetric-key algorithms
    Symmetric-key algorithms are algorithms for cryptography that use the same cryptographic keys for both the encryption of plaintext and the decryption of ciphertext.[4]

4.  Data Encryption Key (DEK)
    A key used to encrypt the data file. In this design, every data file has has a unique DEK.

5.  Paimon Encryption Key (PEK)
    Key used for encrypting DEK

6.  Encryption algorithm
    *   AES GCM
        AES GCM is an authenticated encryption. Besides the data confidentiality (encryption), it supports two levels of integrity verification (authentication): of the data (default),and of the data combined with an optional AAD (“additional authenticated data”). The authentication allows to make sure the data has not been tampered with.
    *   AES CTR
        AES CTR is a regular (not authenticated) cipher. It is faster than the GCM cipher, since it doesn’t perform integrity verification.

Design

Encryption level

The first version ,we will simplify the design, currently only encrypting data files in parquet and orc formats, without encrypting metadata files and avro data file, and using the same key to encrypt the entire data file. In the future, we will continue to optimize and add new features, such as encrypting avro data files, encrypting metadata, and using multiple keys to encrypt the footer and columns of parquet files separately, Or use different algorithms for different columns.

Encryption Manager

Provide a basic module for encrypt and decrypt table data files

Provide two basic implementation classes, one is standard envelope encryption, and the other is plaintext (not encrypted). By default. we use the plaintext mode.

Users can extend this module to implement more encryption mechanisms.

Kms client

For data security, all PEKs currently stored in KMS, and the system provides some commonly used KMS clients, such as ranger ,AWS. Users can extend this interface to achieve more KMS clients.

We will provide a mock KMS for unit testing only and not recommended for production environments.

Encryption format

*   At present, only encryption in parquet and orc formats is provided. If users enable encryption, all fields will be encrypted by default. If users only want to encrypt partial fields, they can specify corresponding configurations.
*   ORC needs to be upgraded to the latest version (encryption is supported after version 1.6)
*   The first version did not provide encryption in AVRO format and metadata, which will be continuously improved in subsequent versions.

Algorithm

Provide CTR and GCM encryption algorithm. To ensure read and write performance, the CTR algorithm is used by default

Store

The PEK ID is stored in the snapshot file, and each table has a unique PEK. When we need a key to encrypt and decrypt data, we use the key ID to obtain the corresponding key from KMS

Each data file has a corresponding KeyMetadata that stores encryption algorithms, encrypted data keys, and so on. KeyMetadata is positioned in the corresponding manifest file. 

Reading and writing process

Write

1.  Get a PEK by kms client.
2.  Generate plaintext data key by EncryptionManager locally, and encrypt the data key by the PEK, get the encrypted data key.
3.  Encrypt the data file by the plaintext data key.
4.  Write the encrypted data key to manifest file.
5.  Write the PEK id to snapshot file.


Read

1.  Get the PEK id from the snapshot file.
2.  Get the plaintext PEK from KMS by the PEK id .
3.  Get the encrypted data key from manifest file.
4.  Decrypted the data key by the plaintext PEK.
5.  Decrypted the data file by the plaintext data key.


Document

This function involves many specific concepts and knowledge in the field of data encryption. We need to provide a comprehensive document to guide users on how to operate it,

Compatibility, Deprecation, and Migration Plan

Compatibility

By default, plain text mode will be used, allowing users to perform normal read and write operations on previous tables. If the user wants to enable encryption mode, they can specify the corresponding encryption parameters when creating the table or alter table.

Migration

If user wants to migrate previous unencrypted table to encrypted table, they first need to modify the table's properties and add corresponding encryption parameters. Then, during the subsequent compaction rewrite process, the system will change the newly written data to encrypted data file to achieve smooth migration.

Test Plan

Unit Tests

Encrypt and decrypt data using built-in mock kms so that we can conduct unit testing.

Benchmark Tests

Provide test case in the paimon-micro-benchmarks module to verify the read and write performance after encryption.

Rejected Alternatives



[1].https://hudi.apache.org/docs/encryption/
[2].https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/encryption/EncryptionManager.java#L32
[3].https://github.com/apache/iceberg/blob/1e57760394583889f2cb7fb87d021471e8c46f0c/core/src/main/java/org/apache/iceberg/encryption/KeyManagementClient.java#L27
[4].https://en.wikipedia.org/wiki/Symmetric-key_algorithm

.

  • No labels