You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Discussion thread
Vote thread
ISSUE
Release0.6


Motivation

With the development of enterprises, data security has become increasingly important, and there are some industries that have mandatory legal requirements for it, such as bank, transportation, etc. As a data lake framework, it is very important for paimon in supporting data encryption to meet enterprise security standards. This document describes how to extend the current paimon architecture to provide users with out-of-the-box encryption capabilities.

Public Interfaces


KmsClient

Provide a KmsClient interface to connect to KMS.


/** KmsClient. */
public interface KmsClient extends Serializable, Closeable {
    void configure(CoreOptions options);

    CreateKeyResult createKey();

    byte[] getKey(String keyId);

    byte[] encryptDataKey(byte[] plaintext, byte[] kek);

    byte[] decryptDataKey(byte[] encryptedDataKey, byte[] kek);

    String identifier();

    class CreateKeyResult implements java.io.Serializable {
        private final String keyId;
        private final byte[] key;

        public CreateKeyResult(String keyId, byte[] key) {
            this.keyId = keyId;
            this.key = key;
        }

        public String keyId() {
            return keyId;
        }

        public byte[] key() {
            return key;
        }
    }
}


EncryptionManager

Provide a interface for encrypting and decrypting table data files


/** Module for encrypting and decrypting table data files. */
public interface EncryptionManager extends Serializable {

    KeyMetadata encrypt(KmsClient.CreateKeyResult createKeyResult);

    byte[] decrypt(KeyMetadata keyMetadata);

    String identifier();

    void configure(CoreOptions options);

}


Proposed Changes

Basic concepts

1.  KMS
    Key Management Service (KMS) is a service that makes it easy for you to create and control the cryptographic keys that are used to protect your data.

    We can use some open-source software to build KMS, such as Hadoop and Ranger, or use managed services provided by public clouds, such as AWS and Aliyun.

2.  Envelope encryption
    Envelope encryption is the practice of encrypting plaintext data with a data key, and then encrypting the data key under another key.

3.  Data Encryption Key (DEK)
    A key used to encrypt the data file. In this design, every data file has has a unique DEK.

4.  Paimon Encryption Key (PEK)
    Key used for encrypting DEK

5.  Encryption algorithm
    *   AES GCM
        AES GCM is an authenticated encryption. Besides the data confidentiality (encryption), it supports two levels of integrity verification (authentication): of the data (default),and of the data combined with an optional AAD (“additional authenticated data”). The authentication allows to make sure the data has not been tampered with.
    *   AES CTR
        AES CTR is a regular (not authenticated) cipher. It is faster than the GCM cipher, since it doesn’t perform integrity verification.

Design

Encryption Manager

Provide a basic module for encrypt and decrypt table data files

Provide two basic implementation classes, one is standard envelope encryption, and the other is plaintext (not encrypted). By default. we use the plaintext mode.

Users can extend this module to implement more encryption mechanisms.

Kms client

For data security, all PEKs currently stored in KMS, and the system provides some commonly used KMS clients, such as ranger ,AWS. Users can extend this interface to achieve more KMS clients.

We will provide a mock KMS for unit testing only and not recommended for production environments.

Encryption format

*   At present, only encryption in parquet and orc formats is provided. If users enable encryption, all fields will be encrypted by default. If users only want to encrypt partial fields, they can specify corresponding configurations.
*   ORC needs to be upgraded to the latest version (encryption is supported after version 1.6)
*   The first version did not provide encryption in AVRO format and metadata, which will be continuously improved in subsequent versions.

Algorithm

Provide CTR and GCM encryption algorithm. To ensure read and write performance, the CTR algorithm is used by default

Store

The PEK ID is stored in the snapshot file, and each table has a unique PEK. When we need a key to encrypt and decrypt data, we use the key ID to obtain the corresponding key from KMS

Each data file has a corresponding KeyMetadata that stores encryption algorithms, encrypted data keys, and so on. KeyMetadata is positioned in the corresponding manifest file. 

Reading and writing process

Write

1.  Get a PEK by kms client.
2.  Generate plaintext data key by EncryptionManager locally, and encrypt the data key by the PEK, get the encrypted data key.
3.  Encrypt the data file by the plaintext data key.
4.  Write the encrypted data key to manifest file.
5.  Write the PEK id to snapshot file.


Read

1.  Get the PEK id from the snapshot file.
2.  Get the plaintext PEK from KMS by the PEK id .
3.  Get the encrypted data key from manifest file.
4.  Decrypted the data key by the plaintext PEK.
5.  Decrypted the data file by the plaintext data key.


Document

This function involves many specific concepts and knowledge in the field of data encryption. We need to provide a comprehensive document to guide users on how to operate it,

Compatibility, Deprecation, and Migration Plan

Compatibility

By default, plain text mode will be used, allowing users to perform normal read and write operations on previous tables. If the user wants to enable encryption mode, they can specify the corresponding encryption parameters when creating the table or alter table.

Migration

If user wants to migrate previous unencrypted table to encrypted table, they first need to modify the table's properties and add corresponding encryption parameters. Then, during the subsequent compaction rewrite process, the system will change the newly written data to encrypted data file to achieve smooth migration.

Test Plan

Unit Tests

Encrypt and decrypt data using built-in mock kms so that we can conduct unit testing.

Benchmark Tests

Provide test case in the paimon-micro-benchmarks module to verify the read and write performance after encryption.

Rejected Alternatives

.

  • No labels