Proposers

Abhishek Modi

Approvers

Nishith Agarwal [APPROVED]
Vinoth Chandar

Status

Current state:

	Current State
UNDER DISCUSSION
IN PROGRESS
ABANDONED
COMPLETED
INACTIVE

Discussion thread: TODO

JIRA: TODO

Released: <Hudi Version>

Abstract

We should enable HoodieRecordKey to not be stored on disk, to help reduce Hudi's storage footprint.

Background

The HoodieRecordKey column often contains data that can be found in different columns - either in one other column, or in a group of other columns. This increases the storage footprint of Hudi tables. We could therefore reduce the storage footprint of all Hudi tables by not storing `_hoodie_record_key` on disk, and instead creating configurations that tell Hudi how to construct the key for a given row.

Implementation

Defining a Row Key

We will provide two configurations, taking inspiration from ComplexKeyGenerator in Hoodie:

hoodie.recordkey.virtual.enabled
- Boolean value controlling whether the recordKey is written to disk
- Ex: true

hoodie.recordkey.virtual.columns
- Comma separated list of column names
- Ex: host,hadoop_timestamp,hadoop_row_key

hoodie.recordkey.virtual.concatenator
- String to use as a separator between columns
- Ex: “_”
hoodie.recordkey.virtual.class

Class that creates the recordKey given columns.
Ex: com.uber.marmaray.SimpleRecordKey

These configurations will be used by an enhanced implementation of a KeyGenerator abstract class. This class can be implemented by customers as well, should they want custom logic for creating a RecordKey, given data from other columns.

Reader-Side Change

Readers currently rely on the _hoodie_record_key field in the parquet files. Since this field will no longer exist in the parquet files, the reader would need to know how to construct the field. And we should avoid requiring the customer to set these configurations. To that end, we should store these configurations on disk. We can store this in the hoodie.properties file that exists for every Hudi table. Read Clients would then pick up the configuration at the start of the query.

Code Changes

Most of the changes for this design will happen around the following classes:

HoodieParquetWriter: Currently, HoodieParquetWriter calls HoodieAvroUtils.addHoodieKeyToRecord which adds `_hoodie_file_name`, `_hoodie_partition_path` and `_hoodie_record_key` to the record, right before it is written to disk
HoodieParquetReader: Since the metadata fields are already added by the writer, the reader simply reads the fields. This will have to be changed, so that the reader adds the _hoodie_record_key field to the read record.
HoodieIndex:

Defaults / Expectations / Exceptions

Default: _hoodie_record_key will be written to disk and behavior will be identical to what we see today.
[reader] If VirtualKey Confs are set and _hoodie_record_key is present in the parquet file, we will ignore the values in the parquet files and log a warning.
[reader + writer] If VirtualKey Confs are set but some columns are not present in the schema, or some records have null, we will throw an exception.
[writer] If the VirtualKey confs are set, but are different from what is present in hoodie.properties, we will throw an exception. (This does not apply to the case where we store the configs in the parquet files)
We expect that hoodie.properties will be considered a system file and will note be mutated. The files should only be mutated through Hudi code.
What if key relies on some column that wasn’t present before some past date? This would imply that we are somehow evolving the primary key for the table. This is complication that is not supported by most databases. We choose to not support this for HoodieRecordKey either.
If a table has configured _hoodie_record_key to be virtual, then the _hoodie_record_key column will no longer be queryable. Queries on this column will return NULL. This is consistent with the behaviors of other databases - compound keys are not materialized as columns, instead users must recreate the key in their query, if they want to select the key.
NOTE: For MOR tables. The compaction job would need to have the KeyGenerator class and configs. This is because it would need to interpret and merge the base and log files.

Rollout/Adoption Plan

NOTE: This change preserves legacy behaviors, so no current users should be affected unless they enable this change.

The initial part of the rollout is simple - we write the relevant configs to disk, and newly written files will no longer have the _hoodie_record_key into the parquet files. The previously written files already have the record key field written, so readers should be able to read all file.

The second part of the rollout is more challenging. This involves rewriting the old data. We need to do this because this is where we will realize the bulk of the storage savings. For this we could either:

Rebootstrap the table using the new writer and new configs. The advantage of this approach is that it is a well understood process.
Create a new rewrite tool in Hoodie. This tools would read existing Hudi tables and write a new version of each file_id, with the new format. There is more effort involved here, however this tool might be useful for future Hudi changes!

Contingency - Rollback

If we are forced to roll back, we may have a big issue because the newly written parquet files will no longer have the _hoodie_record_key fields, and the older clients may not be able to read these. To address this, I believe we should continue writing the _hoodie_record_key fields to disk for some weeks. We should have a config that tells the reader to ignore the _hoodie_record_key field and to instead use the virtual key. Doing this will also help independent rollout of reader and writer clients.

Test Plan

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>

Space shortcuts

Page tree

Proposers

Approvers

Status

Abstract

Background

Implementation

Defining a Row Key

Reader-Side Change

Code Changes

Defaults / Expectations / Exceptions

Rollout/Adoption Plan

Contingency - Rollback

Test Plan

4 Comments

Vinoth Chandar

Abhishek Modi

Nishith Agarwal

Vinoth Chandar

Space shortcuts

Page tree

RFC - 21 : Allow HoodieRecordKey to be Virtual

Proposers

Approvers

Status

Abstract

Background

Implementation

Defining a Row Key

Reader-Side Change

Code Changes

Defaults / Expectations / Exceptions

Rollout/Adoption Plan

Contingency - Rollback

Test Plan

4 Comments

Vinoth Chandar

Abhishek Modi

Nishith Agarwal

Vinoth Chandar