Proposers

Approvers

Status

Current state


Current State

UNDER DISCUSSION


IN PROGRESS

(tick)

ABANDONED


COMPLETED


INACTIVE


Discussion thread: TODO

JIRA: TODO

Released: <Hudi Version>


Abstract

We should enable HoodieRecordKey to not be stored on disk, to help reduce Hudi's storage footprint. 

Background

The HoodieRecordKey column often contains data that can be found in different columns - either in one other column, or in a group of other columns. This increases the storage footprint of Hudi tables. We could therefore reduce the storage footprint of all Hudi tables by not storing `_hoodie_record_key` on disk, and instead creating configurations that tell Hudi how to construct the key for a given row.

Implementation

Defining a Row Key

We will provide two configurations, taking inspiration from ComplexKeyGenerator in Hoodie:

  • hoodie.recordkey.virtual.enabled
    • Boolean value controlling whether the recordKey is written to disk
    • Ex: true
  • hoodie.recordkey.virtual.columns
    • Comma separated list of column names
    • Ex: host,hadoop_timestamp,hadoop_row_key
  • hoodie.recordkey.virtual.concatenator
    • String to use as a separator between columns
    • Ex: “_”
  • hoodie.recordkey.virtual.class
    • Class that creates the recordKey given columns.
    • Ex: com.uber.marmaray.SimpleRecordKey


These configurations will be used by an enhanced implementation of a KeyGenerator abstract class. This class can be implemented by customers as well, should they want custom logic for creating a RecordKey, given data from other columns. 

Reader-Side Change

Readers currently rely on the _hoodie_record_key field in the parquet files. Since this field will no longer exist in the parquet files, the reader would need to know how to construct the field. And we should avoid requiring the customer to set these configurations. To that end, we should store these configurations on disk. We can store this in the hoodie.properties file that exists for every Hudi table. Read Clients would then pick up the configuration at the start of the query. 

Code Changes

Most of the changes for this design will happen around the following classes:

  • HoodieParquetWriter: Currently, HoodieParquetWriter calls HoodieAvroUtils.addHoodieKeyToRecord which adds `_hoodie_file_name`, `_hoodie_partition_path` and `_hoodie_record_key` to the record, right before it is written to disk
  • HoodieParquetReader: Since the metadata fields are already added by the writer, the reader simply reads the fields. This will have to be changed, so that the reader adds the _hoodie_record_key field to the read record.
  • HoodieIndex

Defaults / Expectations / Exceptions

  • Default: _hoodie_record_key will be written to disk and behavior will be identical to what we see today.
  • [reader] If VirtualKey Confs are set and _hoodie_record_key is present in the parquet file, we will ignore the values in the parquet files and log a warning. 
  • [reader + writer] If VirtualKey Confs are set but some columns are not present in the schema, or some records have null, we will throw an exception.
  • [writer] If the VirtualKey confs are set, but are different from what is present in hoodie.properties, we will throw an exception. (This does not apply to the case where we store the configs in the parquet files)
  • We expect that hoodie.properties will be considered a system file and will note be mutated. The files should only be mutated through Hudi code.
  • What if key relies on some column that wasn’t present before some past date? This would imply that we are somehow evolving the primary key for the table. This is complication that is not supported by most databases. We choose to not support this for HoodieRecordKey either.
  • If a table has configured _hoodie_record_key to be virtual, then the _hoodie_record_key column will no longer be queryable. Queries on this column will return NULL. This is consistent with the behaviors of other databases - compound keys are not materialized as columns, instead users must recreate the key in their query, if they want to select the key. 
  • NOTE: For MOR tables. The compaction job would need to have the KeyGenerator class and configs. This is because it would need to interpret and merge the base and log files.

Rollout/Adoption Plan

NOTE: This change preserves legacy behaviors, so no current users should be affected unless they enable this change. 


The initial part of the rollout is simple - we write the relevant configs to disk, and newly written files will no longer have the _hoodie_record_key into the parquet files. The previously written files already have the record key field written, so readers should be able to read all file.

The second part of the rollout is more challenging. This involves rewriting the old data. We need to do this because this is where we will realize the bulk of the storage savings. For this we could either:

  • Rebootstrap the table using the new writer and new configs. The advantage of this approach is that it is a well understood process.
  • Create a new rewrite tool in Hoodie. This tools would read existing Hudi tables and write a new version of each file_id, with the new format. There is more effort involved here, however this tool might be useful for future Hudi changes!

Contingency - Rollback 

If we are forced to roll back, we may have a big issue because the newly written parquet files will no longer have the _hoodie_record_key fields, and the older clients may not be able to read these. To address this, I believe we should continue writing the _hoodie_record_key fields to disk for some weeks. We should have a config that tells the reader to ignore the _hoodie_record_key field and to instead use the virtual key. Doing this will also help independent rollout of reader and writer clients.

Test Plan

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>


  • No labels

4 Comments

  1. Abhishek Modi I have some concerns around the new `VirtualKey` interface. 

    Specifically 

    • Why do we need a new interface? We primarily need this interface to construct a HoodieKey from the existing record on storage, right? Could we just reuse the KeyGenerator interface.
    • Along the same lines, assume a user is writing using the Spark datasource path, using a key generator class already. Then, once the df is converted to a RDD[HoodieRecord], then we lost the information about how the key was constructed. It may not be as simple as a compound key made using a list of fields and a separator. for e.g the timestamp based key generators parse timestamps. I can imagine other ways where some complex code can be written to prepare the key. Again, I feel if we reuse KeyGenerator we can just run through the same code again and reconstruct the key from a record on disk 
    • Questions on how the merge is going to work for different engines on the query side. 
    1. That makes sense. I'll update the RFC to use `KeyGenerator` instead. 

  2. I think we should also think about introducing synthetic keys down the road. hoodie_sequence_num is already a unique key, we could leverage that as well.