Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Approvers

Status

Current state


Current State

Status
titleUnder Discussion

(tick)


Status
colourYellow
titleIn Progress

(tick)

Status
colourRed
titleABANDONED


Status
colourGreen
titleCompleted


Status
colourBlue
titleINactive


Discussion thread: TODO

JIRA: TODO

...

We will provide two configurations, taking inspiration from ComplexKeyGenerator in Hoodie:

  • hoodie.recordkey.is_virtual.enabled
    • Boolean value controlling whether the recordKey is written to disk
    • Ex: true

...

  • hoodie.recordkey.virtual.separatorconcatenator
    • String to use as a separator between columns
    • Ex: “_”
  • hoodie.recordkey.virtual.class
    • Class that creates the recordKey given columns.
    • Ex: com.uber.marmaray.SimpleRecordKey

...

These configurations will be used by a default an enhanced implementation of a VirtualRecordKey abstract KeyGenerator abstract class. This class can be implemented by customers as well, should they want custom logic for creating a RecordKey, given data from other columns. 

...

  • Default: _hoodie_record_key will be written to disk and behavior will be identical to what we see today.
  • [reader] If VirtualKey Confs are set and _hoodie_record_key is present in the parquet file, we will ignore it the values in the parquet files and log a warning. 
  • [reader + writer] If VirtualKey Confs are set but some columns are not present in the schema, or some records have null, we will throw an exception.
  • [writer] If the VirtualKey confs are set, but are different from what is present in hoodie.properties, we will throw an exception for “append” but not on “overwrite” write modes. . (This does not apply to the case where we store the configs in the parquet files)
  • We expect that hoodie.properties will be considered a system file and will note be mutated. The files should only be mutated through Hudi code.
  • What if key relies on some column that wasn’t present before some past date? This would imply that we are somehow evolving the primary key for the table. This is complication that is not supported by most databases. We choose to not support this for HoodieRecordKey either.
  • If a table has configured _hoodie_record_key to be virtual, then the _hoodie_record_key column will no longer be queryable. Queries on this column will return NULL. This is consistent with the behaviors of other databases - compound keys are not materialized as columns, instead users must recreate the key in their query, if they want to select the key
  • NOTE: For MOR tablesThe compaction job would need to have the KeyGenerator class and configs. This is because it would need to interpret and merge the base and log files.

Rollout/Adoption Plan

NOTE: This change preserves legacy behaviors, so no current users should be affected unless they enable this change. 

...