...
Approvers
- Nishith Agarwal [APPROVED]
- Vinoth Chandar
Status
Current state:
Current State | ||||
---|---|---|---|---|
|
| |||||||||
| |||||||||
| |||||||||
|
Discussion thread: TODO
JIRA: TODO
...
We will provide two configurations, taking inspiration from ComplexKeyGenerator in Hoodie:
- hoodie.recordkey.is_virtual.enabled
- Boolean value controlling whether the recordKey is written to disk
- Ex: true
...
- hoodie.recordkey.virtual.separatorconcatenator
- String to use as a separator between columns
- Ex: “_”
- hoodie.recordkey.virtual.class
- Class that creates the recordKey given columns.
- Ex: com.uber.marmaray.SimpleRecordKey
...
These configurations will be used by a default an enhanced implementation of a VirtualRecordKey abstract KeyGenerator abstract class. This class can be implemented by customers as well, should they want custom logic for creating a RecordKey, given data from other columns.
...
- Default: _hoodie_record_key will be written to disk and behavior will be identical to what we see today.
- [reader] If VirtualKey Confs are set and _hoodie_record_key is present in the parquet file, we will ignore it the values in the parquet files and log a warning.
- [reader + writer] If VirtualKey Confs are set but some columns are not present in the schema, or some records have null, we will throw an exception.
- [writer] If the VirtualKey confs are set, but are different from what is present in hoodie.properties, we will throw an exception for “append” but not on “overwrite” write modes. . (This does not apply to the case where we store the configs in the parquet files. )
- We expect that hoodie.properties will be considered a system file and will note be mutated. The files should only be mutated through Hudi code.
- What if key relies on some column that wasn’t present before some past date? This would imply that we are somehow evolving the primary key for the table. This is complication that is not supported by most databases. We choose to not support this for HoodieRecordKey either.
- If a table has configured _hoodie_record_key to be virtual, then the _hoodie_record_key column will no longer be queryable. Queries on this column will return NULL. This is consistent with the behaviors of other databases - compound keys are not materialized as columns, instead users must recreate the key in their query, if they want to select the key.
- NOTE: For MOR tables. The compaction job would need to have the KeyGenerator class and configs. This is because it would need to interpret and merge the base and log files.
Rollout/Adoption Plan
NOTE: This change preserves legacy behaviors, so no current users should be affected unless they enable this change.
...