Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • ...

Status

Current state


Current State

Status
titleUnder Discussion

(tick)

Status
colourYellow
titleIn Progress


Status
colourRed
titleABANDONED


Status
colourGreen
titleCompleted


Status
colourBlue
titleINactive


Discussion thread: here

JIRA: here

...

  • Step 1: judge the query method of data from SQL statements
  • Step 2: give the query method to the scan factory module. The scan factory module has two core functions to obtain the table files to be scanned according to different query methods (the logical Hudi kernel has been implemented); Build a query schema for different query methods (refer to the construction process of query schema for detailed design)

Query method

Query schema constructcountsruct

Snapshot query

Get the latest schemashcema from the latest commit file

Incremental query

The versioned schema is obtained from the specified committime by incremental query, and then the query schema is obtained from the commit file

Read Optimized query

Get the schema version ID from the file name of the basefile,,and then the query schema is obtained from the commit file

  • Step 3: scan factory passes all the files involved in this query to the scanner module; Scanner builds a scan task for each file group; Scan task is responsible for reading specific data.
  • Step4: scan task constructs the file schema of the file to be read for the file to be read. Here, the construction of file schema is similar to the process of finding schema in step 2. Use the committime of the file as the version ID to query the corresponding version of schema from the commit file.
  • Step 5: scan task gives the file schema generated in step 4 to the merge schema module. The merge schema will merge the file schema and the query schema generated in step 2, generate the read schema according to the query engine, and push filters will also be built to push down to the data source side to filter data.

...

Some engines use hive meta store as metadata management, but hive meta does not support multiple versions, so these engines cannot support reading data from historical schema.

Old schema data compatibility

scene1: old hudi table will not do any schema change in the future

This scence is relatively simple, we can fallback to the original read/write logic of hudi.

scene2: do schema change on old hudi table.

schema evolution operation

if we do first schema change on a old hudi table. the first id-schema will be created

1) convert old hudi table's latest avro schema to id-schema as the first id-schema.

2) any schema change will directly applied to this first id-schema and saved with commit file

let's give a exmaple:

Image Added

now rename operationTime to col1:

Image Added

read operation:

once we have done schema change on old hudi table.  first id-schema will be created and all the old files are bound to the first id-schema. since old hudi table only support add/modify column type operation and avro/parquet support those change native. so use the first id-schema to represent the old data is completely fine.

now all files in hudi table are bound to the id-schema.  the query operation can refer chapter Data query process.

Let us follow the above example to give an explanation :

now old table exists two files.   one is old file which bound to the the first id-scheam. the other one is new file which writed after we do add column change and this file bound to the latest id-schema.

follow the step on chapter Data query process.

  1. when we read the old file, the first id-schema will be used as file-schema. lastest id-schema will be used as qurey-schema. then we use merge module to merge file-schema and query-schema to produce the final read-schema. once read-schema is produced, we can read the old files correctly.how to merge file-schema and query-schema pls see the chapter Data query process

Image Added

   2. when we read the new file, the lastest id-schema will be used as file-schema and qurey-schema the remaining process are same as (1)

Image Added

write operation:

now we aready has id-schema. just see the chapter Data Write process 

Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

...