...
- @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
- @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
- ...
Status
Current state:
Current State | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
| |||||||||
| |||||||||
| |||||||||
|
Discussion thread: here
JIRA: here
...
- Step 1: judge the query method of data from SQL statements
- Step 2: give the query method to the scan factory module. The scan factory module has two core functions to obtain the table files to be scanned according to different query methods (the logical Hudi kernel has been implemented); Build a query schema for different query methods (refer to the construction process of query schema for detailed design)
Query method | Query schema constructcountsruct |
Snapshot query | Get the latest schemashcema from the latest commit file |
Incremental query | The versioned schema is obtained from the specified committime by incremental query, and then the query schema is obtained from the commit file |
Read Optimized query | Get the schema version ID from the file name of the basefile,,and then the query schema is obtained from the commit file |
- Step 3: scan factory passes all the files involved in this query to the scanner module; Scanner builds a scan task for each file group; Scan task is responsible for reading specific data.
- Step4: scan task constructs the file schema of the file to be read for the file to be read. Here, the construction of file schema is similar to the process of finding schema in step 2. Use the committime of the file as the version ID to query the corresponding version of schema from the commit file.
- Step 5: scan task gives the file schema generated in step 4 to the merge schema module. The merge schema will merge the file schema and the query schema generated in step 2, generate the read schema according to the query engine, and push filters will also be built to push down to the data source side to filter data.
...
Some engines use hive meta store as metadata management, but hive meta does not support multiple versions, so these engines cannot support reading data from historical schema.
Old schema data compatibility
chapter Data query process
- chapter Data query process
Rollout/Adoption Plan
- <What impact (if any) will there be on existing users?>
- <If we are changing behavior how will we phase out the older behavior?>
- <If we need special migration tools, describe them here.>
- <When will we remove the existing behavior?>
...