Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Parse the InputPaths from JobConf and classify into three categories - paths related to incremental mode queries, paths related to non incremental mode queries, non Hudi Paths.
  2. While Parsing the InputPaths, also create HoodieTableMetadataClient for Hudi tables.
  3. Process each category sequentially
    1. Incremental mode - HoodieTimeline provides commits to check. Read each of those commits (HoodieCommitMetadata) and gather partitions to list. Mutate JobConf to set input paths to this new partitions list. From here on the implementation is same as current - Create ReadOptimizedView of the table and fetch all latest data files which are in the matching commit range.
    2. Non Hudi mode - Mutate jobConf to set these as InputPaths and return FileSatus[] as is.
    3. Non Incremental mode - Mutate Jobconf to set non incremental input paths. List files in these paths, group them based on tables (reuse table metatada created in step 2 above), construct ReadOptimizedView of the table and fetch all latest data files.

Rollout/Adoption Plan

  • Roll out involves replacing the current HoodieInputFormat registered in Hive with this new version.
  • There wont be any user impact since, the behavior of listStatus will remain same. 
  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

Test Plan

Unit tests to cover all operations

...