Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Div
classhome-banner

 HIP-4 : Faster Hive incremental pull queries

Table of Contents
maxLevel4
minLevel3

Proposers

Approvers

Status

Current state:Under Discussion

...

JIRA: here

Released: <Hudi Version>

Abstract

The incremental pull tool for Hive (HiveIncrementalPuller & its uber internal equivalent) provides the ability to obtain change streams on ingested tables in Hive. Currently this is achieved by listing all partitions and then evaluating them for incremental changes. For larger Hive tables, this approach can be time consuming with high number of partitions. This proposal aims at speeding up incremental queries by leveraging the commit metadata that Hudi timeline already has even before listing the partitions. This will reduce the planning overhead to just the partitions affected by the time range the data is incrementally pulled for.

Background

Here are some related concepts that might be useful to know.

...

  1. Parse the InputPaths from JobConf and classify into three categories - paths related to incremental mode queries, paths related to non incremental mode queries, non Hudi Paths.
  2. While Parsing the InputPaths, also create HoodieTableMetadataClient for Hudi tables.
  3. Process each category sequentially
    1. Incremental mode - HoodieTimeline provides commits to check. Read each of those commits (HoodieCommitMetadata) and gather partitions to list. Mutate JobConf to set input paths to this new partitions list. From here on the implementation is same as current - Create ReadOptimizedView of the table and fetch all latest data files which are in the matching commit range.
    2. Non Hudi mode - Mutate jobConf to set these as InputPaths and return FileSatus[] as is.
    3. Non Incremental mode - Mutate Jobconf to set non incremental input paths. List files in these paths, group them based on tables (reuse table metatada created in step 2 above), construct ReadOptimizedView of the table and fetch all latest data files.

Rollout/Adoption Plan

  • Roll out involves replacing the current HoodieInputFormat registered in Hive with this new version.
  • There wont be any user impact since, the behavior of listStatus will remain same. 

Test Plan

Unit tests to cover all operations

...