Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Released: Yet To Be Determined

Abstract

HudiApache Hudi supports efficient upserts to datasets by tracking record-key to fileId mapping. Additionally, Hudi guarantees snapshot isolation using a MVCC model by carefully laying out data and tracking metadata. The current implementation of Hudi Writer in Uber does not make use of its rich metadata for building and managing file-system views. The file-system view building process relies on file-system status calls which could pose overhead to underlying file-system services. The current approach is to make  partition listing file-system calls for each task performing creation/merging/appending of files and during index lookup. This pattern of not leveraging Hudi metadata is also seen in Cleaner job which again poses more overhead to underlying file-system. This proposal aims at addressing these inefficiencies in a holistic manner by leveraging Hudi’s rich metadata which is already collected and stored as part of Hudi timeline.

...