Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Div
classhome-banner

 RFC -

6

06 Add indexing support to the log file format

Table of Contents
maxLevel4
minLevel3

Proposers

Approvers

Status

Current state:"Under Discussion" 

Status
colourBlue
titleInactive

Discussion thread: here

JIRA:  

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-86

Released: TBD

Abstract

Currently, the hudi log is not indexable. Hence, we cannot send inserts into the logs, forcing us to write out parquet files instead.. If we can solve this, we can unlock truly near-real time ingest on MOR storage, as well as take our file sizing story much further. Specifically, it will unlock a very common use-case where users just want to log incoming data as quickly as possible and later compact it to larger files. Through this effort, we will also be laying the groundwork for improving RT view performance as explained in detail below. 

Background

Current log format is depicted here. Its a simple, yet flexible format that consists of - version, type (avro data, deletes, command), header, optional content, footers. Headers and footers allow us to encode various metadata like commit instant time, schema for processing that log block. On DFS implementations that support append, there will be multiple of these log blocks present in a single log file version. However, on cloud/object stores without append support 1 log file version contains 1 log block. 

Coming back to the issue of indexing, the block of specific interest to us here is the `DataBlock`. BloomIndex is able to currently index the parquet/base files using bloom filters and range information stored inside the parquet footer. In copy-on-write storage, there are no log files and since the base file is indexable fully, we are able to send even inserts into the smallest base file and have it be visible for index lookup in the next commit. However, when it comes to merge-on-read, the lack of indexing ability on log blocks means we cannot safely assign inserts to an existing small file group and write them to the its log. Instead we write out base files also for inserts and thus incur additional cost for near real-time ingestion. Small file handling logic in MOR picks the file group with smallest base file that has no log files (if any) and expands it with new inserts. If it cannot find such a base file, a new parquet file is written. Thus we could also create more file groups than is really needed in such cases.. This RFC aims to address all of these issues.

Implementation

At a high level, we will introduce the following changes to a data block. 

...

Gliffy Diagram
size1200
namehudi-log-indexing
pagePin6


Rollout/Adoption Plan

  • This change does introduce a change to log format, but the indexing will seamlessly handle older log blocks which were written without this by computing the bloom filters, min, max from actual data.
  • One thing worth calling out is that : as users pick up this change there might be a slowdown due to the approach above, since a lot of avro data is being full read out. but once sufficient compactions run, the overhead will approach zero and index lookup should perform equivalent to how base file index lookup.
  • Once the log indexing is deemed stable in the next 1-2 releases, we will eventually get rid of all code paths that special case based on index.canIndexLogFiles() 

Test Plan

  • Unit tests including ones that cover a log with both old and new data blocks.
  • Testing with delta streamer continuous mode, by running a test pipeline for few days: checking if the keys are matching, no duplicates, so forth

...