You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

 RFC-6 : Add indexing support to the log file format

Proposers

Approvers

  • Balaji Varadarajan  : [APPROVED/REQUESTED_INFO/REJECTED]
  • @nagarwal : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state: "Under Discussion"

Discussion thread: here

JIRA:   Unable to render Jira issues macro, execution error.

Released: TBD

Abstract

Currently, the hudi log is not indexable. Hence, we cannot send inserts into the logs, forcing us to write out parquet files instead.. If we can solve this, we can unlock truly near-real time ingest on MOR storage, as well as take our file sizing story much further. Specifically, it will unlock a very common use-case where users just want to log incoming data as quickly as possible and later compact it to larger files. Through this effort, we will also be laying the groundwork for improving RT view performance as explained in detail below. 

Background

Current log format is depicted here. Its a simple, yet flexible format that consists of - version, type (avro data, deletes, command), header, optional content, footers. Headers and footers allow us to encode various metadata like commit instant time, schema for processing that log block. On DFS implementations that support append, there will be multiple of these log blocks present in a single log file version. However, on cloud/object stores without append support 1 log file version contains 1 log block. 

Coming back to the issue of indexing, the block of specific interest to us here is the `DataBlock`. BloomIndex is able to currently index the parquet/base files using bloom filters and range information stored inside the parquet footer. In copy-on-write storage, there are no log files and since the base file is indexable fully, we are able to send even inserts into the smallest base file and have it be visible for index lookup in the next commit. However, when it comes to merge-on-read, the lack of indexing ability on log blocks means we cannot safely assign inserts to an existing small file group and write them to the its log. Instead we write out base files also for inserts and thus incur additional cost for near real-time ingestion. Small file handling logic in MOR picks the file group with smallest base file that has no log files (if any) and expands it with new inserts. If it cannot find such a base file, a new parquet file is written. Thus we could also create more file groups than is really needed in such cases.. This RFC aims to address all of these issues.

Implementation

At a high level, we will introduce the following changes to a data block. 

  • Introduce a new version of avro data block, that also writes a sorted list of keys in the avro data.
  • Introduce three new footers : bloom filter, min, max key ranges for avro data (all cumulative) 
    • bloom filter is built with all keys encountered so far in the log i.e including all previous committed data blocks
    • Similarly min, max ranges are for the entire set of committed data blocks in that log file. 

Diagram below shows one file group with several log file versions and multiple data blocks inside them. Old data blocks simply contain records, where as new blocks the additional keys, bloom filter and ranges as described above. Once a compaction runs, all logs will only contain new blocks. For indexing lookup, we will simply read the bloom filter and min/max from the last data block (similar overhead/performance as reading base file footers today). If there are matches or false positives, then we will need to read all data blocks in the logs. Thanks to logging keys separately, we don't pay the overhead of reading the entire avro records for checking against the actual keys. It can be done by simply reading the keys we wrote in the new data blocks. Old data blocks again would pay this penalty until compaction runs at least once for that file group.


hudi-log-indexing


Rollout/Adoption Plan

  • This change does introduce a change to log format, but the indexing will seamlessly handle older log blocks which were written without this by computing the bloom filters, min, max from actual data.
  • One thing worth calling out is that : as users pick up this change there might be a slowdown due to the approach above, since a lot of avro data is being full read out. but once sufficient compactions run, the overhead will approach zero and index lookup should perform equivalent to how base file index lookup.
  • Once the log indexing is deemed stable in the next 1-2 releases, we will eventually get rid of all code paths that special case based on index.canIndexLogFiles() 

Test Plan

  • Unit tests including ones that cover a log with both old and new data blocks.
  • Testing with delta streamer continuous mode, by running a test pipeline for few days: checking if the keys are matching, no duplicates, so forth
  • No labels