Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

An append + update, a.k.a, upsert type of dataset presents a different challenge. Any entry written to such a dataset may depend or belong to a previously written row. More concretely, each row in the table is a NOT necessarily a new row and may have overlap with a previously written row. In such a scenario, a system needs to be able to determine which row should this new update to written to, hence requiring the need to find out which file-id, update should be routed to.e

The 3 approaches uses by consumers of HUDI are as follows : 

  1. The data layout is divided in buckets called partitions and each partition consists of N number of files. The client maintains a mapping of record key <-> file id for updatable tables and provides the partition information before passing off the records to Hudi for processing. The HoodieBloomIndex implementation scans all the BloomIndexes of all files under a partition, and if matches, verifies the record key lies in that file. This way hoodie is table to “tag” the incoming records with the correct file ids.
  2. The data layout is a flat structure, where 1 single directory contains all the files for the updatable table. The GlobalHoodieBloomIndex implementation scans all the BloomIndexes of all files, and if matches, verifies the record key lies in that file. This way hoodie is table to “tag” the incoming records with the correct file ids. The difference between the first and second is that if the number of files is very large under this base path, the indexing time can blow up.
  3. For append only datasets which do not require updating, simply use the partition from the current payload, for eg. the current timestamp.

Irrespective of the type of dataset (append or append + update) in use, index look up plays a very critical role in the both read and write performance. If we can solve record level indexing mapped to FileId and may be partition (depending on the use-case) without adding too much to latency, it will be a good improvement to Hudi's performance. 

So this RFC aims at providing a record level indexing capability to Hoodie for faster lookupsWith a large amount of data being written to data warehouses, there is a need to better partition this data to be able to improve query performance. Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values. Partitions help in efficient querying of the dataset since it limits the amount of data that needs to be queried. In non-partitioned tables, Hive/Presto/Spark would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive—especially in cases of large tables

Implementation


                                                                        

...