Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As long as Hudi primitives can be made to understand this new file format changes, bootstrapping an existing table would only require generating Hudi skeleton metadata only. From initial experiments on some production grade raw-data table, we find this bootstrap mechanism to be an order of magnitude faster than normal bootstrap. The new bootstrap process for a dataset containing around 3500 partitions, 250K files and around 60 billion rows with a single field as row-key took around 1 hour to bootstrap with 500 executors each containing 1 core and 4 GB memory. The old bootstrap process though required 4x the number of executors (2000) to finish bootstrapping in a day (~24 hrs). 

New Bootstrap Process:

The new bootstrap process would involve the following steps. Let us imagine a parquet dataset “fact_events” needs to be bootstrapped to hudi dataset.  Let us imagine that the root path of this dataset is “/user/hive/warehouse/fact_events” and there are several day based partitions within fact_events. Within each partition, there are several parquet files.  Please see below for a pictorial representation.

...