Page History

...

User stops all write operations on original dataset.
User initiates this new bootstrap one time either through delta-streamer or through a standalone tool. Users provide the following information as part of the bootstrap:

Hudi bootstrap scans partitions and files in the original root location “/user/hive/warehouse/fact_events” and performs the following operations :

Creates similar hudi partitions in the new dataset location. In the above example, there will be day based partitions created under “/user/hive/warehouse/fact_events_hudi”
Using Spark parallelism, generates unique file ID and uses it to generate a hudi skeleton parquet file for each original parquet file. A special commit timestamp called “BOOTSTRAP_COMMIT” is used. For the remainder of this document, let us imagine BOOTSTRAP_COMMIT having the timestamp “001”“000000000”. For example, if an original parquet file is at the path /user/hive/warehouse/fact_events/year=2015/month=12/day=31/file1.parquet. Let the unique file id getting generated is h1, then the skeletor skeleton file corresponding to the above parquet file will be stored at path /user/hive/warehouse/fact_events_hudi/year=2015/month=12/day=31/h1_1-0-1_001000000000.parquet.
Generates a special bootstrap index which maps each newly generated hudi skeleton file id to its corresponding original parquet file.
Atomically commits the bootstrap operation using same hudi timeline state transitions. Standard rollback will also be supported to rollback inflight and committed bootstrap.

If hive syncing is enabled, creates a brand new hudi hive table pointing to the new location - “/user/hive/warehouse/fact_events_hudi”
Subsequent write operations happen on the hudi dataset.

...

Space shortcuts