Summary

Tl;DR:

Apache Hudi is currently in incubating phase. The earlier versions of hudi Hudi had maven coordinates com.uber.hoodie and package names starting with com.uber.hoodie. The first Apache version of Hudi (0.5.0) has both bundle and package names following Apache conventions (org.apache.hudi). This document is intended for engineers who build and operate Hudi datasets for migrating from pre 0.5.0 Hudi version to 0.5.0 (first Apache release).

...

Hudi has custom input format implementation to work with Hive tables. These classes are also affected by the change in the package namespace. In addition, these input format names are renamed to note that they work primarily on Parquet dataset.

Please find the relocation details name changes below

View Type	Pre v0.5.0 Input Format Class	v0.5.0 Input Format Class
Read Optimized View	com.uber.hoodie.hadoop.HoodieInputFormat	org.apache.hudi.hadoop.HoodieInputFormatHoodieParquetInputFormat
Realtime View	com.uber.hoodie.hadoop.HoodieRealtimeInputFormat	org.apache.hudi.hadoop.realtime.HoodieRealtimeInputFormat.HoodieParquetRealtimeInputFormat

Changes in Spark DataSource Format Name:

With the package renaming, Hudi’s Spark Data Source will now be accessed for reading and writing using the format name “org.apache.hudi”

Data Source Type	Pre v0.5.0 Format (e.g in scala)	v0.5.0 Format (e.g in scala)
Read	spark.read.format(“com.uber.hoodie”).xxxx	spark.read.format(“org.apache.hudi”).xxxx
Write	spark.write.format(“com.uber.hoodie”).xxxx	spark.write.format(“org.apache.hudi”).xxxx

Migrating Existing Hudi Datasets:

...

Recommended Migration Steps:

Upgrade Hudi to 0.4.8 first 7 first (recommended):

Using the local dockerized environment, we have manually tested the upgrade from com.uber.hoodie:hoodie-xxx-0.4.8 7 to org.apache.hudi:hudi-xxx-0.5.0. While the upgrade from pre 0.4.8 release 7 release to hudi-0.5.0 should theoretically work, we have not personally tested the migration steps.

Upgrade Readers First :

Hudi 0.5.0 (org.apache.hudi:hudi-xxx) packages have special classes and implementation to allow for reading datasets that are written by 0.4.8 and 7 and pre-0.4.8 7 versions. Upgrading Writers first could result in queries from old readers failing

Upgrade Hudi Writer Next :

This should start writing metadata with new namespace “org.apache.hudi” and the query engines (which have already been upgraded) will be able to handle this change.

Register New HoodieInputFormat for Hive Tables: For existing hive tables, change table definition to use new hudi input format.

For Read Optimized Tables: ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieInputFormatHoodieParquetInputFormat;
For Realtime Tables : ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieRealtimeInputFormatHoodieParquetRealtimeInputFormat;

For MOR tables, update hoodie.properties file to rename the configuration value for hoodie.compaction.payload.class from “com.uber.hoodie” to “org.apache.hudi”, We have a utility script that takes in a list of base-paths to be upgraded and does the rename. In theory, there is a small time window where queries and ingestion could see partial hoodie.properties file when the utility script is overwriting the file. To be really safe, this operation has to be performed with downtime but in practice you will most likely be fine. See below for an example invocation:

Code Block

java -cp $HUDI_UTILITIES_BUNDLE:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.8.4.jar org.apache.hudi.utilities.adhoc.UpgradePayloadFromUberToApache --help
Usage: <main class> [options]
  Options:
  * --datasets_list_path, -sp
       Local File containing list of base-paths for which migration needs to be performed
    --help, -h

...

Space shortcuts

Page tree

Versions Compared

Old Version 5

New Version Current

Key

Summary

Tl;DR:

Changes in Spark DataSource Format Name:

Migrating Existing Hudi Datasets:

Recommended Migration Steps:

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 5

New Version Current

Key

Summary

Tl;DR:

Changes in Spark DataSource Format Name:

Migrating Existing Hudi Datasets:

Recommended Migration Steps: