...
Apache Hudi is currently in incubating phase. The earlier versions of hudi had maven coordinates com.uber.hoodie and package names starting with com.uber.hoodie. The first Apache version of Hudi (0.5.0) has both bundle and package names following Apache conventions. This document is intended for engineers who develop build and operate Hudi datasets for migrating from pre 0.5.0 Hudi version to 0.5.0 (first Apache release).
...
- Custom Hudi hooks written by users (like custom payload class, partition-extractor) needs to be modified to use new base class/interfaces (See below)
- Hudi’s integration with query engines like hive and presto relies on Hudi input format class (For e:g
com.uber.hoodie.HoodieInputFormat
) which is registered in Hive metastore. This input format is part of Hive table definition. As the namespace for these input-format classes have changed, upgrade to 0.5.0 has to be carefully done in specific order to avoid compatibility issues between query engines and Hudi writers. - Some of Hudi’s metadata (of compaction plan, actions like clean/rollback) are in avro format with “com.uber.hoodie” namespace. Also, the record payload class name (which could be com.uber.hoodie.xxx) is tracked in hoodie.properties file. Again, the upgrade needs to be carefully planned to avoid any interoperability issues related to this.
...
In some cases, you may have written custom hooks for merging records, transforming upstream data sources and for other purposes. You would need to be aware of the following changes to base interfaces for these hooks.
Hooks | Older base-class/interface | New base-class/interface |
Own Record Payload to perform custom merging semantics | com.uber.hoodie.common.model.HoodieRecordPayload | org.apache.hudi.common.model.HoodieRecordPayload |
Custom Partition Value extractor for sycning to hive metastore | com.uber.hoodie.hive.PartitionValueExtractor | org.apache.hudi.hive.PartitionValueExtractor |
Custom Hoodie Key generator from record | com.uber.hoodie.KeyGenerator | org.apache.hud.KeyGenerator |
Custom Upstream Source for HoodieDeltaStreamer | com.uber.hoodie.utilities.sources.Source | org.apache.hudi.utilities.sources.Source |
Custom Datasource transformer for HoodieDeltaStreamer | com.uber.hoodie.utilities.transform.Transformer | org.apache.hudi.utilities.transform.Transformer |
New Hudi Bundle packages :
The below table maps the old bundle coordinates with new bundle coordinates
S.NO | Old Bundle Name | New Bundle Name |
1 | com.uber.hoodie:hoodie-hadoop-mr-bundle | org.apache.hudi:hudi-hadoop-mr-bundle |
2 | com.uber.hoodie:hoodie-hive-bundle | org.apache.hudi:hudi-hive- |
bundle | ||
3 | com.uber.hoodie:hoodie-spark-bundle | org.apache.hudi:hudi-spark-bundle |
4 | com.uber.hoodie:hoodie-presto-bundle | org.apache.hudi:hudi-presto-bundle |
5 | com.uber.hoodie:hoodie-utilities-bundle | org.apache.hudi:hudi-utilities-bundle |
Changes in Input Format classes for Hive Tables:
Hudi has custom input format implementation to work with Hive tables. These classes are also affected by the change in the package namespace. Please find the relocation details below
View Type | Pre v0.5.0 Input Format Class | v0.5.0 Input Format Class |
Read Optimized View | com.uber.hoodie.hadoop.HoodieInputFormat | org.apache.hudi.hadoop.HoodieInputFormat |
Realtime View | com.uber.hoodie.hadoop.HoodieRealtimeInputFormat | org.apache.hudi.hadoop.realtime.HoodieRealtimeInputFormat |
Migrating Existing Hudi Datasets:
...
- Upgrade Hudi to 0.4.8 first (recommended):
- Using the local dockerized environment, we have manually tested the upgrade from com.uber.hoodie:hoodie-xxx-0.4.8 to org.apache.hudi:hudi-xxx-0.5.0. While the upgrade from pre 0.4.8 release to hudi-0.5.0 should theoretically work, we have not personally tested the migration steps.
- Upgrade Readers First :
- Hudi 0.5.0 (org.apache.hudi:hudi-xxx) packages have special classes and implementation to allow for reading datasets that are written by 0.4.8 and pre-0.4.8 versions. Upgrading Writers first could result in queries from old readers failing
- Upgrade Hudi Writer Next :
- This should start writing metadata with new namespace “org.apache.hudi” and the query engines (which have already been upgraded) will be able to handle this change.
- Register New HoodieInputFormat for Hive Tables: For existing hive tables, change table definition to use new hudi input format.
- For Read Optimized Tables: ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieInputFormat;
- For Realtime Tables : ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieRealtimeInputFormat;
- For MOR tables, update hoodie.properties file to rename the configuration value for hoodie.compaction.payload.class from “com.uber.hoodie” to “org.apache.hudi”, We have a utility script that takes in a list of base-paths to be upgraded and does the rename. See below for an example invocation:
Code Block java -cp $HUDI_UTILITIES_BUNDLE:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.8.4.jar org.apache.hudi.utilities.adhoc.UpgradePayloadFromUberToApache --help
Usage: <main class> [options]
Options:
* --datasets_list_path, -sp
to be performedLocal File containing list of base-paths for which migration needs
to be performed --help, -h