Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Apache Hudi is currently in incubating phase. The earlier versions of hudi had maven coordinates com.uber.hoodie and package names starting with com.uber.hoodie.  The first Apache version of Hudi (0.5.0) has both bundle and package names following Apache conventions (org.apache.hudi). This document is intended for engineers who build and operate Hudi datasets for migrating from pre 0.5.0 Hudi version to 0.5.0 (first Apache release).

...

These are some general guidelines on change management for any big data systems but especially important in the context of Hudi migration.

As you are responsible for your organization's data in Hudi, make sure you have -we recommend you have 

  1. Staging Setup: Have non-prod testing environment (staging) with non-trivial datasets. Test any version upgrade in Hudi in this environment before rolling out to production.
  2. Continuous Testing in Staging: With big-data systems, sufficient baking time and traffic  has to pass to verify and harden new versions before they can be rolled out to production. The same is true for upgrades pertaining to Hudi. Make sure due diligence is given in testing in Staging.
  3. Backwards Compatibility Testing: Apache Hudi sits in a unique place in the data lake ecosystem. On one hand, it integrates with the data ingestion side (involving processing engines like Spark, upstream like Kafka, storage like HDFS/S3/GCS) and on the other hand, it has to seamlessly work with query engines (Spark/Hive/Presto).  For large deployments, it may not be possible to stop the world to upgrade all services with new version of Hudi. In those cases, make sure you perform backwards compatibility testing by upgrading reader first.
  4. Tiered Gradual Rollout :  Once properly vetted in Staging, Have deployment strategies to production in place such that you can deploy any version upgrade to Hudi to one or a smaller subset of data-sets first in one datacenter (if multi-colo) and validate for some amount of time before rolling out to the entire service.

...

  1. Upgrade Hudi to 0.4.8 first (recommended):
    1. Using the local dockerized environment, we have manually tested the upgrade from com.uber.hoodie:hoodie-xxx-0.4.8 to org.apache.hudi:hudi-xxx-0.5.0. While the upgrade from pre 0.4.8 release to hudi-0.5.0 should theoretically work, we have not personally tested the migration steps.
  2. Upgrade Readers First : 
    1. Hudi 0.5.0 (org.apache.hudi:hudi-xxx) packages have special classes and implementation to allow for reading datasets that are written by 0.4.8 and pre-0.4.8 versions. Upgrading Writers first could result in queries from old readers failing
  3. Upgrade Hudi Writer Next : 
    1. This should start writing metadata with new namespace “org.apache.hudi” and the query engines (which have already been upgraded) will be able to handle this change.
  4. Register New HoodieInputFormat for Hive Tables: For existing hive tables, change table definition to use new hudi input format.
    1. For Read Optimized Tables: ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieInputFormat;
    2. For Realtime Tables : ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieRealtimeInputFormat;
  5. For MOR tables, update hoodie.properties file to rename the configuration value for hoodie.compaction.payload.class from “com.uber.hoodie” to “org.apache.hudi”, We have a utility script that takes in a list of base-paths to be upgraded and does the rename. See below for an example invocation: 

    Code Block
    java -cp $HUDI_UTILITIES_BUNDLE:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.8.4.jar org.apache.hudi.utilities.adhoc.UpgradePayloadFromUberToApache --help
    Usage: <main class> [options]
      Options:
      * --datasets_list_path, -sp
           Local File containing list of base-paths for which migration needs to be performed
        --help, -h



...