Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Unless Hive sync is enabled, the dataset written by Hudi using one of the methods above can simply be queries via the Spark datasource like any other source. 

Code Block
languagejavascala
linenumberstrue
collapsetrue
val hoodieROView = spark.read.format("org.apache.hudi").load(basePath + "/path/to/partitions/*")
val hoodieIncViewDF = spark.read().format("org.apache.hudi")
     .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(), DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
     .load(basePath);

...

Hudi provides built in support for rewriting your entire dataset into Hudi one-time using the HDFSParquetImporter tool available from the hudi-cli . You could also do this via a simple read and write of the dataset using the Spark datasource APIs. Once migrated, writes can be performed using normal means discussed here. This topic is discussed in detail here, including only ways to doing partial migrations.

How can I pass hudi configurations to my spark job?

Hudi configuration options covering the datasource and low level Hudi write client (which both deltastreamer & datasource internally call) are here. Invoking --help on any tool such as DeltaStreamer would print all the usage options. A lot of the options that control upsert, file sizing behavior are defined at the write client level and below is how we pass them to different options available for writing data.

1. For Spark DataSource, you can use the "options" API of DataFrameWriter to pass in these configs. 

Code Block
inputDF.write().format("com.uber.hoodie")
  .options(clientOpts) // any of the Hudi client opts can be passed in as well
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
  ...


 2. When using HoodieWriteClient directly, you can simply construct HoodieWriteConfig object with the configs in the link you mentioned.

 3. When using HoodieDeltaStreamer tool to ingest, you can set the configs in properties file and pass the file as the cmdline argument "--props"<Answer WIP> https://lists.apache.org/thread.html/ce35776e8899d620c095354c23933eb000eec63eaa70acfe60ae0b0c@<dev.hudi.apache.org

Can I register my Hudi dataset with Apache Hive metastore?

...

How does the Hudi indexing work & what are its benefits?  

The indexing component is a key part of the Hudi writing and it maps a given recordKey to a fileGroup inside Hudi consistently. This enables faster identification of the file groups that are affected/dirtied by a given write operation.

Hudi supports a few options for indexing as below 

  • HoodieBloomIndex (default) : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
  • HoodieGlobalBloomIndex : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even very large datasets. However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
  • HBaseIndex : Apache HBase is a key value store, typically found in close proximity to HDFS (smile). You can also store the index inside HBase, which could be handy if you are already operating HBase. 


You can implement your own index if you'd like, by subclassing the HoodieIndex class and configuring the index class name in configs. <Answer WIP>

What does the Hudi cleaner do? 

<Answer WIP>The Hudi cleaner process often runs right after a commit and deltacommit and goes about deleting old files that are no longer needed. If you are using the incremental pull feature, then ensure you configure the cleaner to retain sufficient amount of last commits to rewind.

How can I restore my dataset to a known good point in time?

...