Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

What are some ways to write a Hudi dataset? 

Typically, you obtain a set of partial updates/inserts from your source and issue write operations against a Hudi dataset.  If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the delta streamer tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a Hudi datasource to write into Hudi. 

How is a Hudi job deployed? 

The nice thing about Hudi writing is that it just runs like any other spark job would on a YARN/Mesos or even a K8S cluster. So you could simply use the Spark UI to get visibility into write operations.<Answer WIP>

How can I now query the Hudi dataset I just wrote?

Unless Hive sync is enabled, the dataset written by Hudi using one of the methods above can simply be queries via the Spark datasource like any other source. 

Code Block
val hoodieROView = spark.read.format("org.apache.hudi").load(basePath + "/path/to/partitions/*")

if Hive Sync is enabled in the deltastreamer tool or datasource, the dataset is available in Hive as a couple of tables, that can now be read using HiveQL, Presto or SparkSQL. See here for more.<Answer WIP>

How does Hudi handle duplicate record keys in an input? 

...

Can I implement my own logic for how input records are merged with record on storage

<Answer WIP>

...

<Answer WIP>


What are different ways of running compaction for a MOR dataset?

...