Contributing

To contribute content to this FAQ, see here.

General

When is Hudi a useful for me or my organization?

If you are looking to quickly ingest data onto HDFS or cloud storage, Hudi can provide you tools to help. Also, if you have ETL/hive/spark jobs which are slow/taking up a lot of resources, Hudi can potentially help by providing an incremental approach to reading and writing data.

As an organization, Hudi can help you build an efficient data lake, solving some of the most complex, low-level storage management problems, while putting data into hands of your data analysts, engineers and scientists much quicker.

What are some non-goals for Hudi?

Hudi is not designed for any OLTP use-cases, where typically you are using existing NoSQL/RDBMS data stores. Hudi cannot replace your in-memory analytical database (at-least not yet!). Hudi support near-real time ingestion in the order of few minutes, trading off latency for efficient batching. If you truly desirable sub-minute processing delays, then stick with your favorite stream processing solution.

What is incremental processing? Why does Hudi docs/talks keep talking about it?

Incremental processing was first introduced by Vinoth Chandar, in the O'reilly blog, that set off most of this effort. In purely technical terms, incremental processing merely refers to writing mini-batch programs in streaming processing style. Typical batch jobs consume all input and recompute all output, every few hours. Typical stream processing jobs consume some new input and recompute new/changes to output, continuously/every few seconds. While recomputing all output in batch fashion can be simpler, it's wasteful and resource expensive. Hudi brings ability to author the same batch pipelines in streaming fashion, run every few minutes.

While we can merely refer to this as stream processing, we call it incremental processing, to distinguish from purely stream processing solutions like Apache Flink, Apache Apex or Apache Kafka Streams.

What is the difference between COW (copy on write) vs MOR (merge on read) storage types ?

Copy On Write - This table type enables clients to ingest data on columnar file formats, currently parquet. Any new data that is written to the Hudi dataset using COW table type, will write new parquet files. Updating an existing set of rows will result in a rewrite of the entire parquet files that collectively contain the affected rows being updated. Hence, all writes to such datasets are limited by parquet writing performance, the larger the parquet file, the higher is the time taken to ingest the data.

Merge On Read - This table type enables clients to ingest data quickly onto row based data format such as avro. Any new data that is written to the Hudi dataset using MOR table type, will write new log/delta files that internally store the data as avro encoded bytes. A compaction process (configured as inline or asynchronous) will convert log file format to columnar file format (parquet). Two different InputFormats expose 2 different views of this data, HoodieInputFormat exposes columnar parquet reading performance while HoodieRealTimeInputFormat exposes columnar and/or log reading performance respectively. Updating an existing set of rows will result in either a) a companion log/delta file for an existing base parquet file generated from a previous compaction or b) an update written to a log/delta file in case no compaction ever happened for it. Hence, all writes to such datasets are limited by avro/log file writing performance, much faster than parquet. Although, there is a higher cost to pay to read log/delta files vs columnar (parquet) files.

More details can be found here.

How do I choose a storage type for my workload ?

Find more details on trade offs between cow & mor storage types here.

Is Hudi an analytical database?

How do I model the data stored in Hudi?

Does Hudi support cloud storage/object stores?

What versions of Hive/Spark/Hadoop are support by Hudi?

How does Hudi actually store data inside a dataset?

Using Hudi

What are some ways to write a Hudi dataset?

How can I now query the Hudi dataset I just wrote?

How does Hudi handle duplicate record keys in an input?

Can I implement my own logic for how input records are merged with record on storage?

How is a Hudi job deployed?

What are different ways of running compaction for a MOR dataset?

How do I migrate my data to Hudi?

How can I pass hudi configurations to my spark job?

<Answer WIP> https://lists.apache.org/thread.html/ce35776e8899d620c095354c23933eb000eec63eaa70acfe60ae0b0c@<dev.hudi.apache.org>

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

What does the Hudi cleaner do?

How can I restore my dataset to a known good point in time?

How can I partition data stored in a Hudi dataset?

What's Hudi's schema evolution story?

Performance

What performance can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

What performance can I expect for Hudi reading/queries?

How do I improve the Hudi writing performance?

How do I to avoid creating tons of small files?

HoodieWriteConfig exposes knobs to allow for such flexibility.

DataSource Spark API users

HoodieDeltaStreamer users

HoodieWriteClient users

Contributing to FAQ

A good and usable FAQ should be community-driven and crowd source questions/thoughts across everyone.

You can improve the FAQ by the following processes

Comment on the text to spot inaccuracies, typos and leave suggestions.
Propose new questions with answers under the comments section at the bottom of the page
Lean towards making it very understandable and simple, and heavily link to parts of documentation as needed
One committer on the project will review new questions and incorporate them upon review.

Space shortcuts

Page tree

General

When is Hudi a useful for me or my organization?

What are some non-goals for Hudi?

What is incremental processing? Why does Hudi docs/talks keep talking about it?

What is the difference between COW (copy on write) vs MOR (merge on read) storage types ?

How do I choose a storage type for my workload ?

Is Hudi an analytical database?

How do I model the data stored in Hudi?

Does Hudi support cloud storage/object stores?

What versions of Hive/Spark/Hadoop are support by Hudi?

How does Hudi actually store data inside a dataset?

Using Hudi

What are some ways to write a Hudi dataset?

How can I now query the Hudi dataset I just wrote?

How does Hudi handle duplicate record keys in an input?

Can I implement my own logic for how input records are merged with record on storage?

How is a Hudi job deployed?

What are different ways of running compaction for a MOR dataset?

How do I migrate my data to Hudi?

How can I pass hudi configurations to my spark job?

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

What does the Hudi cleaner do?

How can I restore my dataset to a known good point in time?

How can I partition data stored in a Hudi dataset?

What's Hudi's schema evolution story?

Performance

What performance can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

What performance can I expect for Hudi reading/queries?

How do I improve the Hudi writing performance?

How do I to avoid creating tons of small files?

Contributing to FAQ

Space shortcuts

Page tree

Frequently asked questions (FAQ)

General

When is Hudi a useful for me or my organization?

What are some non-goals for Hudi?

What is incremental processing? Why does Hudi docs/talks keep talking about it?

What is the difference between COW (copy on write) vs MOR (merge on read) storage types ?

How do I choose a storage type for my workload ?

Is Hudi an analytical database?

How do I model the data stored in Hudi?

Does Hudi support cloud storage/object stores?

What versions of Hive/Spark/Hadoop are support by Hudi?

How does Hudi actually store data inside a dataset?

Using Hudi

What are some ways to write a Hudi dataset?

How can I now query the Hudi dataset I just wrote?

How does Hudi handle duplicate record keys in an input?

Can I implement my own logic for how input records are merged with record on storage?

How is a Hudi job deployed?

What are different ways of running compaction for a MOR dataset?

How do I migrate my data to Hudi?

How can I pass hudi configurations to my spark job?

How do I delete records in the dataset using Hudi?

Can I register my Hudi dataset with Apache Hive metastore?

What does the Hudi cleaner do?

How can I restore my dataset to a known good point in time?

How can I partition data stored in a Hudi dataset?

What's Hudi's schema evolution story?

Performance

What performance can I expect for Hudi writing?

What ingest latency can I expect out of Hudi?

What performance can I expect for Hudi reading/queries?

How do I improve the Hudi writing performance?

How do I to avoid creating tons of small files?

Contributing to FAQ