You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

General 

When is Hudi a useful for me or my organization? 

<Answer WIP>

What are some non-goals for Hudi? 

<Answer WIP>

What is incremental processing? Why does Hudi docs/talks keep talking about it? 

<Answer WIP>

What is the difference between COW (copy on write) vs MOR (merge on read) storage types ?

Copy On Write - This table type enables clients to ingest data on columnar file formats, currently parquet. Any new data that is written to the Hudi dataset using COW table type, will write new parquet files. Updating an existing set of rows will result in a rewrite of the entire parquet files that collectively contain the affected rows being updated. Hence, all writes to such datasets are limited by parquet writing performance, the larger the parquet file, the higher is the time taken to ingest the data.

Merge On Read - This table type enables clients to  ingest data quickly onto row based data format such as avro. Any new data that is written to the Hudi dataset using MOR table type, will write new log/delta files that internally store the data as avro encoded bytes. A compaction process (configured as inline or asynchronous) will convert log file format to columnar file format (parquet). Two different InputFormats expose 2 different views of this data, HoodieInputFormat exposes columnar parquet reading performance while HoodieRealTimeInputFormat exposes columnar and/or log reading performance respectively. Updating an existing set of rows will result in either a) a companion log/delta file for an existing base parquet file generated from a previous compaction or b) an update written to a log/delta file in case no compaction ever happened for it. Hence, all writes to such datasets are limited by avro/log file writing performance, much faster than parquet. Although, there is a higher cost to pay to read log/delta files vs columnar (parquet) files.

More details can be found here.

How do I choose a storage type for my workload ?


Find more details on trade offs between cow & mor storage types here.

Is Hudi an analytical database? 

<Answer WIP>

How do I model the data stored in Hudi? 

<Answer WIP>

Does Hudi support cloud storage/object stores?

<Answer WIP>

What versions of Hive/Spark/Hadoop are support by Hudi? 

<Answer WIP>

Using Hudi

What are some ways to write a Hudi dataset? 

<Answer WIP>

How can I now query the Hudi dataset I just wrote?

<Answer WIP>

How does Hudi handle duplicate record keys in an input? 

<Answer WIP>

Can I implement my own logic for how input records are merged with record on storage? 

<Answer WIP>

How is a Hudi job deployed? 

<Answer WIP>

What are different ways of running compaction for a MOR dataset?

<Answer WIP>

How do I migrate my data to Hudi?

<Answer WIP>


How can I pass hudi configurations to my spark job?

<Answer WIP> https://lists.apache.org/thread.html/ce35776e8899d620c095354c23933eb000eec63eaa70acfe60ae0b0c@<dev.hudi.apache.org


Performance 

What performance can I expect for Hudi writing?

<Answer WIP>

What ingest latency can I expect out of Hudi? 

<Answer WIP>

What performance can I expect for Hudi reading/queries? 

<Answer WIP>

How do I improve the Hudi writing performance?

<Answer WIP>

How do I to avoid creating tons of small files?

HoodieWriteConfig exposes knobs to allow for such flexibility. 

DataSource Spark API users

HoodieDeltaStreamer users

HoodieWriteClient users



  • No labels