...
A key goal of Hudi is to provide upsert
functionality that is orders of magnitude faster than rewriting entire tables or partitions.
...
Code Block |
---|
val hoodieROView = spark.read.format("org.apache.hudi").load(basePath + "/path/to/partitions/*")
val hoodieIncViewDF = spark.read().format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(), DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
.load(basePath); |
Info | ||
---|---|---|
| ||
Note that currently the reading realtime view natively out of the Spark datasource is not supported. Please use the Hive path below |
if Hive Sync is enabled in the deltastreamer tool or datasource, the dataset is available in Hive as a couple of tables, that can now be read using HiveQL, Presto or SparkSQL. See here for more.
...
How does Hudi handle duplicate record keys in an input?
<Answer WIP>When issuing an `upsert
` operation on a dataset and the batch of records provided contains multiple entries for a given key, then
Can I implement my own logic for how input records are merged with record on storage?
...