Page History

...

In this page hierarchy, we explain the concepts, design and the overall architectural underpinnings of Apache Hudi. This content is intended to be the technical documentation of the project and will be kept up-to date with

Info

title	def: "def~" annotations

In an effort to keep this page crisp for reading, any concepts that we need to explain are annotated with a def~ and hyperlinked off. You can contribute immensely to our docs, by writing the missing pages for annotated terms. These are marked in brown. Please mention any PMC/Committers on these pages for review.

...

Together these primitives unlock stream/incremental processing capabilities directly on top of def~DFS-abstractions. If you are familiar def~stream-processing, this is very similar to consuming events from a def~kafka-topic and and then using a def~state-stores to to accumulate intermediate results incrementally.

...

Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure ensure def~data-quality of of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~dataset def~dataset, that was updated/deleted, as opposed to rewriting entiredef~dataset-partitions or or even the entire def~dataset.
Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to to def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results.
Access to fresh data :
Unified Storage :

...

Space shortcuts

Page tree

Versions Compared

Old Version 32

New Version 33

Key