...
Info | ||
---|---|---|
| ||
In an effort to keep this page crisp for reading, any concepts that we need to explain are marked annotated with a def: annotation and def~ and hyperlinked off. You can contribute immensely to our docs, by writing the missing pages for annotated terms. These are marked in purple. Please mention any PMC/Committers on these pages for review, and mark the links blue |
Introduction
Apache Hudi (Hudi for short, here on) allows you to store vast amounts of data, on top existing def:hadoopexisting def~hadoop-compatible-storage, while providing two primitives, that enable def:stream def~stream-processing ondef:data def~data-lakes, in addition to typical def:batchtypical def~batch-processing.
Specifically,
- Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation. . Queries process the last such committed snapshot, to produce results.
- Change Streams : Hudi also provides first-class support for obtaining an incremental stream of all the records that were updated/inserted/deleted in a given dataset, from a given point-in-time, and unlocks a new category of incremental queries.
Unlocking such Together these primitives unlock stream/incremental processing capabilities on these def:DFS abstractions, has several advantages.
...
directly on top of def~DFS-abstractions. If you are familiar def~stream-processing, this is very similar to consuming events from a def~kafka-topic and then using a def~state-stores to accumulate intermediate results incrementally.
It has several architectural advantages.
- Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure def~data-quality of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~dataset, that was updated/deleted, as opposed to rewriting entire def~dataset-partitions or even the entire def~dataset.
- Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results.
- Access to fresh data :
- Unified Storage :
System Overview
<WIP>
Additionally,
...
- Unified, Optimized analytical storage
- GDPR, Data deletions, Compliance.
- Building block for great data lakes!
System Overview
<WIP>
...
- Optimized storage/sizing
Concepts
Timeline
Storage/Writing
...