Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this page hierarchy, we explain the concepts, design and the overall architectural underpinnings of Apache Hudi. This content is intended to be the technical documentation of the project and will be kept up-to date with 

Info
titledef: "def~" annotations

In an effort to keep this page crisp for reading, any concepts that we need to explain are annotated with a def~  and hyperlinked off. You can contribute immensely to our docs, by writing the missing pages for annotated terms. These are marked in brownPlease mention any PMC/Committers on these pages for review.

...

Together these primitives unlock stream/incremental processing capabilities directly on top of def~DFS-abstractions. If you are familiar def~stream-processing, this is very similar to consuming events from a def~kafka-topic and  and then using a def~state-stores to  to accumulate intermediate results incrementally.

...

  • Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure ensure def~data-quality of  of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~dataset def~dataset, that was updated/deleted, as opposed to rewriting entiredef~dataset-partitions or  or even the entire def~dataset.
  • Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to to  def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs  jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results. 
  • Access to fresh data : 
  • Unified Storage : 

...