Page History

...

Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure def~data-quality of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~dataset def~table, that was updated/deleted, as opposed to rewriting entire def~dataset-partitions or even the entire def~dataset def~table.
Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results. Such data pipelines can be sped up dramatically, by querying one or more input tables using an def~incremental-query instead of a regular def~snapshot-query, resulting once again in only processing the incremental changes from upstream tables and then def~upsert or delete the target derived table, like above.
Access to fresh data : It's not everyday, that you will find that reduced resource usage also result in improved performance, since typically we add more resources (e.g memory) to improve performance metric (e.g query latency) . By fundamentally shifting away from how datasets have been traditionally managed for may be the first time since the dawn of the big data era, Hudi, in fact, realizes this rare combination. A sweet side-effect of incrementalizing def~batch-processing is that the pipelines also much much smaller amount of time to run, putting data into hands of organizations much much quickly, than it has been possible with def~data-lakes before.
Unified Storage : Building upon all the three benefits above, faster and lighter processing right on top of existing def~data-lakes mean lesser need for specialized storage or def~data-marts, simply for purposes of gaining access to near real-time data.

...

With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level, components for writing Hudi datasets are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~dataset def~table. Query engines like Apache Spark, Presto, Apache Hive can then query the dataset, with certain guarantees (that will discuss below).

There are three main components to a def~dataset def~table

Set of def~data-files that actually contain the records that were written to the dataset.
An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
Ordered sequence of def~timeline-metadata about all the write operations done on the dataset, akin to a database transaction log.

...

The implementation specifics of the two def~storagedef~dataset-types are detailed below.

Copy On Write

def~copy-on-write

Excerpt Include

	def~copy-on-write
	def~copy-on-write
nopanel	true

Merge On Read

def~merge-on-read

Excerpt Include

	def~merge-on-read
	def~merge-on-read
nopanel	true

...

Space shortcuts

Page tree

Versions Compared

Old Version 36

New Version 37

Key

Copy On Write

def~copy-on-write

Merge On Read

def~merge-on-read