You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

This wiki space hosts 

If you are looking for documentation on using Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

  1. How to manually register Hudi tables into Hive via Beeline? 
  2. Ingesting Database changes via Sqoop/Hudi
  3. De-Duping Kafka Events With Hudi DeltaStreamer

Design documents/HIPs

HIPs are the way to propose large changes to Hudi and the Hudi Improvement Process  details how to go about driving one from proposal to completion. 

HIPs list below

  1. HIP-1 : CSV Source Support for Delta Streamer 
  2. HIP-2 : Orc Storage in Hudi
  3. HIP-3: Timeline Service with Incremental File System View Syncing
  4.  HIP-4 : Faster Hive incremental pull queries

Roadmap

This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.

Writing data & Indexing 

  • Improving indexing speed for time-ordered keys/small updates
    • leverage parquet record indexes,
    • serving bloom filters/ranges from timeline server/consolidate metadata
    • Indexing the log file, moving closer to scalable 1-min ingests
  • Improving indexing speed for uuid-keys/large update spreads
    • global/hash based index to faster point-in-time lookup
  • Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
  • Auto tuning 
    • Auto tune bloom filter entries based on records
    • Partitioning based on historical workload trend
    • Determination of compression ratio

Reading data

  • Incremental Pull natively via Spark Datasource
  • Real-time view support on Presto
  • Hardening incremental pull via Realtime view
  • Realtime view performance/memory footprint reduction.
  • Support for Streaming style batch programs via Beam/Structured Streaming integration

Storage 

  • ORC Support
  • Support for collapsing and splitting file groups 
  • Custom strategies for data clustering
  • Columnar stats collection to power better query planning

Usability 

  • Painless migration of historical data, with safe experimentation
  • Hudi on Flink
  • Hudi for ML/Feature stores

Metadata Management

  • Standalone timeline server to handle DFS listings, timeline requests
  • Consolidated filesystem metadata for query planning 
    • Hudi timeline is a log. if we compact it we get a snapshot of the table
  • No labels