Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.

Writing data & Indexing 

  • Improving indexing speed for time-ordered keys/small updates
    • leverage parquet record indexes,
    • serving bloom filters/ranges from timeline server/consolidate metadata
    Support for indexing parquet records to improve speed
    • Indexing the log file, moving closer to scalable 1-min ingests
  • Overhaul of 
  • Improving indexing speed for uuid-keys/large update spreads
    • global/hash based index to faster point-in-time lookup
  • Incrementalize & standardize all metadata operations e.g Incrementalizing cleaning based on timeline metadata
  • Auto tuning 
    • Auto tune bloom filter entries based on records
    • Partitioning based on historical workload trend
    • Determination of compression ratio

Reading data

  • Incremental Pull natively via Spark Datasource
  • Real-time view support on Presto
  • Hardening incremental pull via Realtime view
  • Realtime view performance/memory footprint reduction.
  • Support for Streaming style batch programs via Beam/Structured Streaming integration

...

  • Standalone timeline server to server handle DFS listings, timeline requests
  • Consolidated filesystem metadata for query planning 
    • Hudi timeline is a log. if we compact it we get a snapshot of the table