Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Roadmap

Under construction (smile), early 2021 unveiling

Writing data & Indexing 

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations 

  1. Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)

  2. Trino integration with support for querying/writing Hudi table using SQL statements

  3. Kinesis/Pulsar integrations with DeltaStreamer

  4. Kafka Connect Sink for Hudi

  5. Dremio integration 
  6. Interops with other table formats

  7. ORC Support

Writing 

  • Indexing 

    • MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.

    • Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)

    • Range index to maintain column/field value ranges, to help file skipping for query performance

    • Addition of more auxiliary indexing structures - bitmaps, .. 

  • Improving indexing speed for time-ordered keys/small updates
    • leverage parquet record indexes,
    • serving bloom filters/ranges from timeline server/consolidate metadata
    • Indexing the log file, moving closer to scalable 1-min ingests
  • Improving indexing speed for uuid-keys/large update spreads
    • global/hash based index to faster point-in-time lookup

  • Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
  • Auto tuning 
    • Auto tune bloom filter entries based on records
    • Partitioning based on historical workload trend
    • Determination of compression ratio

Reading data

  • Concurrency Control

    • Addition of optimistic concurrency control, with pluggable locking services.
    • Non-blocking clustering implementation w.r.t updates

    • Multi-writer support with fully non-blocking log based concurrency control.
    • Multi table transactions
  • Performance
    • Integrate row writer with all Hudi writer operations
  • Self Managing 

    • Clustering based on historical workload trend 

    • On-fly data locality during write time (HUDI-1628)
    • Auto Determination of compression ratio

Querying

  • Performance

    • Complete integration with metadata table.
    • Realtime view performance/memory footprint reduction.
  • PrestoDB
    • Incremental Query support on Presto

  • Hive
    • Storage handler to leverage metadata table for partition pruning
  • Incremental Pull natively via Spark Datasource
  • Real-time view support on Presto
  • Spark SQL
      • Hardening incremental pull via Realtime view

    • Realtime view performance/memory footprint reduction.
    • Support for Streaming style batch programs via Beam/Structured Streaming integration

    Storage 

    • ORC Support
    • Support for collapsing and splitting file groups 
    • Custom strategies for data clustering
    • Columnar stats collection to power better query planning
    • Object storage

    Usability 

      • Spark Datasource redesign around metadata table
      • Streaming ETL via Structured Streaming
    • Flink
      • Support for end-end streaming ETL pipelines

      • Materialized view support via Flink/Calcite SQL
    • Mutable, Columnar Cache Service

      • File group level caching to enable real-time analytics (backed by Arrow/AresDB)
    • Painless migration of historical data, with safe experimentation
    • Hudi on Flink
    • Hudi for ML/Feature stores

    Metadata Management

    • Standalone timeline server to handle
      • Serves interactive query planning performance: schema, DFS listings, statistics, timeline requests
      Consolidated filesystem metadata for query planning 
      • High availability/sharding
      • Pluggable backing stores including rocksDB, Dynamo, Spanner
      • Hudi timeline is a log. if we compact it we get a snapshot of the table