Page History

...

Roadmap

Under construction , early 2021 unveiling

Writing data & Indexing

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations

Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)
Trino integration with support for querying/writing Hudi table using SQL statements
Kinesis/Pulsar integrations with DeltaStreamer
Kafka Connect Sink for Hudi
Dremio integration
Interops with other table formats
ORC Support

Writing

Indexing
- MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.
- Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)
- Range index to maintain column/field value ranges, to help file skipping for query performance
- Addition of more auxiliary indexing structures - bitmaps, ..
Improving indexing speed for time-ordered keys/small updates
- leverage parquet record indexes,
- serving bloom filters/ranges from timeline server/consolidate metadata
- Indexing the log file, moving closer to scalable 1-min ingests
Improving indexing speed for uuid-keys/large update spreads
- global/hash based index to faster point-in-time lookup
Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
Auto tuning
- Auto tune bloom filter entries based on records
- Partitioning based on historical workload trend
- Determination of compression ratio

Reading data

Concurrency Control
- Addition of optimistic concurrency control, with pluggable locking services.
- Non-blocking clustering implementation w.r.t updates
- Multi-writer support with fully non-blocking log based concurrency control.
- Multi table transactions
Performance
- Integrate row writer with all Hudi writer operations
Self Managing
- Clustering based on historical workload trend
- On-fly data locality during write time (HUDI-1628)
- Auto Determination of compression ratio

Querying

Performance
- Complete integration with metadata table.
- Realtime view performance/memory footprint reduction.
PrestoDB
- Incremental Query support on Presto
Hive
- Storage handler to leverage metadata table for partition pruning
Incremental Pull natively via Spark Datasource
Real-time view support on Presto
Spark SQL
- - Hardening incremental pull via Realtime view
- Realtime view performance/memory footprint reduction.
- Support for Streaming style batch programs via Beam/Structured Streaming integration
Storage
- ORC Support
- Support for collapsing and splitting file groups
- Custom strategies for data clustering
- Columnar stats collection to power better query planning
- Object storage
Usability
- - Spark Datasource redesign around metadata table
  - Streaming ETL via Structured Streaming
- Flink
  - Support for end-end streaming ETL pipelines
  - Materialized view support via Flink/Calcite SQL
- Mutable, Columnar Cache Service
  - File group level caching to enable real-time analytics (backed by Arrow/AresDB)
- Painless migration of historical data, with safe experimentation
- Hudi on Flink
- Hudi for ML/Feature stores
Metadata Management
- Standalone timeline server to handle
  - Serves interactive query planning performance: schema, DFS listings, statistics, timeline requests
  Consolidated filesystem metadata for query planning
  - High availability/sharding
  - Pluggable backing stores including rocksDB, Dynamo, Spanner
  - Hudi timeline is a log. if we compact it we get a snapshot of the table

Space shortcuts

Page tree

Versions Compared

Old Version 48

New Version 49

Key

Roadmap

Writing data & Indexing

Integrations

Writing

Reading data

Querying

Storage

Usability

Metadata Management