Info | ||
---|---|---|
| ||
Most of Hudi content is now hosted on the project site or the Github repo. This wiki is not updated/maintained actively. |
This wiki space hosts
Table of Contents |
---|
If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community
Technical documentation
How-to blogs
- How to manually register Hudi tables into Hive via Beeline?
- Ingesting Database changes via Sqoop/Hudi
- De-Duping Kafka Events With Hudi DeltaStreamer
Design documents/
...
RFCs
HIPs RFCs are the way to propose large changes to Hudi and the Hudi Improvement RFC Process details details how to go about driving one from proposal to completion.
HIPs list below
- HIP-1 : CSV Source Support for Delta Streamer
- HIP-2 : Orc Storage in Hudi
- HIP-3: Timeline Service with Incremental File System View Syncing
- HIP-4 : Faster Hive incremental pull queries
Roadmap
This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.
Writing data & Indexing
- Support for indexing parquet records to improve speed
- Indexing the log file, moving closer to scalable 1-min ingests
- Overhaul of
- Incrementalizing cleaning based on timeline metadata
Reading data
- Incremental Pull natively via Spark Datasource
- Real-time view support on Presto
- Hardening incremental pull via Realtime view
- Support for Streaming style batch programs via Beam/Structured Streaming integration
Storage
- ORC Support
- Support for collapsing and splitting file groups
- Custom strategies for data clustering
- Columnar stats collection to power better query planning
Usability
- Painless migration of historical data, with safe experimentation
- Hudi on Flink
- Hudi for ML/Feature stores
Metadata Management
...
. Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.