Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Quick Links 

Table of Contents
outlinetrue

Issue Management

Actual issue tracking is in Apache JIRA!  We use this page to ground ourselves.

...

  • (Vinoth) Identify & land all critical outstanding PRs (that solve critical issues, take us forward in our 1.0 path)
    •  Vinoth to identify.
    •  [Sagar] Move master to 1.0.0
  • (Ethan & Vinoth & Danny) Land storage format 1.0 (Complete)
    •  [Vinoth] Put up a 1.0 tech specs doc
    •  Standardization of serialization - log blocks, timeline meta files.
    •  Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly.
    •  Changes to make multiple base file formats within each file group.
    Remove any
    •  No Java classes
    from showing
    •  [Danny] Introduce transition time into the active timeline
    •  [Danny] Land LSM Timeline in well-tested, performant shape (HUDI-309, HUDI-6626, this needs an epic ASAP???)
  • Design:
    •  [Sagar] Multi-table transactions? (VC: we have a strawman. but needs an RFC to validate correctness across phantom reads, self-joins, nested queries, and isolation levels)
    •  [Lin] Keys: UUIDs vs. what we do today.
    •  [Danny???] Time-Travel Read (+Write) (resolve HUDI-4500, HUDI-4677 and similar, address branch/merge use-cases)
    •  [Ethan???] Logical partitioning/Index Functions API (Java, Native) and its integration into Spark/Presto/Trino. (HUDI-512)
    •  [Sagar + ???] Schema Evolution and version tracking in MT.
    •  [Vinoth] Lance file format + storing blobs/images.
  • Implementation
    •  [Sagar] RFC-46/RecordMerger API, is this our final choice? cross-platform? only for hoodie.merge.mode=custom ? (complete HUDI-3217)
    •  [Sagar] Async indexer is in final shape (complete HUDI-2488)
    •  [Lin] Land Parquet keyed lookup code (???)
    •  [???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790)
    •  [Ethan] Implement MoR snapshot query (positional/key based updates, deletes), partial updates, custom merges on new File Format code path.
    •  [Ethan] Implement writers for positional updates, deletes, partial updates, ordering field based merging.
    •  Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)
  • (Sagar) Open/Risk Items:
    •  _hoodie_operation metafield. Spark/Flink interop.
    •  Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues)

Execution Phase 2 (Sept 15-Oct 30)

...

    •  FileGroup APIs in Java

...

...

...

      •  Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)

...

...

...

  •  Design

...

    •  General purpose, global timeline (no active vs archived distinction)

...

    •  Non-blocking concurrency control/clustering + updates, inserts + inserts for Spark + Flink.

...

    •  Spark SQL statements to complete DB vision. (vinoth has a list. ???)

...

  •  Implementation

...

    •  Multi-table transaction

...

    •  Implement Non blocking CC for Spark...

...

    •  Secondary indexes (Bloom, RLI, VectorIndex, ..) on Spark read/write path. (HUDI-3907, HUDI-4128)

...

...

...

...

...

    •  Meta Sync to Glue/HMS with reduced storage/API overhead (HUDI-2519, HUDI-5108, HUDI-6488), seamless inc query, cdc query, ro/rt experience

...

    •  Broader Performance improvements (HUDI-3249)

...

    •  Encoding updates as deletes + inserts. (HUDI-6490)

...

    •  SQL experience for timeline, metadata. (HUDI-6498)

...

    •  Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)

...

    •  Introduce HudiStorage APIs to abstract out Hadoop FileSystem. (HUDI-6497)

Packaging Phase (Nov 1- Nov 15)(Marked 1.1.0 for now)

  •  Release (if still pending!)

...

  •  Docs

...

  •  Examples

...

...

  •  Site updates

...

  •  Deprecate/Cleanup cWiki

Below the line (Marked 1.1.0 for now)

  •  Unstructured Hudi table.

...

  •  Native HFile reader/writer in Hudi. (VC: This was punted since we'd default to Parquet based MDT)

...

  •  Streaming Performance: optimize the current upsert DAG on MetadataIndex (hybrid of RLI, Bloom Index, ....)

...

  •  Column family use-case (sparse rows on wide tables??)

...

  •  Cool new indexes

...

    •  Spatial Index

...

    •  Search/Lucene Index

...

    •  Bitmap Index,

...

  •  Hive Storage Handler

...

  •  Demos

...

...

  •  Dev Hygiene

...

...

  •  Tests

...

...

...

...