Proposers
Taher Koitawala <taherk77@gmail.com>
Approvers
: [APPROVED/REQUESTED_INFO/REJECTED]
@<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
...
Status
Current state: [One of "Under Discussion", "Accepted", "Rejected"]
Discussion thread: here
JIRA: here
Released: <Hudi Version>
Abstract
Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.
Hudi has remarkable performance when it comes to replacing traditional batch processing with stream processing to keep datasets updated/fresh. To do this Hudi uses a lot of internal optimizations to filter duplicates, making writes efficient etc.
However, Hudi depends on having Spark’s configuration to a metrics reporting system (Graphite, Prometheus) to display performance numbers, index hits etc. While it is also very important that the users not only want to see performance metrics but also wants to view crucial Hudi dataset/table metadata which Hudi cannot display just yet.
For such reasons, I propose HUI (Hudi WebUI)
Background
Almost all users of Hbase find the HMaster and HRegionServer WebUIs really helpful for performance metrics, bloom metrics, Memstore size etc. Similarly, it will be helpful for Hudi users to not only see various Hudi metrics but also perform some basic actions like creating Hudi Jobs (DeltaStreamer one-click deploy), see all started, running, failed, completed Hudi jobs etc.
Proposed Views
Job View
- Job View shows how many Hudi jobs were scheduled. It shows job statuses like YARN (Started, Running, Failed, Completed). Clicking on each job shows further metrics and options which are explained further.
- Jobs View also allows the user to create a new DeltaStreamer job through the WebUI
Hudi Table View
- The Hudi Table view shows all information about a Hudi Dataset. Furthermore, under the Hudi Table View, we will have a Compaction View and a Metadata Lineage View.
Compaction View
- The compaction view allows users to select a compaction strategy and also see compaction related metrics. View explained further.
- The Hudi Table view shows all information about a Hudi Dataset. Furthermore, under the Hudi Table View, we will have a Compaction View and a Metadata Lineage View.
Metadata Lineage View
- A Metadata Lineage view should show users what data source was used to create a particular Hudi dataset/table.
- When running DeltaStreamer or a Spark Job which extends Hudi we can track the data source and the root.dir. By capturing this we can create a lineage of the dataset in the WebUI.
Views Explained
Jobs View
The major purpose of this view is to show Hudi jobs, job metrics and job-related actions.
Running Jobs (DeltaStreamer Jobs, Other Spark Jobs which extended the Hudi library)
- This view shows all running Hudi jobs with the following statistics:
- Index Metrics
- Delta Metrics
- Commit Metrics
- Inflight Metrics
- Compaction Metrics
- Num files are written
- This view shows all running Hudi jobs with the following statistics:
Implementation
<Describe the new thing you want to do inappropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>
Rollout/Adoption Plan
- <What impact (if any) will there be on existing users?>
- <If we are changing behavior how will we phase out the older behavior?>
- <If we need special migration tools, describe them here.>
- <When will we remove the existing behavior?>
Test Plan
<Describe in few sentences how the HIP will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>