You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

HIP-5: HUI (Hudi WebUI)


Proposers

Approvers

Status

Current state: [One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi has remarkable performance when it comes to replacing traditional batch processing with stream processing to keep datasets updated/fresh. To do this Hudi uses a lot of internal optimizations to filter duplicates, making writes efficient etc.

However, Hudi depends on having Spark’s configuration to a metrics reporting system (Graphite, Prometheus) to display performance numbers, index hits etc. While it is also very important that the users not only want to see performance metrics but also wants to view crucial Hudi dataset/table metadata which Hudi cannot display just yet.

For such reasons, I propose  HUI (Hudi WebUI)


Background

Almost all users of Hbase find the HMaster and HRegionServer WebUIs really helpful for performance metrics, bloom metrics, Memstore size etc. Similarly, it will be helpful for Hudi users to not only see various Hudi metrics but also perform some basic actions like creating Hudi Jobs (DeltaStreamer one-click deploy), see all started, running, failed, completed Hudi jobs etc.

Implementation

Sections below contain certain views, think of a view as a web page that the user is going to see and how it would add value to them.

Proposed Views

  1. Job View

    1. Job View shows how many Hudi jobs were scheduled. It shows job statuses like YARN (Started, Running, Failed, Completed). Clicking on each job shows further metrics and options which are explained further.
    2. Jobs View also allows the user to create a new DeltaStreamer job through the WebUI
  2. Hudi Table View

    1. The Hudi Table view shows all information about a Hudi Dataset. Furthermore, under the Hudi Table View, we will have a Compaction View and a Metadata Lineage View.
      1. Compaction View
        1. The compaction view allows users to select a compaction strategy and also see compaction related metrics. View explained further.
      1. Metadata Lineage View
        1. A Metadata Lineage view should show users what data source was used to create a particular Hudi dataset/table.
        2. When running DeltaStreamer or a Spark Job which extends Hudi we can track the data source and the root.dir. By capturing this we can create a lineage of the dataset in the WebUI.

Views Explained

Jobs View

The major purpose of this view is to show Hudi jobs, job metrics and job-related actions. The reason for giving Hudi its own jobs view is to isolate Hudi jobs from all other jobs on YARN/Mesos which make it easier for users to stay focused on their HUDI etls.

  1. Running Jobs (DeltaStreamer Jobs, Other Spark Jobs which extended the Hudi library)
    1. This view shows all running Hudi jobs with the following statistics:
      1. Index Metrics
      2. Delta Metrics
      3. Commit Metrics
      4. Inflight Metrics
      5. Compaction Metrics
      6. Num files are written
      7. Num records inserted
      8. Num records updated
      9. Num records deleted
      10. DFS writes etc.
  2. Completed Jobs
    1. Simply a list of all completed Hudi jobs with the above metrics in view.
  3. Failed Jobs
    1. Simply a list of all failed Hudi jobs and logs.
  4. Restart Job
    1. Explicitly allow a user to restart a running Hudi Job with the following options:
      1. Wipe and Restart (Delete the existing Hudi data and restart the entire Hudi job)
      2. Restart (Restart and append to current set dir) 
  5. Kill Job
    1. Simply kill a running Hudi job.
  6. Create a new job
    1. The ease of using DeltaStreamer is that with just a few source and dataset parameters you are ready to go.
      1. Hence, users should be able to describe their data source(RDBMS, Kafka, Hive etc), dataset properties (record key, partition key, sort key etc), spark properties (master, driver mem, executor mem), check a few boxes for dedups, hive-sync etc and should be able to launch the DeltaStreamer right away.
      2. Clone an existing Job.
      3. Create DeltaStreamer Templates.
      4. Share DeltaStreamer Templates.

Hudi Table View

The Hudi Table view displays metadata about tables. Metadata like:

  1. Hudi table Dir (s3://…….)
  2. Hudi record Key
  3. Hudi sort Key
  4. Hudi partition key
  5. Hudi records per partition (To check table partition skew)
  6. Hudi table size on DFS
  7. Hudi table source (Was the table created by pulling RDBMS data, Kafka topic, Hive table etc)
  8. Hudi table Type (COPY_ON_WRITE, MERGE_ON_READ)
  9. Hudi table View (Read Optimized View, Incremental View, Near-Real time Table)
  10. Hudi table Storage format (ORC, Parquet)
  11. Hudi table compression (Gzip, Snappy, Zlib etc)
  12. Hudi table schema  (Id int, name string etc)
  13. Hudi num compactions (Since table inception)
  14. Hudi total records updated (Since table inception)
  15. Hudi total records inserted (Since table inception)
  16. Hudi total records deleted (Since table inception)

For ref attached how HBase table view looks like:


Compaction View

By clicking on a table user should see something like

When clicking on `Compact` user should be able to schedule compactions through the WebUI.


Metadata Lineage View

                   When users start migrating their datasets from various sources it will be important to track which data source has been used to create this table. For now, to do these tracking users depend on correct Hudi dataset location or table naming conventions. And also has the manual intervention of asking tables.

To avoid creating a dataset again in Hudi which may have been forgotten we need to maintain Hudi Dataset lineages. This view will show the users how a particular Hudi dataset has emerged. What was the source? Which RDBMS or Hive table? Or Which Kafka or Pulsar topic? Or Which file from DFS?




Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

Test Plan

<Describe in few sentences how the HIP will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>

  • No labels