Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Approvers

Status

Current state: [One of "Under Discussion", "Accepted", "Rejected"]

...

Almost all users of Hbase find the HMaster and HRegionServer WebUIs really helpful for performance metrics, bloom metrics, Memstore size etc. Similarly, it will be helpful for Hudi users to not only see various Hudi metrics but also perform some basic actions like creating Hudi Jobs (DeltaStreamer one-click deploy), see all started, running, failed, completed Hudi jobs etc.

Implementation

Sections below contain certain views, think of a view as a web page that the user is going to see and how it would add value to them.

Proposed Views

  1. Job View

    1. Job View shows how many Hudi jobs were scheduled. It shows job statuses like YARN (Started, Running, Failed, Completed). Clicking on each job shows further metrics and options which are explained further.
    2. Jobs View also allows the user to create a new DeltaStreamer job through the WebUI
  2. Hudi Table View

    1. The Hudi Table view shows all information about a Hudi Dataset. Furthermore, under the Hudi Table View, we will have a Compaction View and a Metadata Lineage View.
      1. Compaction View
        1. The compaction view allows users to select a compaction strategy and also see compaction related metrics. View explained further.

...

The major purpose of this view is to show Hudi jobs, job metrics and job-related actions. The reason for giving Hudi its own jobs view is to isolate Hudi jobs from all other jobs on YARN/Mesos which make it easier for users to stay focused on their HUDI etls.

  1. Running Jobs (DeltaStreamer Jobs, Other Spark Jobs which extended the Hudi library)
    1. This view shows all running Hudi jobs with the following statistics:
      1. Index Metrics
      2. Delta Metrics
      3. Commit Metrics
      4. Inflight Metrics
      5. Compaction Metrics
      6. Num files are written

Implementation

      1. Num records inserted
      2. Num records updated
      3. Num records deleted
      4. DFS writes etc.
  1. Completed Jobs
    1. Simply a list of all completed Hudi jobs with the above metrics in view.
  2. Failed Jobs
    1. Simply a list of all failed Hudi jobs and logs.
  3. Restart Job
    1. Explicitly allow a user to restart a running Hudi Job with the following options:
      1. Wipe and Restart (Delete the existing Hudi data and restart the entire Hudi job)
      2. Restart (Restart and append to current set dir) 
  4. Kill Job
    1. Simply kill a running Hudi job.
  5. Create a new job
    1. The ease of using DeltaStreamer is that with just a few source and dataset parameters you are ready to go.
      1. Hence, users should be able to describe their data source(RDBMS, Kafka, Hive etc), dataset properties (record key, partition key, sort key etc), spark properties (master, driver mem, executor mem), check a few boxes for dedups, hive-sync etc and should be able to launch the DeltaStreamer right away.
      2. Clone an existing Job.
      3. Create DeltaStreamer Templates.
      4. Share DeltaStreamer Templates.

Hudi Table View

The Hudi Table view displays metadata about tables. Metadata like:

  1. Hudi table Dir (s3://…….)
  2. Hudi record Key
  3. Hudi sort Key
  4. Hudi partition key
  5. Hudi records per partition (To check table partition skew)
  6. Hudi table size on DFS
  7. Hudi table source (Was the table created by pulling RDBMS data, Kafka topic, Hive table etc)
  8. Hudi table Type (COPY_ON_WRITE, MERGE_ON_READ)
  9. Hudi table View (Read Optimized View, Incremental View, Near-Real time Table)
  10. Hudi table Storage format (ORC, Parquet)
  11. Hudi table compression (Gzip, Snappy, Zlib etc)
  12. Hudi table schema  (Id int, name string etc)
  13. Hudi num compactions (Since table inception)
  14. Hudi total records updated (Since table inception)
  15. Hudi total records inserted (Since table inception)
  16. Hudi total records deleted (Since table inception)

For ref attached how HBase table view looks like:

Image Added

Image Added


Compaction View

By clicking on a table user should see something like

Image Added

When clicking on `Compact` user should be able to schedule compactions through the WebUI.


Metadata Lineage View

                   When users start migrating their datasets from various sources it will be important to track which data source has been used to create this table. For now, to do these tracking users depend on correct Hudi dataset location or table naming conventions. And also has the manual intervention of asking tables.

To avoid creating a dataset again in Hudi which may have been forgotten we need to maintain Hudi Dataset lineages. This view will show the users how a particular Hudi dataset has emerged. What was the source? Which RDBMS or Hive table? Or Which Kafka or Pulsar topic? Or Which file from DFS?


Image Added<Describe the new thing you want to do inappropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>



Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

...