Target release
Epic

Unable to render Jira issues macro, execution error.

Document status
DRAFT
Document owner
Designer


Goals

Create a lightweight internal analytics framework to support prediction of NiFi cluster behavior. This framework would:

  • Provide a flexible internal analytics engine and model api for NiFi metrics that supports the addition of or enhancement to onboard models
  • Support both traditional and online (incremental) learning methods
  • Support integration of remote or cloud based ML models
  • Provide support for model caching (perhaps later inclusion into a model repository or registry)
  • UI enhancements to display prediction information either in existing summary data, new data visualizations, or directly within the flow/canvas (where applicable)

Background and strategic fit

NiFi currently provides cluster, flow and component specific metrics which can be viewed in the context of it’s UI or exported via several Reporting Tasks implementations. With this data users can assess the real-time health and performance of a cluster and also predict cluster behavior such as back pressure occurrence, memory utilization, flow rates, service anomalies etc. This information could then be used to perform certain actions proactively such as trigger alerts or notifications or perform automated tasks such scaling or configuration updates to maintain SLA.

Today users who collect metrics for these activities typically export them to other systems such as Prometheus, ELK stacks, or Ambari metrics for analysis. These systems are efficient in capturing, analyzing, and visualizing metric data, however it requires additional customization/integration work, with knowledge of NiFi operations, to provide meaningful analytics reported within a data flow context. Also it requires users to navigate and configure multiple applications to discover information on NiFi behavior and important trends. This highlights an opportunity to introduce an analytics framework that can provide users reasonable predictions on key performance indicators for cluster and flows to help administrators improve operational management of NiFi.

Assumptions

  • Operational Analytics will be focused on internal metrics for NiFi (not analytics on data in  a data flow)
  • Onboard models supported will be “lightweight” working on smaller datasets on local repository
  • Remote models support would be “heavier weight” working on larger datasets in remote repositories

Proposed Phases

Phase 1: Framework Definition and Connection Analytics

Given the importance of back pressure as a key indicator for flow performance the initial framework can be built to support models that predict back pressure occurrence on connections within the following contexts:

  • Predicted Time left until Back pressure occurs
  • Predicted Object/Byte Count in an upcoming time interval/window

Requirements

#TitleUser StoryImportanceNotes
1Provide connection specific predictions time when backpressure may occur objects/bytes
MUST
2Provide connection specific predictions on amount of objects/bytes that will be queued within a given time frame
MUST
3Ensure analytics can be set as optional


4Ensure that models and query times for analytics can be configurable


User interaction and design

Users could have two primary methods for accessing predictions.

  1. Analytics REST endpoint - For a given component or cluster users access a specific endpoint to obtain one or more predictions available for that entity
  2. UI enhancements - Given the type of prediction values can be surfaced either directly on a component or within a new analytics view on the canvas

The internal API could work as follows:

On startup, NiFi creates an Analytics Engine with access repositories for pulling metric information. The engine would be responsible for instantiating an Analytics object that would provide component specific prediction capabilities (e.g. Connection Analytics object). That object would be provided an Analytics model to use when running a prediction for a given component. Model API would not have awareness of the types of component or specific predictions being made (only exposes an API that accepts features and target values for prediction). Engines can have different implementation to support caching of Analytics objects, if needed, especially for cases where objects are using online learning models that require multiple samples for predictions.

This would not only allow for flexibility in choosing the model to perform a prediction (e.g. local vs remote model) but also provide an api for model execution that could be used throughout the NiFi ecosystem.

Engine could be invoked either directly via a REST endpoint which would provide predictions for the given component ID. Existing component status endpoints can also be enhanced to provide prediction information on existing status detail screens. However in the long term newer analytics endpoints can be added to the UI along with analytics specific views for components that could show specific metrics and visualizations of predictions where applicable.

Future Enhancements


Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome






Not Doing