Operational Analytics Framework for NiFi

Target release
Epic	Unable to render Jira issues macro, execution error.
Document status	DRAFT
Document owner	Yolanda Davis
Designer

Goals

Create a lightweight internal analytics framework to support prediction of NiFi cluster behavior. This framework would:

Provide a flexible internal analytics engine and model api for NiFi metrics that supports the addition of or enhancement to onboard models
Support both traditional and online (incremental) learning methods
Support integration of remote or cloud based ML models
Provide support for model caching (perhaps later inclusion into a model repository or registry)
UI enhancements to display prediction information either in existing summary data, new data visualizations, or directly within the flow/canvas (where applicable)

Background and strategic fit

NiFi currently provides cluster, flow and component specific metrics which can be viewed in the context of it’s UI or exported via several Reporting Tasks implementations. With this data users can assess the real-time health and performance of a cluster and also predict cluster behavior such as back pressure occurrence, memory utilization, flow rates, service anomalies etc. This information could then be used to perform certain actions proactively such as trigger alerts or notifications or perform automated tasks such scaling or configuration updates to maintain SLA.

Today users who collect metrics for these activities typically export them to other systems such as Prometheus, ELK stacks, or Ambari metrics for analysis. These systems are efficient in capturing, analyzing, and visualizing metric data, however it requires additional customization/integration work, with knowledge of NiFi operations, to provide meaningful analytics reported within a data flow context. Also it requires users to navigate and configure multiple applications to discover information on NiFi behavior and important trends. This highlights an opportunity to introduce an analytics framework that can provide users reasonable predictions on key performance indicators for cluster and flows to help administrators improve operational management of NiFi.

Assumptions

Operational Analytics will be focused on internal metrics for NiFi (not analytics on data in a data flow)
Onboard models supported will be “lightweight” working on smaller datasets on local repository
Remote models support would be “heavier weight” working on larger datasets in remote repositories

Proposed Phases

Phase 1: Framework Definition and Connection Analytics

Given the importance of back pressure as a key indicator for flow performance the initial framework can be built to support models that predict back pressure occurrence on connections within the following contexts:

Predicted Time left until Back pressure occurs
Predicted Object/Byte Count in an upcoming time interval/window

Requirements

#	Title	Importance
1	Provide connection specific predictions time when backpressure may occur objects/bytes	MUST
2	Provide connection specific predictions on amount of objects/bytes that will be queued within a given time frame	MUST
3	Ensure analytics can be set as optional
4	Ensure that models and query times for analytics can be configurable

User interaction and design

Users could have two primary methods for accessing predictions.

Analytics REST endpoint - For a given component or cluster users access a specific endpoint to obtain one or more predictions available for that entity
UI enhancements - Given the type of prediction values can be surfaced either directly on a component or within a new analytics view on the canvas

The internal API could work as follows:

On startup, NiFi creates an Analytics Engine with access repositories for pulling metric information. The engine would be responsible for instantiating an Analytics object that would provide component specific prediction capabilities (e.g. Connection Analytics object). That object would be provided an Analytics model to use when running a prediction for a given component. Model API would not have awareness of the types of component or specific predictions being made (only exposes an API that accepts features and target values for prediction). Engines can have different implementation to support caching of Analytics objects, if needed, especially for cases where objects are using online learning models that require multiple samples for predictions.

This would not only allow for flexibility in choosing the model to perform a prediction (e.g. local vs remote model) but also provide an api for model execution that could be used throughout the NiFi ecosystem.

Engine could be invoked either directly via a REST endpoint which would provide predictions for the given component ID. Existing component status endpoints can also be enhanced to provide prediction information on existing status detail screens. However in the long term newer analytics endpoints can be added to the UI along with analytics specific views for components that could show specific metrics and visualizations of predictions where applicable.

Future Enhancements

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question	Outcome

Not Doing

Space shortcuts

Child pages