Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Page properties
Target release
Epic
Document status
Status
titleDRAFT
Document owner

Mark Payne

Designer
Developers
QA

Goals

  • Provide the ability to replicate FlowFile content and attributes in such a way that if a node in a NiFi cluster is lost, another node or nodes can quickly and automatically resume the processing of the data 

Background and strategic fit

In today's world, it is becoming increasingly important for many organizations that data processing is both timely and highly available, even in the face of failures. The current design and implementation of the Content and FlowFile Repositories is such that if a NiFi node is lost, the data will not be processed until that node is brought back online. While this is  acceptable for many use cases, there are many other use cases in which this is not acceptable.

While using a RAID configuration can mitigate the fears of data loss by providing high-performance local replication, we have heard from several organizations that in order to make use of NiFi, they need the system to automatically failover to a new node if one node is lost.

Assumptions

Requirements

#TitleUser StoryImportanceNotes
1Replicate FlowFile ContentIf a particular NiFi node is lost (due to machine failure, etc.) the data needs to be made available in such a way that other nodes in the cluster can retrieve the data and process it themselvesMust Have 
2Replicate FlowFile Attributes

If a particular NiFi node is lost (due to machine failure, etc.) the FlowFile information (attributes, current queue identifier, metadata, etc.) needs to be made available in such a way that other nodes in the cluster can retrieve the information and process the FlowFiles themselves

Must Have 
3Ability to FailoverAt any time, all nodes in a NiFi cluster must be able to begin processing data that previously was "owned" by another node. This must be orchestrated in such a way that the data is processed only once and if the original "owner" returns to the cluster that it does not re-process the data itselfMust Have 
4Automatic FailoverIn addition to the ability of a node to begin processing data from another node, this failover must happen quickly and in an automated fashion. This should not require human intervention.Must Have 
5Replicate "Swap Files"If one of the NiFi queues reaches a certain (configurable) capacity, NiFi will "swap out" the FlowFiles, serializing the FlowFile info to disk, and removing them from the Java heap in order to ensure that the JVM does not run out of memory. This must also be replicated in order to ensure that another node that resumes the work of a failed node is also able to recover those FlowFiles without running out of heap space.Must Have 

User interaction and design

Replicating FlowFile Content

...

There also exists another Feature Proposal for State Management. If we implement this first, it will largely aid in the implementation of communicating with ZooKeeper.

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome

Not Doing