State Management improvements for Disaster Recovery scenarios

Target release	NiFi 2.0.0
Epic	Unable to render Jira issues macro, execution error. Unable to render Jira issues macro, execution error.
Document status	DRAFT
Document owner	Pierre Villard
Designer
Developers
QA

Goals

The goal is to provide options in NiFi to better support Disaster Recovery scenarios. This is a requirement for some users of NiFi running critical flows. In this case, we want to focus on the state of the components and how this state can be preserved from one deployment to another.

Background and strategic fit

In NiFi, when running extremely critical flows with aggressive SLAs, it may be required to consider Disaster Recovery scenarios to anticipate the complete loss of a NiFi deployment and have another deployment take over the execution of the flows (active/passive DR scenario).

Assuming a critical flow like:

ListS3 → FetchS3 → DoSomething → PutDatabaseRecord

The requirement a customer may have is that the loss of a NiFi deployment should not prevent me from having another NiFi deployment (in another region / data center for example) run the exact same flow and have limited downtime regardless of how long it would take for the down region to come back up. Note that the recently added feature of running a process group using the stateless engine helps with making sure that in-flight data would not be stuck in the down NiFi and the newly active NiFi could take things over.

The challenge with this scenario is that the ListS3 processor (like many others) has a state that is stored in the configured state manager. The state manager (Zookeeper, Redis, Kubernetes) is usually coupled with the deployment and we should assume that the state manager would also be down in case the NiFi deployment goes down (Redis does provide some replication options across multiple regions).

To help with this problem, we could imagine a couple of solutions:

Provide implementations of a State Manager that easily support cross regions replications. However, I don't think this is the direction we want to take considering more and more k8s based deployments where the k8s state manager (ConfigMap on top of etcd) is used.
Provide a composite sate manager where it'd be possible in a NiFi deployment to define multiple instances for the state manager. One instance would be the leader, one instance would be the follower. Any write would be done to both instances, a read would be made against the leader, and could fail over to the follower if required.

Having said that, this improvement would not be enough. Currently the state of a processor is strictly coupled with its UUID. Unless a user is moving the flow.json.gz file from one NiFi deployment to the other, it's unlikely that the processors with a state would have the same UUID. To help with this problem, we could provide the option for the user to specify a custom state identifier. When specified the state would be stored using this state ID, and this state ID would be preserved if deploying the flow using NiFi Registry for example.

Assumptions

The above approach is assuming that we're in an active/passive type of deployment. In case of a disaster on the active NiFi deployment, the passive NiFi deployment would be started and would take over the execution of the flows.

Requirements

#	Title	User Story	Importance	Notes
1	Composite State Manager	Being able to specify multiple state manager	Must have	Unable to render Jira issues macro, execution error.
2	Custom State Identifier	Being able to specify a custom state identifier	Must have	Unable to render Jira issues macro, execution error.

User interaction and design

Assumption is that in the passive region, NiFi flows are not running (NiFi could be up with flows stopped, or NiFi is stopped) and the state manager is up and running.

Under normal operations

NiFi R1 is configured with State Manager R1 as leader and State Manager R2 as a follower. When the processor is working and saving the state we first write the state in SMR1, and we try to write the state SMR2 (if it fails, this should be a WARNING but not impact operations). In case the SMR2 is unavailable for some time, it should not block operations on the active site and if the SMR2 comes back up we would write state as expected.

In case changes are made to the flow definition, we're assuming that the NiFi Registry is used, and when a flow is upgraded to a new version this is done on both regions. It is easy to have a NiFi Registry on each region with a database backend that would be replicated across the regions.

R1 goes down, R2 becomes active

Manual action is done to start the flows running in R2, the flows would using whatever state is available in SMR2 (NiFi R2 is configured with SMR2 as the leader and SMR1 as the follower). This way the flows are starting with a proper state according to whatever is the latest state committed. We do not guarantee that we would not pick up a given file twice (if R1 goes down before state has been committed into SMR2) but this would be an acceptable scenario and it's up to the user to account for potential duplicates.

When running the flows, NiFi would store state in SMR2 and fail to write into SMR1 until the state manager of the first region is back up.

R1 is back up, R2 goes back to passive

Once Region 1 is available again the NiFi R2 would be writing state in SMR1. At this point it's possible to stop the flows in NiFi R2 and start again the flows in NiFi R1 to get back into the normal operations state with R1 active and R2 passive.

---

Considerations for the custom state identifier

Providing the option to specify a custom state identifier comes with a set of requirements:

- The state identifier should be using a strict pattern to make sure this does not cause issue with state manager implementations
- The state identifier should be unique across the NiFi cluster:
  - The custom state identifier should default to the UUID of the component to preserve current behavior
  - If someone tries to set a custom state identifier that already exists, it should fail
  - If someone instantiates multiple instances of the same flow where a component has a custom state identifier, a warning should be raised and the custom state identifier should be set to the UUID of the component
- The state identifier needs to be saved as part of the flow definition (VersionedFlowSnapshot) for versioning with the NiFi Registry
- There should be a CLI command to set the name of the custom state identifier for a given component. The scenario is that some users are instantiating multiple instances of the same versioned flow but with a different set of parameter values. It should be possible to easily set a custom state identifier as part of the instantiation of a flow from the NiFi Registry

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question	Outcome

Not Doing

Space shortcuts

Child pages