Framework-Level Retries to Improve Failure Handling

Target release
Document status	DRAFT
Document owner	Mark Payne

Goals

Improve the users' ability to handle retry logic in NiFI

Background and strategic fit

Often times, a Processor will route a FlowFile to a relationship where the user wants the logic to dictate that the FlowFile should be retried X number of times, and then upon X failures, the FlowFile should be routed elsewhere.

The typical pattern in NiFi for handling retries is to route a 'failure' relationship back to the processor itself. This has the downside, though, of not being able to handle the logic above, of try X number of times, then do something else. It results in just retrying forever.

This often results in users building complicated logic into the flow with multiple Processors, or using the RetryFlowFile processor. But this approach has its own downsides. It's tedious to set up. It's not very clean on the UI, making it less obvious what is happening in the flow. And most importantly, if we have a flow A -> B -> A, and both of the connections fill up with backpressure we end up in a situation where data can stop flowing. Self-loops don't have this issue because there are special provisions for this in the framework to avoid the situation.

To address all of these issues, we should supply a mechanism that is driven by the framework in which any Processor can be easily configured to retry some specified number of times.

This is a complex task that involves updates to many parts of the framework, including UI, REST backend, data model, and processing/engine.

Tasks Required

#	Title	Notes
1	UI: Provide ability to configure how many retries per Relationship	User may want to retry 5 times for a 'failure' relationship. Or 10. Or 0. It should be up to the user to decide how many times a Relationship should be retried.
2	UI: Provide ability to configure backoff mechanism	When a FlowFile is to be retried, the user needs to be able to dictate the backoff policy. User should be given the option of penalizing the FlowFile that is to be retried (this should be the default) or Yielding the entire Processor. There are cases where we want to maintain the ordering of the FlowFiles, even during failures, and penalizing a FlowFile can result in the ordering being changed. As a result, we need the ability to configure the Processor to Yield when retrying. In this case, no FlowFiles will be processed until the first flowfile is ready to be retried.
3	UI: Provide ability to configure Max Backoff Period	When a FlowFile is penalized or a Processor yielded, we should wait double the last retry period (i.e., use an exponential backoff) up to some configurable max amount of time. User must be able to specify the max amount of time.
4	UI: Introduce new tab in Processor config dialog for Relationships	Currently, a user configures which Relationships should be auto-terminated in the Settings tab of the configuration dialog. We now will have more complex configuration for each Relationship and as such need to a new tab in this dialog. The ability to configure which relationships are auto-terminated should be removed from the Settings tab and added to this new Relationships tab. This tab should list all Relationships for the Processor and allow the user to configure whether or not to auto-terminate the Relationship, as it does now. It should also allow configuring whether or not FlowFiles routed to a given Relationship should be retried. It will be valid to configure both a specific number of retries AND auto-terminating the Relationship. Or just one or the other. This tab should also contain the Max Backoff Period and Backoff Mechanism outlined above.
5	Update DTO Data Model	The ProcessorConfigDTO data model must be updated to include the new logic. The DTO will need the following members added: Set<String> retriedRelationships; // The name of any Relationship that is to be retried. String backoffMechanism; // Should be an enum but currently the DTO's do not make use of enums and instead use String objects with an Allowable Values specified in the @ApiModelProperty annotation. Best to stay consistent. String maxBackoffPeriod; This means we will also need to ensure that we update the DtoFactory class to properly populate these values when creating the DTO.
6	Update ProcessorResource to validate arguments	The new retriedRelationships, backoffMechanism, and maxBackoffPeriod elements should have their values validated when attempting to perform a POST or a PUT to a Processor. The request should throw an IllegalArgumentException if one of the values is invalid. For example, if the max backoff period were set to "5" instead of "5 mins"
7	Update Flow Data Model	Update the serialization logic to persist these new configurations on the Processor and to handle the deserialization logic. Must also update the flow-configuration.xsd schema
8	Update Versioned Processor Config Data Model	Update the VersionedProcessor in much the same way that we update the DTO so that changes to the processor are tracked in registry, exported versions, etc.
9	Update Flow Mapper	Update the FlowMapper to ensure that the new fields in the VersionedProcessor are populated when mapping a ProcessorNode to a VersionedProcessor
10	Update Versioned Flow Comparator	Update VersionedFlowComparator so that if the configuration for retry logic is updated, it's recognized as a local change
11	Update Flow Fingerprint	Update Flow Fingerprint to ensure that nodes within the cluster have the same configuration values
12	Update ProcessorNode	Update ProcessorNode to hold the new configuration elements for retry logic configuration in much the same way as the DTO data model. However, for the ProcessorNode, it will be important that an enum be used for the backoff mechanism.
13	Update ProcessContext	Update ProcessContext so that processor developers have the ability to determine whether or not a given Relationship is configured for retrying and how many times it is to be retried
14	Update User Guide	A new section must be added to the User Guide explaining the new tab and how to configure the different options. Additionally, any screenshots that show the Processor configuration dialog are now outdated and should be updated.
15	Implement the Retry Logic	A description of the Implementation Logic is given below.
16	Update FileSystemSwapManager	Update the FileSystemSwapManager to ensure that the data that is written out indicates the number of retries for a given FlowFile. Ensure that swap files written in the old format, which does not include this information, are still readable.

Testing

This is a major feature, which touches critical parts of the framework. The following testing needs to be done, at a minimum, in order to consider the feature a success.

#	Title	Notes
1	Update StandardProcessSessionIT	The majority of the logic updates will be in the StandardProcessSession. Need to ensure that the updates to this test are comprehensive.
2	Create System Test	A system-level test should be created to ensure that a Processor retries some number of times before routing to a configured relationship. This test should verify both that retries occur and that after some number of retries, they stop. It is easy to create a 'dummy processor' in the nifi-system-test-extensions-bundle module to make it easy to verify the behavior here.
3	Create Stateless System Test	A test similar to the System Test for NiFi should also be created for Stateless NiFi to ensure that Stateless properly handles these cases
4	Manual Testing	Need to manually test everything in NiFi. This is perhaps mostly easily accomplished by configuring PutFile to fail if a duplicate file already exists and then causing the failure to occur. If the file is then manually deleted from the file system, the processor should succeed the next time around. Otherwise, after N tries, it should be routed to the 'failure' relationship

Implementation Logic

Most of the logic for implementing this feature will exist in the StandardProcessSession.

The StandardProcessSession's checkpoint() logic must be updated to take the following logic into consideration.

If a FlowFile is routed to a given Relationship, before considering if it is auto-terminated or if the relationship is cloned, etc., we must check if the Relationship is configured for Retry Logic.
- If so, we must check how many times the FlowFile has been retried. This means that we will have to add a new field to the FlowFileRecord object (and therefore the StandardFlowFileRecord object). This field can simply be an `int numRetries`. There is no need to add a mapping of Relationship name to number of retries. This should instead be kept simple by saying after the FlowFile has been retried N number of times, it's done retrying (based on the relationship it was routed to). For example, if relationships 'ABC' and 'XYZ' are to be retried 7 times, consider that the Processor routes FlowFile 1 to relationship 'ABC' 7 times and then to relationship 'XYZ'. At this point, it is the 8th processing attempt, which is greater than 7, so it should be routed to relationship 'XYZ', not retried again. Note that the number of retries is 'transient' - it should not be serialized to the FlowFile Repository. On restart of NiFi, the number of retries can reset to 0. However, it MUST be persisted to/restored from Swap Files. Otherwise, once a queue reaches a certain threshold, the retries may no longer work.
- If the FlowFile has not yet reached the threshold for retries (i.e., it must be retried again), the Process Session must:
  - Transfer the FlowFile back to its original queue (if there is one, else remove the FlowFile). The FlowFile should be penalized, if the Retry Logic is configured to do so. Otherwise, the Processor must be yielded.
  - Any FlowFile that was created as a Child of this FlowFile must be removed. It will be very important here that we ensure that the logic is correct for cleaning up the Content Repository!!
  - Any Provenance Event for this FlowFile (FlowFile that is a Child of this FlowFile) must be removed.
- We should still update the states for the number of bytes Read/Written, the number of Tasks/Time. We should not update the number of FlowFiles/Bytes In/Out.
If Relationship that FlowFile is logic is not configured for retry logic, OR if the configured number of retries has been reached, we should process the FlowFile as we normally would.

It will also be important to take into account the performance impacts of the changes. Specifically, there should be no noticeable impact in throughput when the data is not retried. To verify this, we will need to run some simple flows like GenerateFlowFile → UpdateAttribute → UpdateAttribute before and after the change with no retry logic configured. The performance should be nearly identical.

Future Considerations

The following are out of scope for this Feature Proposal but worth considering in the future:

Update Provenance for ROUTE events. Update the Provenance Events to always include which Relationship a FlowFile is routed to. If no Provenance Event is generated for a FlowFile, generate a ROUTE event in order to show where the FlowFile went. This is not done currently because if we have a self-loop, it would generate massive amounts of Provenance Events for a given FlowFile. We should probably still consider that self-loops are possible and thus should not generate them if a FlowFile is being routed back to the same Connection
Add a Retry Count property to Provenance Events. If we retry a FlowFile, indicate the number of times that it is Retried in the Provenance Events. This can help to explain why the FlowFile took so long to process when looking at lineage

Space shortcuts

Child pages