Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Page properties


Target releaseEpic
Document status
Status
titleDRAFT
Document owner
Designer

DevelopersQA


Goals

  • Allow user to indicate that data in a Connection should be spread across the cluster instead of just queuing up on the node

...

This Feature Proposal aims to simplify the configuration of the flow by allowing the user to configure a Connection and choose a Load Balancing Strategy: "Do Not Load Balance" (the default), "Round Robin" (which would automatically fail over if unable to communicate with a node), "Single Node" (all data goes to a single node in the cluster) or "Partition by Attribute" (more details below). By enabling the user to configure the flow this way, a user will be able to simply create a flow as ListSFTP → FetchSFTP and have the connection configured to use the "Round Robin" Load Balancing Strategy. This would result in automatically spreading the listing throughout the cluster and would require no Remote Process Groups or Ports be created. The flow is now much more self-contained and clear and concise.

Assumptions

Requirements

#TitleImportanceNotes
1Provide protocol for distributing data between nodes in the clusterMUST


2Support secure data transfer between nodesMUSTInter-node data transfer should be performed via SSL/TLS if a Keystore and Truststore are configured.
3Expose configuration of load balancing properties in nifi.properties MUST 
 

4Expose configuration in user interface for selecting a Load Balancing StrategyMUST
5Update User Guide images to depict new Connection Configuration dialogMUST
6Update Admin Guide to explain new nifi.properties propertiesMUST
7Update User Guide to explain how the feature worksMUST
8Persist state about the nodes in the cluster and restore this state upon restartSHOULDWhen the cluster topology changes, data that is partitioned via attribute will have to be rebalanced. Upon a restart of the entire cluster, this can result in a lot of data being pushed around. To avoid this, we should persist the nodes in a cluster and restore this information upon restart. If the cluster is shutdown and one node is not intended to be brought back up, the user will be responsible from removing the node from the cluster via the "Cluster" screen in the UI.
9Provide a visual indicator in the UI as to how many FlowFiles are awaiting transfer to another nodeSHOULDIf the UI does not easily provide an indicator of how many FlowFiles are waiting to be transferred, users may well be confused about why the connection shows X number of FlowFiles but the destination is not processing them.
10Allow user to configure whether or not compression is usedSHOULDFor some data, it will not make sense to compress the data during transfer, such as data that is known to be GZIP'd. For other data, compression can be very helpful. Additionally, even when the FlowFile content cannot be usefully compressed, the attributes likely can and so an option should be provided to compress Nothing, FlowFile Attributes Only, or FlowFiles Attributes and Content. This may well be implemented at a later time.

User interaction and design

...

This will necessitate adding the following properties to nifi.properties:

Proposed Property NameDescriptionDefault Value
nifi.cluster.load.balance.portThe port to listen on for incoming connections for load balancing data6342
nifi.cluster.load.balance.hostThe hostname to listen on for incoming connections for load balancing dataThe same as the `nifi.cluster.node.address` property, if specified, else localhost
nifi.cluster.load.balance.connections.per.nodeThe number of TCP connections to make to each node in the cluster for load balancing4
nifi.cluster.load.balance.max.threadsThe max size of the thread pool to use for load balancing data across the cluster8


It will be important to avoid heap exhaustion if we are now maintaining multiple queues of FlowFiles. While the default is to swap out any FlowFiles over 20,000 in increments of 10,000 we cannot do this for each internal queue. If we did, then a cluster of 10 nodes would end up queuing 10 * 20,000 = 200,000 FlowFiles in each queue, which would quickly exhaust heap. As a result, we will swap out the data more aggressively in the queues that are responsible for distributing to other nodes - likely after 1,000 FlowFiles have been reached, in increments of 1,000. The "local partition," however, can continue to swap out at 20,000 / 10,000.

...

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome
What permissions should be required/checked when receiving data from a load balancing protocol?

Rather than relying on explicit permissions, we will ensure that data that is sent in a secure clustered is sent from a node whose certificate is trusted by the configured TrustStore and moreover that the certificate's Distinguished Name or one of its Subject Alternative Names maps to the Node identity of one of the nodes in the cluster. In this way, we ensure that the data comes from a node that is in fact part of the cluster, and this scales nicely even as we have more elastic clustering.

How should Disconnected nodes be handled?

A node that has been disconnected from the cluster should continue to Load Balance the data in its queue to all nodes in the cluster. However, the nodes that are connected should not send data to the disconnected node.

This approach mimics how the Site-to-Site works today and lays the groundwork for later work to decommission a node in a cluster. If a node is disconnected, then it will not receive any new notifications about cluster topology updates. Therefore, it is possible that it will start sending the data to the wrong node if using the "Partition by Attribute" strategy. This needs to be acceptable, and the node that receives such data needs to be responsible for determining which node in the cluster should receive the data and distributing the data appropriately.



Not Doing