Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This Feature Proposal aims to simplify the configuration of the flow by allowing the user to configure a Connection and choose a Load Balancing Strategy: "Do Not Load Balance" (the default), "Round Robin" (which would automatically fail over if unable to communicate with a node), "Single Node" (all data goes to a single node in the cluster) or "Partition by Attribute" (more details below). By enabling the user to configure the flow this way, a user will be able to simply create a flow as ListSFTP → FetchSFTP and have the connection configured to use the "Round Robin" Load Balancing Strategy. This would result in automatically spreading the listing throughout the cluster and would require no Remote Process Groups or Ports be created. The flow is now much more self-contained and clear and concise.

...

  • Do Not Load Balance: This will be the default value and will cause the connection to behave as all Connections do currently.
  • Round Robin: FlowFiles will be distributed across the cluster in a round-robin fashion. If a node is disconnected from the cluster or if unable to communicate with a node, the data that is queued for that node will be automatically redistributed to another node(s).
  • Single Node: All FlowFiles will be distributed to a single node in the cluster. If the node is disconnected from the cluster or if unable to communicate with the node, the data that is queued for that node will remain queued until the node is available again. Which node the data goes to will not be configurable. All connections that are configured with this strategy will send data to the same node. There was some consideration to allow the user to specify which node data should go to, in order to allow the data to be spread across the cluster at the user's discretion. However, this has some significant downsides: as clusters become more elastic, the nodes in the cluster may be added or removed more arbitrarily, making the selected value invalid and requiring manual intervention. Additionally, it would mean that storing this value in a Flow Registry or a template would make the configuration invalid in any other environment. A better approach that was also considered was to send all data to the Primary Node. This has the downside, however, if requiring that all data be transferred any time that the Primary Node changes. To avoid this, we will simply choose a node at the framework's discretion and send all data to that node.
  • Partition by Attribute: The user will provide the name of a FlowFile Attribute to partition the data by. All FlowFiles that have the same value for the specified attribute will be distributed to the same node in the cluster. If a FlowFile does not have a value for that attribute, the absence of the attribute (i.e., the value of `null`) will itself be considered a value. So all FlowFiles that do not have that attribute will be sent to the same node. If the destination node is disconnected from the cluster or is otherwise unable to communicate, the data should not fail over to another node. The data should queue, waiting for the node to be made available again. If the topology of the cluster is changed, this will result in a rebalancing of the data. Consistent Hashing should be used in order to avoid having to redistribute all of the data when a node joins or leaves the cluster.

...

QuestionOutcome
What permissions should be required/checked when receiving data from a load balancing protocol?

Currently, there is a policy for "Modify the Data". We should require that the node sending data has been granted this permission.

Currently, Connections themselves do not have permissions, however.This is driven by the Source and Destination components, so in order for a node to be able to load balance the data across the cluster, all nodes should be given permission to "Modify the Data" for the source and destination processors. This is typically managed best by managing at the Process Group level.Rather than relying on explicit permissions, we will ensure that data that is sent in a secure clustered is sent from a node whose certificate is trusted by the configured TrustStore and moreover that the certificate's Distinguished Name or one of its Subject Alternative Names maps to the Node identity of one of the nodes in the cluster. In this way, we ensure that the data comes from a node that is in fact part of the cluster, and this scales nicely even as we have more elastic clustering.

How should Disconnected nodes be handled?

A node that has been disconnected from the cluster should continue to Load Balance the data in its queue to all nodes in the cluster. However, the nodes that are connected should not send data to the disconnected node.

This approach mimics how the Site-to-Site works today and lays the groundwork for later work to decommission a node in a cluster. If a node is disconnected, then it will not receive any new notifications about cluster topology updates. Therefore, it is possible that it will start sending the data to the wrong node if using the "Partition by Attribute" strategy. This needs to be acceptable, and the node that receives such data needs to be responsible for determining which node in the cluster should receive the data and distributing the data appropriately.

Do we need a "Static Partitioning Strategy"? This would allow the user to specify one node in the cluster and all data would be distributed directly to this node. This could be accomplished by using the "Partition by Attribute" and setting the attribute to a non-existent attribute such as "I Do Not Exist" but then all connections that do this would go to the same node and the User Experience is subpar. Exposing a specific strategy to allow the user to choose a node makes it such that the user can choose for each connection which node the data should go to and can therefore distribute data more explicitly

.



Not Doing