Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Failover, new nodes, reducing number of messages, redeployment.

...

  • Error during service initialization on a node, included into assignment. In this case the problematic node sends failure details to the coordinator over the communication protocol. Once the coordinator receives the failure details, it recalculates assignments and sends another discovery message with updated assignments, if needed. All nodes should be retried in turn. If all nodes suitable for deployment fail to deploy a service, then coordinator sends a discovery message, containing this information, to all nodes, so the deploying methods can throw a corresponding exceptioninformation about the failure or a partial deployment.
  • Failure of a node, included into assignment. This situation triggers recalculation of service deployment assignments. Coordinator node sends another discovery message with a set of new assignments in it. If a node already initialized a service and it is not present in the new assignments set, then the service should be cancelled, if needed.
  • Coordinator failure. This situation is processed in a similar way as the previous one. The only difference is that the nodes should resend deployment results to the new coordinator.

Deployment

...

When a new node connects to the existing cluster, all needed services are deployed and initialised on it by the time it is accepted to the topology.

This is what happens, when a new node joins the cluster:

...

results

There are three possible outcomes of IgniteServices#deploy* method execution:

  • All services have been deployed successfully. In this case method execution just returns normally.
  • None of the services have been deployed. In this case ServiceDeploymentException is thrown, containing information about deployment failure.
  • Some of the assigned services have been deployed. In this case PartialServiceDeploymentException is thrown, containing information about failed deployments and number of deployed service instances. There should be a policy for processing of such outcomes, or an easy way to cancel the partially deployed services.

Also deployment of each service instance should trigger a system event, containing information about the service and a node, where it took place. The same should be done for deployment failures.

But these events shouldn't be triggered for services, that are not considered deployed in the cluster yet. Events about initial deployments should be triggered only after the discovery message about successful deployment is sent.

Deployment on new nodes

When a new node comes, it receives information about existing service assignments in the data attached to an initial discovery messages. Information about ongoing deployments should also be included into the discovery data.

If a new node is included into assignments, then it should start the deployment procedure asynchronously.

If assignment recalculation is needed, then coordinator performs it and sends a reassignment message.

If a node comes during service deployment, and it is suitable for deployment, then the new node should start it. Coordinator should detect such node and wait for the result from it as well.

Reducing the number of messages

Not all discovery events or deployment failures require assignment recalculation.

Services, that have configuration like (maxPerCluster = 0, maxPerNode > 0) should lead to creation of only one assignment. It should look like (eachNode=N), or in some similar way.

When a new node comes to the topology, or initialization failure happens, assignment recalculation shouldn't be triggered for such services.

TODO: How should other nodes know, which nodes succeeded in service deployment, if assignment looks like (eachNode=N)? Should they listen to system events about service deployment?

Also information about already deployed services should be included into the discovery data. Otherwise a node, that just case to the cluster, won't be able to tell, which nodes have which services.

Service cancellation

IgniteServices#cancel() method triggers sending of a discovery message, containing information about services, that are being cancelled.

Each node should call Service#cancel() methods on its services and undeploy them. Also all ongoing deployments should be interrupted.

...

Hot redeployment

It should be possible to update service implementation without downtime. Employment of Deployment SPI should solve this problem.

Service processor should subscribe to class deployments and restart corresponding services, when their classes change.

The basic usage scenario involves enabling UriDeploymentSpi and updating the JAR files, containing implementation classes. It will lead to existing services cancellation and reployment. It implies, that services should be ready to sudden cancellations. Documentation should contain explanation of this fact with examples.

To make redeployment with an updated class possible, service's properties and its class should be separated. ServiceConfiguration should contain the following properties:

  • String serviceClassName – name of a service implementation class.
  • Map<String, Object> properties – properties, that a service can use during initialization and work. Properties should be included into the ServiceContext object.

It will also help the service processor distinguish between different services configurations, when only properties change. 

ServiceConfiguration#service property should be removed.

Service classe should have an empty constructor, that will be used by deploying nodes.

A possible point of improvement here is to start redeployment with a random delay to avoid denial of service on the whole cluster.

Risks and Assumptions

These changes will break compatibility with previous versions of Apache Ignite completely.

Also there will be no way to preserve services between cluster restarts. Even though there is such possibility currently exists.

Further work

There are still some flaws in the current service grid design, that are not covered in this IEP.

...

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-6069

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-5551