...

ID

IEP-23

Author

Igor Sapego

Sponsor

Igor Sapego

Created

18 Jun 2018

Status


colour	Green
title	ACTIVECOMPLETED

Table of Contents

Motivation

Currently, there is an issue with cache operations latency in thin clients, which also can result in throughput issues when using synchronous operations. It's caused by the additional network hop from client to server, as thin client currently is not able to determine physical location of the data, so it sends all cache requests to random server, which re-routes data to the right server.

The proposed solution is to implement "Affinity Awareness" for our thin clients, so they will send cache requests to the server, contain the data right away. With this idea in mind, we have potential to improve mean latency when using thin clients dramatically

Description

Here you can find description on how the solution can be implemented.

Connection

On thin client startup it connects to all nodes provided by user by client configuration.
Upon handshake server returns its UUID to client.
By the end of the startup procedure, client have open connections to all available server nodes and the following mapping (nodeMap): [UUID => Connection].

Connection to all nodes helps to identify available nodes, but can lead to significant delay, when thin client is used on a large cluster with a long IP list provided by user. To lower this delay, asynchronous establishment of connections can be used.

Changes in format of handshake protocol message

The format of the handshake messages can be found here. Only successful response message is changed.

...

Field type	Field description
int	Success message length, 1.
byte	Success flag, 1.
UUID	UUID of the server node.

Connection algorithm written in pseudo-code

Code Block

language	java
title	Pseudo code
linenumbers	true

Map<UUID, TcpConnection> connect(ClientConfig cfg)
{
	Map<UUID, TcpConnection> nodeMap = new Map();

	// Synchronous case here, as it is easier to read, but the same operation
    // can be performed asynchronously.
	for (addr: cfg.GetAddresses()) {
		TcpConnection conn = establishConnection(addr);

		if (conn.isGood()) {
			HandshakeResult handshakeRes = handshake(conn);

			if (handshakeRes.isOk()) {
				UUID nodeUuid = handshakeRes.nodeUuid();

				if (nodeMap.contains(nodeUuid)) {
					// This can happen if the same node has several IPs.
					// It have sense to keep more fresh connection alive.
					nodeMap.get(nodeUuid).disconnect();
				}

				nodeMap.put(nodeUuid, conn);
			} else {
				conn.disconnect();
			}
		}
	}

	if (nodeMap.isEmpty()) {
		// This is the only case which leads to the failure of the whole operation.
		// If at least one connection has been established, we can work with the cluster.
		reportFailure("Can not establish connection to a cluster);
	}


	return nodeMap;
}

Cache affinity mapping acquiring

To be able to route data to the primary node for the key, client should know partition mapping for a cache. There are several possible cases when client may want to request an affinity mapping for a cache or several caches, so it makes sense to add possibility to request affinity mapping for a several caches in one request. Also, partition mappings for several caches are often the same, so as an optimization, it makes sense to include in response with partition mapping a list of caches for which it applies. Thus the partitions request can be described by the following steps:

...

See proposed Cache Partitions Request and Response message format below.

Cache Partitions Request

Field type	Description
Header	Request header. Format details can be found here.
int	Number of caches N to get partition mappings for.
int	Cache ID #1
int	Cache ID #2
...	...
int	Cache ID #N

Cache Partitions Response

Field type	Description
Header	Response header. Format details can be found here.
long	Topology Affinity Version.
int	Minor Topology Affinity Version.
int	Number of cache mappings J, that describe all the caches listed in request.
Partition Mapping	Partition mapping #1. [[cacheId] => [nodeUuid => partition]]. See format below.
Partition Mapping	Partition mapping #2
...	...
Partition Mapping	Partition mapping #J

Partition Mapping

Field type	Description
bool	Applicable. Flag that shows, whether standard affinity is used for caches.
int	Number K of caches for which this mapping is applicable
int	Cache ID #1
Cache key configuration	Key configuration for cache #1. Present only if Applicable is true.
int	Cache ID #2
Cache key configuration	Key configuration for cache #2. Present only if Applicable is true.
...	...
int	Cache ID #K
Cache key configuration	Key configuration for cache #K. Present only if Applicable is true.
int	Number L of nodes. Present only if Applicable is true.
Node Partitions	Partitions of the node #1.Present only if Applicable is true.
Node Partitions	Partitions of the node #2.Present only if Applicable is true.
...	...
Node Partitions	Partitions of the node #L.Present only if Applicable is true.

Cache key configuration

Field type	Description
int	Number R of key configurations
int	Key type ID #1
int	Affinity Key Field ID #1
int	Key type ID #2
int	Affinity Key Field ID #2
...	...
int	Key type ID #R
int	Affinity Key Field ID #R

Node Partitions

Field type	Description
UUID	UUID of the node
int	Number of partitions M associated with node
int	Partition #1 for node.
int	Partition #2 for node.
...	...
int	Partition #M for node.

Changes to cache operations with single key

When user makes key-based cache operation, thin client makes the best effort to send the request to the node, which stores the data.

Client determines partitionMap for the Cache using cacheId and distributionMap: distributionMap(cacheId) => partitionMap. For details about partitionMap and distributionMap see section Cache instance acquiring.
If there is no partitionMap for the cache, it means a feature is not applicable for the Cache. Go to the step 6.
Once given a key by a user, client checks its cacheKeyMap (see section Cache instance acquiring 4.a) to find out, if cache configured to calculate partition from the key, using a specific key field.
Client uses a whole key or affinity key field if present with its internal implementation of rendezvous affinity function to calculate a partition for the key.
Using partitionMap , client gets a nodeUuid of the primary node for the key.
Using the nodeMap described in section Connection, client checks, whether there is an active connection nodeConnection to the required node, associated with nodeUuid.
If there is no connection to the primary node for the key, or if it can not be determined, client gets random nodeConnection from nodeMap.
Connection nodeConnection used to make request.
If the error happens on request send, use other random connection to send request. Used connection should be excluded from the nodeMap. This is not an error for user, though a warning log message may be a good idea in this case.
If no nodes are left, report an error to a user.

Request Sending Algorithm in pseudo code

Code Block

language	java
title	Pseudo code
linenumbers	true

Response sendRequest(CacheKey key, Message msg)
{
	UUID nodeUuid = null;
	Connection nodeConnection = null;

	if (!distributionMap.contains(cacheId))
		updateCachePartitions(cacheId); // See "Cache instance acquiring"

	PartitionMap partitionMap = distributionMap.get(cacheId);

	if (!partitionMap.empty()) {
		Object affinityKey = key;

		Map<int, int> keyAffinityMap = cacheKeyMap.get(cacheId);
		int affinityKeyId = keyAffinityMap.get(key.typeId());

		if (affinityKeyId != null)
			affinityKey = key.getFieldById(affinityKeyId)

		int partition = RendezvousAffinityFunction(affinityKey);

		nodeUuid = partitionMap.get(partition);

		nodeConnection = nodeMap.get(nodeUuid);
	}

	if (nodeConnection == null)
		nodeUuid, nodeConnection = nodeMap.getRandom();

	while (true) {
		try {
			Response rsp = nodeConnection.send(msg);

			return rsp;
		}
		catch(err) {
			logWarning(err);

			nodeConnection.disconnect();

			nodeMap.remove(nodeUuid);

			if (nodeMap.isEmpty())
				reportErrorToUser("Cluster is unavailable");

			nodeUuid, nodeConnection = nodeMap.getRandom();
		}
	}
}

Updating partition mapping

It is important for client to keep partition mapping updated. To ensure this, the following changes are proposed:

...

Updates distributionMap and partitionMap (preferably asynchronously) for all active caches. It also may be done "on demand" - on the first call to the Cache instance;
Tries to establish connection to hosts, which is not yet connected. This is required as changes of the topology may be caused by the new node joining the cluster.

Standard response message header changes

The format of the standard response messages can be foundhere.

...

As shown above, it is proposed to add new "Flags" field, to reduce size of the success response message (the most common case).

Benchmarks

Benchmarks that were performed on thin clients show good performance improvements for 3-nodes use case of C++ thin client with prototype implementation.

View file

name	thin_benches_v8.pdf
height	250

Risks and Assumptions

The proposed algorithm can introduce significant delay on start up, when thin client is used on a large cluster with a long IP list provided by user. To lower this delay, asynchronous establishment of connections can be used.
There may be a need to limit maximum number of open connection to use the feature on large clusters. However, this will reduce the efficiency of the proposed enhancement for large clusters.
Enhancement does not improve performance on a cluster with non-standard affinity function. However it does not reduce the performance in this case either.
The feature was not yet tested with large clusters.

Discussion Links

Initial proposal discussion: http://apache-ignite-developers.2346864.n4.nabble.com/Best-Effort-Affinity-for-thin-clients-td31574.html

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-23) ORDER BY status
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 48

New Version Current

Key

Motivation

Description

Connection

Changes in format of handshake protocol message

Connection algorithm written in pseudo-code

Cache affinity mapping acquiring

Cache Partitions Request

Cache Partitions Response

Partition Mapping

Cache key configuration

Node Partitions

Changes to cache operations with single key

Request Sending Algorithm in pseudo code

Updating partition mapping

Standard response message header changes

Benchmarks

Risks and Assumptions

Discussion Links

Tickets

Page tree

Page History

Versions Compared

Old Version 48

New Version Current

Key

Motivation

Description

Connection

Changes in format of handshake protocol message

Connection algorithm written in pseudo-code

Cache affinity mapping acquiring

Cache Partitions Request

Cache Partitions Response

Partition Mapping

Cache key configuration

Node Partitions

Changes to cache operations with single key

Request Sending Algorithm in pseudo code

Updating partition mapping

Standard response message header changes

Benchmarks

Risks and Assumptions

Discussion Links

Tickets