IDIEP-83
Author
Sponsor
Created

  

Status

ACTIVE


Motivation

TCP connections can enter half-open state: seems to be alive, but any attempt to send data will fail. Long-living and mostly idle connections are especially susceptible to this behavior.

Retry mechanism (IEP-82 Thin Client Retry Policy) in thin client implementations partially mitigates the issue. However, not all operations are safe to retry, and reconnect affects performance.

To improve the connection stability and detect failures early we can add a keep-alive mechanism.

Description

Why not TCP keepalive

TCP has a built-in keepalive mechanism, but it has some disadvantages:

  • Optional (may not be present in some TCP stacks)
  • May not be handled well by some routers (RFC 1122, section 4.2.3.6)
  • Default timeout is too long (2 hours), and is problematic to adjust on SDK versions that are in use in Ignite (Java 8, .NET Standard 2.0), or hard to do right in some languages (Python, JS).

Because of that, some protocols implement keepalive logic on a higher level (SMB, TLS). More details: https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

Proposal

  • Add OP_HEARTBEAT to the protocol with an empty payload. Clients can send heartbeats at a configurable interval and receive responses to ensure that the connection is active.
  • Add OP_GET_IDLE_TIMEOUT to the protocol that returns server-side idle timeout (see ClientConnectorConfiguration.idleTimeout) as int64 milliseconds. When server.heartbeatInterval is less than client.idleTimeout, log a warning on the client.
  • Add ClientConfiguration.heartbeatInterval property. Defaults to 0 (heartbeats disabled).

As a result:

  • Client can identify a broken connection by sending OP_HEARTBEAT to the server.
  • Server can identify a broken connection with the existing idleTimeout functionality.


This applies to Ignite 2.x and 3.x.

Risks and Assumptions

  • New ProtocolBitmaskFeature will be added to maintain protocol compatibility.
  • When client-side heartbeats are enabled (not by default), server will no longer disconnect an otherwise idle client. This should be carefully documented and cross-linked.

Discussion Links

Reference Links

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels