Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The default value of tcp_syn_retries may be large, such as 6 for Linux. It means the default timeout value is 127 seconds for finishing the three-way handshake. A shorter timeout at the transportation level will help clients detect dead nodes faster. The existing configuration “request.timeout.ms” sets an upper-bound of the time used by both the transportation and application layer whose complexity varies. It’s risky to lower “request.timeout.ms” for detecting dead nodes quicker because of the involvement of the application layer logic.
  2. In some scenarios, the existing configuration "request.timeout.ms" is not able to time out the connections properly. For example, currently, the leastLoadedNode() provides a cached node with the criteria below. 
    1. Provide the connected node with least number of inflight requests
    2. If no connected node exists, provide the connecting node with the largest index in the cached list of nodes.
    3. If no connected or connecting node exists, provide the disconnected node which respects the reconnect backoff with the largest index in the cached list of nodes.

...

  1. Use request.timeout.ms to time out the socket connection at the client level instead of the network client level
    1. request.timeout.ms is at the client/request level. We need one in the NetworkClient level to control the connection states.
    2. The transportation layer timeout should be relatively shorter than the request timeout. It's good to have a separate config.
    3. In some scenarios, request.timeout.ms is not able to time out the connections properly.
  2. Add a new connection state TIMEOUT besides DISCONNECTED, CONNECTING, CHECKING_API_VERSIONS, READY, and AUTHENTICATION_FAILED
    1. We don't necessarily need to differentiate the timeout and disconnected states.
  3. a lazy socket connection time out. That is, the NetworkClient will only check and disconnect timeout connections in leastLoadedNode(). 

    Pros:
    1. Usually, when clients send a request, they will ask the network client to send the request to a specific node. In these cases, the connection.setup.timeout won’t matter too much because the client doesn’t want to try other nodes for that specific request. The request level timeout would be enough. The metadata fetcher fetches the status of the nodes periodically so the clients will reassign the timeout request correspondingly to a different node.
    2. Consumer, producer, and AdminClient are all using leastLoadedNode() for metadata fetches, where the connection setup timeout can play an important role. Unlike other requests can refer to the metadata for node condition, the metadata requests can only blindly choose a node for retry in the worst scenario. We want to make sure the client can get the metadata smoothly and as soon as possible. As a result, we need this socket.connections.setup.timeout.ms.
    3. Implementing the timeout in NetworkClient.poll() or anywhere else might need an extra iteration of all nodes, which might downgrade the network client performance.
    4. Clients fetch metadata periodically which means leastLoadedNode() will get called frequently. So the timeout checking frequency is guaranteed.

...