Status
Current state: "Under Discussion"
Discussion thread: here
JIRA: here [Change the link from KAFKA-1 to your own ticket]
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Creating a new connection adds CPU overhead to the broker. KIP-306 mitigated the issue of connection storm due to un-authorized connections (e.g., misconfigured clients). However, connections storms may also come from (mostly) well-behaved clients. One example is when deploying a new application may cause temporary connection storm due to a large number of clients starting up and creating connections to the cluster at the same time. Another example is clients that create a new connection for each produce/consume request, causing high connection rate to the brokers. A very high connection creation rate may stop broker from doing other useful work, causing high request latencies or even URPs.
To address this issue, the KIP proposes to add the ability to set a limit on the rate with which the broker accepts new connections. To ensure that some clients do not take over most of the connection creation rate quota, the KIP also adds the ability to set a limit on connection creation rate per IP.
Public Interfaces
A new broker configuration option will be added to limit the total rate at which non-inter-broker connections will be accepted on the broker. Connections on the inter-broker listener will be permitted even if the configured broker-wide limit is reached. This will be a dynamic config that can be updated without restarting the broker.
Config option: Name: max.connection.creation.rate
Type: Int
Default value: Int.MaxValue
The config may be prefixed with a listener prefix to specify a different listener-specific limit: listener.name.{listenerName}.max.connection.creation.rate
. Listener-specific limits will be applied in addition to the broker-wide limit. If a listener-specific limit is not specified, each listener can create connections with the rate up to the broker-wide limit as long as the total rate is also within the broker-wide limit. If a broker has multiple listeners, connections on the inter-broker listener will always succeed as long as connection creation rate is within that listener's rate limit. The behavior of the proposed broker-wide and per-listener configs is consistent with max.connections broker configuration (KIP-402).
A new broker configuration option will be added to limit the rate at which connections will be accepted for each IP address. New connections for the IP will be dropped once the limit is reached. This will also be a dynamic config that can be updated without restarting the broker:
Config option: Name: max.connection.creation.rate.per.ip
Type: Int
Default value: Int.MaxValue
The connection creation rate limits will be applied to the same quota window configuration (quota.window.size.seconds
with 1 second default) as existing produce/fetch quotas and request rate quota (KIP-124). Since limit on connection creation rate on the broker is also a type of quota, this approach will keep it consistent with the existing quota implementations on the broker. If connection creation rate on the broker exceeds the broker-wide limit, the broker will delay accepting a new connection by an amount of time that brings the rate within the limit. If a listener-specific limit is specified, and the connection rate on that listener exceeds the limit, the broker will delay accepting new connection on that listener by an amount of time that brokers the rate of the listener within the listener limit. The maximum delay applied will be the quota window size (which also means that the minimum connection rate limit is effectively 1 connection creation / second).
If connection creation rate is reached for a specific IP address, the connection will be dropped. The broker will continue dropping connections for that IP until the rate for the IP is within the per-IP connection creation rate limit.
No new metrics will be added. The existing metric (kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listenerName}) that tracks the amount of time Acceptor
is blocked from accepting connections will now additionally include the amount of time Acceptor
is blocked due to hitting connection create limit (in addition to the time blocked due to hitting the maximum limit on currently active connections).
Proposed Changes
The broker will track connection acceptance rates, broker-wide and per every listener and per IP, via sensors that wrap the Rate
metric with the MetricConfig
. MetricConfig#quota
will be set to the corresponding configured connection creation rate limit. When Acceptor
accepts a new connection, the broker-wide and the corresponding listener's metric will be incremented. When the actual connection creation rate exceeds either broker-wide or listener-specific quota, quota violation exception will be thrown. On quota violation, the broker will calculate the delay needed to bring the metric within quota by using the same formula implemented in ClientQuotaManager.throttleTime
. The Acceptor
thread will wait for the delay duration before accepting new connections. The maximum delay applied will be the quota window size (1 second by default).
When quota violation happens due to reaching the limit for a IP address, the connection for the IP will be closed. No delay will be calculated. If another connection gets accepted for the same IP, it will either be accepted (if there is no quota violation) or rejected again (if there is a quota violation).
Most of this logic will be added to ConnectionQuotas
class, which currently throttles Acceptor
thread to limit the number of active connections. ConnectionQuotas
class be extended to enforce both the number of active connections and connection creation rate. This proposal adds another condition when the Acceptor
thread waits, which will be implemented as delaying accepting a new connection based on whichever limit is reached first:
- If the number of active connections is below the limit, but broker hits the connection rate limit, the
Acceptor
will wait for the calculated delay that brings the connection creation rate metric within quota. - If there are no available active connection slots, the broker waits for the new slot independent of whether connection rate exceeds quota or not.
Compatibility, Deprecation, and Migration Plan
The feature is backward compatible, because the default configuration will have a legacy behavior: No connection creation rate limit.
There will be no impact on existing users.
Rejected Alternatives
None.