Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added some additional rejected alternatives

...

Not relevant.


Rejected Alternatives

Send metrics out-of-band directly to collector or to separate metric cluster

There are plenty of existing solutions that allows a client implementation to send metrics directly to a collector, but it falls short to meet the enabled-by-default requirements of this KIP:

  • will require additional client-side configuration: endpoints, authentication, etc.
  • may require additional network filtering and our routing configuration to allow the client host to reach the collector endpoints. By using the Kafka protocol we already have a usable connection.
  • adds another network protocol the client needs to handle and its runtime dependencies (libraries..).
  • makes correlation between client instances and connections on the broker harder - which makes correlating client side and broker side metrics harder.
  • more points of failure: what if the collector is down but the cluster is up?
  • zero conf is an absolute must for KIP-714 to provide value: It is already possible today to send metrics out-of-band, but people don't and they still won't if any extra configuration is needed.


An alternative approach is to send metrics to a different Kafka cluster, the idea is that having metrics go through the same cluster may mean client metrics may be unavailable if the original cluster is unavailable, and this suggestion would solve that by sending metrics to a different cluster that is not affected by original cluster problems.

This however has the same problems as described above, and requires an additional client instance to connect to the metrics cluster for each client instance connecting to the original cluster, which adds more complexity. Also, it is not clear how the metrics client itself is monitored.

Produce metrics data directly to a topic

Instead of using a dedicated PushTelemetryRequest, the suggestion is to use the existing ProduceRequest to send telemetry to a predefined topic on the cluster. While this has benefits in that it provides an existing method for sending compressed data to the cluster, there are more issues:

  • While an existing producer instance could also produce to this topic, a consumer would need to instantiate a new producer instance for sending these metrics. In an application with many consumers, which is not uncommon, this would double the number of connections to the cluster, and increase the resource usage (memory, cpu, threads) of the application.
  • A separate producer instance makes the mapping between original connection and client instance id more complex.
  • What observes/collect metrics for the metrics producer instance?
  • Lack of abstraction. We don't do this for any other parts of the protocol (i.e., OffsetCommitRequest could be a ProduceRequest, so could the transactional requests. FindCoordinator could be done through local hashing and metadata, etc, etc,). Makes future improvements, changes, a lot more problematic: How do we add functionality that is not covered by the produce API?
  • More points of failure by introducing an additional client instance and additional connections.
  • Requires metric to (at least temporarily) to be stored in a topic. Operators may want to push metrics upstream directly.

Dedicated Metrics coordinator based on client instance id

...