Current state: "Under Discussion"
Discussion thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
KafkaProducer#send returns a java.util.concurrent.Future, but it also blocks when it can't get partition metadata, or when the queue is full. This is a surprising user experience, since typically methods that return Future are asynchronous, and do not block. This can clash with other asynchronous systems, like Netty, Finagle, or gRPC, where the assumption is that you do not block the IO loop, and where blocking can cause deadlocks for other requests.
This section is a WIP while we try to iterate on a good solution
KafkaProducer#send (add an overload)
Producer#send (add an overload)
There are two things that make this really tricky. First, KafkaProducer#send has an existing contract that we don't want to break, which is to preserve the order of the records that it sends. Second, we need to figure out which thread is going to be responsible for waiting for metadata, or for waiting for space to be free in the queue.
There are many plausible approaches here for each solution. We'll break them down by problem.
I've added an asterisk * next to the solutions that I think are the most pragmatic.
When we encounter a situation where we would need to block, we can instead add a continuation to a single additional intermediate queue that must be processed in order.
Pros:
Cons:
For the blocking problem related to metadata, it's something that needs to be done a single time, and then can be cached, so it would be nice to be able to do it on construction. However, we don't have the full information we need to fetch the metadata at construction yet
Pros:
Cons:
When we encounter a situation where we would need to block, we can instead add a continuation to a queue that's appropriate for that specific topic that must be processed in order.
Pros:
Cons:
When we need to write to a queue but we don't have the relevant metadata information yet, we can pass a Future<TopicPartition>, instead of passing the concrete TopicPartition. Then we can block sending until we know the TopicPartition.
Pros:
Cons:
This simplifies doing retries on the client-side. It makes it possible to try a send with an immediate retry, and then to trigger a retry in a different thread with a longer timeout without having to maintain multiple clients. The downside of this approach is that it forces the customer to change their API to Future<Future<RecordMetadata>> to do this safely, so it still creates work for the customer, but it is a much simpler retrying story than the current "safe" one, which requires that the customer implement reasonable backoffs on 0-duration sends.
Pros:
Cons:
This allows the API to better-represent what's going on, which is that there are two potentially blocking things that we want to represent, and we want to signal when each of them is done.
Pros:
Cons:
Keep the status quo.
Pros:
Cons:
By default, we will use a common threadpool (either one managed by Kafka, or by the JDK (like FJP.commonpool)). However, it will be possible for a user to override the pool.
Pros:
Cons:
Update the API so that the user provides a threadpool, in which we do the blocking.
Pros:
Cons:
Change it so that it's truly asynchronous, and that the Kafka IO loop does the next chunk of work when it's ready.
Pros:
Cons:
Use a common threadpool that's provided by the JDK, like the ForkJoin common pool to take care of the blocking.
Pros:
Cons:
Keep the status quo.
Pros:
Cons:
TBD. It's too early to discuss now.
Prior Art: KIP-286
Everything is on the table! We can add things here as we reject them.