IDIEP-70
Author
Sponsor
Created

  

Status
DRAFT


Motivation

Cache async operations invoke future listeners on Striped pool threads, which can cause deadlocks and/or reduce cache performance.

IgniteFuture fut = cache.putAsync(1, 1);
fut.listen(f -> {
    // Executes on Striped pool and deadlocks.
    cache.replace(1, 2);
});

Users are supposed to be aware of this and handle it manually, however:

  • This behavior is unexpected
  • Users should carefully read the docs to know about this
  • Handling this manually is verbose and error-prone

The problem is more pronounced in Ignite.NET:

  • async/await exists for a long time and most code bases are async
  • async/await sugar somewhat makes this less obvious
  • custom thread pools are less common
await cache.PutAsync(1, 1);
// Now we are on a Striped pool thread!

// CPU-heavy method blocks the stripe and cache ops are stalled.
RunSomething();


A similar problem exists for Compute. Async operation continuations are executed on the Public pool, which can lead to starvation there when all threads are taken up by continuation logic.

Description

  • Add IgniteConfiguration#asyncContinuationExecutor (of type Executor).
  • Use ForkJoinPool#commonPool by default (when null / not set).
  • Use this executor for all Cache and Compute async continuations

This fixes the issue in Java, .NET and C++, because thick integrations use direct JNI callbacks for Futures.

NOTE: This IEP is NOT related to scan query filters, cache entry processors, etc, which also run on Striped pool.

Risks and Assumptions

  • Some users may already have custom code to deal with the problem.
  • Some users run simple continuations that work fine on the striped/public pool.

Those users can force the old behavior with `IgniteConfiguration.setAsyncContinuationExecutor(Runnable::run)`.

Performance

Executing continuation on a different thread involves some overhead. Local benchmark with integer key and value shows ~6% drop (see JmhCacheAsyncListenBenchmark in the PoC).

In a real world workload the difference should be insignificant.

Benchmark                          Mode  Cnt      Score      Error  Units
JmhCacheAsyncListenBenchmark.put  thrpt   10  77859.584 ± 2071.196  ops/s (before)
JmhCacheAsyncListenBenchmark.put  thrpt   10  73393.986 ± 1336.420  ops/s (after)


Discussion Links

IEP thread: http://apache-ignite-developers.2346864.n4.nabble.com/IEP-70-Async-Continuation-Executor-td51775.html

Original discussions: 


Reference Links

PoC: https://github.com/apache/ignite/pull/8870

Tickets


key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels