To be Reviewed By: April 22th, 2022

Authors: Alberto Gomez (alberto.gomez@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Threads that handle client requests could hang forever due to bugs in code. This has been observed in systems in the field with threads stuck for months. This provokes a reduction of the capacity of the system that could lead to a complete service outage if all the threads assigned to client requests reach that state.

These threads normally get stuck waiting for a condition to be fulfilled that never does (for example waiting for a CountDownLatch to be decreased) or waiting indefinitely (and uninterruptibly) for an answer from another member of the system.

Anti-Goals

This proposal does not intend to provide a mechanism to timely identify deadlocks in the system in order to act on the root cause and get the system out of them. The aim is just to provide a defensive mechanism to release threads stuck for such an unreasonable amount of time that stopping them is considered the best option.

Original Proposal (left just for the record but abandoned for the reasons shown in the comments)

In order to release client request threads that have been stuck for such a long time that it can be assumed that the client has already lost interest in the response, it is proposed to include a mechanism in Geode that releases these threads after some configurable time.

The release of the threads would be done by two means:

  • For threads waiting uninterruptibly for a response from another member, the new configurable timeout will allow the thread to exit from this wait when the waiting time reaches the timeout value.
  • For threads waiting on a different condition than the one above, the mechanism implemented by the ThreadsMonitoringProcess and AbstractExecutor classes used by Geode to detect threads stuck for some time, could be enhanced to also interrupt threads that have been stuck for longer than the new configurable timeout.

A draft pull request has been created in order to show what the implementation of this mechanism could look like: https://github.com/apache/geode/pull/7555

Changes and Additions to Public Interfaces

Two new configurable parameters are proposed in order to specify the new time-outs available:

  • maxWaitTimeout
  • maxThreadStuckTime

Alternatively, just one parameter could be used to release threads waiting on any of the conditions described above.

Solution

When a thread is stuck in a Geode member, the only generic safe action to release it is to restart the member. Stopping the stuck thread selectively may lead to data inconsistencies or other types of problems and therefore it is not recommended.

In order to be able to release stuck threads in a Geode server in a safe way the following is proposed:

  • Enhance the current mechanism in Geode to detect threads that are stuck (based on ThreadsMonitoringProcess and AbstractExecutor) to detect when a thread has been stuck for longer than a reasonable period (which would be a new configurable parameter).
  • Send an alert when the above mechanism detects that a thread has been stuck for longer than the maximum value configured.

External systems to Geode could receive the new alert and possibly issue a restart of the member with stuck threads at a convenient time.

Changes and Additions to Public Interfaces

A new configurable parameter is proposed in order to specify the maximum time a thread can be stuck before the member sends an alert:

  • max-thread-stuck-time-minutes

Performance Impact

No impacts foreseen

Backwards Compatibility and Upgrade Path

No impacts foreseen

Prior Art

-

FAQ


Errata


  • No labels

13 Comments

  1. What thread pools are you proposing to interrupt?

    If a thread doing a distribution operation is interrupted, that will have implications for data consistency. How to you propose to deal with this?

    1. The idea would be to interrupt any thread no matter what it is doing assuming that after so much time stuck (let's say 24 hours), it makes no difference if it is interrupted or not.

      Anthony Baker Do you think the system should be more selective and never interrupt certain threads?

  2. I think if we interrupt certain threads we will introduce a data consistency problem. If member A and member B have different values for a given key due to thread interruption, how do we resolve this?


    1. You have a point here. But again, if the thread in charge of replicating an entry has been stuck for many hours, then the inconsistency will already be there and I would not count on waiting for it to be fixed. At least, if you interrupt the thread you would be releasing wasted resources.

  3. Right. So the question is which is worse – data consistency or non-availability?  In the case of data loss I think you would encounter downstream failures since callbacks / WAN / CQ events would all be impacted.

    I agree that non-availability is also a concern. Perhaps we can think of this as an administrative tool rather than a product property. "If your cluster is in a bad state, here's a tool to help you out". Have you thought about a gfsh command that could interrupt a given thread on a member? Of course this is dangerous and should require superuser permissions.

    1. I am not so sure that for the case described (thread stuck for hours/days), it is a matter of data inconsistency vs non-availability. Again, if the stuck thread is the one in charge of replicating the data, if we do not act, we would still have the data inconsistency problem and eventually, the availability problem.

      I also considered the administrative tool approach, but it would not help in the cases where the thread is in an uninterruptible wait (for those cases we need to set a timeout as you can see in the draft PR). Also I do not see a clear advantage over a mechanism that automatically stops the threads after some configurable safe time (which should be quite high).

  4. I kinda agree with Anthony Baker that the proposed fix here is worse than the original problem. If a member stops waiting for a response to a message, that could just result in data inconsistency for a single key. But, if that message is related to more critical metadata like what bucket lives where, it could potentially result in removing the last copy of bucket, or all sort of other nasty issues.

    Interrupting threads at arbitrary places where they might be stuck also seems like it will have completely unpredictable results, potentially resulting in cascading failures to other members.

    I think the only safe thing do to if a member has stuck threads would be to shutdown the entire member. Maybe that is a better action to take if we detect a thread or reply that is stuck for a long time.

    I wonder if this is another case where being able to plug in custom logic might be the better way to go - perhaps ThreadsMonitoringProcess could notify a callback that some threads have been stuck for X amount of time, and that callback could do whatever a user wants - shutdown the member, ping and administrator, etc.

  5. Anthony Baker, Dan Smith , thanks for the valuable feedback.

    So, summarizing your comments, the only safe action to take when a thread is stuck for long is to restart the server.

    Things being so, what if instead of the original proposal, Geode sends an alert when a thread has been stuck for some configurable time? That way, a client registered to notifications could notice and decide on what action to take and when to take it if appropriate.

    1. Anthony Baker , Dan Smith , do you agree on going on with the proposal of sending an alert when a thread has been stuck for long?

  6. Some concerns I have on interrupting stuck threads:

    1. Are you proposing some new thread that will monitor the state of threads and interrupt them? I often would it perform this process?
    2. 99% of Geode's codebase currently waits uninterruptibly by looping and resetting the interrupt thread. Handling interrupts without causing further problems has historically been very problematic in this code base. Are you planning to change all of these spots in the code?
    3. At present, I believe that if you start interrupting Geode threads without bouncing the Server, this will very commonly put the cluster into a state of inconsistency or cause additional problems that I can't foresee.

    It might be better to monitor for stuck threads without interrupting them. This could generate a logging based Alert as you mention in the comments. You could also define a callback (notification listener or observer) that is invoked with information about the thread being stuck. Then you could implement a callback notification listener that does something like:

    1. Restarts the affected server. You probably want to limit it to bouncing just one server and then waiting for some period before considering bouncing a 2nd or 3rd server containing stuck threads.
    2. If you feel lucky, try interrupting the stuck thread in the callback without bouncing the server, but again like in #1, have a longer period before considering interrupting any more threads. This way you can plug in this behavior (or any other reaction to stuck threads) without making it a "feature" in Geode.

    Any new "feature" should be something that most users will want to enable and use. If there's any doubt about how many users will want to run with this turned on, then it's much better to make some sort of SPI or plugin hook or listener mechanism for customizing specific deployments of Geode by a limited number of users.

    1. Kirk Lund Thanks for the thorough answer.

      You guys have convinced me that interrupting a stuck thread is dangerous and the only safe action is to bounce the server.

      For now, I think the alert option could be good enough although I will give a thought to the callback notification listener alternative.