Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The above diagram includes the current threading model during inference as I understand it.  The most important thing to note here is that when we run inference or read a result we must do it on a single main thread (this thread is often referred to as the aka dispatcher thread).  It’s therefore very important that we block this thread as little as possible.  This is not a problem when we’re submitting predictions, we can call forward on a symbol as many times as we like.  Each forward call will quickly kick off a computation graph that will eventually be executed in the engine.

We can eventually read the results of running forward in our output NDArray by calling asnumpy() (which in turn calls wait_to_read), but at this point we run into a potential problem.  If we submit for example four requests of various sizes, we are forced to choose one of them to read first, which will block our dispatcher main thread for a potentially significant amount of time.  We can’t submit any more work until the dispatcher main thread is unblocked.

Another consideration is that for many models the order of results is not guaranteed.  This means if we’re waiting for requests in the order in which they were issued, we would likely have higher average latency than if we could read results in the order they complete.  What would be a better solution would be to only call asnumpy() on outputs that we know are finished.  There’s a few ways we could do this, but one relatively simple way would be to expose a can_read property on NDArrays that lets us know when they’re ready to read.  Services could then poll any output NDArrays, and as soon as one is ready it could read the result, and pass that result back to a request thread.

Another approach that should be considered would just be to call asnumpy() on requests threads as soon as you start a prediction.  It would block requests threads, but it would never block our dispatcher main thread.  My understanding is that although asnumpy() is only reading data, it still mutates state, so calling it directly from requests threads would not be thread-safe.  (It would be great if a core dev could correct me if this is not the case).

...