Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Right now, Samza relies on YARN to detect whether a container is alive or not. This has a few problems:1) with

  • With the effort to make standalone Samza (SAMZA-516) and make Samza more pluggable w/ other distributed cluster management system (like Mesos, Kubernetes), we need to make the container liveness detection independent.

...

  • YARN based liveness detection has also created problems w/ leaking containers when NM crashed. It creates a dilemma:
    • In the case that NM can be restarted quickly, we would like to keep the container alive w/o being affected by NM goes down since that saves ongoing work. yarn.nodemanager.recovery.enabled=true
    • However, when RM loses the heart beat from NM and determines that the container is "dead", we truly need to make sure to kill the container to avoid duplicate containers being launched, since AM has no other way to know whether the container is actually alive or not.

The proposal is to solve this by implementing a JobCoordinator HTTP endpoint for heart beat.

Motivation

If we implement a With the direct heart beat mechanism between Samza the JobCoordinator and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is. It is also simple to implement and understand due to its synchronous flow. 

Proposed Changes

JobCoordinator side

...

  • On the container side we start a new thread that periodically polls this endpoint described above to check if the container is valid. If its not, we shutdown the run loop and raise an error (so that the exit code is non 0 so that YARN reschedules the container)
  • Calling the shutdown function on the run loop from the ContainerValidator thread would causes the finally block in SamzaContainer.run() to execute, we need all the shutdown functions to execute before we can exit with a non-zero code. Therefore we need a way to indicate to the LocalContainerRunner class that the SamzaContainer class exited with an exception.
    • This could involve setting
    Since raising an exception may not be the ideal way to shutdown the container (skips all the shutdowns in the finally block). It may be useful to set
    • a member variable in SamzaContainer to denote that an exception was raised. When shutting down
    in
    • the LocalContainerRunner we can check for this variable and exit with a non-zero exit code. This would mean that the container validator thread has a reference to the container instance (which seems wrong).
    • ???

Public Interfaces

This would introduce a few new configs:

...