...
JIRA: SAMZA-871
Released:
Problem
(taken from SAMZA-871)
Right now, Samza relies on YARN to detect whether a container is alive/valid or not. This has a few problems :
...
as the YARN based liveness detection
...
fails when the NM crashes, causing the container to be rescheduled on a different host without killing the old container, leading to double processing of messages. We need a way to make sure that invalid containers are killed in order to handle duplicate containers being launched.
The proposal is to solve this by implementing a JobCoordinator HTTP endpoint for a heart beat between the containers and the JobCoordinator.
Motivation
With the direct heart beat mechanism between the JobCoordinator and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is. It is also simple to implement and understand due to its synchronous flow.
...