Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion thread: <link to mailing list DISCUSS thread>

JIRA: SAMZA-TBD2373

Released:  TBD

Problem

Samza operates in a multi-tenant environment with cluster managers like Yarn and Mesos where a single host can run multiple Samza containers. Often due to soft limits configured for cluster managers like Yarn and no notion of dynamic workload balancing in Samza a host lands in a situation where it is underperforming and it is desired to move one or more containers from that host to other hosts. Today this is not possible without affecting other jobs on the hot host or restarting the affected job manually. In other use cases like resetting checkpoints of a single container or supporting canary or rolling bounces the ability to restart a single or a subset of containers without restarting the whole job is highly desirable. 

...

Solution 1. Write locality to Coordinator Stream (Metastore) and restart job [Rejected]

Pros

Cons

  • Simple to implement the current tool does that for host affinity enabled jobs (since they maintain locality mapping)
  • Needs a job restart and does a best effort to get preferred hosts for containers but has no guarantee on getting them
  • If a job has standby containers enabled, this method involves changing standby mapping in addition to active container mappings 
  • Job faces downtime when the job has hundreds of containers and only one of them needs to be restarted, if it is stateful there is a likelihood that containers might not get the new asked resource on the restart and start bootstrapping
  • This solution is not scalable to be used by Controllers who want to take multiple control actions on containers across several jobs, for example, auto-sizing controller
  • This method will not be work for building Canary / Cluster Balancer

Solution 2. Container Placement Service [Accepted]

...

1. Container Placement Actions (Move / Restart)

API

registerContainerPlacementAction

Description

Active Container: Stop container process on source-host and starts it for 

  1. Stateless Job on either
    1. Destination-host (destination host can be source as well)
    2. Any host (destination-host = ANY_HOST)
  2. Stateful Job on either 
    1. Destination-host (if specified, destination host can be source as well)
    2. Standby Container (destination-host = STANDBY)
    3. Any host (destination-host = ANY_HOST)

StandBy Container: Stop container process on source-host and starts it on:

    1. Destination-host (if specified & matches StandBy Constraints)
    2. Any host (otherwise which matches StandBy Constraints)

Parameters

uuid: unique identifier a request, populated by the client

applicationId: unique identifier of the deployed app for which the action is taken

processor-id: Samza resource id of container e.g 0, 1, 2 

destination-host: valid hostname / “ANY_HOST” / “STANDBY”

request-expiry-timeout: [optional]: timeout for any resource request to the cluster manager 

Status code

CREATED, BAD_REQUEST, ACCEPTED, IN_PROGRESS, SUCCEEDED, FAILED

Returns

Since this is an ASYNC API, the status can be queried by processorId using

Failure Scenarios

There are following cases under which a request to place container might fail:

  1. When an active container stop fails, in this case, we mark the request failed
  2. When requested resources cannot be obtained from the cluster manager, in this case, we mark the request failed
  3. When stopped active container fails to start on destination host in that case we mark the request failed and attempt to start on the source host, failure to do so results in starting the same on ANY_HOST


Note: For supporting canary above parameter list can be easily extended to support the following parameters

Parameters

user-version: user application version [optional]

samza-version: samza framework version [optional]

jvm-args: arbitrary string to be used as jvm arguments [optional]


2. Container Status

API

containerStatus

Description

Gives the status & info of the container placement request, for ex is it running, stopped what control commands are issued on it

Parameters

processor-id: Samza resource id of container e.g 0, 1, 2 

applicationId: unique identifier of the deployed app for which the action is taken

uuid: unique identifier a request

Status code

ACCEPTED, UNAUTHORIZED

Returns

Status of the Container placement action 


2. Enable & Disable StandBy [Stretch]

API

controlStandBy

Description

Starts or Stops a standBy container for the active container

Parameters

processor-id: Samza resource id of container e.g 0, 1, 2 

Status code

ACCEPTED, UNAUTHORIZED

Architecture

For implementing a scalable container placement control system, the proposed solution is divided into two parts:

...