Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

1. Container Placement Actions (Move / Restart)

API

registerContainerPlacementActionplaceContainer

Description

Active Container: Stop container process on source-host and starts it for 

  1. Stateless Job on either
    1. Destination-host (destination host can be source as well)
    2. Any host (destination-host = ANY_HOST)
  2. Stateful Job on either 
    1. Destination-host (if specified, destination host can be source as well)
    2. Standby Container (destination-host = STANDBY)
    3. Any host (destination-host = ANY_HOST)

StandBy Container: Stop container process on source-host and starts it on:

    1. Destination-host (if specified & matches StandBy Constraints)
    2. Any host (otherwise which matches StandBy Constraints)

Parameters

uuid: unique identifier a request, populated by the client

applicationId: unique identifier of the deployed app for which the action is taken

processor-id: Samza resource id of container e.g 0, 1, 2 

destination-host: valid hostname / “ANY_HOST” / “STANDBY”

request-expiry-timeout: [optional]: timeout for any resource request to the cluster manager 

Status code

CREATED, BAD_REQUEST, ACCEPTED, IN_PROGRESS, SUCCEEDED, FAILED

Returns

Since this is an ASYNC API nothing is returned, the status of the request can be queried by processorId using

Failure Scenarios

There are following cases under which a request to place container might fail:

  1. When an active container stop fails, in this case, we mark the request failed
  2. When requested resources cannot be obtained from the cluster manager, in this case, we mark the request failed
  3. When stopped active container fails to start on destination host in that case we mark the request failed and attempt to start on the source host, failure to do so results in starting the same on ANY_HOST

...

Note: For supporting canary above parameter list can be easily extended to support the following parameters

Parameters

userapp-version: user application version [optional]

samza-version: samza framework version [optional]

jvm-args: arbitrary string to be used as jvm arguments [optional]

...

API

containerStatus

Description

Gives the status & info of the container placement request, for ex is it running, stopped what control commands are issued on it

Parameters

processor-id: Samza resource id of container e.g 0, 1, 2 

applicationId: unique identifier of the deployed app for which the action is taken

uuid: unique identifier a request

Status code

ACCEPTEDBAD_REQUEST, UNAUTHORIZED

Returns

Status of the Container placement action 

...

  1. Control Plane as described plane above the job that allows taking control actions by multiple controllers like Samza Dashboard, Start points controller. 
  2. ContainerPlacementHandler is a stateless handler registered to control plane that translates control dispatches placement actions to invoke Container Placement Service APIs

...

Samza Metastore will provide an API to write to the coordinator stream. One simple way to expose Container Placement API is, Container Placement handler can have a coordinator stream consumer polling control messages from coordinator stream & acting on them. CPH will take actions maintaining some in-memory state with Container Placement Service to not take an action twice. In addition, since control actions are associated with a deployment id, they are automatically invalidated across restarts 

Pros

Cons

  • No need to build Authentication & Authorization, already handled by the Metadata auth service
  • No need to enable Rate limiting since requests are queued so the flow of requests can be regulated at the consumer side
  • If AM dies there can be still queued requests in Coordinator Stream, such requests have to be handled across AM restarts
  • Coordinator stream is log compacted so control messages written to the coordinator stream need to be deleted to prevent it from growing to large sizes which can affect job start times

...

Pros

Cons

  • Simple to extend the existing REST endpoint
  • No need Need to build Authentication since AM runs on hosts which are blacklisted for anyone except the Samza Teamauthentication
  • If the AM dies all the outstanding requests are discarded (no additional handling needed)
  • Need to build Authorization layer around these rest endpoints 
  • Loading the already Heavy loaded Job coordinator with another service might cause an increase in memory used
  • Need to build a service for discovery or rely on Yarn embedded Servlet

...

  1. ContainerPlacementHandler is a stateless handler dispatching ContainerPlacementRequestMessages from Metastore to Container Placement Service & ContainerPlacementResponseMessages from Container Placement Service to metastore for external controls to query the status of an action. (PR). 
  2. Metastore used today by in Samza by default is Kafka (coordinator stream) which is used to store configs & container mappings & is log compacted
  3. ContainerPlacementRequestMessage & ContainerPlacementResponseMessage are maintained in individual namespaces using NamespaceAwareMetaStore
  4. Key for storing the ContainerPlacementRequestMessage & ContainerPlacementResponseMessage in Metastore is chosen to be processorId (logical container id 0,1,2, etc) because at the worst the messages in metastore will be 2n where n is the size of containers, this choice to ensure job boot up times are not affected due to such messages
  5. One way to delete stale ContainerPlacementMessages is to delete messages from the previous incarnation in the metastore on job restarts

...

  1. To expose this API

...

  1. , a metadata store writer tool can be used. The same tool is going to give Open source users access to Start points APIs.

KV for ContainerPlacementRequestMessages & ContainerPlacementResponseMessage

Key

Value

processorId

uuid: unique identifier a request, populated by client

applicationId: unique identifier of the deployed app for which the action is taken

destination-host: valid hostname / “ANY_HOST” / “STANDBY”

request-expiry-timeout: [optional]: timeout for any resource request to cluster manager 

Part 2. Container Placement Service

...