Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Below graph describes the lifecycle of a Samza application running on Kubernetes.

Image Modified


Figure 2. LifecycleLifecycle of Samza applications running on Kubernetes


  • The run-app.sh script is started providing the location of your application’s binaries and its config file. The script instantiates an ApplicationRunner, which is the main entry-point responsible for running your application.

  • The ApplicationRunner parses your configs and writes them to a special Kafka topic named - the Coordinator Stream for distributing them. It proceeds to submit a request to Kubernetes API-server to launch the Samza-Operator Pod.

  • The Samza-Operator Pod (The AM, in YARN’s parlance) is started, It is then responsible for managing the overall application. It reads configs from the Coordinator Stream and computes work-assignments for individual Pods.

  • It also determines the hosts each Pod should run on taking data-locality into account. It proceeds to send Pod creation requests to API-server.

  • The Kubelet will watch the requests and start the task Pods. If the application’s dependencies are hosted in remote artifact repositories like HDFS. They need to be downloaded to the nodes first. How to download?

    • M1: the task Pod can leverage the Kubernetes Init-container functionality to download the dependencies.

    • M2: the regular container can download the dependencies first before executing the core logic.

      • M1 vs M2: The Init-containers is ensured to be run before regular containers. In M1, if the regular container fails, the Init-container will not be re-run.  In M2, if the regular container fails, it needs to handle the case to not re-run the logic to download the resources.

    • M3: the other way is to pre-bake all the dependencies into the container image itself, but that is less flexible as it requires all the code, configs to be available in the image. Regardless of M1 or M2, this method can always be used.

  • When the task Pod is started, each Pod first queries the Samza Operator to determine its work-assignments and configs. It then proceeds to execute its assigned tasks.

  • The Samza Operator does the typical control-loop pattern, ensures the current state matching the desired state. e.g. It monitors how many task Pods are alive and creates new Pods to match the desired replicas.

...

Host Affinity & Kubernetes


This document describes a mechanism to allow Samza to request containers from YARN on a  specific machine. This locality-aware container assignment is particularly useful for containers to access their local state on the machine. This mechanism leverages a feature in YARN to be able to request container by hostname.


Similar primitive is provided in Kubernetes to allow users to request pods by hostname. This document describes the feature. Particularly, “preferredDuringSchedulingIgnoredDuringExecution” policy can be used to “run my pod on host X, if not satisfied, run it elsewhere.”


Alternatively, If a remote storage, instead of local, can be used for persisting Samza task state, the goal of container-state-rebinding can be achieved by dynamically attach the remote storage to the container even if the container is restarted on a different host, by leveraging the Kubernetes PersistentVolume primitive. This is usually useful in a cloud environment where remote storage is typically accessible.


Reference