Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

A deployable Samza application currently consists of JARs for Samza infrastructure code (and dependent JARs) and JARs for application-specific code (and dependent JARs). The full deployable package is determined at build time. When deploying an application, the built package of JARs is placed on the necessary node(s), which includes the job coordinator and the processing containers. This build-time packaging has benefits, as it simplifies the deployment responsibilities of Samza infrastructure – the package built by the application has everything needed to run a Samza application. Application owners (who may not be the same as the owners of the Samza infrastructure) choose the version of Samza to use and do the packaging.

One pain point in working under this model involves dependency management. Since applications do the packaging of JARs, it is up to them to do dependency conflict resolution. If application-specific code builds against a dependency of a certain version and Samza infrastructure code builds against that same dependency with a different version, then only one of those versions will actually get used at runtime. This can result in unexpected versions of libraries being used at runtime, causing issues like ClassNotFoundExceptions. There are some parts of Samza infrastructure which are relatively agnostic of application-specific code (e.g. YARN application master), but those can still be impacted by how an application does the packaging of JARs (e.g. what dependencies are included). Samza infrastructure is validated against a certain set of dependencies, but applications can still change the actual runtime dependencies that are used. These issues result in lower availability and the need to spend time on debugging. It is also up to the application to fix the packaging.

It would be helpful to be able to isolate the dependencies of the Samza infrastructure from the dependencies of the application. This SEP covers how to achieve this for the cluster-based job coordinator, which is used when running Samza jobs in resource management systems like YARN.

Terms

TermDescription
cluster-based job coordinatorprocess that is responsible for managing the processing containers of a Samza job (e.g. starting containers, keeping correct # of containers running) when running Samza with a resource management system
YARNa resource management system which can be used to run Samza jobs
application mastera cluster-based job coordinator in the context of YARN
application runnerSamza component which is responsible for launching an application
application (or application-specific)code and dependencies which are specific to a particular Samza application, as opposed to Samza infrastructure
pluggable (or plugin) classclass which is specified by an application through configuration (e.g. system factory, grouper)

Requirements

  • Application dependencies should not be able to impact the Samza cluster-based job coordinator
  • Solution should be leverageable for the Samza logic running on processing containers