Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Application dependencies should not be able to impact the Samza cluster-based job coordinator
  • Solution should be leverageable for the Samza logic running on processing containers

Design

New configs

Config keyDescription
samza.cluster.based.job.coordinator.dependency.isolation.enabledSet to "true" to enable cluster-based job coordinator dependency isolation
yarn.resources.__samzaFrameworkApi.pathPath to the Samza framework API resource
yarn.resources.__samzaFrameworkApi.*Any other YARN resource configurations for the Samza framework API resource
yarn.resources.__samzaFrameworkInfrastructure.pathPath to the Samza framework infrastructure resource
yarn.resources.__samzaFrameworkInfrastructure.*Any other YARN resource configurations for the Samza framework infrastructure resource

Existing JAR management

Currently, Samza infrastructure code and dependencies are included in the tarball with the Samza application. This means that conflicting dependencies between the application and Samza are resolved at build time before the tarball is created, which can cause a certain version of a dependency to be excluded. All JARs in the tarball are installed into a single directory for classpath generation and execution.

...

Generating the Samza API whitelist

In order to load the Samza API classes from the API classloader, we need to tell cytodynamics what those classes are. We can do this by providing a whitelist of packages/classes when building the cytodynamics classloader. All public interfaces/classes inside of samza-api should be considered an API class. One way to generate this whitelist is to use a Gradle task to find all the classes from samza-api and put that list in a file. Then, that file can be read by Samza when constructing the cytodynamics classloader. The Gradle task should also include classes from samza-kv.

...

If this feature is on, then there is some potential runtime impact:

...

Previously, the application packaged all Samza code and determined the dependencies, and that was what was used for the application runner, job coordinator, and processing containers. This meant that all runtime code was consistent across the Samza processes. With isolation, there may be an inconsistency between Samza and its dependencies used in the job coordinator when compared to the runner and processing containers. If there is any flow which requires the same set of dependencies to be used across all 3 pieces, then there would be a problem.

...

An example of an issue could be if Java serialization is used to serialize a class

...

on the application runner, and then it is deserialized on the job coordinator, where the version of the class

...

is different than the version on the runner.

Samza serializes data into strings to pass them between processes. There are certain categories of data that are serialized into strings:

  • Plain strings (e.g. configs): Normal strings should be compatible across versions
  • JSON . Although it is possible that this could break something, it seems very unlikely that it could cause a problem. The inter-process flows we currently have involving the job coordinator should be using objects defined within Samza (same Samza version is used across components), simple objects (e.g. strings), or serialization technologies that have good checkpoints): JSON has reasonable compatibility concepts built-in (e.g. JSON). , but they need to be considered when the data models are changed
  • Serializable objects (e.g. serdes in configs): Need to follow https://docs.oracle.com/javase/8/docs/platform/serialization/spec/version.html when changing Serializable objects

Within Samza, the data categories can be controlled and compatibility rules can be followed.

However, it is difficult to strictly control compatibility across versions of dependencies. It is possible that a certain dependency version serializes some data, but then a different process is unable to deserialize it, because it is using a different version of the dependency. Practically, it is not expected that this will be a common issue, since dependencies should generally be relatively consistent and it is uncommon to use third-party serialization, but it is still possible.

Once we have general split deployment, this will no longer be a problem

...

, because the version of Samza used across all parts of the application will be consistent.

Alternative solutions

Alternative solutions for SEP-24