Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Generating the Samza API whitelist

In order to load the Samza API classes from the API classloader, we need to tell cytodynamics what those classes are. We can do this by providing a whitelist of packages/classes when building the cytodynamics classloader. All public interfaces/classes inside of samza-api should be considered an API class. One way to generate this whitelist is to use a Gradle task to find all the classes from samza-api and put that list in a file. Then, that file can be read by Samza when constructing the cytodynamics classloader. The Gradle task should also include classes from samza-kv.

...

ClassesDescription
samza-apimain API classes
samza-kvsome classes from here are used by implementations of pluggable classes
org.apache.logging.log4j:log4j-apisee Logging 135861549 below for more information
org.apache.logging.log4j:log4j-coresee Logging 135861549 below for more information

...

The API and infrastructure classloaders each need a package of JARs which is isolated from the application. Those packages need to be built separately from an application. They need to include the core Samza components (e.g. samza-api, samza-core), and they can contain any pluggable components used across many applications (e.g. samza-kafka). This The directory structure of the API and infrastructure packages should be the same as the structure for the application (e.g. scripts in the "bin" directory, libraries in the "lib" directory).

The packaging is left to the Samza user (or group of users), as group of Samza jobs that are using the same set of job coordinator JARs, as different components may be included by different usersjobs. There are multiple tools that exist for building the packages (e.g. Gradle, Maven).

Dependencies

An example of packaging will be included in the samza-hello-samza project.

Dependencies

API classloader API classloader dependencies

  • (required) samza:samza-api
  • (required) samza:samza-kv: includes KeyValueStorageEngine, which is a base class for StorageEngine
  • (optional; if using samza-log4j2 as infrastructure) log4j2 API/core

...

In the single classloader case, all classes were easily able to use logging through static access to the logging API (e.g. slf4j). Samza did have to do a little bit of management for using the log4j binding vs. the log4j2 binding for slf4j.

With multiple classloaders, we have to be more careful, since . Log4j does certain things to help make it easy to use it. For example, it uses static contexts to be able to aggregate logging across multiple classes. However, static contexts are not shareable if they get loaded by different classloaders., so that can cause issues with the conventional log4j usage. There are other areas to watch out for as well (see Notes regarding logging in a multiple-classloader scenario for more context).

Supporting isolation for Samza implementations of log4j pluggable components

One piece of functionality that would be good to have is isolation for Samza It would be good if we can split deploy Samza implementations of log4j pluggable components (e.g. StreamAppender).

Useful notes:

  • Log4j searches for a configuration file specified by the "log4j.configuration" system property (or "log4j.configurationFile" for log4j2). If that property is not specified, then log4j will try to find a log4j.xml (or log4j2.xml for log4j2) file on the classpath. Note that log4j2 will look also for a log4j2.xml if the file specified at "log4j.configurationFile" is not found. See LogManager for the log4j implementation and ConfigurationFactory for the log4j2 implementation.
    • Samza does specify the "log4j.configuration" property in run-class.sh.
    • If the "log4j.configuration" system property is an accessible file, then all classloaders will be able to load it.
    • The log4j.xml file will only be searched for through the current classloader.
  • When initializing a class that has a static slf4j Logger field, the LoggerFactory and some core log4j components/interfaces will be loaded from the "current" classloader. However, some pluggable log4j components, (e.g. Appender) will be loaded by the Thread.getContextClassLoader and then passed back to the "current" classloader. If the context classloader loads core log4j components separately from the "current" classloader, then the appenders can't be shared, since the Appender interface would need to come from the same classloader.
    • A config "log4j.ignoreTCL" does exist to ignore the context classloader. Log4j will fall back to using the current classloader if the context class loader is not found or ignored (see org.apache.log4j.helpers.Loader). Samza doesn't currently set the context class loader, although it might be possible that the context class loader gets set by some system using Samza.
  • We should not instantiate multiple instances of RollingFileAppender which write to the same file at the same time due to concurrency issues. Usually, this isn't something to worry about since logging is initialized statically, but when there are multiple classloaders, it is possible to instantiate multiple appenders at the same time.
  • Log4j2 does some special resource loading involving looking at the parent classloader of the context classloader (see ProviderUtil), so we need to be careful if log4j-core is on both the API and infrastructure classpaths, since it might lead to using the same class from both classloaders.
    • This can lead to error logs of the form "Unrecognized format specifier" and "Unrecognized conversion specifier", since plugins get loaded from one classloader and get sent to the other.
  • If a context classloader is set, then all log4j2 plugins are loaded from that classloader. Otherwise, it will load from the "current classloader".

Changes

  1. To use the pluggable components from the infrastructure classloader, the context classloader needs to be set to the infrastructure classloader.
  2. Framework API module will include slf4j and log4j2 dependencies (including log4j2 binding). Only log4j-api and log4j-core classes will be in the API whitelist.
    1. slf4j dependencies are just needed for the classes in the API module which use logging.
      1. We should not add the slf4j API nor any slf4j binding to the parent-preferred whitelist for the API classloader. If the application does not want to use the logging framework that is used by API/infrastructure, then that should be allowed. This does mean that the application will always need to include an slf4j binding on its classpath if it is using slf4j, even if it is the slf4j to log4j2 binding. If the slf4j to log4j2 binding is included by the application, then it will delegate to the API classloader for log4j-api classes.
    2. log4j-api is included in the API whitelist so the log4j2 concrete classes which implement log4j-api classes (e.g. LoggerContextFactory) and are loaded by the context classloader would be compatible with the application layer
    3. log4j-core is included in the API whitelist since there are some log4j2 concrete classes which implement log4j-core interfaces (e.g. Appender) and come from the application classloader, and those need to be compatible with the infrastructure layer
  3. Infrastructure module will include slf4j and log4j2 dependencies (including log4j2 binding). It will also include samza-log4j2.
    1. slf4j-api and log4j-slf4j-impl are needed for the classes in the API module which use logging.
    2. log4j-api classes will end up getting loaded from the API classloader, so it's not necessary to include it, but it will be transitively pulled in and it is not necessary to exclude it.
    3. log4j-core is needed for base log4j2 functionality and for being able to use custom Samza log4j2 plugins
    4. samza-log4j2 is for including custom Samza log4j2 plugins
  4. When setting the log4j2 configuration file ("log4j.configurationFile" system property), we need to use the application's log4j2.xml if it exists. If the application does not provide that file, then we need to provide a default log4j2.xml in the infrastructure classpath.
    1. This can be done by passing an extra environment variable which is the "application lib directory" which may contain the application's log4j2.xml file to the job coordinator execution, and then reading that environment variable in the run-class.sh script when setting the log4j configuration system property.
  5. All classloaders (API, infrastructure, application) need to exclude "log4j:log4j" (i.e. log4j1) from the classpath and use "org.apache.logging.log4j:log4j-1.2-api" (i.e. bridge from log4j1 to log4j2). This means log4j1 will not be supported, and log4j2 must be used.

Pros

  • Able to split deploy log4j2 pluggable components built by Samza
  • Can override Samza infrastructure logging configuration
  • Applications can choose their own logging API

Cons

  • Samza ends up controlling log4j2 API version
  • Need to figure out how to manage configuration files for log4j2 correctly
  • No support for log4j1, so existing apps would need to migrate to log4j2

External context

On the job coordinator, no ExternalContext is built, so there should be no conflict between Samza infrastructure and application. Therefore, we don't need to do anything for isolation for ExternalContext usage in the job coordinator.

We will need to consider the conflict on the application runners when using ExternalContext in SamzaApplication.describe, and we need to ensure that the pattern we choose works with general split deployment in which we need to consider the ExternalContext usage on processing containers. This will be discussed in other designs.

Beam

Some Beam infrastructure code runs on Samza RAIN hosts when a deployment is requested. This is needed for creating the SamzaApplication in order to call the Samza RemoteApplicationRunner. Beam can have its own container dependency which includes the Samza infrastructure JARs. This allows the Beam applications to not explicitly specify a Samza version.

No Beam-specific code runs on the application master, so we do not need to make additional changes for that part.

SQL

Currently, Samza SQL applications just consist of SQL statements (i.e. text in a file).

The functionality provided by this document should not currently be leveraged by Samza SQL, since Samza SQL requires general split deployment and isolation is not needed due to the non-existence of application JARs. We still need to ensure that the new functionality does not break the existing Samza SQL functionality. One area to watch out for is that Samza SQL currently uses the SQL framework code as the main classpath, so that should not break.

In the future, UDFs should be able to be specified by applications. We should be able to leverage the separate classloader solution for this. Also, it is possible in the future that the job coordinator will need to run SQL-specific code. This would likely be a pluggable component, so we should be able to handle that by including it on the Samza infrastructure classpath.

Backward Compatibility

If this feature is off, this is backwards compatible, because we will use the old single-classpath model.

If this feature is on, then there is some potential runtime impact: Previously, the application packaged all Samza code and determined the dependencies, and that was what was used for the application runner, job coordinator, and processing containers. This meant that all runtime code was consistent across the Samza processes. With isolation, there may be an inconsistency between Samza and its dependencies used in the job coordinator when compared to the runner and processing containers. If there is any flow which requires the same set of dependencies to be used across all 3 pieces, then there would be a problem. An example of an issue could be if Java serialization is used to serialize a class on the application runner, and then it is deserialized on the job coordinator, where the version of the class is different than the version on the runner.

Samza serializes data into strings to pass them between processes. There are certain categories of data that are serialized into strings:

  • Plain strings (e.g. configs): Normal strings should be compatible across versions
  • JSON (e.g. checkpoints): JSON has reasonable compatibility concepts built-in, but they need to be considered when the data models are changed
  • Serializable objects (e.g. serdes in configs): Need to follow https://docs.oracle.com/javase/8/docs/platform/serialization/spec/version.html when changing Serializable objects

Within Samza, the data categories can be controlled and compatibility rules can be followed.

However, it is difficult to strictly control compatibility across versions of dependencies. It is possible that a certain dependency version serializes some data, but then a different process is unable to deserialize it, because it is using a different version of the dependency. Practically, it is not expected that this will be a common issue, since dependencies should generally be relatively consistent and it is uncommon to use third-party serialization, but it is still possible.

Once we have general split deployment, this will no longer be a problem, because the version of Samza used across all parts of the application will be consistent.

Testing

Local testing

We can use samza-hello-samza to test this locally. It has scripts to set up Zookeeper, Kafka, and YARN locally. The local YARN deployment will give the process isolation necessary to test the AM.

  1. Locally build the framework tarballs for API and infrastructure. It would be useful to put an example somewhere for how to build those tarballs.
  2. Deploy Zookeeper, Kafka, and YARN locally (https://samza.apache.org/startup/hello-samza/latest/).
  3. Fill in certain configs (see New configs above). These will go into the properties file passed to the run-app.sh script.
  4. Create the tarball for the application (https://samza.apache.org/startup/hello-samza/latest/). For testing local changes, remember to run the "publishToMavenLocal" command.

Automated integration test

  • Build API and infrastructure framework artifacts
  • Build a simple test job with dependency isolation enabled
    • This will require multiple configs, including the location of the framework artifacts for YARN resources (see New configs above).
  • Use the integration test framework (which uses real YARN) to check that the job runs successfully

Alternative solutions

...

, such as StreamAppender.

One change needs to be made to Samza infrastructure: The context classloader needs to be set to the infrastructure classloader. This is so that logging can be routed back to the infrastructure classloader when logging is called by the application code.

Also, the job coordinator packaging needs to have some special set-up:

  1. Framework API module will include slf4j and log4j2 dependencies (including log4j2 binding). Only log4j-api and log4j-core classes will be in the API whitelist.
    1. slf4j dependencies are just needed for the classes in the API module which use logging.
      1. We should not add the slf4j API nor any slf4j binding to the parent-preferred whitelist for the API classloader. If the application does not want to use the logging framework that is used by API/infrastructure, then that should be allowed. This does mean that the application will always need to include an slf4j binding on its classpath if it is using slf4j, even if it is the slf4j to log4j2 binding. If the slf4j to log4j2 binding is included by the application, then it will delegate to the API classloader for log4j-api classes.
    2. log4j-api is included in the API whitelist so the log4j2 concrete classes which implement log4j-api classes (e.g. LoggerContextFactory) and are loaded by the context classloader would be compatible with the application layer
    3. log4j-core is included in the API whitelist since there are some log4j2 concrete classes which implement log4j-core interfaces (e.g. Appender) and come from the application classloader, and those need to be compatible with the infrastructure layer
  2. Infrastructure module will include slf4j and log4j2 dependencies (including log4j2 binding). It will also include samza-log4j2.
    1. slf4j-api and log4j-slf4j-impl are needed for the classes in the API module which use logging.
    2. log4j-api classes will end up getting loaded from the API classloader, so it's not necessary to include it, but it will be transitively pulled in and it is not necessary to exclude it.
    3. log4j-core is needed for base log4j2 functionality and for being able to use custom Samza log4j2 plugins
    4. samza-log4j2 is for including custom Samza log4j2 plugins
  3. When setting the log4j2 configuration file ("log4j.configurationFile" system property), we need to use the application's log4j2.xml if it exists. If the application does not provide that file, then we need to provide a default log4j2.xml in the infrastructure classpath.
    1. This can be done by passing an extra environment variable which is the "application lib directory" which may contain the application's log4j2.xml file to the job coordinator execution, and then reading that environment variable in the run-class.sh script when setting the log4j configuration system property.
  4. All classloaders (API, infrastructure, application) need to exclude "log4j:log4j" (i.e. log4j1) from the classpath and use "org.apache.logging.log4j:log4j-1.2-api" (i.e. bridge from log4j1 to log4j2). This means log4j1 will not be supported, and log4j2 must be used.

Pros

  • Able to isolate log4j2 pluggable components built by Samza
  • Can override Samza infrastructure logging configuration
  • Applications can choose their own logging API

Cons

  • Samza ends up controlling log4j2 API version
  • Need to figure out how to manage configuration files for log4j2 correctly
  • No support for log4j1, so existing apps would need to migrate to log4j2

No isolation for Samza implementations of log4j pluggable components

It is not required to have isolation for log4j pluggable components when packaging the job coordinator JARs. Instead of needing to set up log4j in the framework packages, all of the logging dependencies (including the log4j pluggable components) can be included in the application package. This means that if Samza implementations of log4j pluggable components want to be used, then they will all be on the same classpath as the application code, so the effectiveness of isolation is reduced.

The application classpath still needs to have an slf4j binding so that the Samza framework code can use it.

The log4j components do need to be excluded from the framework API and infrastructure packages.

Pros

  • Easier to do packaging for job coordinator JARs
  • Logging flow is less complex, since the application provides the concrete logging implementations
  • Application has more flexibility in choosing how to do logging

Cons

  • Samza implementations of log4j pluggable components are not isolated from infrastructure, so isolation is less effective

External context

The ExternalContext is currently only used on processing containers, so there should be no conflict between Samza infrastructure and application on the job coordinator. Therefore, we don't need to do anything for isolation for ExternalContext usage in the job coordinator.

Beam

No Beam-specific code runs on the application master, so we do not need to make additional changes for that part.

SQL

Currently, Samza SQL applications just consist of SQL statements (i.e. text in a file).

The functionality provided by this document should not currently be leveraged by Samza SQL, due to the non-existence of application JARs. We still need to ensure that the new functionality does not break the existing Samza SQL functionality. One area to watch out for is that Samza SQL currently uses the SQL framework code as the main classpath, so that should not break.

In the future, UDFs should be able to be specified by applications. We should be able to leverage the separate classloader solution for this. Also, it is possible in the future that the job coordinator will need to run SQL-specific code. This would likely be a pluggable component, so we should be able to handle that by including it on the Samza infrastructure classpath.

Backward Compatibility

If this feature is off, this is backwards compatible, because we will use the old single-classpath model.

If this feature is on, then there is some potential runtime impact: Previously, the application packaged all Samza code and determined the dependencies, and that was what was used for the application runner, job coordinator, and processing containers. This meant that all runtime code was consistent across the Samza processes. With isolation, there may be an inconsistency between Samza and its dependencies used in the job coordinator when compared to the runner and processing containers. If there is any flow which requires the same set of dependencies to be used across all 3 pieces, then there would be a problem. An example of an issue could be if Java serialization is used to serialize a class on the application runner, and then it is deserialized on the job coordinator, where the version of the class is different than the version on the runner.

Samza serializes data into strings to pass them between processes. There are certain categories of data that are serialized into strings:

  • Plain strings (e.g. configs): Normal strings should be compatible across versions
  • JSON (e.g. checkpoints): JSON has reasonable compatibility concepts built-in, but they need to be considered when the data models are changed
  • Serializable objects (e.g. serdes in configs): Need to follow https://docs.oracle.com/javase/8/docs/platform/serialization/spec/version.html when changing Serializable objects

Within Samza, the data categories can be controlled and compatibility rules can be followed.

However, it is difficult to strictly control compatibility across versions of dependencies. It is possible that a certain dependency version serializes some data, but then a different process is unable to deserialize it, because it is using a different version of the dependency. Practically, it is not expected that this will be a common issue, since dependencies should generally be relatively consistent and it is uncommon to use third-party serialization, but it is still possible.

Once we have general split deployment, this will no longer be a problem, because the version of Samza used across all parts of the application will be consistent.

Testing

Local testing

We can use samza-hello-samza to test this locally. It has scripts to set up Zookeeper, Kafka, and YARN locally. The local YARN deployment will give the process isolation necessary to test the AM.

  1. Locally build the framework tarballs for API and infrastructure. It would be useful to put an example somewhere for how to build those tarballs.
  2. Deploy Zookeeper, Kafka, and YARN locally (https://samza.apache.org/startup/hello-samza/latest/).
  3. Fill in certain configs (see 135861549 above). These will go into the properties file passed to the run-app.sh script.
  4. Create the tarball for the application (https://samza.apache.org/startup/hello-samza/latest/). For testing local changes, remember to run the "publishToMavenLocal" command.

Automated integration test

  • Build API and infrastructure framework artifacts
  • Build a simple test job with dependency isolation enabled
    • This will require multiple configs, including the location of the framework artifacts for YARN resources (see 135861549 above).
  • Use the integration test framework (which uses real YARN) to check that the job runs successfully

Alternative solutions

Alternative solutions for SEP-24

Appendix

Notes regarding logging in a multiple-classloader scenario

  • Log4j searches for a configuration file specified by the "log4j.configuration" system property (or "log4j.configurationFile" for log4j2). If that property is not specified, then log4j will try to find a log4j.xml (or log4j2.xml for log4j2) file on the classpath. Note that log4j2 will look also for a log4j2.xml if the file specified at "log4j.configurationFile" is not found. See LogManager for the log4j implementation and ConfigurationFactory for the log4j2 implementation.
    • Samza does specify the "log4j.configuration" property in run-class.sh.
    • If the "log4j.configuration" system property is an accessible file, then all classloaders will be able to load it.
    • The log4j.xml file will only be searched for through the current classloader.
  • When initializing a class that has a static slf4j Logger field, the LoggerFactory and some core log4j components/interfaces will be loaded from the "current" classloader. However, some pluggable log4j components, (e.g. Appender) will be loaded by the Thread.getContextClassLoader and then passed back to the "current" classloader. If the context classloader loads core log4j components separately from the "current" classloader, then the appenders can't be shared, since the Appender interface would need to come from the same classloader.
    • A config "log4j.ignoreTCL" does exist to ignore the context classloader. Log4j will fall back to using the current classloader if the context class loader is not found or ignored (see org.apache.log4j.helpers.Loader). Samza doesn't currently set the context class loader, although it might be possible that the context class loader gets set by some system using Samza.
  • We should not instantiate multiple instances of RollingFileAppender which write to the same file at the same time due to concurrency issues. Usually, this isn't something to worry about since logging is initialized statically, but when there are multiple classloaders, it is possible to instantiate multiple appenders at the same time.
    • Some log appender implementations could work concurrently. For example, StreamAppender should work as long as the system is able to handle concurrent logging events.
  • Log4j2 does some special resource loading involving looking at the parent classloader of the context classloader (see ProviderUtil), so we need to be careful if log4j-core is on both the API and infrastructure classpaths, since it might lead to using the same class from both classloaders.
    • This can lead to error logs of the form "Unrecognized format specifier" and "Unrecognized conversion specifier", since plugins get loaded from one classloader and get sent to the other.
  • If a context classloader is set, then all log4j2 plugins are loaded from that classloader. Otherwise, it will load from the "current classloader".