More tightly couple job coordinator JARs with application JARs so they are consistent

Instead of isolating the job coordinator JARs from the application JARs, we could force them to depend on the same versions of JARs. We can build dependency tools to force all of the application dependencies to match the Samza infrastructure dependencies.

Pros

  • Easier to implement, since can continue to build and deploy everything from the application side

Cons

  • Applications are forced to use certain dependencies
    • Samza must rebuild with new dependencies whenever an application needs an upgraded dependency
  • Requires rebuild of all applications when Samza wants to upgrade dependencies

Decoupling job coordinator JARs from application JARs

  • Using cytodynamics with three classloaders (infrastructure, infrastructure plugins, application) and explicitly wiring classloader through Samza
    • Pros
      • More explicit about classloader being used for pluggable classes
      • Straightforward to specify classloader to use (e.g. no reflection needed)
    • Cons
      • Need to wire classloader through anywhere in Samza that uses pluggable classes
        • Might be hard to remember to do this everywhere when evolving Samza
      • Infrastructure and infrastructure plugins have several overlaps in dependencies (e.g. samza-core)
  • Using cytodynamics with only two classloaders (infrastructure, application)
    • Pros
      • Simpler to maintain fewer classloaders
    • Cons
      • If infrastructure needs to use reflection to load a class on the application classpath, and it is unable to explicitly specify a classloader to use (e.g. 3rd party libraries like Avro), then it will be unable to find the necessary classes. The application classloader already delegates to the infrastructure classloader, so the infrastructure classloader can't also delegate to the application classloader.
  • Shading
    • This involves changing the namespace of classes when generating a JAR so that duplicate classes from different sources no longer conflict.
    • Pros
      • No code changes
      • Shaded fat JAR provides isolation and is self-contained
    • Cons
      • Can't shade anything which is part of the public API, including third party classes
        • Any unshaded classes can cause runtime issues to arise
        • Shading works best when everything is shaded
      • Special build time step to do shading
      • Need to be careful when using reflection and code generation
        • Samza uses a lot of reflection (application might need to use shaded class names)
  • Java Service Provider Interfaces
    • This is a pattern which helps to load different implementations of an interface for use by an application. The implementations are installed and specified in a special way.
    • Pros
      • Helps to ensure good isolation of user code and dependencies since they can be installed separately
      • Can help to keep structure of application installation clean
    • Cons
      • Still need custom classloaders to isolate service provider implementations (e.g. still need to load API classes from parent classloader)
      • Need to package META-INF files to specify provider implementations, and those would need to match with the configs
      • Need to filter out implementations that aren't needed (e.g. only need a single system implementation per stream)
      • Service provider pattern seems to be more for being able to extend functionality off of some basic functionality, but Samza has certain components that are required from the user (e.g. task)
  • OSGi
    • OSGi is a framework which uses isolated "bundles" to provide implementations of "services". Bundles communicate through the services. It also provides functionality around detecting when new bundles get registered and unregistered.
    • One open source implementation of this is Apache Felix.
    • There does not seem to be any serialization overhead for local services (https://stackoverflow.com/questions/11222933/overheads-involved-in-osgi). For remote services (which we wouldn't need in Samza), there are some restrictions (https://osgi.org/specification/osgi.cmpn/7.0.0/service.remoteservices.html, https://osgi.org/specification/osgi.core/7.0.0/framework.dto.html).
    • Pros
      • Good isolation of user code and dependencies since they can be installed separately
      • Can help to keep structure of application installation organized
      • Granular specification of dependencies between services by using package name
    • Cons
      • Samza would need to have a specific structure to work with OSGi
        • Would require refactoring Samza classloading to work with OSGi pattern of accessing services
          • Some cases like having multiple systems of the same type might not fit as well in this pattern
        • Requires some specific structure for each bundle
      • Need to specify service dependencies in each bundle (both input and output)
        • Maybe can automate some of this
      • Multiple classloaders is not obvious, so certain assumptions are invalid (e.g. static variables are not shared across classloaders)
      • Has difficulty with case of third party infrastructure dependencies needing to classload something from the application classpath (e.g. Avro)
      • Extra dependency for Samza (seemingly a pretty heavy dependency since it provides a lot of functionality, much of which will be unused by Samza)
  • Use multiple java.net.URLClassLoader instances to isolate infrastructure from application
    • Pros
      • No additional external dependencies, since URLClassLoader is built-in to Java
    • Cons
      • Application URLClassLoader needs to load API classes from parent, but URLClassLoader can't differentiate between direct API classes and dependencies, so application code might end up using the dependencies of API classes
  • Custom classloader implementation
    • This involves implementing classloader logic specifically for Samza.
    • Pros
      • No additional external dependencies
    • Cons
      • Cytodynamics seems to provide the necessary functionality
  • Associate Samza infrastructure with one process and associate the application with another, and have them communicate through interprocess API calls
    • Pros
      • Infrastructure is fully separated from the application since they run in different JVMs
    • Cons
      • Significant network traffic and serialization/deserialization impacts performance
      • Need to manage interprocess APIs; might be harder to evolve interprocess APIs
      • Significant changes to execution structure
  • Only install Samza-owned infrastructure JARs for running the job coordinator
    • Pros
      • Simple (e.g. no special classloading required)
    • Cons
      • Inconsistency with general split deployment since initialization of JVM will be different with other components (e.g. runner, processing containers)
        • This will not be leverageable for general split deployment, and this would be obsolete once we solve general split deployment.
      • Still have some pluggable classes that could be specified by applications (e.g. SystemFactory, groupers)
  • Put application JARs (including dependencies) first in classpath, and then put Samza JARs (including dependencies)
    • Pros
      • Simple to implement
    • Cons
      • Does not provide actual isolation, since Samza infrastructure will use application dependencies

Generating the Samza API whitelist

  • Use cytodynamics @Api annotations with Samza API classes
    • Pros
      • Close coupling for annotation and class is straightforward and self-descriptive
    • Cons
      • Need to add annotation to all Samza API classes
  • Manually maintain a file/config in Samza with the class whitelist
    • Pros
      • No need to change Samza API classes
      • Flexibility to add anything to whitelist
    • Cons
      • Need to remember to add classes to the whitelist
      • If a config, then might get too large for a config entry
  • Whitelist org.apache.samza.*
    • Pros
      • Easy to implement and maintain
    • Cons
      • Pulls in non-API Samza classes as "API", so we can't make any backwards incompatible changes to any Samza classes
  • Rename all API classes to be org.apache.samza.api.*, and then whitelist that package name
    • This might be a feasible solution for general split deployment.
    • Pros
      • Easy to implement and maintain
    • Cons
      • Backwards incompatible
  • Separate classloader which only has Samza API classes, then whitelist org.apache.samza.*
    • Pros
      • Easy to implement and maintain
    • Cons
      • Need to keep track of an extra classloader

Classloader wiring

  • Set "java.system.class.loader" property to be custom classloader
    • Pros
      • No special code needed for calling custom classloader
      • ClassLoader.getSystemClassLoader will return the custom classloader
    • Cons
      • Need to modify "java" command in launch script to set property, and that requires detecting in the launch script if split deployment is enabled
      • Need to follow a certain constructor structure for custom classloader
  • Inject classloader as a dependency to any class which loads pluggable classes
    • Pros
      • More explicit about classloader being used for pluggable classes
      • Straightforward to specify classloader to use (e.g. no reflection needed)
    • Cons
      • Need to wire classloader through anywhere in Samza that uses pluggable classes
        • Might be hard to remember to do this everywhere when evolving Samza

Packaging the job coordinator JARs

  • Add a new module to samza for packaging the infrastructure JARs
    • Pros
      • Samza users don't have to package the framework JARs on their own
      • Can help ensure that the Samza framework package contains the version of Samza that is directly depended on by the application
    • Cons
      • Might not be practical to do this, since Samza users might want to determine their own set of dependencies in the framework package

Localizing the job coordinator JARs

  • Include the Samza infrastructure code in the same artifact as the application so it gets all downloaded at once
    • Pros
      • Only need to download the application artifact
    • Cons
      • Depends on application to package everything correctly (e.g. need to put infrastructure JARs in a different place than application JARs)
      • Likely harder to switch infrastructure JARs out for full split deployment
        • Would require the application artifact to change when switching JARs out
        • Would need to re-localize infrastructure JARs only into a location that was created in the localization step
  • Job coordinator uses whatever Samza version is in the classpath (external system would need to localize the JARs)
    • Pros
      • No need to specify the version directly as part of the application
    • Cons
      • Depends on external system
      • Harder to keep track of which version is actually being used
  • Make Samza JARs available on all nodes, copy them to the application directory on deployment
    • Pros
      • JARs just need to be copied over to application directory (or can possibly read directly from the original install location)
    • Cons
      • Unable to support multiple versions of Samza
      • Operability challenges: need to know full set of hosts for installing Samza JARs; when cluster expands, need to remember to deploy to new hosts
  • Install all versions of Samza onto all nodes
    • Pros
      • JARs just need to be copied over to application directory (or can possibly read directly from the original install location)
    • Cons
      • Can end up being a lot of versions to package together
        • Need to support all released versions in case an application wants a specific version
      • Operability challenges: need to know full set of hosts for installing Samza JARs; when cluster expands, need to remember to deploy to new hosts

Generating classpaths for the JARs

  • Allow JAR directories to be passed as an environment variable
    • Pros
      • Relatively simple
    • Cons
      • Probably need to set environment variables in run-jc.sh, and that will be hard to evolve (e.g. if JARs aren't in the app workspace in the future)
  • Allow JAR directories to be passed as a config
    • Pros
      • Can specify any location for infrastructure JARs
    • Cons
      • Difficult to specify JAR locations if the JARs are in the app workspace, since the app workspace directory name is generated at deployment time
        • Could possible specify a "relative" path, but might also need to support an "absolute" path in the future
  • Allow JAR directories to be passed as an environment variable and a config
    • Pros
      • Flexible for specifying JAR locations
      • Might be somewhat similar pattern as specifying coordinator stream configs or logged store directory
    • Cons
      • Might still need to put default environment variable values in run-jc.sh
      • Relatively more complex solution
  • Pass "base directory" from run script as an environment variable
    • Pros
      • Can construct JAR locations in code
      • Relatively simple
    • Cons
      • Need to modify run scripts
      • Might be useful for other cases in the future

Logging

  • Support log4j2 in framework, but isolate log4j1 usage between framework and application
    • Pros
      • Infrastructure only has to worry about log4j2
    • Cons
      • log4j1 also uses the context classloader to load pluggable components, so application usage would still route to the infrastructure (infrastructure needs to include log4j1 classes since some dependencies still use log4j1)
        • This becomes "Support log4j1 in framework".
        • log4j has an "ignoreTCL" config to ignore the context classloader, but log4j1 and log4j2 use the same config, and we can't ignore the context classloader for log4j2
  • Support log4j1 in framework
    • Pros
      • Existing apps using log4j1 can continue to use it
    • Cons
      • Need to manage configurations for both log4j1 and log4j2 at the same time
  • Include slf4j API in API whitelist
    • Pros
      • Application doesn't have to manage binding for slf4j
    • Cons
      • Application has less flexibility in logging implementation
      • Still need to deal with direct usage of log4j1/log4j2 in application
  • Require log4j1 to be included in application classpath
    • Pros
      • Infrastructure only has to worry about log4j2, since it can delegate to application for log4j1
    • Cons
      • Would be an odd requirement to applications to include log4j1, since not all apps need log4j1
  • Do not allow log4j/log4j2 configuration file to be specified through system property (only include configuration file through classpath as resource)
    • Pros
      • Isolation of framework and application logs
      • No need to manage system property
    • Cons
      • Unable to use configuration files from application
      • Current Samza flow automatically sets this system property, so the job coordinator isolation flow will need to change that to prevent it
  • Add StreamAppender (and other pluggable components) to the API classloader
    • Pros
      • Unnecessary to set context classloader
    • Cons
      • More dependencies for the API classloader (e.g. StreamAppender uses Kafka)
  • Each classloader handles its own logging without delegation to other classloaders
    • Pros
      • Each classloader stays isolated from others for logging
      • Do not need to worry about classloader delegation for logging components
    • Cons
      • Need to somehow avoid writing to same log file, so can't apply same log4j.xml
        • Would need to set system property to change configuration file location (or log file location) based on which classloader was being used
      • Can't split deploy Samza logging components
      • Applications need to include more runtime dependencies (e.g. samza-kafka for system for StreamAppender)
  • Use a 4th classloader just for logging, and delegate to it from all other classloaders
    • Pros
      • Isolate logging from the API classloader
    • Cons
      • Need to manage another classloader (including localizing the classpath)

Decoupling application runner JARs from application JARs

  • Many of the possible solutions for job coordinator isolation also apply to the application runners
    • Pros
      • Leveraging solution for the job coordinator
    • Cons
      • If we already don't have app logic in the app runners, then this is not necessary.

Handling SamzaApplication.describe

  • Add descriptors to API classloader
    • Since application code is directly calling Samza infrastructure (e.g. LiKafkaInputDescriptor), then the classloader structure proposed for job coordinator isolation will not work for SamzaApplication.describe. If we used that classloader structure, then the Samza-owned components would be loaded by the application classloader, and that would mean that any dependencies would also be loaded by the application classloader. Those Samza-owned components should be loaded by a classloader associated with Samza infrastructure, so that the proper dependencies can be used. We can add the concrete descriptors to the API classloader. We would need to whitelist the concrete descriptors that we want loaded from the API classloader.

    • Pros
      • Descriptors and tables are part of the user API, so this organization is more intuitive
      • Classloader structure does not need to change
    • Cons
      • Table functions are passed to table descriptors in describe and are used in processing. The API classloader won't be able to delegate back to the application classloader for the processing case.
      • More dependencies in API classloader
  • Extract descriptors and table functions into separate thin modules which can be made part of the API classloader
    • This would require decoupling the table function creation from the table function processing logic, so that different classloaders can be used for instantiation and processing.
      • One way of doing this could be to change the table descriptor to accept a table function factory instead of the table function directly.
    • Pros
      • Descriptors and tables are part of the user API, so this organization is more intuitive
      • Results in more consistency between table descriptors and other descriptors (e.g. system descriptors)
    • Cons
      • Requires changing table API
  • Additional plugins classloader which just has plugins (e.g. descriptors, tables, system factories, etc.)
    • Pros
      • Descriptors and tables are part of the user API, so this organization is more intuitive
    • Cons
      • Still have the issue with how to delegate between plugins classloader and application classloader
        • In describe, the application should delegate to the plugins, but in regular processing, it should be the other way around

Existing solutions for other technologies

  • Hadoop: classpath isolation for jobs
    • "mapreduce.job.classloader" config: Uses custom classloader for user classes which loads user classes before infra classes (Hadoop open source config doc)
    • "mapreduce.job.user.classpath.first" config: Put application JARs ahead of infra JARs on the classpath (Stack Overflow link)
    • This still can have classpath conflict problems of infra using application dependencies or vice versa.
  • Azkaban: classpath isolation for plugins
    • Azkaban uses a separate URLClassLoader for plugin JARs, where the parent classloader is the Azkaban classloader. This means that the Azkaban classloader will always be checked first.
    • This means that the plugins might use Azkaban dependencies.
  • Presto: Java Service Provider Interface for plugins
    • Plugin owner packages their own JARs (including dependencies), and then Presto will load the plugin using those plugin JARs and a whitelist of Presto classes
    • Still need to have custom classloader logic
  • Jetty: application dependencies get priority over Jetty dependencies, except for certain "system" classes (e.g. java.lang.String, javax.servlet.Servlet) which get loaded from the parent and certain "server" classes which are hidden
    • See org.eclipse.jetty.webapp.WebAppClassLoader
    • Jetty's API doesn't return arbitrary objects, so it can fully control the classes that get shared with the application ("system" classes)
  • AWS Lambda: separate environment for each function
  • Flink
  • No labels