Status

Current state: UNDER DISCUSSION

Discussion thread:http://mail-archives.apache.org/mod_mbox/samza-dev/201807.mbox/%3CCAFvExu3_nmaSQTy=5SypzwmqGA7S9+Txa=QkyERQ+hT3JZ29ig@mail.gmail.com%3E

JIRA: Unable to render Jira issues macro, execution error.

Released:

Problem

In the current implementation of ApplicationRunner, there are a few issues:

Instantiation of specific implementation of ApplicationRunner is exposed to the user, which requires user to choose a specific implementation of ApplicationRunner in source code, depending on the deployment environment (I.e. YARN vs standalone).
ApplicationRunner only supports high-level API and does not fully support low-level API:
1. In standalone environment, user's program written in StreamTask/AsyncStreamTask classes is only supported in LocalApplicationRunner w/ a run() method
2. In YARN, RemoteApplicationRunner only support high-level API applications and falls back to JobRunner for low-level API applications.
There is no unified API to allow user to specify high-level API and low-level API in initialization either.
There is no defined standard lifecycle of a user application process in both YARN and standalone deployment. Hence, no consistent pattern to insert user code into the application’s full lifecycle
1. There is no standard method to insert user-defined application initialization sequence
  1. In YARN, all application processes are initialized (i.e. configure re-writer, stream processor initialization, etc.) by the build-in main functions in Samza framework (i.e. ApplicationRunnerMain on launch host and LocalContainerRunner on NodeManagers).
  2. In standalone, user can put arbitrary code in user main function to initialize the application process.
2. There is no defined method to allow user to inject customized logic before start/stop the user defined processors (I.e. StreamProcessors defined by user application) either, in addition to initialization

Motivation

Our goal is to allow users to write a high-level or low-level API applications once and deploy in both YARN and standalone environments without code change. The following requirements are necessary to achieve our goal:

Hide the choice of specific implementation of ApplicationRunner via configuration, not in source code.
Define a unified API to allow user to describe the processing logic in high- and low-level API in all environment (I.e. all ApplicationRunners)
Expand the ApplicationRunner to run both low- and high-level APIs applications in YARN and standalone environments.
Define a standard processor life-cycle aware API to allow user’s customized logic to be injected before and after start/stop the processors in both YARN and standalone environments.

Note that we need to define the following concepts clearly:

ApplicationRunner defines a set of standard execution methods to change the deployment status of an application in runtime (I.e. run/status/kill/waitForFinish)
Application’s processor lifecycle aware methods are the user-defined functions to inject customized logic to be executed before or after we start or stop the stream processing logic in the user application (I.e. beforeStart/afterStart/beforeStop/afterStop are called when we start/stop the StreamProcessors in local host)

Proposed Changes

The proposed changes are the followings:

Define a unified API ApplicationBase as a single entry point for users to implement all user-customized logic, for both high-level API and low-level API
1. User implements a single describe() method to implement all user processing logic before creating the runtime application instance
  1. Sub-classes StreamApplication and TaskApplication provide specific describe() methods for high-level API and low-level API, respectively
Define a unified API class ApplicationDescriptor to contain
1. High- and low-level processing logic defined via ApplicationBase.describe(). Sub-class StreamAppDescriptor and TaskAppDescriptor are used for high- and low-level APIs respectively.
2. User implemented ProcessorLifecycleListenerFactory interface that creates a ProcessorLifecycleListener which includes customized logic before and after starting/stopping the StreamProcessor(s) in the user application
  1. Methods are beforeStart/afterStart/beforeStop/afterStop
3. Other used-defined objects in an application (e.g. configuration and context)
Expand ApplicationRunner with a mandatory constructor with ApplicationDescriptor object as parameter
1. An ApplicationRunner is now constructed with an ApplicationDescriptor as the parameter
  1. ApplicationDescriptor contains all user customized logic.
  2. ApplicationRunner deploys and runs the user code. This would be instantiated from the configuration, not exposed to user at all.

A high-level overview of the proposed changes is illustrated below:

Figure-1: high-level user programming model

Figure-2: Interaction and lifecycle of runtime API objects (using StreamApplication in LocalApplicationRunner as an example).

The above design achieves the following goals:

Defined a unified ApplicationBase interface for both high- and low-level APIs in different deployment environments. All user code is now implemented in one of the sub-classes (I.e. StreamApplicaiton or TaskApplication).
1. All processing logic is implemented in the standard describe() method in either StreamApplication or TaskApplication.
2. All user customized logic to start/stop contextual objects in their application process are in standard lifecycle listener methods defined in ProcessorLifecycleListener.
Construction of ApplicationRunner object is implemented by Samza framework code, which hides:
1. Choice of a specific ApplicationRunner for different environment via configuration
2. Association of a user application to the specific instance of ApplicationRunner as the parameter to the constructor
3. Initialization of user processing logic before ApplicationRunner executes the application when constructing the ApplicationDescriptor object
4. Invoking user-defined lifecycle listener methods when run/kill the application via ApplicationRunner in local process (I.e. LocalApplicationRunner*)
  1. Note that RemoteApplicationRunner only submit the application w/o launching the StreamProcessors. Hence, lifecycle listener methods are not invoked in RemoteApplicationRunner.
  2. Note this is also pending on one refactor item that we need to refactor LocalContainerRunner s.t.
    1. It implements run/kill/status w/ proper async implementation
    2. It launches StreamProcessor instead of directly running SamzaContainer

Note that the application main() method in the cross-functional swimlane chart is marked with a different color, since there could be options for the user to use either a user-defined main() or a Samza build-in main() functions. We consider the above two options in three different runtime environments:

Standalone environment: user will use LocalApplicationRunner to launch the application in the same JVM process
YARN application launch host: user will use RemoteApplicationRunner to submit the application to a remote cluster
YARN NodeManager: Samza will run a build-in runner to launch the container in the same JVM process

For Samza system build-in main method (as in ApplicationRunnerMain#main()), we require the user application class to have a default constructor w/o any parameters:

Class<ApplicationBase> appClass = (Class<ApplicationBase>) Class.forName(appConfig.getAppClass());
if (StreamApplication.class.isAssignableFrom(appClass) || TaskApplication.class.isAssignableFrom(appClass)) {
  return appClass.newInstance();
}

The reason is: when deploying via RemoteApplicationRunner in YARN, we will run the managed main() method implemented by Samza in the NodeManager, which don’t have the ability to invoke customized constructor for user application. Hence, we expect a default constructor implemented by any user application. This is the same behavior as we expected today from any user implementing a high- or low-level application.

For user-defined main() applications, we can run it in both standalone and YARN, as long as:

The user application class implements a default constructor w/o any parameters
Creation of ApplicationRunner in main is using Samza provided methods (I.e. ApplicationRunners.getApplicationRunner())

Simple code examples of high- and low-level API applications:

public class PageViewCounterExample implements StreamApplication {

  public static void main(String[] args) {
    CommandLine cmdLine = new CommandLine();
    Config config = cmdLine.loadConfig(cmdLine.parser().parse(args));
    ApplicationRunner runner = ApplicationRunners.getApplicationRunner(ApplicationClassUtils.fromConfig(config), config);
    runner.run();
    runner.waitForFinish();
  }

  @Override
  public void describe(StreamAppDescriptor appDesc) {
      MessageStream<PageViewEvent> pageViewEvents = null;
      // TODO: replace "pageViewEventStream" with pveStreamDescriptor when SEP-14 is implemented
      pageViewEvents = appDesc.getInputStream("pageViewEventStream", new JsonSerdeV2<>(PageViewEvent.class));
      OutputStream<KV<String, PageViewCount>> pageViewEventPerMemberStream =
          appDesc.getOutputStream("pageViewEventPerMemberStream",
              KVSerde.of(new StringSerde(), new JsonSerdeV2<>(PageViewCount.class)));

      SupplierFunction<Integer> initialValue = () -> 0;
      FoldLeftFunction<PageViewEvent, Integer> foldLeftFn = (m, c) -> c + 1;
      pageViewEvents
          .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofSeconds(10), initialValue, foldLeftFn,
              null, null)
              .setEarlyTrigger(Triggers.repeat(Triggers.count(5)))
              .setAccumulationMode(AccumulationMode.DISCARDING), "tumblingWindow")
          .map(windowPane -> KV.of(windowPane.getKey().getKey(), new PageViewCount(windowPane)))
          .sendTo(pageViewEventPerMemberStream);
  }
}

public class TaskApplicationExample implements TaskApplication {

  public static void main(String[] args) {
    CommandLine cmdLine = new CommandLine();
    Config config = cmdLine.loadConfig(cmdLine.parser().parse(args));
    ApplicationRunner runner = ApplicationRuntimes.getApplicationRunner(new TaskApplicationExample(), config);
    runner.run();
    runner.waitForFinish();
  }

  @Override
  public void describe(TaskAppDescriptor appDesc) {
    // add input and output streams
    // TODO: replace "myinput" with inputStreamDescriptor and "myoutput" with outputStreamDescriptor when SEP-14 is implemented
    appDesc.addInputStreams(Collections.singletonList("myinput"));
    appDesc.addOutputStreams(Collections.singletonList("myoutput"));
    TableDescriptor td = new RocksDbTableDescriptor("mytable");
    appDesc.addTables(Collections.singletonList(td));
    // create the task factory based on configuration
    appDesc.setTaskFactory(TaskFactoryUtil.createTaskFactory(appBuilder.getConfig()));
  }

}

Public Interfaces

There are two types of public API classes that are exposed to the user: a) user-implemented interface classes that allows users to inject customized code; b) Samza framework implemented runtime objects that allows users to start/stop a runtime application.

A) user-implemented interface classes include the followings:

ApplicationBase: defines the basic life-cycle aware methods to allow users to inject customized logic before and after the lifecycle methods of an application

public interface ApplicationBase<T extends ApplicationDescriptor> { 
  void describe(T appDesc); 
}

StreamApplication: extends ApplicationBase with a typed describe() method for high-level user application

public interface StreamApplication extends ApplicationBase<StreamAppDescriptor> { 
}

TaskApplication: extends ApplicationBase with a typed describe() method to initialize the low-level user application

public interface TaskApplication extends ApplicationBase<TaskAppDescriptor> { 
}

ProcessorLifecycleListenerFactory: defines the factory interface to create ProcessorLifecycleListener in an application

public interface ProcessorLifecycleListenerFactory extends Serializable {
  /**
   * Create an instance of {@link ProcessorLifecycleListener} for the StreamProcessor
   *
   * @param pContext the context of the corresponding StreamProcessor
   * @param config the configuration of the corresponding StreamProcessor
   * @return the {@link ProcessorLifecycleListener} callback object for the StreamProcessor
   */
  ProcessorLifecycleListener createInstance(ProcessorContext pContext, Config config);
}


/**
 * The context for a StreamProcessor. This is a stub class, just include the method to identify the current StreamProcessor.
 *
 */
public interface ProcessorContext extends Serializable {
  String getProcessorId();
}

ProcessorLifecycleListener: defines the unified processor lifecycle aware methods to allow users to inject customized logic before/after start/stop the StreamProcessor(s) in an application

public interface ProcessorLifecycleListener {
  /**
   * User defined initialization before a StreamProcessor is started
   */
  default void beforeStart() {}

  /**
   * User defined callback after a StreamProcessor is started
   *
   */
  default void afterStart() {}

  /**
   * User defined callback before a StreamProcessor is stopped
   *
   */
  default void beforeStop() {}

  /**
   * User defined callback after a StreamProcessor is stopped
   *
   * @param t the error causing the stop of the StreamProcessor. null value of this parameter indicates a successful completion.
   */
  default void afterStop(Throwable t) {}
}

B) Samza framework implemented runtime objects

Samza framework generates two sets of runtime classes that are directly exposed to the user. One set of classes are ApplicationDescriptor that includes all user-defined logic and configuration for an application; the other set of classes are ApplicationRunner class.

ApplicationDescriptor classes

ApplicationDescriptor: this is a base interface for both high- and low-level applications.

public interface ApplicationDescriptor<T extends ApplicationBase> { 
  /** 
   * Get the global unique application ID in the runtime process 
   * @return globally unique application ID 
   */ 
  String getGlobalAppId(); 
 
  /** 
   * Get the user defined {@link Config} 
   * @return config object 
   */ 
  Config getConfig(); 
 
  /** 
   * TODO: this needs to be replaced with proper SharedContextFactory when SAMZA-1714 is completed.
   * we have to keep it here to enable the current samza-sql implementation.
   *
   * Sets the {@link ContextManager} for this application. 
   * <p> 
   * The provided {@link ContextManager} can be used to setup shared context between the operator functions 
   * within a task instance 
   * 
   * @param contextManager the {@link ContextManager} to use for the {@link StreamApplicationSpec} 
   * @return the {@link StreamApplicationSpec} with {@code contextManager} set as its {@link ContextManager} 
   */ 
  ApplicationDescriptor<T> withContextManager(ContextManager contextManager); 


  /** 
   * Sets the lifecycle listener factory for user customized logic before and after starting/stopping 
   * StreamProcessors in the application 
   */  
  ApplicationDescriptor<T> withProcessorLifecycleListenerFactory(ProcessorLifecycleListenerFactory listener); 
}

StreamAppDescriptor: this extends ApplicationDescriptor for a high-level application, including all methods to describe a high-level application in StreamGraph.

public interface StreamAppDescriptor extends ApplicationDescriptor<StreamApplication>, StreamGraph { 
}

TaskAppDescriptor: this extends ApplicationDescriptor for a low-level application, including the user-defined TaskFactory and the corresponding list of input and output streams and tables.

public interface TaskAppDescriptor extends ApplicationSpec<TaskApplication> { 
 
  void setTaskFactory(TaskFactory factory); 
 
  // TODO: the following two interface methods depend on SEP-14
  void addInputStreams(List<InputStreamDescriptor> inputStreams);  
  void addOutputStreams(List<OutputStreamDescriptor> outputStreams); 
 
  void addTables(List<TableDescriptor> tables); 
 
}

ApplicationRunner classes

ApplicationRunner

This is an interface class that defines the standard execution methods to deploy an application. It is used by users and not intend to be implemented by users.

public interface ApplicationRunner {
  /**
   * Start a runtime instance of the application
   */
  void run();

  /**
   * Stop a runtime instance of the application
   */
  void kill();

  /**
   * Get the {@link ApplicationStatus} of a runtime instance of the application
   * @return the runtime status of the application
   */
  ApplicationStatus status();

  /**
   * Wait the runtime instance of the application to complete.
   * This method will block until the application completes.
   */
  void waitForFinish();

  /**
   * Wait the runtime instance of the application to complete with a {@code timeout}
   *
   * @param timeout the time to block to wait for the application to complete
   * @return true if the application completes within timeout; false otherwise
   */
  boolean waitForFinish(Duration timeout);

  /**
   * Method to add a set of customized {@link MetricsReporter}s in the application runtime instance
   *
   * @param metricsReporters the map of customized {@link MetricsReporter}s objects to be used
   */
  void addMetricsReporters(Map<String, MetricsReporter> metricsReporters);

}

ApplicationRunners:

Samza framework provided factory class to allow instantiation of ApplicationRunner for user applications.

public class ApplicationRunners {

  private ApplicationRunners() {

  }

  public static final ApplicationRunner getApplicationRunner(ApplicationBase userApp, Config config) {
    if (userApp instanceof StreamApplication) {
      return getRunner(new StreamAppDescriptorImpl((StreamApplication) userApp, config));
    }
    if (userApp instanceof TaskApplication) {
      return getRunner(new TaskAppDescriptorImpl((TaskApplication) userApp, config));
    }
    throw new IllegalArgumentException(String.format("User application instance has to be either StreamApplicationFactory or TaskApplicationFactory. "
        + "Invalid userApp class %s.", userApp.getClass().getName()));
  }

...
  /**
   * Static method to get the {@link ApplicationRunner}
   *
   * @param appDesc  {@link AppDescriptorImpl} object that contains all user-customized application logic and configuration
   * @return  the configure-driven {@link ApplicationRunner} to run the user-defined stream applications
   */
  public static ApplicationRunner getRunner(AppDescriptorImpl appDesc) {
    AppRunnerConfig appRunnerCfg = new AppRunnerConfig(appDesc.getConfig());
    try {
      Class<?> runnerClass = Class.forName(appRunnerCfg.getAppRunnerClass());
      if (ApplicationRunner.class.isAssignableFrom(runnerClass)) {
        // mandate AppDescritorImpl as the parameter to constructor
        Constructor<?> constructor = runnerClass.getConstructor(AppDescriptorImpl.class); // *sigh*
        return (ApplicationRunner) constructor.newInstance(appDesc);
      }
    } catch (Exception e) {
      throw new ConfigException(String.format("Problem in loading ApplicationRunner class %s",
          appRunnerCfg.getAppRunnerClass()), e);
    }
    throw new ConfigException(String.format(
        "Class %s does not extend ApplicationRunner properly",
        appRunnerCfg.getAppRunnerClass()));
  }
}

Implementation and Test Plan

The implementation of the above API changes involves the following sections:

ApplicationBase/StreamApplication/TaskApplication interfaces and user code examples for high- and low-level APIs. Interface classes of StreamApplication and TaskApplication don’t have default implementation. The main effort is to write user code examples implementing those interfaces. We need to port all existing high-level user code examples in samza-test module and also add low-level user code examples.
Implementation of runtime public API classes: AppDescriptorImpl/StreamAppDescriptorImpl/ TaskAppDescriptorImpl/ ApplicationRunners. Those classes are implemented by Samza framework and directly used by users. Hence, it needs both implementation and user code examples.
Internal implementation of ApplicationRunners: implementation of ApplicationRunners need to be refactored to support running different ApplicationDescriptor, based on whether the ApplicationDescriptor is a StreamAppDescriptor or TaskAppDescriptor. All ApplicationRunner classes need to be refactored to support TaskAppDescriptor.
Implementation of local application runners need to support invocation of ProcessorLifecycleListener API methods before and after start/stop the StreamProcessor(s)
1. This requires a refactoring of LocalContainerRunner to properly launch StreamProcessor instead of directly running SamzaContainer

Test plans:

Changes in all ApplicationRunners need to be included in unit tests. Adding tests for TaskAppDescriptor as well.
Applications written in high-level API need to be included in AbstractIntegrationTestHarness for testing.
Applications written in low-level API also need to be included in AbstractIntegrationTestHarness for testing.
Applications using different runners via config change also need to be tested.

Compatibility, Deprecation, and Migration Plan

The proposed changes the existing API classes:

Incompatible changes:

The StreamApplication.init() is replaced by StreamApplication.describe().
StreamAppDescriptor class replaces StreamGraph to describe the high-level API application
Use ApplicationRunners public classes to replace the user instantiation of a specific implementation of ApplicationRunner
Changed the mandatory parameter to construct an ApplicationRunner from Config to AppDescriptorImpl

Addition-only changes:

Added TaskApplication interface
Added TaskApplicationSpec interface
Added ProcessorLifecycleListenerFactory interface

There is no configuration change that is backward incompatible w/ current API.

The main in-compatible change is with high-level API applications:

Rejected Alternatives

The rejected alternatives is to always run user’s main() function for high- and low-level APIs in YARN and standalone. The reasons to reject this option are the following:

In legacy low-level APIs, user doesn’t have main() function implemented.
In applications launched via standard lifecycle management framework like Spring, users don’t write main() function either.
In YARN environment, we want to manage the main() function to be launched in the NodeManager (to avoid launching arbitrary user code in NodeManager).

Space shortcuts

Child pages

Problem

Motivation

Public Interfaces

A) user-implemented interface classes include the followings:

B) Samza framework implemented runtime objects

ApplicationDescriptor classes

ApplicationRunner classes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

SEP-13: unify high- and low-level user applications in YARN and standalone

Problem

Motivation

Public Interfaces

A) user-implemented interface classes include the followings:

B) Samza framework implemented runtime objects

ApplicationDescriptor classes

ApplicationRunner classes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives