Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Some of design is similar to what Flink currently uses for savepoint submissions, in an attempt to reuse the existing setup for long running operations.

...

Just like what happens with SavepointHandlers, JarRunAsyncHandlers will be grouped under a single class and specialize the JarRunAsyncHandlerBase class.

...

Code Block
{}

Proposed Changes

One update we need to make is register the handlers for the asynchronous job submission API calls in the WebSubmissionExtension class.

To develop the proposed feature, we plan to reuse components of two existing features: job and savepoint submissions. Nearly all background processing of job submissions should follow the existing approach. However, the asynchronous workflow is based on the savepoint submissions.

As a way to fulfill separate use cases, clients need to choose between the use of a previously uploaded JAR (through JarId) or provide the JAR in the request itself. The first should start the processing of the run request quicker, due to the transfer of the JAR not being part of the execution path. The update of a running application could follow this optimization strategy as the incurred downtime would be independent of how long the JAR takes to upload. However, clients who are starting their application for the first time can now make a single API call with all the information needed to start an application.

Given the asynchronous nature of the design, clients should call /run-async/:triggerid to inquire about the state of a previously submitted run request. This is essential to understand if the run request completed successfully or if there was an issue that needs to be addressed. Issues can be of various types, for instance: the request is lacking information (e.g. neither a JarId nor a JAR was uploaded with the request), the provided JarId does not exist, the code in the JAR is faulty or the job manager failed (maybe due to something unrelated to the run request) and it is unable to complete the run request. Clients are responsible for retrying the requests with the issues addressedDescribe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users? 
  • If we are changing behavior how will we phase out the older behavior? 
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

Existing users will not be impacted by the change, because the synchronous job submission will be unaffected. However, they will have the choice between simplicity and higher level of control over their job submissions. This is why there are no plans to phase out / remove the synchronous job submission.

Test Plan

We plan to start by adapting the existing tests for the synchronous job submission.

For the positive test cases, the new workflow consists of a job submission followed by polling of the completion tracking call for the associated triggerId. Assertions over the correctness of the running job should be identical to the synchronous submission, given that the workflows would have been merged at this stage. Failures raised due to problems in the provided inputs come from the job submission call. Failures that occur during the processing of a request are identified through the completion tracking calls.

To complete the testing of the job submission call, we plan to focus on the upload of a JAR alongside the request and the use of a triggerId for idempotency purposes. On the former, the priority is to guarantee that the JAR’s lifetime is limited to the duration of the job submission request (i.e. the JAR is disposed after the job creation finishes either successfully or due to an error). For the latter, we need to guarantee that duplicate requests are processed only once.

The remaining tests spread across the other three API calls. We intend to exercise both the completion tracking and listing API calls to externally assess the state of the system. Therefore, assertions over the reported state allow us to understand if the feature is behaving has expected. Regarding the job submission deletion, the main concerns come from the deletion of completed versus ongoing job submissions. A deletion of a completed job submission removes the tracking information for the requested triggerId. For ongoing job submissions, there are additional concerns to halt the background computation and delete the JAR that was uploaded alongside the asynchronous job submissionIf there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.