Apache Airavata
This document defines the scope of the phase I Workflow implementation in Airavata 0.17.
The implementation was initially motivated to is mainly to support CIPRES requirement for workflow capabilities in Airavata. Current target is to have the SEAGrid capabilities to consume workflow capabilities.
Execute multiple applications in sequence mode.
Doesn’t require control over workflow. No loops or conditions nodes.
Use previous working directory if user specify. other wise create separate working directory for each job.
Instead of input staging, use same data(move locally) when one application use the same data which was produced or used by previous application executed in the same workflow sequence and in the same machine.
Stage input data if a remote resource is involved making sure to associate Airavata Experiment with the local job.
Experiment only goes to it’s end state after all associate jobs come to one of end state.
Workflow can have different set of applications which runs on set of compute resource.
Support sequential multiple applications running inside an experiment
Change existing experiment model to support multiple applications.
Multiple applications can be defined as a DAG
Applications will be executed sequentially; one-after-the-other. Output will be available at the completion of the complete workflow.
Workflow node representation should contain application id, inputs, work directories, host application (deployment id) and other necessary information in order to make processes and tasks internally.
Cases to support
When application running on multiple hosts, how to handle output staging
The output staging (to the storage resource) should be automatically executed
When applications running on same host, do we use same working directory and such applications needs to be handled differently
The working directory for a new job in the workflow should be different and in some cases this is important to ensure data reuse from the previous run for yet unknown run since a job may rewrite some data.
Orchestrator needs to know whether the experiment is a single application or contains multiple applications.
When defining inputs to applications, will an output of a previous application be an input for another application? (This will change input data handling models)