AIP-57 Refactored SLA Feature

Status

State	Draft
Discussion Thread	https://lists.apache.org/thread/kn1rb9l70jk16j83fycp68l2g70wnhzl https://lists.apache.org/thread/kj33fwd9otbbm52my9j3y39rp547g0n5 https://lists.apache.org/thread/l2ho7w18d11vcnon7dxoxghzxhtjw5hf (Multiple threads were created on the topic)
Vote Thread
Vote Result Thread
Progress Tacking (PR/GitHub Project/Issue Label)
Date Created	$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())
Version Released
Authors	Sung Yun

Motivation

We often need to guarantee that a job is finished by a certain deadline. This concept is loosely defined as an ‘SLA’. In the scope of scheduled jobs, an SLA usually is referred to relative to the expected start time of the job. While the current SLA feature evaluates the event of missing an SLA correctly according to this definition, it is more of a misfeature due to the following list of problems.

The original discussion was carried out on the detailed Google Doc.

Problems in the Current State

Sla_miss_callback only executes when the task is in SUCCESS or SKIPPED state. This means that if the task has not succeeded in the first place, due to it being blocked, or is still running, or is yet to be scheduled, the sla_miss_callback is not executed. This is a detrimental misfeature of the existing SLA implementation, as users often want to use SLA for delay detection, to enable them to investigate any possible issues in their workflow pre-emptively.
SLA is defined as a timedelta relative to the dagrun.data_interval_end based on the timetable of the scheduled job. Some DAGs are triggered externally either manually or via a Dataset, meaning they don’t have a fixed schedule or start time preventing the evaluation of an SLA miss.
Sla_miss_callback is an attribute of the DAG, but the SLA’s are defined as attributes of the individual tasks. This is counterintuitive, and does not allow users to run callbacks that are customized at the task level if deadlines are missed.
The current implementation triggers the parsing of dag files when an sla_miss_callback is sent to the DagFileProcessor. This means that if an sla is defined within a dag, we may end up parsing the dag file more frequently than at the frequency defined by min_file_process_interval because the sla_miss_callbacks are sent to the DagFileProcessor at every time dag_run.update_state is called. This is especially bad for sla_miss_callbacks compared to the other dag level callbacks, because sla_miss_callback is requested everytime we examine and attempt to schedule the dagrun, even if there are no state changes within the dag.

Current Behavior 1: Delayed SLA detection since sla miss is not detected until sla task ends

Current Behavior 2: Delayed SLA detection when previous task runs long

Desired Behavior: Detected ASAP at data_interval_end + SLA

State Diagram and Challenges in Solving Above Problems for Task-level SLAs

In a single threaded process, managing the above sequence of actions is really simple. But in a distributed set of processes, we need to describe the above flow with a state diagram, so that the detection of missed SLA and the execution of an SLA callback can be managed by processes or parts of the scheduler loop that are best equipped to handle them. This is what the current architecture seeks to do, regrettably along with all of the issues outlined above. Although the current complicated workflow of detecting and alerting on SLA misses stretches across multiple parts of the scheduler - from the scheduler loop consistently sending poorly defined callbacks to the DagFileProcessor and the DagFileProcessorProcess constantly creating and updating these records in the SlaMiss table, we can describe this workflow in a set of state changes described below (referred to hereby as SLA state).

One important thing to note here is that although this is a simple acyclic and directed set of states, this is a new lineage of states that co-exists in parallel to the current set of states in Apache Airflow. Since a task can have any permutation of SLA state and Core Airflow State - we would need to evaluate whether the state would need to be updated for all of the task states, instead of just checking for tasks with unfinished states. This is obviously bad because it will incur much more work on the scheduler (or another process) and on the metadata database to have to evaluate the SLA state update for all tasks in any states.

One way to get around this would be to limit the evaluation of SLA state on just unfinished tasks. But this would mean that we would need to make the evaluation and update of the SLA state a first-class component of the task’s state change diagram, and make a strong guarantee that the SLA state of a finished task as defined by an executor/scheduler is up-to-date and final. This would require us to evaluate the state within the executor or the scheduler and overload _process_executor_events call in the scheduler.

In addition, users may want to track the SLA of tasks that have yet to be scheduled that are currently blocked by delays in upstream tasks. Evaluating SLAs for task instances that have yet to be scheduled will complicate the logic even more. (This is explored in depth in the Google Doc Addendum: Idea 1)

If overloading the critical path of the scheduler with task-level SLA evaluation and updates is not the direction we would like to take, I propose that we should limit the adoption of the refactored SLA feature to the DAG level, instead of supporting it for individual tasks. The sla_miss_callback is already defined at the dag level, and adopting the parallel state change diagram at the DAG level will make the implementation much simpler.

Considerations

What change do you propose to make?

There are three parts to this proposal.

First proposal is that we implement a DAG-level SLA feature, that is simpler to implement and manage, similar to the existing DAG-level on_success_callback and on_failure_callback.

Second proposal is that we mark the existing Task-level SLA misfeature for deprecation, and target the deprecation for Airflow 3.0.

Finally, I propose that we implement a new Task-level SLA feature, that is contained within the bounds of the task's lifetime, and is measured within the task itself.

Added:

DAG.sla: None | timedelta

DAG.on_sla_miss_callback: None | DagStateChangeCallback | list[DagStateChangeCallback]

BaseOperator.on_sla_miss_callback: None | TaskStateChangeCallback | list[TaskStateChangeCallback]

Updated:

BaseOperator.sla

Marked for Deprecation (3.0.0):

DAG.sla_miss_callback

SlaMiss

SlaMiss UI tab

Evaluation Logic added to

Dagrun.update_state

Dag.handle_callback

BaseOperator._execute_task

What problem does it solve?

Using DAG-level SLA feature, and a Task-level sla that is measured within the Task itself, solves all of the four problems listed under the Problems in the Current State section.

The SLA Miss will be detected as close to when it happens, and could actually be used to pre-emptively detect delays in their jobs, which is a very attractive feature increment for teams that are managing pipelines that have sensitive deadlines.
For Dag level SLAs, we will define the SLA from the dagrun.start_date for any dagrun that is not a DagRunType.BACKFILL type. This means that we will be able to extend the SLA feature to DagRunType.MANUAL and DagRunType.DATASET_TRIGGERED dagruns.
For Task level SLAs, we will define the SLA from the actual start time of the task.
sla and on_sla_miss_callback will be DAG or task level attributes, matching where the callback and sla will be measured for. Hence, the refactored feature will feel much more consistent for new users looking to adopt it
enabling sla feature will no longer overload the DagFileProcessingProcessor and the metadata database with sla_miss_callback requests and queries

.

Are there any downsides to this change?

The following types of SLAs are out of scope with the new DAG and Task SLA features.

New Task level SLA cannot be used within Deferrable Operators
SLAs of task relative to the scheduled start time of the DAG

A workaround to this is to use DateTimeTriggers and create helper tasks that can monitor specific tasks, and execute custom callbacks if they have not been completed within the defined SLA. These helper tasks could easily be customized to meet other diverse SLA measuring requirements that some users in the community has been requesting to date.

Below is an example of how these helper tasks could be implemented with the use of Deferrable Operators and Triggers.

Which users are affected by the change?

Users making use of the existing Task-level SLA misfeature will have to adopt the new DAG-level SLA feature. If task-level SLA monitoring is still desired, they will need to use a different workaround like above suggested Deferrable Operator that monitors the SLA of another task in the DAG.

How are users affected by the change? (e.g. DB upgrade required?)

DB upgrade will be required, on_sla_miss_callback will be repurposed and handled similarly to other dag callback requests. Primarily, the sla_miss table will be deprecated.

Other considerations?

Please find a detailed summary of all other considerations that have been explored in detail on the Google Doc

What defines this AIP as "done"?

This AIP is "done" when the proposed DAG-level SLA feature is implemented, and the existing Task-level SLA feature implementations have been marked for deprecation.

Space shortcuts

Page tree