You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Status

State: Draft

Discussion thread: https://lists.apache.org/thread.html/9b379c2583dc765fa1c6b6222f3cde6e505b0759f5e5098144d33949@%3Cdev.airflow.apache.org%3E

JIRA: https://issues.apache.org/jira/browse/AIRFLOW-3964


Motivation

Thanks Fokko for sharing the reschedule sensor PR by Seelmann( https://github.com/apache/airflow/pull/3596/files ). It did a great job.

I reopen this AIP after viewing the sensor rescheduling PR. Since the reschedule mode does not reduce the number of worker processes for sensor. The batch sensor idea can be a supplement for this purpose and should work well with reschedule mode. 

Low efficiency in Airflow Sensor Implementation:

Sensors are a special kind of operator that will keep running until a certain criterion is met. Examples include a specific file landing in HDFS or S3, a partition appearing in Hive, or a specific time of the day. Sensors are derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns True.

The reason that the sensor tasks are inefficient is because in current design, we sprawn a separate worker process for each partition sensor. This worker might last a long time, until the target partition is available.  In the case where there are many sensor tasks that need to run within certain time limits, we have to allocate a lot of resources to have enough workers for the sensor tasks. 

Idea

We propose two approaches that could address this issues, batch-sensor and smart-sensor.

Batch-sensor

The basic idea of batch-sensor is to batch process sensor tasks to save resources. During running, a batch-sensor will take N partition sensor requests as the input and poke those N partitions periodically. If the batch-sensor finds that the criteria of some sensor task is met, the batch-sensor will update the database about this sensor tasks.

To do this, we need to create a sensor basic class called ‘batchable’ and make all sensors inherit from this basic class. We also need to change the behavior of schedule regarding a batchable sensor tasks. The schedule will find as many as possible batchable sensor tasks and run those tasks in a batch.

Smart-sensor

Smart-sensor is an improvement on top of batch-sensor.

The idea of smart-sensor is that the worker of smart-sensor will run like a service and it periodically queries task-instance table to find sensor tasks; poke the metastore and update the task instance table if it detects that certain partition or file created.

Considerations

Both approaches need to update the scheduler behavior and Airflow DB schema.

  • No labels