Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: the list is updated

...

Contents

...

Userguide and reports

Review contents of the component towards providing an up-to-date userguide and write benchmarking code for generating performance reports (JMH).

Difficulty: Minor
Potential mentors:
Gilles Sadowski, mail: erans (at) apache.org
Project Devs, mail: dev (at) commons.apache.org

Stream-based utilities

Since it is possible to release different modules with different language level requirements, we could consider creating a commons-numbers-complex-stream module to hold the utilities currently in class ComplexUtils.

From a management POV, it would avoid keeping the maintenance burden of an outdated API once the whole component switches to Java 8.

Release 1.0 should not ship with ComplexUtils.

Difficulty: Minor
Potential mentors:
Gilles Sadowski, mail: erans (at) apache.org
Project Devs, mail: dev (at) commons.apache.org

Beam

BeamSQL aggregation

and aggregation analytics functions

analytics functions

BeamSQL has a long list of of aggregation/aggregation analytics functionalities to support.

To begin with, you will need to support this syntax:

analytic_function_name ( [ argument_list ] )
OVER (
[ PARTITION BY partition_expression_list ]
[ ORDER BY expression [

{ ASC | DESC }

] [, ...] ]
[ window_frame_clause ]
)

This will requires touch core components of BeamSQL:
1. SQL parser to support the syntax above.
2. SQL core to implement physical relational operator.
3. Distributed algorithms to implement a list of functions in a distributed manner.
4. Build benchmarks to measure performance of your implementation

BeamSQL has a long list of of aggregation/aggregation analytics to support.

Difficulty: Major
Potential mentors:
Rui Wang, mail: amaliujia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Add Daffodil IO for Apache Beam

From https://daffodil.apache.org/:

Daffodil is an open source implementation of the DFDL specification that uses these DFDL schemas to parse fixed format data into an infoset, which is most commonly represented as either XML or JSON. This allows the use of well-established XML or JSON technologies and libraries to consume, inspect, and manipulate fixed format data in existing solutions. Daffodil is also capable of the reverse by serializing or “unparsing” an XML or JSON infoset back to the original data format.

We should create a Beam IO that accepts a DFDL schema as an argument and can then produce and consume data in the specified format. I think it would be most natural for Beam users if this IO could produce Beam Rows, but an initial version that just operates with Infosets could be useful as well.

Difficulty: Major
Potential mentors:
Brian Hulette, mail: bhulette (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Implement Nexmark (benchmark suite) in Python and integrate it with Spark and Flink runners

Apache Beam [1] is a unified and portable programming model for data processing jobs (pipelines). The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Nexmark [5] is a benchmark for streaming jobs, basically a set of jobs (queries) to test different use cases of the execution system. Beam implemented Nexmark for Java [6, 7] and it has been succesfully used to improve the features of multiple Beam runners and discover performance regressions.

Thanks to the work on portability [8] we can now run Beam pipelines on top of open source systems like Apache Spark [9] and Apache Flink [10]. The goal of this issue/project is to implement the Nexmark queries on Python and configure them to run on our CI on top of open source systems like Apache Spark and Apache Flink. The goal is that it helps the project to track and improve the evolution of portable open source runners and our python implementation as we do for Java.

Because of the time constraints of GSoC we will adjust the goals (sub-tasks) depending on progress.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://web.archive.org/web/20100620010601/http://datalab.cs.pdx.edu/niagaraST/NEXMark/
[6] https://beam.apache.org/documentation/sdks/java/testing/nexmark/
[7] https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark
[8] https://beam.apache.org/roadmap/portability/
[9] https://spark.apache.org/
[10] https://flink.apache.org/

Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Apache Airflow

One of Celery executor tests is flaky

tests/executors/test_celery_executor.py::TestCeleryExecutor::test_celery_integration_0_amqp_guest_guest_rabbitmq_5672


Log attached.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

add GDriveToGcsOperator

There is GcsToGDriveOperator but there isn't the equivalent in the other direction



Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

prevent autocomplete of username in login UI

Login page of the UI has autocomplete for username field. This should be disabled for security

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

list_dag_runs cli command should allow exec_date between start/end range and print start/end times

1. accept argument exec_date_from, exec_date_to to filter execution_dates returned, ie show dag runs with exec_date between 20190901 and 20190930
2. separate to that in the output print the start_date and end_date of each dagrun (ie execdate for 20190907 had start_date 2019090804:23 and end_date 2019090804:38
3. dag_id arg should be optional

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Mock Cassandra in tests

Cassandra consume 1.173GiB of memory. Travis does not have very efficient machines, so we should limit system/integration tests of components that do not require much attention, e.g. they are not changed often. Cassandra is a good candidate for this. This will allow the machine power to be used for more needed work.

            CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET
            I/O             BLOCK I/O           PIDS
            8aa37ca50f7c        ci_airflow-testing_run_1f3aeb6d1052   0.00%               5.715MiB / 3.855GiB   0.14%               1.14kB
            / 0B         2.36MB / 0B         2
            f2b3be15558f        ci_cassandra_1                        0.69%               1.173GiB / 3.855GiB   30.42%              2.39kB
            / 0B         75.3MB / 9.95MB     50
            ef1de3981ca6        ci_krb5-kdc-server_1                  0.02%               12.15MiB / 3.855GiB   0.31%               2.46kB
            / 0B         18.9MB / 184kB      4
            be808233eb91        ci_mongo_1                            0.31%               36.71MiB / 3.855GiB   0.93%               2.39kB
            / 0B         43.2MB / 19.1MB     24
            667e047be097        ci_rabbitmq_1                         0.77%               69.95MiB / 3.855GiB   1.77%               2.39kB
            / 0B         43.2MB / 508kB      92
            2453dd6e7cca        ci_postgres_1                         0.00%               7.547MiB / 3.855GiB   0.19%               1.05MB
            / 889kB      35.4MB / 145MB      6
            78050c5c61cc        ci_redis_1                            0.29%               1.695MiB / 3.855GiB   0.04%               2.46kB
            / 0B         6.94MB / 0B         4
            c117eb0a0d43        ci_mysql_1                            0.13%               452MiB / 3.855GiB     11.45%              2.21kB
            / 0B         33.9MB / 548MB      21
            131427b19282        ci_openldap_1                         0.00%               45.68MiB / 3.855GiB   1.16%               2.64kB
            / 0B         32.8MB / 16.1MB     4
            8c2549c010b1        ci_docker_1                           0.59%               22.06MiB / 3.855GiB   0.56%               2.39kB
            / 0B         95.9MB / 291kB      30
            
Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add SalesForce connection to UI

Airflow has SalesForceHook but it doesn't have a distinct connection.

In order to create a Connection one must expose it's secret token as text :

https://stackoverflow.com/questions/53510980/salesforce-connection-using-apache-airflow-ui

Also it's not very intuitive that the Conn Type should remain blank.

It would be easier and also user friendly if there will be salesforce connection in the UI which has a security_token field that is encrypted.

Difficulty: Major
Potential mentors:
Elad, mail: eladk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

dag_processor_manager/webserver/scheduler logs should be created under date folder

dag level logs are written under separate date folders. This is great because the old dates are not 'modified/accessed' so they can be easily purged by utilities like tmpwatch

This JIRA is about making other logs (such as dag_processor_manager/webserver/scheduler.etc) go under separate date folders to allow easy purging. the log from redirecting 'airflow scheduler' to stdout grows over 100mb a day in my env

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Request for OktopostToGoogleStorageOperator

Difficulty: Major
Potential mentors:
HaloKu, mail: HaloKu (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

clear cli command needs a 'conf' option

key-value pairs of conf can be passed into trigger_dag command
ie
--conf '

{"ric":"amzn"}

'

clear command needs this feature too

ie in case exec_date is important and there was a failure halfway in the 1st dagrun due to bad conf being sent on trigger_dag command and want to run the same execdate but with new conf on 2nd dagrun

alternative solution would be a new delete_dag_run cli command so never need to 'clear' but can do a 2nd DagRun for same exec date

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve multiline output in admin gui

Multiline attributes, rendered templates, or Xcom variables are not well supported in the admin GUI at present. Any values are treated as native HTML text() blocks and as such all formatting is lost. When passing structured data such as YAML in these variables, it makes a real mess of them.

Ideally, these values should keep their line-breaks and indentation.

This should only require having these code blocks wrapped in a <pre> block or setting `white-space: pre` on the class for the block.

Difficulty: Major
Potential mentors:
Paul Rhodes, mail: withnale (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Snowflake Connector cannot run more than one sql from a sql file

I am getting an error when passing in a SQL file with multiple SQL statements to snowflake operator

            snowflake.connector.errors.ProgrammingError: 000006 (0A000): 01908236-01a3-b2c4-0000-f36100052686: Multiple SQL statements
            in a single API call are not supported; use one API call per statement instead.
            

It only fails if you pass a file with multiple statements. A file with just one statement or list of statements to the operator works fine.

After looking at the current snowflake operator implementation it seems like a list of SQL statements work because it executes one statement at a time. Whereas multiple statements in a SQL file fails because all of them are read as one continuous string.


How can we fix this:

There is an API call in Snowflake python connector that supports multiple SQL statements.

https://docs.snowflake.net/manuals/user-guide/python-connector-api.html#execute_string

This can be fixed by overriding the run function in Snowflake Hook to support multiple sql statements in a file.

Difficulty: Major
Potential mentors:
Saad, mail: saadk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

SageMakerEndpointOperator is not idempotent

The SageMakerEndpointOperator currently taken an argument "operati on" with value "create"/"update" which determines whether to create or update a SageMaker endpoint. However this doesn't work in the following situation:

  • DAG run #1 create the endpoint (have to provide operation="create" here)
  • Following DAG runs will update the endpoint created by DAG run #1 (would have to edit DAG and set operation="update" here)

Which should be a very valid use case IMO.

The SageMakerEndpointOperator should check itself if an endpoint with name X already exists and overwrite it (configurable desired by the user).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Check and document that docker-compose >= 1.20 is needed to run breeze


Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Airflow UI should also display dag_concurrency reached

Currently, in the main view, the schedule column box is highlighted in red if the max. number of DAG runs is achieved. In this case no more DAG runs can be started until a DAG run completes.

I think it should also display in red when the dag_concurrency (i.e. max concurrent tasks) is achieved. In this case also, no more tasks can be started until a task completes. However there is currently nothing in the UI showing that (currently running 1.10.5).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to specify a maximum modified time for objects in GoogleCloudStorageToGoogleCloudStorageOperator

The fact that I can specify a minimum modified time to filter objects on in GoogleCloudStorageToGoogleCloudStorageOperator but not a maximum seems rather arbitrary. Especially considering the typical usage scenario of running a copy on a schedule, I would like to be able to find objects created within a particular schedule interval for my execution, and not just copy all of the latest objects.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to specify multiple objects to copy to GoogleCloudStorageToGoogleCloudStorageOperator

The restriction in GoogleCloudStorageToGoogleCloudStorageOperator that I am only allowed to specify a single object to list is rather arbitrary. If I specify a wildcard, all it does is split at the wildcard and use that to get a prefix and delimiter. Why not just let me do this search myself and return a list of objects?

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

security - hide all password/secret/credentials/tokens from log

I am proposing a new config flag. It will enforce a generic override in all airflow logging to suppress printing any lines containing case-insensitive match on any of: password|secret|credential|token


If you do a

            grep -iE 'password|secret|credential|token' -R <airflow_logs_folder>

you may be surprised with what you find :O


ideally could replace only the sensitive value but there are various formats like:  

            key=value, key'=value, key value, key"=value, key = value, key"="value, key:value

..etc

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

AWS Batch Operator improvement to support batch job parameters

AWSBatchOperator does not currently support AWS Batch Job parameters.

When creating an AWS Batch Job Definition and when submitting a job to AWS Batch, it's possible to define and supply job parameters. Most of our AWS Batch jobs take parameters but we are not able to pass them using the AWSBatchOperator.

In order to support batch job parameters, a new argument to _init_(self) could be added called job_parameters, saved to an instance variable and supplied to self.client.submit_job() in the execute() method:

            self.client.submit_job(
            jobName=self.job_name,
            jobQueue=self.job_queue,
            jobDefinition=self.job_definition,
            containerOverrides=self.overrides,
            parameters=self.job_parameters)
            

See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html#Batch.Client.submit_job

Difficulty: Major
Potential mentors:
Tim Mottershead, mail: TimJim (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add docs how to integrate with grafana and prometheus

I'm not sure how this is doable but one of the key components that is missing in airflow is the ability to notify about detecting anomalies something like graphana https://grafana.com/

It would be great if airflow can add support for such tools


I'm talking here about airflow itself. For example: if DAG run normally takes 5 minutes but now for any reason it's running over 30 minutes than we want an alert to be sent with graph that shows that anomaly.

Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add support for dmypy (Mypy daemon) to Breeze environment

Per discussion in https://github.com/apache/airflow/pull/5664 we might use dmypy for local development speedups.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow GoogleCloudStorageToBigQueryOperator to accept source_objects as a string or otherwise take input from XCom

`GoogleCloudStorageToBigQueryOperator` should be able to have its `source_objects` dynamically determined by the results of a previous workflow. This is hard to do with it expecting a list, as any template expansion will render as a string. This could be implemented either as a check for whether `source_objects` is a string, and trying to parse it as a list if it is, or a separate argument for a string encoded as a list.

My particular use case for this is as follows:

  1. A daily DAG scans a GCS bucket for all objects created in the last day and loads them into BigQuery.
  2. To find these objects, a `PythonOperator` scans the bucket and returns a list of object names.
  3. A `GoogleCloudStorageToBigQueryOperator` is used to load these objects into BigQuery.

The operator should be able to have its list of objects provided by XCom, but there is no functionality to do this, and trying to do a template expansion along the lines of `source_objects='{{ task_instance.xcom_pull(key="KEY") }}'` doesn't work because this is rendered as a string, which `GoogleCloudStorageToBigQueryOperator` will try to treat as a list, with each character being a single item.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Make AWS Operators Pylint compatible

Make AWS Operators Pylint compatible.

Difficulty: Major
Potential mentors:
Ishan Rastogi, mail: gto (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Google Search Ads 360 integration

Hi

This project lacks integration with the Google Search Ads 360 service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://developers.google.com/search-ads/
API Documentation: https://developers.google.com/resources/api-libraries/documentation/dfareporting/v3.3/python/latest/

Lots of love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML Tables integration

Hi

This project lacks integration with the Cloud AutoML Tables service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/automl-tables/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Sentiment integration

Hi

This project lacks integration with the Cloud AutoML NL Sentiment service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/sentiment/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Lots of love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Entity Extraction integration

Hi

This project lacks integration with the Cloud AutoML NL Entity Extraction service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/entity-analysis/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Classification integration

Hi

This project lacks integration with the Cloud AutoML NL Classification service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Gzip compression to S3_hook

Allow to load compressed file in the load_file function.

We have similar logic in GoogleCloudStorageHook

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add FacebookAdsHook

Add hook to interact with FacebookAds

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

S3Hook load_file should support ACL policy parameter

We have a use case where we are uploading files to an S3 bucket in a different AWS account to the one Airflow is running in.  AWS S3 supports this situation using the pre canned ACL policy, specifically bucket-owner-full-control. 

However, the current implementations of the S3Hook.load_*() and S3Hook.copy_object() methods do not allow us to supply any ACL policy for the file being uploaded/copied to S3.  

It would be good to add another optional parameter to the S3Hook methods called acl_policy which would then be passed into the boto3 client method calls like so 


            # load_file
            ...
            if encrypt: 
            extra_args['ServerSideEncryption'] = "AES256"
            if acl_policy:
            extra_args['ACL'] = acl_policy
            
            client.upload_file(filename, bucket_name, key, ExtraArgs=extra_args)


            # load_bytes
            ...
            if encrypt: 
            extra_args['ServerSideEncryption'] = "AES256"
            if acl_policy:
            extra_args['ACL'] = acl_policy
            
            client.upload_file(filename, bucket_name, key, ExtraArgs=extra_args)
            # copy_object
            self.get_conn().copy_object(Bucket=dest_bucket_name,
            Key=dest_bucket_key,
            CopySource=CopySource, 
            ACL=acl_policy)
            
Difficulty: Major
Potential mentors:
Keith O'Brien, mail: kbobrien@gmail.com (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow filtering by all columns in Browse Logs view

The "Browse Logs" UI currently allows filtering by "DAG ID", "Task ID", "Execution Date", and "Extra".

For consistency and flexibility, it would be good to allow filtering by any of the available columns, specifically "Datetime", "Event", "Execution Date", and "Owner". 

Difficulty: Minor
Potential mentors:
Brylie Christopher Oxley, mail: brylie (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for emptyDir volume in KubernetesExecutor

Currently It seems that K8 Executor expects the dags_volume_claim or git_repo to be always defined through airflow.cfg. Otherwise it does not come up.
Though there is support for "emptyDir" volume in worker_configuration.py but kubernetes_executor fails in _validate function if these configs are not defined.
Our dag files are stored in some remote location which can be synced to worker pod through init/side-car container. We are exploring if it makes sense to allow K8 executor to come up for cases where dags_volume_claim are git_repo are not defined. In such cases worker pod would look for the dags in emptyDir and worker_airflow_dags path (like it does for git-sync). Dag files can be made available in worker_airflow_dags path through init/side-car container.


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance migration

The cc1e65623dc7_add_max_tries_column_to_task_instance migration creates a DagBag for the corresponding DAG for every single task instance. This is very redundant and not necessary.

Hence, there are discussions on Slack like these:

murquizo   [Jan 17th at 1:33 AM]
            Why does the airflow upgradedb command loop through all of the dags?
            
            ....
            
            murquizo   [14 days ago]
            NICE, @BasPH! that is exactly the migration that I was referring to.  We have about 600k task instances and have a several
            python files that generate multiple DAGs, so looping through all of the task_instances to update max_tries was too slow. 
            It took 3 hours and didnt even complete! i pulled the plug and manually executed the migration.   Thanks for your response.
            

An easy to accomplish improvement is to parse a DAG only once and after that set the task instance try_number. I created a branch for it (https://github.com/BasPH/incubator-airflow/tree/bash-optimise-db-upgrade), currently running tests and will make PR when done.

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

KubernetesPodOperator: Use secretKeyRef or configMapKeyRef in env_vars

The env_vars attribute of the KubernetesPodOperator allows to pass environment variables as string but it doesn't allows to pass a value from a configmap or a secret.

I'd like to be able to do

            modeling = KubernetesPodOperator(
            ...
            env_vars={
            'MY_ENV_VAR': {
            'valueFrom': {
            'secretKeyRef': {
            'name': 'an-already-existing-secret',
            'key': 'key',
            }
            }
            },
            ...
            )
            

Right now if I do that, Airflow generates the following config

            - name: MY_ENV_VAR
            value:
            valueFrom:
            configMapKeyRef:
            name: an-already-existing-secret
            key: key
            

instead of 

            - name: MY_ENV_VAR
            valueFrom:
            configMapKeyRef:
            name: an-already-existing-secret
            key: key
            

The extract_env_and_secrets method of the KubernetesRequestFactory could check if the value is a dictionary and use it directly.


Difficulty: Major
Potential mentors:
Arthur Brenaut, mail: abrenaut (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for Passing Custom Env variables while launching k8 Pod

Is there a way to provide env variables while launching K8 pod through K8 executor. we need to pass some env variable which are referred inside our Airflow Operator. so can we provide custom env variable to docker run command while launching task pod. Currently it seems that it supports predefined env variable.

worker_configuration.py

def get_environment(self): """Defines any necessary environment variables for the pod executor""" env = { 'AIRFLOWCOREDAGS_FOLDER': '/tmp/dags', 'AIRFLOWCORE_EXECUTOR': 'LocalExecutor' } if self.kube_config.airflow_configmap: env['AIRFLOW__CORE__AIRFLOW_HOME'] = self.worker_airflow_home return env


Possible solution

At the moment there is not a way to configure environmental variables on a per-task basis, but it shouldn't be too hard to add that functionality. Extra config options can be passed through the `executor_config` on any operator:

https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L2423-L2437

Which are eventually used here to construct the kubernetes pod for the task:

https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/kubernetes/worker_configuration.py#L186


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Drop snakebite in favour of pyarrow

The current HdfsHook relies on the snakebite library, which is unfortunately not compatible with Python 3. To add Python 3 support for the HdfsHook requires switching to a different library for interacting with HDFS. The hdfs3 library is an attractive alternative, as it supports Python 3 and seems to be stable and relatively well supported.

Update: hdfs3 doesn't get any updates anymore. The best library right now seems to be pyarrow: https://arrow.apache.org/docs/python/filesystems.html
Therefore I would like to upgrade to pyarrow instead of hdfs3.

Difficulty: Blocker
Potential mentors:
Julian de Ruiter, mail: jrderuiter (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow wild-cards in the search box in the UI

In the UI there is a search box.

If you search DAG name you will see the result for the search as you type.

Please allow support of wild-cards. Mainly for : *


So if I have a Dag called :abcd and I'm searching for ab* I will see it in the list.


This is very helpful for systems with 100+ dags.

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Show lineage in visualization


Difficulty: Major
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add additional quick start to INSTALL


Difficulty: Blocker
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

HttpHook shall be configurable to non-status errors

When using HttpSensor, which underlying would use HttpHook to perform the request. If the target service is down, which refused the connection, the task would fail immediately.

would be great if this behaviour is configurable, so the sensor would keep sensoring until the service is up again.

traceback of the error:
[2017-04-29 02:00:31,248]

{base_task_runner.py:95}

INFO - Subtask: requests.exceptions.ConnectionError: HTTPConnectionPool(host='xxxx', port=123): Max retries exceeded with url: /xxxx (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f94b64b44e0>: Failed to establish a new connection: [Errno 111] Connection refused',))

Difficulty: Major
Potential mentors:
Deo, mail: jy00520336 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

`/health` endpoint on each component

Please provide a /health endpoint of each of the following component:

  • webservice (to avoid pinging the / root endpoint)
  • worker
  • scheduler

This would ease integration in Mesos/Marathon framework.

If you agree, I volunteer to add this change.

Difficulty: Major
Potential mentors:
gsemet, mail: gaetan@xeberon.net (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Custom parameters for DockerOperator

Add ability to specify custom parameters to docker cli. E.g. "--volume-driver=""" or --net="bridge" or any other

Difficulty: Major
Potential mentors:
Alexandr Nikitin, mail: alexandrnikitin (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org