Status
Motivation
During the 4-year project development, the community made many decisions about the structure and principles of module creation. Some decisions have been made by Airbnb and they no longer apply. Other recommendations were rejected because of the low recognition of the community.
The first chapter discusses the problems that concern modules. This chapter is written from a general perspective. Imagine that there are only packages and modules in the project. Specific lines of code do not matter. The next chapter discusses the implementation details and ways of introducing proposed improvements from the perspective of the source code. Here is the important full content of the files. Chapter “Executive considerations” discusses how to introduce these rules into our codebase. You should look at this chapter from the perspective of the repository. At the end of this document, there are conclusions and a summary.
Chapters containing considerations are divided into specific cases. Each case has a possible solution discussed. Included examples serve to show specific cases and solutions In the real world, many problems overlap, so the solutions and examples also overlap. The examples try to be readable for this purpose are limited only to a given case.
This document assumes you are already familiar with Airflow codebase and may change over time based on feedback.
Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration.
All proposed solutions are backwards compatible.
Voting results (see voting below)
What to do with the "contrib" folder | Drop modules *_operator suffix | Grouping Cloud Providers operators/sensors/hooks | *Operator *Sensor *Hook in class name | Isolated cases | Deprecation method | |
---|---|---|---|---|---|---|
A. Move everything "contrib" → "airflow" | A. Drop the suffix. Example:
| D. Group operators/sensors/hooks in airflow/<PROVIDER>/operators(sensors, hooks). Non-cloud provider ones are moved to airflow/operators(sensors/hooks).
| B. No change - keep Operator/Sensor suffix in class name. | A. Make individual decisions of renames for operators that do not follow common conventions used for other operators. Consistency trumps compatibility. Examples: DataProcHadoopOperator renamed to: DataprocHadoopOperator | A. Native python method (with better IDE support and more flexible but a bit more verbose) |
Architectural considerations
It is based on widely accepted rules, and also shows cases when these rules are not followed. I will also show ideas for improving these principles.
Case #1 airflow.contrib.{resources}
There should be one-- and preferably only one --obvious way to do it.
Tim Peters, The Zen of Python
Currently, resources are located in two places:
airflow.{resource_type}
airflow.contrib.{resource_type}
In the first place are resources that were originally maintained by Airbnb. However, they have been transferred to Apache and Airbnb is not responsible for their maintenance. The community is responsible for maintaining them. In the second place are operators that are maintained by the community from the beginning until now. Currently, all new resources are added only to the second place. The changes and development of the first place are strictly limited.
There is currently no reason for this type of division. All resources should be in one place.
Solution A:
We should move all the resources from the first place to the second. All resources will be located in airflow.{hooks/operators/sensors/example_dags}.
Advantages | Disadvantages |
- resources are located in one place (and one place only). No need to check multiple locations for docs for example. - no confusion for new contributors whether their work needs to be managed differently. (New contributors shouldn’t wonder if there is a difference between their work and non-contrib work. Because there shouldn’t be one!) | - resources moved from contrib to core has to be tested before moved. Outdated hooks/operators need to be updated or removed. Unit tests for all need to be added if it doesn’t already exists. |
Solution B:
Move all the well-tested and maintained resources to the core for e.g GCP resources are well-tested with good documentation. All the new resources need to be first added to contrib folder and once they reach “maturity” they can be moved to core. We need to define what is that maturity. Contrib resources would be analogous to beta features in a product. We should also consider changing the words "contrib" to "incubator" in this situation.
|
|
|
|
Solution C:
No change.
Case #2 git *_{operator/sensor}{/s}.py
Currently, the import takes the following format:
airflow{.contrib/}.operators.*_operator |
There is information redundancy here. There is no need to use the word "operator" twice
It is worth mentioning that the word “operator” also appears in the class name
Solution A:
The import should take the following format:
airflow{.contrib/}.operators.* |
Suffix “_operator” should be dropped
Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp_bigtable.py.
Advantages | Disadvantages |
- Shorter name, but still focussing on the essential task of the class (no information loss) | - |
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #3 {aws/azure/gcp}_*
With the development of integration for the largest cloud providers, a large part of new files received a prefix, which is assigned to each of them. For example, for Google Cloud Platform it is "gcp". Google mentioned the practice even in official recommendations[1]. Not all files follow this rule. Ansible also uses similar structure.
Solution A:
The prefix can be completely dropped. Major provider can get a separate sub-module for each type of resource.
Operators that integrate with two services will not change.
Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp/bigtable_operator.py.
The package format will look like this:
airflow/{contrib/}{resource_type}/{supplier/}bigtable_operator.py
|
|
|
|
Solution B:
The prefix will be completed for incompatible files
Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow/contrib/operators/aws_sns_publish_operator.py
File /airflow/contrib/operators/dataproc_operator.py should be moved to /airflow/contrib/operators/gcp_dataproc_operator.py
Operators that integrate with two services will not change.
Solution C:
This solution has been reported by ashb
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources.
This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.
The package format will look like this:.
airflow/integration/{supplier}/{resource_type}/bigtable_operator.py
Advantages | Disadvantages |
This way the integration package contains everything from a supplier and you won’t have multiple same supplier packages for hooks, operators, macros, etc. Moreover it would be simpler to move such an integration to a separate repository. (See AIP-8) | - |
Solution D:
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources. Other operators will be moved to one module (ex. core).
This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.
The package format will look like this:.
airflow_integration/{resource_type}/gcp_bigtable_operator.py
Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow_integration/aws/operators/aws_sns_publish_operator.py
File /airflow/operators/bash_operator.py should be moved to /airflow_integration/core/bash_operator.py
Case #4 Separate namespace for resources
Namespaces are one honking great idea -- let's do more of those!
Tim Peters, The Zen of Python
Note - we do not move the namespaces out. It's mereWe can create a new namespace for all resources. We will not take advantage of all the possibilities that it offers, but in the future it will be possible to easily switch to a separate package for group of resource.
This solution should also be considered taking into account the acceptance of solution D from Case #3 {aws/azure/gcp}_*
Example of a project that uses a separate namespace: https://github.com/googleapis/google-cloud-python
Note: This change does not introduce separated packages for group of resources. It tries to retain only compatibility. Details are available: AIP-8 Split Hooks/Operators into Separate Packages by Tim Swast.
The package format will look like this:.
airflow_resources/{category}/{resource_type}/bigtable_operator.py
Solution #A:
We can introduce namespaces.
|
|
|
|
Solution #B:
We reject introduction namespaces.
Advantages and disadvantages are analogous to solution A
Note that grouping remains as if in namespaces (but this is merely a package not a separate namespace),
Case #5 *Operator
Class name does not need suffix "Operator"
Solution A:
We can delete the suffix “Operator” from the class name
Example:
Class GcpTransferServiceJobDeleteOperator should be renamed to GcpTransferServiceJobDelete.
|
|
|
|
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #6 Other isolated cases
There are other random cases of inconsistencies in the naming of classes. It is necessary to review the list of all classes and prepare a plan of change. Support from major cloud service providers will be useful.
For example:
Google Dataproc operators:
Current name | Proposition of new name |
DataProcHadoopOperator | DataprocHadoopOperator |
DataProcHiveOperator | DataprocHiveOperator |
DataProcPigOperator | DataprocPigOperator |
DataProcPySparkOperator | DataprocPySparkOperator |
DataProcSparkOperator | DataprocSparkOperator |
DataProcSparkSqlOperator | DataprocSparkSqlOperator |
DataprocClusterCreateOperator | No change |
DataprocClusterDeleteOperator | No change |
DataprocClusterScaleOperator | No change |
DataprocWorkflowTemplateBaseOperator | No change |
DataprocWorkflowTemplateInstantiateInlineOperator | No change |
DataprocWorkflowTemplateInstantiateOperator | No change |
GoogleCloudStorageToS3Operator | GcsToS3Operator |
This document does not analyze such cases. It can be one area of analysis by other groups of people ex. employees of the largest cloud service providers.
Any such change must be considered individually when accepting pull requests. Each change must be consistent with the recommendations made after voting on the changes in this document.
Implementation considerations
Case #7
Developer perspective - source code, and console view from both methods is available: https://imgur.com/a/mRaWpQm
Repository with samples: https://github.com/mik-laj/airflow-deprecation-sample
Solution #A native python
Advantages | Disadvantages |
Its supported by IDE. More flexible - we can add a link to the documentation | Files must exist in the project - temporary mess. More code in the project (226 characters. vs 78 character = +189%). |
Sample PR: https://github.com/apache/airflow/pull/4667
Solution #B zope.deprecation/sys.modules hack
Solution proposed by @ashb
|
|
|
|
Executive considerations
We can introduce the proposed changes in two ways:
- as one commit;
- as many commits for each group of operators;
The first method will be faster to perform, but finding one bug (if it would appear) in such a patch will be very difficult. The introduced change should, therefore, be made a series of corrections.
Each change should contain one commit. Each PR and commit should be described in the format: “[AIRFLOW-XXX]”
Reference
fenglu@google.com. 2018. GCP Service Airflow Integration Guide. [ONLINE] Available at: https://lists.apache.org/thread.html/e8534d82be611ae7bcb21ba371546a4278aad117d5e50361fd8f14fe@%3Cdev.airflow.apache.org%3E. [Accessed 8 February 2019].