Status
Page properties | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Motivation
During the 4-year project development, the community made many decisions about the structure and principles of module creation. Some decisions have been made by Airbnb and they no longer apply. Other recommendations were rejected because of the low recognition of the community.
The first chapter discusses the problems that concern modules. This chapter is written from a general perspective. Imagine that there are only packages and modules in the project. Specific lines of code do not matter. The next chapter discusses the implementation details and ways of introducing proposed improvements from the perspective of the source code. Here is the important full content of the files. Chapter “Executive considerations” discusses how to introduce these rules into our codebase. You should look at this chapter from the perspective of the repository. At the end of this document, there are conclusions and a summary.
Chapters containing considerations are divided into specific cases. Each case has a possible solution discussed. Included examples serve to show specific cases and solutions In the real world, many problems overlap, so the solutions and examples also overlap. The examples try to be readable for this purpose are limited only to a given case.
This document assumes you are already familiar with Airflow codebase and may change over time based on feedback.
Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration.
All proposed solutions are backwards compatible.
Table of Contents |
---|
Architectural considerations
It is based on widely accepted rules, and also shows cases when these rules are not followed. I will also show ideas for improving these principles.
Case #1 airflow.contrib.{resources}
There should be one-- and preferably only one --obvious way to do it.
Tim Peters, The Zen of Python
Currently, resources are located in two places:
airflow.{resource_type}
airflow.contrib.{resource_type}
In the first place are resources that were originally maintained by Airbnb. However, they have been transferred to Apache and Airbnb is not responsible for their maintenance. The community is responsible for maintaining them. In the second place are operators that are maintained by the community from the beginning until now. Currently, all new resources are added only to the second place. The changes and development of the first place are strictly limited.
There is currently no reason for this type of division. All resources should be in one place.
Solution A:
We should move all the resources from the first place to the second. All resources will be located in airflow.{hooks/operators/sensors/example_dags}.
Advantages | Disadvantages |
- resources are located in one place (and one place only). No need to check multiple locations for docs for example. - no confusion for new contributors whether their work needs to be managed differently. (New contributors shouldn’t wonder if there is a difference between their work and non-contrib work. Because there shouldn’t be one!) | - resources moved from contrib to core has to be tested before moved. Outdated hooks/operators need to be updated or removed. Unit tests for all need to be added if it doesn’t already exists. |
Solution B:
Move all the well-tested and maintained resources to the core for e.g GCP resources are well-tested with good documentation. All the new resources need to be first added to contrib folder and once they reach “maturity” they can be moved to core. We need to define what is that maturity. Contrib resources would be analogous to beta features in a product. We should also consider changing the words "contrib" to "incubator" in this situation.
Advantages | Disadvantages |
- resources in core can be trusted by people and contributors take full responsibility of those resources. - resources in contrib are subject to change. | - resources needs to be maintained at 2 places and can be a source of confusion for new contributors. |
Solution C:
No change.
Case #2 git *_{operator/sensor}{/s}.py
Currently, the import takes the following format:
airflow{.contrib/}.operators.*_operator |
There is information redundancy here. There is no need to use the word "operator" twice
It is worth mentioning that the word “operator” also appears in the class name
Solution A:
The import should take the following format:
airflow{.contrib/}.operators.* |
Suffix “_operator” should be dropped
Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp_bigtable.py.
Advantages | Disadvantages |
- Shorter name, but still focussing on the essential task of the class (no information loss) | - |
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #3 {aws/azure/gcp}_*
With the development of integration for the largest cloud providers, a large part of new files received a prefix, which is assigned to each of them. For example, for Google Cloud Platform it is "gcp". Google mentioned the practice even in official recommendations[1]. Not all files follow this rule. Ansible also uses similar structure.
Solution A:
The prefix can be completely dropped. Major provider can get a separate sub-module for each type of resource.
Operators that integrate with two services will not change.
Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp/bigtable_operator.py.
The package format will look like this:
airflow/{contrib/}{resource_type}/{supplier/}bigtable_operator.py
Advantages | Disadvantages |
- shorter name, but still focussing on the essential task of the class (no information loss) - users only need to look at their _supplier_ package instead of lot of other _supplier_‘s services at once. (Most users probably use only one supplier at a time) (This could also speed up navigating through the documentation for users depending on how the documentation is structured) |
|
Solution B:
The prefix will be completed for incompatible files
Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow/contrib/operators/aws_sns_publish_operator.py
File /airflow/contrib/operators/dataproc_operator.py should be moved to /airflow/contrib/operators/gcp_dataproc_operator.py
Operators that integrate with two services will not change.
Solution C:
This solution has been reported by ashb
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources.
This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.
The package format will look like this:.
airflow/integration/{supplier}/{resource_type}/bigtable_operator.py
Advantages | Disadvantages |
This way the integration package contains everything from a supplier and you won’t have multiple same supplier packages for hooks, operators, macros, etc. Moreover it would be simpler to move such an integration to a separate repository. (See AIP-8) | - |
Solution D:
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources. Other operators will be moved to one module (ex. core).
This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.
The package format will look like this:.
airflow_integration/{resource_type}/gcp_bigtable_operator.py
Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow_integration/aws/operators/aws_sns_publish_operator.py
File /airflow/operators/bash_operator.py should be moved to /airflow_integration/core/bash_operator.py
Case #4 Separate namespace for resources
Namespaces are one honking great idea -- let's do more of those!
Tim Peters, The Zen of Python
We can create a new namespace for all resources. We will not take advantage of all the possibilities that it offers, but in the future it will be possible to easily switch to a separate package for group of resource.
This solution should also be considered taking into account the acceptance of solution D from Case #3 {aws/azure/gcp}_*
Example of a project that uses a separate namespace: https://github.com/googleapis/google-cloud-python
Note: This change does not introduce separated packages for group of resources. It tries to retain only compatibility. Details are available: AIP-8 Split Hooks/Operators into Separate Packages by Tim Swast.
The package format will look like this:.
airflow_resources/{category}/{resource_type}/bigtable_operator.py
Solution #A:
We can introduce namespaces.
Advantages | Disadvantages |
We will avoid changing import paths in the future | - |
Solution #B:
We reject introduction namespaces.
Advantages and disadvantages are analogous to solution A.
Case #5 *Operator
Class name does not need suffix "Operator"
Solution A:
We can delete the suffix “Operator” from the class name
Example:
Class GcpTransferServiceJobDeleteOperator should be renamed to GcpTransferServiceJobDelete.
Advantages | Disadvantages |
- Shorter name, but still focussing on the essential task of the class (no information loss) | - |
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #6 Other isolated cases
There are other random cases of inconsistencies in the naming of classes. It is necessary to review the list of all classes and prepare a plan of change. Support from major cloud service providers will be useful.
For example:
Google Dataproc operators:
Current name | Proposition of new name |
DataProcHadoopOperator | DataprocHadoopOperator |
DataProcHiveOperator | DataprocHiveOperator |
DataProcPigOperator | DataprocPigOperator |
DataProcPySparkOperator | DataprocPySparkOperator |
DataProcSparkOperator | DataprocSparkOperator |
DataProcSparkSqlOperator | DataprocSparkSqlOperator |
DataprocClusterCreateOperator | No change |
DataprocClusterDeleteOperator | No change |
DataprocClusterScaleOperator | No change |
DataprocWorkflowTemplateBaseOperator | No change |
DataprocWorkflowTemplateInstantiateInlineOperator | No change |
DataprocWorkflowTemplateInstantiateOperator | No change |
GoogleCloudStorageToS3Operator | GcsToS3Operator |
This document does not analyze such cases. It can be one area of analysis by other groups of people ex. employees of the largest cloud service providers.
Any such change must be considered individually when accepting pull requests. Each change must be consistent with the recommendations made after voting on the changes in this document.
Implementation considerations
Case #7
Developer perspective - source code, and console view from both methods is available: https://imgur.com/a/mRaWpQm
Repository with samples: https://github.com/mik-laj/airflow-deprecation-sample
Solution #A native python
Advantages | Disadvantages |
Its supported by IDE. More flexible - we can add a link to the documentation | Files must exist in the project - temporary mess. More code in the project (226 characters. vs 78 character = +189%). |
Sample PR: https://github.com/apache/airflow/pull/4667
Solution #B zope.deprecation/sys.modules hack
Solution proposed by @ashb
Advantages | Disadvantages |
Less boilerplate code. | It is NOT supported by IDE. Files must exist in the project - temporary mess. No configuration options |
Executive considerations
We can introduce the proposed changes in two ways:
- as one commit;
- as many commits for each group of operators;
The first method will be faster to perform, but finding one bug (if it would appear) in such a patch will be very difficult. The introduced change should, therefore, be made a series of corrections.
Each change should contain one commit. Each PR and commit should be described in the format: “[AIRFLOW-XXX]”
Summary of the proposal
Choice | What to do with the "contrib" folder | Drop modules *_operator suffix | Separate out module's Cloud Provider prefixes (gcp/aws/azure) to packages | Introduce separate namespaces for different resources | *Operator *Sensor *Hook in class name | Isolated cases | Deprecation method |
---|---|---|---|---|---|---|---|
A | Move everything "contrib" → "airflow" | Drop the suffix. Example:
| Drop the prefix. Cloud provider prefix becomes python package in the Examples:
| Group operators/hooks/sensors in related groups (for example per cloud provider). The idea is that each namespace can be maintained separately in the future as separate project as in (AIP-8 Split Hooks/Operators into Separate Packages). Examples:
This can be the same as Case 3) D. - assuming the resources are grouped per type. | Remove the Operator suffix from class name. Examples:
| Make individual decisions of renames for operators that do not follow common conventions used for other operators. Consistency trumps compatibility. Examples: DataProcHadoopOperator renamed to: DataprocHadoopOperator | Native python method (with better IDE support and more flexible but a bit more verbose) |
B | Move well tested code "contrib" → "airflow" Rename "contrib" to "incubator" for less-well tested code. | No change. Example:
| Add prefixes where they are missing. Examples:
| No namespace introduction - no plans to separate out groups of packages. | No change - keep Operator/Sensor suffix in class name. | Avoid renaming operators due to better backwards compatibility. | Use zope.deprecation (less IDE support, less verbose, less flexibility) |
C | No change | Drop the prefix, move all the operators/sensors/hooks to per-provider packages. Example:
| |||||
D | Drop prefix. move all the integrations to either cloud-provider specific package or core.
|
Voting
Feel free to add your votes below:
Person | Binding | What to do with the "contrib" folder | Drop modules *_operator suffix | Separate out module's Cloud Provider prefixes (gcp/aws/azure) to packages | Introduce separate namespaces for different resources | *Operator *Sensor *Hook in class name | Isolated cases | Deprecation method |
---|---|---|---|---|---|---|---|---|
Yes | A: Move everything "contrib"→ "airflow" | A. gcp_bigtable_operator.py → gcp_bigtable.py | C. gcp_bigtable_operator.py → gcp/operators/bigtable.py | B. No namespaces introduction. | B. No changes. Keep *Operator *Sensor *Hook in class name | A. Rename isolated cases for consistency. | A. Native python with better IDE integration. | |
Yes | A | A | No opinion | B - it's clearer at call site ( task = XOperator() vs task = X() ) | No opinion | No strong opionin | ||
Yes | A | A | B | B | B | A | A | |
No | ||||||||
Daniel Standish | No | |||||||
Yes | A | A | C | A |
B -
| A | A | ||||||
No | ||||||||
Any strong "vetos" on any of the answers please record it here with justification:
Person | Binding | Case 1 What to do with the "contrib" folder | Case 2 Drop modules *_operator suffix | Case 3 Separate out module's Cloud Provider prefixes (gcp/aws/azure) to packages | Case 4 Introduce separate namespaces for different resources | Case 5 *Operator *Sensor *Hook in class name | Case 6 Isolated cases | Case 7 Deprecation method |
---|---|---|---|---|---|---|---|---|
Vetoing A with the prefix of airflow_resources , but don't object to airflow.gcp.operator.x grouping. |
The Voting mechanism:
Voting will take place till Tuesday 6pm CEST (5 pm BST, 9am PST) .
After the choice, the final consistent set of choices will be announced (taking into account majority of binding vote, also including potential vetos and consistency between the choices. Non-binding votes will be taken into account in case there is a draw. The final set of choices will be announced at devlist thread after the voting completes.
Unless there is a veto raised to the final proposal, the final proposal is accepted by Lazy Consensus on Friday at 6pm CEST (5 pm BST, 9am PST).
Reference
fenglu@google.com. 2018. GCP Service Airflow Integration Guide. [ONLINE] Available at: https://lists.apache.org/thread.html/e8534d82be611ae7bcb21ba371546a4278aad117d5e50361fd8f14fe@%3Cdev.airflow.apache.org%3E. [Accessed 8 February 2019].
...