During the 4-year project development, the community made many decisions about the structure and principles of module creation. Some decisions have been made by Airbnb and they no longer apply. Other recommendations were rejected because of the low recognition of the community.
The first chapter discusses the problems that concern modules. This chapter is written from a general perspective. Imagine that there are only packages and modules in the project. Specific lines of code do not matter. The next chapter discusses the implementation details and ways of introducing proposed improvements from the perspective of the source code. Here is the important full content of the files. Chapter “Executive considerations” discusses how to introduce these rules into our codebase. You should look at this chapter from the perspective of the repository. At the end of this document, there are conclusions and a summary.
Chapters containing considerations are divided into specific cases. Each case has a possible solution discussed. Included examples serve to show specific cases and solutions In the real world, many problems overlap, so the solutions and examples also overlap. The examples try to be readable for this purpose are limited only to a given case.
This document assumes you are already familiar with Airflow codebase and may change over time based on feedback.
Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration.
This AIP has gone through many changes
This AIP has gone through many changes and it might be confusing trying to get the final conclusions from the sequence of events (it has original voting + set of subsequent updates to it).
D. Group operators/sensors/hooks inairflow/providers/<PROVIDER>/operators(sensors, hooks).
Each provider can define its own internal structure of that package. For example in case of "google" provider the packages will be further grouped by "gcp", "gsuite", "core" sub-packages.
In case of transfer operators where two providers are involved, the transfer operators will be moved to "source" of the transfer. When there is only one provider as target but source is a database or another non-provider source, the operator is put to the target provider.
Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Drop the prefix.
During implementation of AIP-23 we found that the original decision about grouping operators was not the best and did not cover all the scenarios. Therefore we updated the rules as follows:
Grouping by cloud provider should be done in "airflow/providers" package (previously it was directly in "airflow"
Each provider can have different internal structure, potentially grouping the operators in sub-packages. For example in case of "google" provider the packages will be further grouped by "gcp", "gsuite", "core" sub-packages.
In case of transfer operators where two providers are involved, the transfer operators will be moved to "source" (NOTE it's been changed to "target" in subsequent Update) of the transfer. When there is only one provider as target but source is a database or another non-provider source, the operator is put to the target provider.
Update to include 1.10.* backportability and details about non-cloud-providers package (2019-11-16).
In the light of coming Airflow 2.0 release the community decided there is a need to make it easier to migrate from Airflow 1.10 to the upcoming 2.0 release. Airflow 2.0 is - by definition - not backwards-compatible with 1.10.* series. There are a number of incompatibilities that are introduced - in core, database, concepts but also in parameters of a number of operators integrating with external services/software. DAGs written for Airflow 1.10.* might not work out-of-the-box in Airflow 2.0. We have not yet figured out if we are going to provide some automated migration, but we can provide a mechanism to switch to Airflow 2.0 "provider" set of operators and hooks even when still running Airflow 1.10. That can make migration process easier as organisation doing the migration might do it in steps. There are some organisations that still use python 2 even though Airflow 2.0 supports only python 3.5+ (possibly 3.6+ in the final 2.0 release). We figured out that most of the new/updated operators to be released in airflow 2.0 which were not cherry-picked to Airflow 1.10 are still runnable in Airflow 1.10. and you should be able to start using those operators in Airflow 1.10 in parallel to old operators. We decided to move most of the non-core operators to new packages (all inside "providers" package and release them as separate packages that will be installable in Python 3.5+ Airflow 1.10* releases). POC for that is available here:
Switching to the new "providers" operators can be mostly automated and it can be done incrementally for the DAGs a company has (we can provide some scripts for that). Each of the steps can be done separately in it's own pace.
This will make it easier for companies to move to Airflow 2.0 as well as it might provide an early testing ground for all the operators/hooks/sensors which are only present in Airflow 2.0 and have incompatible changes.
It is based on widely accepted rules, and also shows cases when these rules are not followed. I will also show ideas for improving these principles.
Case #1 airflow.contrib.{resources}
There should be one-- and preferably only one --obvious way to do it.
Tim Peters, The Zen of Python
Currently, resources are located in two places:
airflow.{resource_type}
airflow.contrib.{resource_type}
In the first place are resources that were originally maintained by Airbnb. However, they have been transferred to Apache and Airbnb is not responsible for their maintenance. The community is responsible for maintaining them. In the second place are operators that are maintained by the community from the beginning until now. Currently, all new resources are added only to the second place. The changes and development of the first place are strictly limited.
There is currently no reason for this type of division. All resources should be in one place.
Solution A:
We should move all the resources from the first place to the second. All resources will be located in airflow.{hooks/operators/sensors/example_dags}.
Advantages
Disadvantages
- resources are located in one place (and one place only). No need to check multiple locations for docs for example.
- no confusion for new contributors whether their work needs to be managed differently. (New contributors shouldn’t wonder if there is a difference between their work and non-contrib work. Because there shouldn’t be one!)
- resources moved from contrib to core has to be tested before moved. Outdated hooks/operators need to be updated or removed. Unit tests for all need to be added if it doesn’t already exists.
Solution B:
Move all the well-tested and maintained resources to the core for e.g GCP resources are well-tested with good documentation. All the new resources need to be first added to contrib folder and once they reach “maturity” they can be moved to core. We need to define what is that maturity. Contrib resources would be analogous to beta features in a product. We should also consider changing the words "contrib" to "incubator" in this situation.
Advantages
Disadvantages
- resources in core can be trusted by people and contributors take full responsibility of those resources.
- resources in contrib are subject to change.
- resources needs to be maintained at 2 places and can be a source of confusion for new contributors.
Solution C:
No change.
Case #2 git *_{operator/sensor}{/s}.py
Currently, the import takes the following format:
airflow{.contrib/}.operators.*_operator
There is information redundancy here. There is no need to use the word "operator" twice
It is worth mentioning that the word “operator” also appears in the class name
Solution A:
The import should take the following format:
airflow{.contrib/}.operators.*
Suffix “_operator” should be dropped
Example: File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp_bigtable.py.
Advantages
Disadvantages
- Shorter name, but still focussing on the essential task of the class (no information loss)
-
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #3 {aws/azure/gcp}_*
With the development of integration for the largest cloud providers, a large part of new files received a prefix, which is assigned to each of them. For example, for Google Cloud Platform it is "gcp". Google mentioned the practice even in official recommendations[1]. Not all files follow this rule. Ansible also uses similar structure.
Solution A:
The prefix can be completely dropped. Major provider can get a separate sub-module for each type of resource.
Operators that integrate with two services will not change.
Example: File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp/bigtable_operator.py.
The package format will look like this: airflow/{contrib/}{resource_type}/{supplier/}bigtable_operator.py
Advantages
Disadvantages
- shorter name, but still focussing on the essential task of the class (no information loss)
- users only need to look at their _supplier_ package instead of lot of other _supplier_‘s services at once. (Most users probably use only one supplier at a time)
(This could also speed up navigating through the documentation for users depending on how the documentation is structured)
it’s a bit easier to find files when the file name contains relevant gcp_* for example in most IDE’s. This is however very weak argument as most of the IDEs will also support gcp/* as prefix when looking for a file
Solution B:
The prefix will be completed for incompatible files
Example: File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow/contrib/operators/aws_sns_publish_operator.py File /airflow/contrib/operators/dataproc_operator.py should be moved to /airflow/contrib/operators/gcp_dataproc_operator.py
Operators that integrate with two services will not change.
Solution C:
This solution has been reported by ashb
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources.
The package format will look like this:. airflow/integration/{supplier}/{resource_type}/bigtable_operator.py
Advantages
Disadvantages
This way the integration package contains everything from a supplier and you won’t have multiple same supplier packages for hooks, operators, macros, etc.
Moreover it would be simpler to move such an integration to a separate repository. (See AIP-8)
-
Solution D:
The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources. Other operators will be moved to one module (ex. core).
The package format will look like this:. airflow_integration/{resource_type}/gcp_bigtable_operator.py
Example: File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow_integration/aws/operators/aws_sns_publish_operator.py File /airflow/operators/bash_operator.py should be moved to /airflow_integration/core/bash_operator.py
Case #4 Separate namespace for resources
Namespaces are one honking great idea -- let's do more of those!
Tim Peters, The Zen of Python
Note - we do not move the namespaces out. It's mereWe can create a new namespace for all resources. We will not take advantage of all the possibilities that it offers, but in the future it will be possible to easily switch to a separate package for group of resource.
This solution should also be considered taking into account the acceptance of solution D from Case #3 {aws/azure/gcp}_*
Note: This change does not introduce separated packages for group of resources. It tries to retain only compatibility. Details are available: AIP-8 Split Hooks/Operators into Separate Packages by Tim Swast.
The package format will look like this:. airflow_resources/{category}/{resource_type}/bigtable_operator.py
Solution #A:
We can introduce namespaces.
Advantages
Disadvantages
We will avoid changing import paths in the future
-
Solution #B:
We reject introduction namespaces.
Advantages and disadvantages are analogous to solution A
Note that grouping remains as if in namespaces (but this is merely a package not a separate namespace),
Case #5 *Operator
Class name does not need suffix "Operator"
Solution A:
We can delete the suffix “Operator” from the class name
Example: Class GcpTransferServiceJobDeleteOperator should be renamed to GcpTransferServiceJobDelete.
Advantages
Disadvantages
- Shorter name, but still focussing on the essential task of the class (no information loss)
-
Solution B:
No change
Advantages and disadvantages are analogous to solution A.
Case #6 Other isolated cases
There are other random cases of inconsistencies in the naming of classes. It is necessary to review the list of all classes and prepare a plan of change. Support from major cloud service providers will be useful.
For example: Google Dataproc operators:
Current name
Proposition of new name
DataProcHadoopOperator
DataprocHadoopOperator
DataProcHiveOperator
DataprocHiveOperator
DataProcPigOperator
DataprocPigOperator
DataProcPySparkOperator
DataprocPySparkOperator
DataProcSparkOperator
DataprocSparkOperator
DataProcSparkSqlOperator
DataprocSparkSqlOperator
DataprocClusterCreateOperator
No change
DataprocClusterDeleteOperator
No change
DataprocClusterScaleOperator
No change
DataprocWorkflowTemplateBaseOperator
No change
DataprocWorkflowTemplateInstantiateInlineOperator
No change
DataprocWorkflowTemplateInstantiateOperator
No change
GoogleCloudStorageToS3Operator
GcsToS3Operator
This document does not analyze such cases. It can be one area of analysis by other groups of people ex. employees of the largest cloud service providers.
Any such change must be considered individually when accepting pull requests. Each change must be consistent with the recommendations made after voting on the changes in this document.
Implementation considerations
Case #7
Developer perspective - source code, and console view from both methods is available: https://imgur.com/a/mRaWpQm
We can introduce the proposed changes in two ways:
as one commit;
as many commits for each group of operators;
The first method will be faster to perform, but finding one bug (if it would appear) in such a patch will be very difficult. The introduced change should, therefore, be made a series of corrections.
Each change should contain one commit. Each PR and commit should be described in the format: “[AIRFLOW-XXX]”
Group operators/sensors/hooks in airflow/operators(sensors, hooks)/<PROVIDER>.Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Drop the prefix.
No change - keep Operator/Sensor suffix in class name.
Avoid renaming operators due to better backwards compatibility.
Use zope.deprecation (less IDE support, less verbose, less flexibility)
C
No change
Group operators/sensors/hooks in airflow/operators(sensors, hooks)/<PROVIDER>. Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Keep the prefix.
Group operators/sensors/hooks inairflow/<PROVIDER>/operators(sensors, hooks). Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Drop the prefix.
Group operators/sensors/hooks inairflow/<PROVIDER>/operators(sensors, hooks). Non-cloud provider ones are moved to airflow/operators(sensors/hooks).Keep the prefix.
It will make it more organised. Anyone willing to find out Airflow support for a particular cloud provider would just need to look into a single folder.
Voting will take place till Tuesday 6pm CEST (5 pm BST, 9am PST) .
After the choice, the final consistent set of choices will be announced (taking into account majority of binding vote, also including potential vetos and consistency between the choices. Non-binding votes will be taken into account in case there is a draw. The final set of choices will be announced at devlist thread after the voting completes.
Unless there is a veto raised to the final proposal, the final proposal is accepted by Lazy Consensus on Friday at 6pm CEST (5 pm BST, 9am PST).