You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Status

StateDraft
Discussion Thread
c880ef89f8cb4a0240c404f9372615b998c4a4eeca342651927d596c
JIRA

4733

Motivation

During the 4-year project development, the community made many decisions about the structure and principles of module creation. Some decisions have been made by Airbnb and they no longer apply. Other recommendations were rejected because of the low recognition of the community.  

The first chapter discusses the problems that concern modules. This chapter is written from a general perspective. Imagine that there are only packages and modules in the project. Specific lines of code do not matter. The next chapter discusses the implementation details and ways of introducing proposed improvements from the perspective of the source code. Here is the important full content of the files. Chapter “Executive considerations” discusses how to introduce these rules into our codebase. You should look at this chapter from the perspective of the repository. At the end of this document, there are conclusions and a summary.

Chapters containing considerations are divided into specific cases. Each case has a possible solution discussed. Included examples serve to show specific cases and solutions In the real world, many problems overlap, so the solutions and examples also overlap. The examples try to be readable for this purpose are limited only to a given case.

This document assumes you are already familiar with Airflow codebase and may change over time based on feedback.

Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration.

All proposed solutions are backwards compatible.

Considerations

Architectural considerations

It is based on widely accepted rules, and also shows cases when these rules are not followed. I will also show ideas for improving these principles.

Case #1 airflow.contrib.{resources}

There should be one-- and preferably only one --obvious way to do it.

Tim Peters, The Zen of Python

Currently, resources are located in two places:

airflow.{resource_type}

airflow.contrib.{resource_type}

In the first place are resources that were originally maintained by Airbnb. However, they have been transferred to Apache and Airbnb is not responsible for their maintenance. The community is responsible for maintaining them. In the second place are operators that are maintained by the community from the beginning until now. Currently, all new resources are added only to the second place. The changes and development of the first place are strictly limited.

There is currently no reason for this type of division. All resources should be in one place.

Solution A:

We should move all the resources from the first place to the second. All resources will be located in airflow.{hooks/operators/sensors/example_dags}.

Advantages

Disadvantages

- resources are located in one place (and one place only). No need to check multiple locations for docs for example.

- no confusion for new contributors whether their work needs to be managed differently. (New contributors shouldn’t wonder if there is a difference between their work and non-contrib work. Because there shouldn’t be one!)

- resources moved from contrib to core has to be tested before moved. Outdated hooks/operators need to be updated or removed. Unit tests for all need to be added if it doesn’t already exists.


Solution B:

Move all the well-tested and maintained resources to the core for e.g GCP resources are well-tested with good documentation. All the new resources need to be first added to contrib folder and once they reach “maturity” they can be moved to core. We need to define what is that maturity. Contrib resources would be analogous to beta features in a product. We should also consider changing the words "contrib" to "incubator" in this situation.

Advantages

Disadvantages

- resources in core can be trusted by people and contributors take full responsibility of those resources.

- resources in contrib are subject to change.

- resources needs to be maintained at 2 places and can be a source of confusion for new contributors.


Solution C:

No change.

Case #2 git *_{operator/sensor}{/s}.py

Currently, the import takes the following format:

airflow{.contrib/}.operators.*_operator

There is information redundancy here. There is no need to use the word "operator" twice

It is worth mentioning that the word “operator” also appears in the class name

Solution A:

The import should take the following format:

airflow{.contrib/}.operators.*

Suffix “_operator” should be dropped

Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp_bigtable.py.

Advantages

Disadvantages

- Shorter name, but still focussing on the essential task of the class (no information loss)

-


Solution B:

No change

Advantages and disadvantages are analogous to solution A.

Case #3 {aws/azure/gcp}_*

With the development of integration for the largest cloud providers, a large part of new files received a prefix, which is assigned to each of them. For example, for Google Cloud Platform it is "gcp". Google mentioned the practice even in official recommendations[1]. Not all files follow this rule. Ansible also uses similar structure.

Solution A:

The prefix can be completely dropped. Major provider can get a separate sub-module for each type of resource.

Operators that integrate with two services will not change.

Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp/bigtable_operator.py.

The package format will look like this:
airflow/{contrib/}{resource_type}/{supplier/}bigtable_operator.py

Advantages

Disadvantages

- shorter name, but still focussing on the essential task of the class (no information loss)

- users only need to look at their _supplier_ package instead of lot of other _supplier_‘s services at once. (Most users probably use only one supplier at a time)

(This could also speed up navigating through the documentation for users depending on how the documentation is structured)

  • it’s a bit easier to find files when the file name contains relevant gcp_* for example in most IDE’s. This is however very weak argument as most of the IDEs will also support gcp/* as prefix when looking for a file


Solution B:

The prefix will be completed for incompatible files

Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow/contrib/operators/aws_sns_publish_operator.py
File /airflow/contrib/operators/dataproc_operator.py should be moved to /airflow/contrib/operators/gcp_dataproc_operator.py

Operators that integrate with two services will not change.

Solution C:

This solution has been reported by ashb

The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources.

This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.

The package format will look like this:.
airflow/integration/{supplier}/{resource_type}/bigtable_operator.py

Advantages

Disadvantages

This way the integration package contains everything from a supplier and you won’t have multiple same supplier packages for hooks, operators, macros, etc.

Moreover it would be simpler to move such an integration to a separate repository. (See AIP-8)

-


Solution D:

The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources. Other operators will be moved to one module (ex. core).

This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.

The package format will look like this:.
airflow_integration/{resource_type}/gcp_bigtable_operator.py

Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow_integration/aws/operators/aws_sns_publish_operator.py
File /airflow/operators/bash_operator.py should be moved to /airflow_integration/core/bash_operator.py

Case #4 Separate namespace for resources

Namespaces are one honking great idea -- let's do more of those!

Tim Peters, The Zen of Python

We can create a new namespace for all resources. We will not take advantage of all the possibilities that it offers, but in the future it will be possible to easily switch to a separate package for group of resource.

This solution should also be considered taking into account the acceptance of solution D from Case #3 {aws/azure/gcp}_*

Example of a project that uses a separate namespace: https://github.com/googleapis/google-cloud-python

Note: This change does not introduce separated packages for group of resources. It tries to retain only compatibility. Details are available: AIP-8 Split Hooks/Operators into Separate Packages by Tim Swast.

The package format will look like this:.
airflow_resources/{category}/{resource_type}/bigtable_operator.py

Solution #A:

We can introduce namespaces.

Advantages

Disadvantages

We will avoid changing import paths in the future

-


Solution #B:

We reject introduction namespaces.

Advantages and disadvantages are analogous to solution A.

Case #5 *Operator

Class name does not need suffix "Operator"

Solution A:

We can delete the suffix “Operator” from the class name

Example:
Class GcpTransferServiceJobDeleteOperator should be renamed to GcpTransferServiceJobDelete.

Advantages

Disadvantages

- Shorter name, but still focussing on the essential task of the class (no information loss)

-


Solution B:

No change

Advantages and disadvantages are analogous to solution A.

Case #6 Other isolated cases

There are other random cases of inconsistencies in the naming of classes. It is necessary to review the list of all classes and prepare a plan of change. Support from major cloud service providers will be useful.

For example:
Google Dataproc operators:

Current name

Proposition of new name

DataProcHadoopOperator

DataprocHadoopOperator

DataProcHiveOperator

DataprocHiveOperator

DataProcPigOperator

DataprocPigOperator

DataProcPySparkOperator

DataprocPySparkOperator

DataProcSparkOperator

DataprocSparkOperator

DataProcSparkSqlOperator

DataprocSparkSqlOperator

DataprocClusterCreateOperator

No change

DataprocClusterDeleteOperator

No change

DataprocClusterScaleOperator

No change

DataprocWorkflowTemplateBaseOperator

No change

DataprocWorkflowTemplateInstantiateInlineOperator

No change

DataprocWorkflowTemplateInstantiateOperator

No change

GoogleCloudStorageToS3Operator

GcsToS3Operator


This document does not analyze such cases. It can be one area of analysis by other groups of people ex. employees of the largest cloud service providers.

Any such change must be considered individually when accepting pull requests. Each change must be consistent with the recommendations made after voting on the changes in this document.

Implementation considerations

Case #7

Developer perspective - source code, and console view from both methods is available:  https://imgur.com/a/mRaWpQm

Repository with samples: https://github.com/mik-laj/airflow-deprecation-sample

Solution #A native python

Advantages

Disadvantages

Its supported by IDE.

More flexible - we can add a link to the documentation

Files must exist in the project - temporary mess.

More code in the project (226 characters. vs 78 character = +189%).


Sample PR: https://github.com/apache/airflow/pull/4667

Solution #B zope.deprecation/sys.modules hack

Solution proposed by @ashb

Advantages

Disadvantages

Less boilerplate code.

It is NOT supported by IDE.

Files must exist in the project - temporary mess.

No configuration options


Executive considerations

We can introduce the proposed changes in two ways:

  1. as one commit;
  2. as many commits for each group of operators;

The first method will be faster to perform, but finding one bug (if it would appear) in such a patch will be very difficult. The introduced change should, therefore, be made a series of corrections.

Each change should contain one commit. Each PR  and commit should be described in the format: “[AIRFLOW-XXX]”

Reference

fenglu@google.com. 2018. GCP Service Airflow Integration Guide. [ONLINE] Available at: https://lists.apache.org/thread.html/e8534d82be611ae7bcb21ba371546a4278aad117d5e50361fd8f14fe@%3Cdev.airflow.apache.org%3E. [Accessed 8 February 2019].


  • No labels