Status

StateCompleted
Discussion Thread
JIRA

AIRFLOW-4733 - Getting issue details... STATUS

Created

$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())

In Release2.0.0

Motivation

During the 4-year project development, the community made many decisions about the structure and principles of module creation. Some decisions have been made by Airbnb and they no longer apply. Other recommendations were rejected because of the low recognition of the community.  

The first chapter discusses the problems that concern modules. This chapter is written from a general perspective. Imagine that there are only packages and modules in the project. Specific lines of code do not matter. The next chapter discusses the implementation details and ways of introducing proposed improvements from the perspective of the source code. Here is the important full content of the files. Chapter “Executive considerations” discusses how to introduce these rules into our codebase. You should look at this chapter from the perspective of the repository. At the end of this document, there are conclusions and a summary.

Chapters containing considerations are divided into specific cases. Each case has a possible solution discussed. Included examples serve to show specific cases and solutions In the real world, many problems overlap, so the solutions and examples also overlap. The examples try to be readable for this purpose are limited only to a given case.

This document assumes you are already familiar with Airflow codebase and may change over time based on feedback.

Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration.


This AIP has gone through many changes

This AIP has gone through many changes and it might be confusing trying to get the final conclusions from the sequence of events (it has original voting + set of subsequent updates to it). 

However the final agreement to the proposal has been nicely captured in https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#naming-conventions-for-provider-packages and you should treat it as the current status.




All proposed solutions are backwards compatible.

Final results (see voting below)

Case 1

What to do with the "contrib" folder

Case 2

Drop modules *_operator suffix

Case 3 + Case 4 

Grouping Cloud Providers operators/sensors/hooks 

Case 5

*Operator *Sensor *Hook in class name 

Case 6 

Isolated cases

Case 7

Deprecation method

A. Move everything "contrib" → "airflow"

A. Drop the suffix.

Example:

  • airflow.operator.gcp_bigtable_operator.py 
    becomes airflow.operator.gcp_bigtable.py.

D.  Group operators/sensors/hooks in airflow/providers/<PROVIDER>/operators(sensors, hooks).

Each provider can define its own internal structure of that package. For example in case of "google" provider the packages will be further grouped by "gcp", "gsuite", "core" sub-packages. 

In case of transfer operators where two providers are involved, the transfer operators will be moved to "source" of the transfer. When there is only one provider as target but source is a database or another non-provider source, the operator is put to the target provider.

NOTE! The above decision has been updated during AIP-8 Split Providers into Separate Packages for Airflow 2.0. The rule we apply is "maintainability" rule - i.e. transfer operators are put in the provider, where stakeholders are more likely to have interest in maintaining them.

Non-cloud provider ones are moved to airflow/operators(sensors/hooks).
Drop the prefix.

Examples:

AWS operator:

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/providers/amazon/aws/operators/
    sns_publish.py

Google GCP operator:

  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/providers/gooogle/cloud/operators/dataproc.py  

Previously GCP-prefixed operator:

  • airflow/contrib/operators/gcp_bigtable_operator.py 
    becomes airflow/providers/google/cloud/operators/bigtable.py

Transfer from GCP:

  • airflow/contrib/operators/gcs_to_s3_operator.py becomes airflow/providers/google/cloud/operators/gcs_to_s3.py

MySQL to GCS:

  • airflow/contrib/operators/mysql_to_gcs_operator.py becomes airflow/providers/google/cloud/operators/mysql_to_gcs.py

SSH operator:

  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh.py

B. No change - keep Operator/Sensor suffix in class name.

A. Make individual decisions of renames for operators that do not follow common conventions used for other operators.

Consistency trumps compatibility.

Examples:

DataProcHadoopOperator

renamed to:

DataprocHadoopOperator

A. Native python method (with better IDE support  and more flexible but a bit more verbose)

 Target groups in the providers packages





ServiceTransfer

Fundamentals (no change)



airflow.hooks.base_hook
airflow.hooks.dbapi_hook
airflow.models.baseoperator
airflow.sensors.base_sensor_operator
airflow.operators.branch_operator
airflow.operators.check_operator
airflow.operators.dagrun_operator
airflow.operators.dummy_operator
airflow.operators.generic_transfer
airflow.operators.latest_only_operator
airflow.operators.subdag_operator
airflow.sensors.external_task_sensor
airflow.sensors.sql_sensor
airflow.sensors.time_delta_sensor
airflow.sensors.time_sensor
airflow.contrib.sensors.weekday_sensor → move to airflow.sensors
airflow.operators.bash_operator
airflow.contrib.sensors.bash_sensor
airflow.operators.python_operator
airflow.contrib.sensors.python_sensor

providers




google




cloudairflow.gcp.hooks.automl, airflow.gcp.operators.automl
airflow.gcp.hooks.bigquery airflow.gcp.operators.bigquery
airflow.gcp.hooks.bigquery_dts airflow.gcp.operators.bigquery_dts
airflow.gcp.hooks.bigtable airflow.gcp.operators.bigtable
airflow.gcp.hooks.cloud_build airflow.gcp.operators.cloud_build
airflow.gcp.hooks.compute airflow.gcp.operators.compute
airflow.gcp.hooks.dlp airflow.gcp.operators.dlp
airflow.gcp.hooks.dataflow airflow.gcp.operators.dataflow
airflow.gcp.hooks.dataproc airflow.gcp.operators.dataproc
airflow.gcp.hooks.datastore airflow.gcp.operators.datastore
airflow.gcp.hooks.functions airflow.gcp.operators.functions
airflow.gcp.hooks.kms
airflow.gcp.hooks.kubernetes_engine airflow.gcp.operators.kubernetes_engine
airflow.gcp.hooks.mlengine airflow.gcp.operators.mlengine
airflow.gcp.hooks.cloud_memorystore airflow.gcp.operators.cloud_memorystore
airflow.providers.google.cloud.hooks.natural_language airflow.providers.google.cloud.operators.natural_language
airflow.providers.google.cloud..hooks.pubsub airflow.providers.google.cloud..operators.pubsub
airflow.gcp.hooks.spanner airflow.gcp.operators.spanner
airflow.gcp.hooks.speech_to_text airflow.gcp.operators.speech_to_text
airflow.gcp.hooks.cloud_sql airflow.gcp.operators.cloud_sql
airflow.gcp.hooks.gcs airflow.gcp.operators.gcs
airflow.gcp.hooks.cloud_storage_transfer_service airflow.gcp.operators.cloud_storage_transfer_service
airflow.gcp.hooks.tasks airflow.gcp.operators.tasks
airflow.gcp.hooks.text_to_speech airflow.gcp.operators.text_to_speech
airflow.gcp.hooks.translate airflow.gcp.operators.translate
airflow.gcp.hooks.video_intelligence airflow.gcp.operators.video_intelligence
airflow.providers.google.cloud.hooks.vision
airflow.operators.cassandra_to_gcs, airflow.operators.adls_to_gcs, airflow.contrib.operators.s3_to_gcs_operator, airflow.gcp.operators.cloud_storage_transfer_service, airflow.operators.adls_to_gcs
airflow.contrib.operators.s3_to_gcs_operator,airflow.gcp.operators.cloud_storage_transfer_service
airflow.operators.bigquery_to_bigquery
airflow.operators.bigquery_to_gcs
airflow.operators.bigquery_to_mysql
airflow.operators.cassandra_to_gcs
airflow.operators.gcs_to_bq
airflow.operators.gcs_to_gcs,airflow.gcp.operators.cloud_storage_transfer_service
airflow.operators.local_to_gcs
airflow.operators.mssql_to_gcs
airflow.operators.mysql_to_gcs
airflow.operators.postgres_to_gcs
airflow.operators.sql_to_gcs
airflow.operators.gcs_to_sftp


gsuiteairflow.contrib.hooks.gdrive_hook airflow.gcp.hooks.gsheetsairflow.contrib.operators.gcs_to_gdrive_operator


marketing_platformairflow.providers.google.marketing_platform.hooks.campaign_manager airflow.providers.google.marketing_platform.operators.campaign_manager airflow.providers.google.marketing_platform.sensors.campaign_manager airflow.providers.google.marketing_platform.hooks.display_video airflow.providers.google.marketing_platform.operators.display_video airflow.providers.google.marketing_platform.sensors.display_video airflow.providers.google.marketing_platform.hooks.search_ads airflow.providers.google.marketing_platform.operators.search_ads airflow.providers.google.marketing_platform.sensors.search_ads

amazon




awsairflow.contrib.hooks.aws_dynamodb_hook
airflow.contrib.hooks.aws_glue_catalog_hook
airflow.contrib.hooks.aws_logs_hook
airflow.contrib.hooks.emr_hook
airflow.contrib.hooks.sagemaker_hook
airflow.contrib.operators.ecs_operator
airflow.contrib.operators.emr_add_steps_operator
airflow.contrib.operators.emr_create_job_flow_operator
airflow.contrib.operators.emr_terminate_job_flow_operator
airflow.contrib.operators.s3_copy_object_operator
airflow.contrib.operators.s3_delete_objects_operator
airflow.contrib.operators.s3_list_operator
airflow.contrib.operators.sagemaker_base_operator
airflow.contrib.operators.sagemaker_endpoint_config_operator
airflow.contrib.operators.sagemaker_endpoint_operator
airflow.contrib.operators.sagemaker_model_operator
airflow.contrib.operators.sagemaker_training_operator
airflow.contrib.operators.sagemaker_transform_operator
airflow.contrib.operators.sagemaker_tuning_operator
airflow.contrib.sensors.aws_glue_catalog_partition_sensor
airflow.contrib.sensors.emr_base_sensor
airflow.contrib.sensors.emr_job_flow_sensor
airflow.contrib.sensors.emr_step_sensor
airflow.contrib.sensors.sagemaker_base_sensor
airflow.contrib.sensors.sagemaker_endpoint_sensor
airflow.contrib.sensors.sagemaker_training_sensor
airflow.contrib.sensors.sagemaker_transform_sensor
airflow.contrib.sensors.sagemaker_tuning_sensor
airflow.operators.s3_file_transform_operator
airflow.providers.amazon.aws.hooks.athena
airflow.providers.amazon.aws.hooks.datasync
airflow.providers.amazon.aws.hooks.kinesis
airflow.providers.amazon.aws.hooks.lambda_function
airflow.providers.amazon.aws.hooks.redshift
airflow.providers.amazon.aws.hooks.s3
airflow.providers.amazon.aws.hooks.sns
airflow.providers.amazon.aws.hooks.sqs
airflow.providers.amazon.aws.operators.athena
airflow.providers.amazon.aws.operators.batch
airflow.providers.amazon.aws.operators.datasync
airflow.providers.amazon.aws.operators.sns
airflow.providers.amazon.aws.operators.sqs
airflow.providers.amazon.aws.sensors.athena
airflow.providers.amazon.aws.sensors.redshift
airflow.providers.amazon.aws.sensors.sqs
airflow.sensors.s3_key_sensor
airflow.sensors.s3_prefix_sensor
airflow.contrib.operators.hive_to_dynamodb, airflow.operators.google_api_to_s3_transfer, airflow.contrib.operators.hive_to_dynamodb, airflow.operators.redshift_to_s3_operator, airflow.operators.s3_to_hive_operator, airflow.operators.s3_to_redshift_operator, airflow.contrib.operators.dynamodb_to_s3, airflow.contrib.operators.s3_to_sftp_operator, airflow.contrib.operators.sftp_to_s3_operator, airflow.operators.gcs_to_s3, airflow.contrib.operators.imap_attachment_to_s3_operator, airflow.contrib.operators.mongo_to_s3, airflow.operators.google_api_to_s3_transfer, airflow.operators.gcs_to_s3

microsoft




azureairflow.contrib.hooks.wasb_hook airflow.contrib.operators.wasb_delete_blob_operator airflow.contrib.sensors.wasb_sensor airflow.contrib.hooks.azure_container_instance_hook, airflow.contrib.hooks.azure_container_registry_hook, airflow.contrib.hooks.azure_container_volume_hook airflow.contrib.operators.azure_container_instances_operator airflow.contrib.hooks.azure_cosmos_hook airflow.contrib.operators.azure_cosmos_operator airflow.contrib.sensors.azure_cosmos_sensor airflow.contrib.hooks.azure_data_lake_hook airflow.contrib.operators.adls_list_operator airflow.contrib.hooks.azure_fileshare_hook,airflow.contrib.operators.file_to_wasb, airflow.contrib.operators.oracle_to_azure_data_lake_transfer

apache




cassandraairflow.contrib.hooks.cassandra_hook airflow.contrib.sensors.cassandra_record_sensor,airflow.contrib.sensors.cassandra_table_sensor


druidairflow.hooks.druid_hook airflow.contrib.operators.druid_operator,airflow.operators.druid_check_operatorairflow.operators.hive_to_druid


hadoop

airflow.hooks.hdfs_hook airflow.sensors.hdfs_sensor,
airflow.contrib.sensors.hdfs_sensor,

airflow.hooks.webhdfs_hook airflow.sensors.web_hdfs_sensor




hiveairflow.hooks.hive_hooks airflow.operators.hive_operator,airflow.operators.hive_stats_operator airflow.sensors.named_hive_partition_sensor,
airflow.sensors.hive_partition_sensor,airflow.sensors.metastore_partition_sensor
airflow.operators.mssql_to_hive, airflow.operators.s3_to_hive_operator, airflow.contrib.operators.vertica_to_hive


pigairflow.hooks.pig_hook airflow.operators.pig_operator


pinotairflow.contrib.hooks.pinot_hook


spark

airflow.contrib.hooks.spark_jdbc_hook,
airflow.contrib.hooks.spark_jdbc_script,airflow.contrib.hooks.spark_sql_hook,
airflow.contrib.hooks.spark_submit_hook airflow.contrib.operators.spark_jdbc_operator,
airflow.contrib.operators.spark_sql_operator,airflow.contrib.operators.spark_submit_operator




sqoopairflow.contrib.hooks.sqoop_hook airflow.contrib.operators.sqoop_operator

mysql

airflow.operators.hive_to_mysql, airflow.contrib.operators.presto_to_mysql

jira
airflow.contrib.hooks.jira_hook airflow.contrib.operators.jira_operator
airflow.contrib.sensors.jira_sensor


databricks
airflow.contrib.hooks.databricks_hook airflow.contrib.operators.databricks_operator

datadog
airflow.contrib.hooks.datadog_hook airflow.contrib.sensors.datadog_sensor

dingding
airflow.contrib.hooks.dingding_hook airflow.contrib.operators.dingding_operator

discord
airflow.contrib.hooks.discord_webhook_hook airflow.contrib.operators.discord_webhook_operator

cloudant
airflow.contrib.hooks.cloudant_hook

jenkins
airflow.contrib.hooks.jenkins_hook airflow.contrib.operators.jenkins_job_trigger_operator

opsgenie
airflow.contrib.hooks.opsgenie_alert_hook airflow.contrib.operators.opsgenie_alert_operator

qubole
airflow.contrib.hooks.qubole_hook,airflow.contrib.hooks.qubole_check_hook airflow.contrib.operators.qubole_operator,airflow.contrib.operators.qubole_check_operator airflow.contrib.sensors.qubole_sensor

salesforce
airflow.contrib.hooks.salesforce_hook

segment
airflow.contrib.hooks.segment_hook airflow.contrib.operators.segment_track_event_operator

slack
airflow.hooks.slack_hook,airflow.contrib.hooks.slack_webhook_hook airflow.operators.slack_operator,airflow.contrib.operators.slack_webhook_operator

snowflake
airflow.contrib.hooks.snowflake_hook airflow.contrib.operators.snowflake_operator

vertica
airflow.contrib.hooks.vertica_hook airflow.contrib.operators.vertica_operatorairflow.contrib.operators.vertica_to_mysql

zendesk
airflow.hooks.zendesk_hook

celery
airflow.contrib.sensors.celery_queue_sensor

docker
airflow.hooks.docker_hook airflow.operators.docker_operator,
airflow.contrib.operators.docker_swarm_operator


kubernetes
airflow.contrib.operators.kubernetes_pod_operator

mssql
airflow.hooks.mssql_hook airflow.operators.mssql_operator

mongodb
airflow.contrib.hooks.mongo_hook airflow.contrib.sensors.mongo_sensor

mysql
airflow.hooks.mysql_hook airflow.operators.mysql_operator

openfaas
airflow.contrib.hooks.openfaas_hook

oracle
airflow.hooks.oracle_hook airflow.operators.oracle_operatorairflow.contrib.operators.oracle_to_oracle_transfer

papermill
airflow.operators.papermill_operator

postgres
airflow.hooks.postgres_hook airflow.operators.postgres_operator

presto
airflow.hooks.presto_hook airflow.operators.presto_check_operator

redis
airflow.contrib.hooks.redis_hook airflow.contrib.operators.redis_publish_operator airflow.contrib.sensors.redis_pub_sub_sensor,airflow.contrib.sensors.redis_key_sensor

samba
airflow.hooks.samba_hookairflow.operators.hive_to_samba_operator

sqlite
airflow.hooks.sqlite_hookairflow.operators.sqlite_operator

imap
airflow.contrib.hooks.imap_hook airflow.contrib.sensors.imap_attachment_sensor

ssh
airflow.contrib.hooks.ssh_hook airflow.contrib.operators.ssh_operator

filesystem
airflow.contrib.hooks.fs_hook airflow.contrib.sensors.file_sensor

sftp
airflow.contrib.hooks.sftp_hook airflow.contrib.operators.sftp_operator airflow.contrib.sensors.sftp_sensor

ftp
airflow.contrib.hooks.ftp_hook airflow.contrib.sensors.ftp_sensor

http
airflow.hooks.http_hook airflow.operators.http_operator airflow.sensors.http_sensor

grpc
airflow.contrib.hooks.grpc_hook airflow.contrib.operators.grpc_operator

smtp
airflow.operators.email_operator

jdbc
airflow.hooks.jdbc_hook airflow.operators.jdbc_operator

winrm
airflow.contrib.hooks.winrm_hook airflow.contrib.operators.winrm_operator


Update to the original point D. (2019-10-11).

During implementation of AIP-23 we found that the original decision about grouping operators was not the best and did not cover all the scenarios. Therefore we updated the rules as follows:

  • Grouping by cloud provider should be done in "airflow/providers" package (previously it was directly in "airflow"
  • Each provider can have different internal structure, potentially grouping the operators in sub-packages. For example in case of "google" provider the packages will be further grouped by "gcp", "gsuite", "core" sub-packages.
  • In case of transfer operators where two providers are involved, the transfer operators will be moved to "source" (NOTE it's been changed to "target" in subsequent Update) of the transfer. When there is only one provider as target but source is a database or another non-provider source, the operator is put to the target provider.

Update to include 1.10.* backportability and details about non-cloud-providers package (2019-11-16).

In the light of coming Airflow 2.0 release the community decided there is a need to make it easier to migrate from Airflow 1.10 to the upcoming 2.0 release. Airflow 2.0 is - by definition - not backwards-compatible with 1.10.* series. There are a number of incompatibilities that are introduced - in core, database, concepts but also in parameters of a number of operators integrating with external services/software. DAGs written for Airflow 1.10.* might not work out-of-the-box in Airflow 2.0. We have not yet figured out if we are going to provide some automated migration, but we can provide a mechanism to switch to Airflow 2.0 "provider" set of operators and hooks even when still running Airflow 1.10. That can make migration process easier as organisation doing the migration might do it in steps. There are some organisations that still use python 2 even though Airflow 2.0 supports only python 3.5+ (possibly 3.6+ in the final 2.0 release).  We figured out that most of the new/updated operators to be released in airflow 2.0 which were not cherry-picked to Airflow 1.10 are still runnable in Airflow 1.10. and you should be able to start using those operators in Airflow 1.10 in parallel to old operators.  We decided to move most of the non-core operators to new packages (all inside "providers" package and release them as separate packages that will be installable in Python 3.5+ Airflow 1.10* releases). POC for that is available here: 

https://github.com/apache/airflow/pull/6507

Therefore the migration process might look as follows.

 (1) Python 2.7 + Airflow 1.10.* → (2) Python 3.6 + Airflow 1.10.* → (3) Python 3.6 + Airflow 1.10.* + switch to using "providers" operators → (4) Python 3.6 + Airflow 2.0

Switching to the new "providers" operators can be mostly automated and it can be done incrementally for the DAGs a company has (we can provide some scripts for that). Each of the steps can be done separately in it's own pace.

This will make it easier for companies to move to Airflow 2.0 as well as it might provide an early testing ground for all the operators/hooks/sensors which are only present in Airflow 2.0 and have incompatible changes.

The list of all Airflow operators/sensors/hooks is above in target_groups

Architectural considerations

It is based on widely accepted rules, and also shows cases when these rules are not followed. I will also show ideas for improving these principles.

Case #1 airflow.contrib.{resources}

There should be one-- and preferably only one --obvious way to do it.

Tim Peters, The Zen of Python

Currently, resources are located in two places:

airflow.{resource_type}

airflow.contrib.{resource_type}

In the first place are resources that were originally maintained by Airbnb. However, they have been transferred to Apache and Airbnb is not responsible for their maintenance. The community is responsible for maintaining them. In the second place are operators that are maintained by the community from the beginning until now. Currently, all new resources are added only to the second place. The changes and development of the first place are strictly limited.

There is currently no reason for this type of division. All resources should be in one place.

Solution A:

We should move all the resources from the first place to the second. All resources will be located in airflow.{hooks/operators/sensors/example_dags}.

Advantages

Disadvantages

- resources are located in one place (and one place only). No need to check multiple locations for docs for example.

- no confusion for new contributors whether their work needs to be managed differently. (New contributors shouldn’t wonder if there is a difference between their work and non-contrib work. Because there shouldn’t be one!)

- resources moved from contrib to core has to be tested before moved. Outdated hooks/operators need to be updated or removed. Unit tests for all need to be added if it doesn’t already exists.


Solution B:

Move all the well-tested and maintained resources to the core for e.g GCP resources are well-tested with good documentation. All the new resources need to be first added to contrib folder and once they reach “maturity” they can be moved to core. We need to define what is that maturity. Contrib resources would be analogous to beta features in a product. We should also consider changing the words "contrib" to "incubator" in this situation.

Advantages

Disadvantages

- resources in core can be trusted by people and contributors take full responsibility of those resources.

- resources in contrib are subject to change.

- resources needs to be maintained at 2 places and can be a source of confusion for new contributors.


Solution C:

No change.

Case #2 git *_{operator/sensor}{/s}.py

Currently, the import takes the following format:

airflow{.contrib/}.operators.*_operator

There is information redundancy here. There is no need to use the word "operator" twice

It is worth mentioning that the word “operator” also appears in the class name

Solution A:

The import should take the following format:

airflow{.contrib/}.operators.*

Suffix “_operator” should be dropped

Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp_bigtable.py.

Advantages

Disadvantages

- Shorter name, but still focussing on the essential task of the class (no information loss)

-


Solution B:

No change

Advantages and disadvantages are analogous to solution A.

Case #3 {aws/azure/gcp}_*

With the development of integration for the largest cloud providers, a large part of new files received a prefix, which is assigned to each of them. For example, for Google Cloud Platform it is "gcp". Google mentioned the practice even in official recommendations[1]. Not all files follow this rule. Ansible also uses similar structure.

Solution A:

The prefix can be completely dropped. Major provider can get a separate sub-module for each type of resource.

Operators that integrate with two services will not change.

Example:
File airflow/contrib/operators/gcp_bigtable_operator.py should be moved to airflow/contrib/operators/gcp/bigtable_operator.py.

The package format will look like this:
airflow/{contrib/}{resource_type}/{supplier/}bigtable_operator.py

Advantages

Disadvantages

- shorter name, but still focussing on the essential task of the class (no information loss)

- users only need to look at their _supplier_ package instead of lot of other _supplier_‘s services at once. (Most users probably use only one supplier at a time)

(This could also speed up navigating through the documentation for users depending on how the documentation is structured)

  • it’s a bit easier to find files when the file name contains relevant gcp_* for example in most IDE’s. This is however very weak argument as most of the IDEs will also support gcp/* as prefix when looking for a file


Solution B:

The prefix will be completed for incompatible files

Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow/contrib/operators/aws_sns_publish_operator.py
File /airflow/contrib/operators/dataproc_operator.py should be moved to /airflow/contrib/operators/gcp_dataproc_operator.py

Operators that integrate with two services will not change.

Solution C:

This solution has been reported by ashb

The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources.

This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.

The package format will look like this:.
airflow/integration/{supplier}/{resource_type}/bigtable_operator.py

Advantages

Disadvantages

This way the integration package contains everything from a supplier and you won’t have multiple same supplier packages for hooks, operators, macros, etc.

Moreover it would be simpler to move such an integration to a separate repository. (See AIP-8)

-


Solution D:

The prefix can be completely dropped. Major provider will get their own sub-module, which will contain all types of resources. Other operators will be moved to one module (ex. core).

This change forces the adoption of a solution A from Case #1 airflow.contrib.{resources} at the same time.

The package format will look like this:.
airflow_integration/{resource_type}/gcp_bigtable_operator.py

Example:
File /airflow/contrib/operators/sns_publish_operator.py should be moved to /airflow_integration/aws/operators/aws_sns_publish_operator.py
File /airflow/operators/bash_operator.py should be moved to /airflow_integration/core/bash_operator.py

Case #4 Separate namespace for resources

Namespaces are one honking great idea -- let's do more of those!

Tim Peters, The Zen of Python

Note - we do not move the namespaces out. It's mereWe can create a new namespace for all resources. We will not take advantage of all the possibilities that it offers, but in the future it will be possible to easily switch to a separate package for group of resource.

This solution should also be considered taking into account the acceptance of solution D from Case #3 {aws/azure/gcp}_*

Example of a project that uses a separate namespace: https://github.com/googleapis/google-cloud-python

Note: This change does not introduce separated packages for group of resources. It tries to retain only compatibility. Details are available: AIP-8 Split Hooks/Operators into Separate Packages by Tim Swast.

The package format will look like this:.
airflow_resources/{category}/{resource_type}/bigtable_operator.py

Solution #A:

We can introduce namespaces.

Advantages

Disadvantages

We will avoid changing import paths in the future

-


Solution #B:

We reject introduction namespaces.

Advantages and disadvantages are analogous to solution A

Note that grouping remains as if in namespaces (but this is merely a package not a separate namespace),

Case #5 *Operator

Class name does not need suffix "Operator"

Solution A:

We can delete the suffix “Operator” from the class name

Example:
Class GcpTransferServiceJobDeleteOperator should be renamed to GcpTransferServiceJobDelete.

Advantages

Disadvantages

- Shorter name, but still focussing on the essential task of the class (no information loss)

-


Solution B:

No change

Advantages and disadvantages are analogous to solution A.

Case #6 Other isolated cases

There are other random cases of inconsistencies in the naming of classes. It is necessary to review the list of all classes and prepare a plan of change. Support from major cloud service providers will be useful.

For example:
Google Dataproc operators:

Current name

Proposition of new name

DataProcHadoopOperator

DataprocHadoopOperator

DataProcHiveOperator

DataprocHiveOperator

DataProcPigOperator

DataprocPigOperator

DataProcPySparkOperator

DataprocPySparkOperator

DataProcSparkOperator

DataprocSparkOperator

DataProcSparkSqlOperator

DataprocSparkSqlOperator

DataprocClusterCreateOperator

No change

DataprocClusterDeleteOperator

No change

DataprocClusterScaleOperator

No change

DataprocWorkflowTemplateBaseOperator

No change

DataprocWorkflowTemplateInstantiateInlineOperator

No change

DataprocWorkflowTemplateInstantiateOperator

No change

GoogleCloudStorageToS3Operator

GcsToS3Operator


This document does not analyze such cases. It can be one area of analysis by other groups of people ex. employees of the largest cloud service providers.

Any such change must be considered individually when accepting pull requests. Each change must be consistent with the recommendations made after voting on the changes in this document.

Implementation considerations

Case #7

Developer perspective - source code, and console view from both methods is available:  https://imgur.com/a/mRaWpQm

Repository with samples: https://github.com/mik-laj/airflow-deprecation-sample

Solution #A native python

Advantages

Disadvantages

Its supported by IDE.

More flexible - we can add a link to the documentation

Files must exist in the project - temporary mess.

More code in the project (226 characters. vs 78 character = +189%).


Sample PR: https://github.com/apache/airflow/pull/4667

Solution #B zope.deprecation/sys.modules hack

Solution proposed by @ashb

Advantages

Disadvantages

Less boilerplate code.

It is NOT supported by IDE.

Files must exist in the project - temporary mess.

No configuration options


Executive considerations

We can introduce the proposed changes in two ways:

  1. as one commit;
  2. as many commits for each group of operators;

The first method will be faster to perform, but finding one bug (if it would appear) in such a patch will be very difficult. The introduced change should, therefore, be made a series of corrections.

Each change should contain one commit. Each PR  and commit should be described in the format: “[AIRFLOW-XXX]”

Summary of the proposal

Green are the voted options


Choice

Case 1

What to do with the "contrib" folder

Case 2

Drop modules *_operator suffix

Case 3 + Case 4 

Grouping Cloud Providers operators/sensors/hooks 

Case 5

*Operator *Sensor *Hook in class name 

Case 6 

Isolated cases

Case 7

Deprecation method

A

Move everything "contrib" → "airflow"

Drop the suffix.

Example:

  • airflow.operator.gcp_bigtable_operator.py 
    becomes airflow.operator.gcp_bigtable.py.

Keep operators/sensors/hooks in airflow/operators(sensors, hooks) and keep/add prefixes in file names.

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/operators/
    aws_sns_publish_operator.py
  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/operators/gcp_dataproc_operator.py  


  • airflow/contrib/hooks/gcp_bigtable_hook.py 
    becomes airflow/hooks/gcp_bigtable_hook.py
  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh_operator.py

Remove the Operator suffix from class name.

Examples:

  • GcpTransferServiceJobDeleteOperator  rename to GcpTransferServiceJobDelete
  • BashOperator rename to Bash

Make individual decisions of renames for operators that do not follow common conventions used for other operators.

Consistency trumps compatibility.

Examples:

DataProcHadoopOperator

renamed to:

DataprocHadoopOperator

Native python method (with better IDE support  and more flexible but a bit more verbose)
B

Move well tested code "contrib" → "airflow"

Rename "contrib" to "incubator" for less-well tested code.

No change.

Example:

  •  gcp_bigtable_operator.py 
    stays gcp_bigtable_operator.py

Group operators/sensors/hooks in airflow/operators(sensors, hooks)/<PROVIDER>. Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Drop the prefix.

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/operators/
    aws/sns_publish_operator.py
  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/operators/gcp/dataproc_operator.py  


  • airflow/contrib/operators/gcp_bigtable_operator.py 
    becomes airflow/operators/gcp/bigtable_operator.py

  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh_operator.py
No change - keep Operator/Sensor suffix in class name.Avoid renaming operators due to better backwards compatibility. Use zope.deprecation (less IDE support, less verbose, less flexibility)
CNo change

Group operators/sensors/hooks in airflow/operators(sensors, hooks)/<PROVIDER>. Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Keep the prefix.

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/operators/
    aws/aws_sns_publish_operator.py
  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/operators/gcp/gcp_dataproc_operator.py  


  • airflow/contrib/operators/gcp_bigtable_operator.py 
    becomes airflow/operators/gcp/gcp_bigtable_operator.py

  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh_operator.py



D

Group operators/sensors/hooks in airflow/<PROVIDER>/operators(sensors, hooks). Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Drop the prefix.

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/aws/operators/
    sns_publish_operator.py
  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/gcp/operators/dataproc_operator.py  


  • airflow/contrib/operators/gcp_bigtable_operator.py 
    becomes airflow/gcp/operators/bigtable_operator.py

  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh_operator.py



E

Group operators/sensors/hooks in airflow/<PROVIDER>/operators(sensors, hooks). Non-cloud provider ones are moved to airflow/operators(sensors/hooks). Keep the prefix.

  • airflow/contrib/operators/sns_publish_operator.py 
    becomes airflow/aws/operators/aws_
    sns_publish_operator.py
  • airflow/contrib/operators/dataproc_operator.py 
    becomes airflow/gcp/operators/gcp_dataproc_operator.py  


  • airflow/contrib/operators/gcp_bigtable_operator.py 
    becomes airflow/gcp/operators/gcp_bigtable_operator.py

  • airflow/contrib/operators/ssh_operator.py 
    becomes airflow/operators/ssh_operator.py



Voting

Feel free to add your votes below:

PersonBinding

Case 1

What to do with the "contrib" folder

Case 2

Drop modules *_operator suffix

Case 3 + Case 4 

Grouping Cloud Providers operators/sensors/hooks 

Case 5

*Operator *Sensor *Hook in class name

Case 6 

Isolated cases

Case 7

Deprecation method

Yes

A: Move everything "contrib"→ "airflow"

A. gcp_bigtable_operator.py → gcp_bigtable.py

D. airflow/contrib/operators/gcp_bigtable_operator.py 
→  airflow/gcp/operators/bigtable_operator.py

B. No changes. Keep *Operator *Sensor *Hook in class name

A. Rename isolated cases for consistency.

A. Native python with better IDE integration.
YesAADB - it's clearer at call site ( task = XOperator()  vs task = X() )No opinionNo strong opionin
YesAAABAA
NoAADBANo opinion
YesAADBAA
YesAAD

AA
YesAAABAA
NoAADBAA

Any strong "vetos" on any of the answers please record it here with justification:

PersonBinding

Case 1

What to do with the "contrib" folder

Case 2

Drop modules *_operator suffix

Case 3

Separate out module's Cloud Provider prefixes (gcp/aws/azure) to packages

Case 4 

Introduce separate namespaces for different resources

Case 5

*Operator *Sensor *Hook in class name 

Case 6 

Isolated cases

Case 7

Deprecation method





Vetoing A with the prefix of airflow_resources , but don't object to airflow.gcp.operator.x  grouping.



Original votes on Case 3/Case 4:


Case 3

Separate out module's Cloud Provider prefixes (gcp/aws/azure) to packages

Case 4 

Introduce separate namespaces for different resources

C. gcp_bigtable_operator.py → gcp/operators/bigtable.py

B. No namespaces introduction.

No opinion
BB
CB
CB
Daniel Standish

Cairflow.gcp.operator.*

It will make it more organised. Anyone willing to find out Airflow support for a particular cloud provider would just need to look into a single folder.



BB

The Voting mechanism:

Voting will take place till Tuesday  6pm CEST  (5 pm BST, 9am PST) .

After the choice, the final consistent set of choices will be announced (taking into account majority of binding vote, also including potential vetos and consistency between the choices. Non-binding votes will be taken into account in case there is a draw. The final set of choices will be announced at devlist thread after the voting completes.

Unless there is a veto raised to the final proposal, the final proposal is accepted by Lazy Consensus  on Friday  at 6pm CEST (5 pm BST, 9am PST).


Reference

fenglu@google.com. 2018. GCP Service Airflow Integration Guide. [ONLINE] Available at: https://lists.apache.org/thread.html/e8534d82be611ae7bcb21ba371546a4278aad117d5e50361fd8f14fe@%3Cdev.airflow.apache.org%3E. [Accessed 8 February 2019].


8 Comments

  1. Very much in favour of reorganising the module paths and names.

    Could we also consider moving parts out of airflow/models? For example I think a more suitable location for the BaseOperator would be airflow/operators (I understand for historical reasons most DB models were in airflow/models.py, but at the moment half of them are stored in airflow/models, and the other half of DB models are located in some logical location - which is quite confusing).

    1. I am afraid that this AIP is already very large. I think, this topic should be discussed in a separate AIP. Here, I would like us to focus on resources. 

      "Every time I write resources, I mean operators, hooks, sensors or another piece of code that is specific to an integration."

      Do you think it is a good idea?

      1. Sure, one step at a time. I'll propose new locations for the models classes in a new AIP.

  2. Can we edit this page to reflect the new AWS package structure - /providers/amazon/aws - discussed in the mail-

    https://lists.apache.org/thread.html/5ed0ce4dfd77640d7464dc57d4fafc75e8aac44e016f87ca1be3a4a3@%3Cdev.airflow.apache.org%3E

    1. Yeah. I think we have a consensus there.

  3. We should drop the prefix in the class names for the Operators/Hooks (S3Hook, DynamoDBHook for example). Right?

  4. We have only detailed guidelines on naming written for GCP operators (https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit) no standardisation across the providers. Maybe you would like to take it as a base, copy/modify accordingly and have one proposed in community for Amazon (smile)

    1. Oh sure! I'd like to contribute to proposing the naming guidelines for AWS operators. I'll propose it on the mailing list soon.