AIP-67 Multi-team deployment of Airflow components

Status

State	Draft
Discussion Thread	https://lists.apache.org/thread/6k3sxxvboko4fw7qdhr78y3qhf26vh3q
Vote Thread
Vote Result Thread
Progress Tacking (PR/GitHub Project/Issue Label)
Date Created	2024-03-06
Version Released
Authors	Jarek Potiuk

Summary

This document describes the proposal of adding a “--team” flag to Airflow components, so that it will be possible to create a multi-team deployment of Airflow.

The multi-team feature described here allows to have a single deployment of Airflow for several/multiple teams that will be isolated from each other in terms of:

access to team-specific configuration (variables, connections)
execute the code submitted by team-specific DAG authors in isolated environments (both parsing and execution)
allow different teams to use different set of dependencies/execution environment libraries
allow different teams to use different executors (including multiple executors per-team following AIP-61)
allows to link DAGs between different teams via a “dataset” feature. Datasets can be produced/consumed by different teams
allow the UI users to to see a subset of single team or multiple teams DAGs

The goal of this AIP/Document is to get feedback from the wider community of Airflow on the proposed multi-team architecture to better understand if the proposed architecture is - in fact - addressing the needs of the user.

Airflow Survey 2023 shows that multi-tenancy is one of the highly requested features of Airflow. Of course multi-tenancy can be understood differently, this document aims to propose a multi-team model chosen by Airflow maintainers, not a "customer multi-tenancy", where all resources are isolated between tenants, but rather propose a way how Airflow can be deployed for multiple teams within the same organization, which we identified as what many of our users understooda as "multi-tenancy". We chose to use "multi-team" name to avoid ambiguity of the "multi-tenancy". The ways how some levels of multi-tenancy can be achieved today is discussed in the “Multi-tenancy today” chapter and differences between this proposal and those current ways is described in the “Differences vs. current Airflow multi-team options”.

Motivation

Main motivation is the need for the users of Airflow to have a single deployment of Airflow, where separate teams in the company structure have access to only a subset of resources (e.g. DAGs and related tables referring to dag_ids) belonging to the team. This allows to share the UI/web server deployment and scheduler between different teams, while allowing the teams to have isolated DAG processing and configuration/sensitive information. It also allows a separate group of DAGs that
SHOULD be executed in a separate / high confidentiality environment, also allows to decrease the cost of deployment by avoiding having multiple schedulers and web servers deployed.

This covers the cases where currently multiple Airflow deployments are used in several departments/teams by the same organization and where maintaining (even if more complex) single instance of Airflow is preferable over maintaining multiple, independent instances.

This allows for partially centralized management of airflow while delegating the execution environment decisions to teams as well as makes it easier to isolate the workloads, while keeping the option of easier interaction between multiple teams via shared dataset feature of Airflow.

Wording/Phrasing

Note, that this is not a formal specification, but where emphasised by capital letters, The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in the RFC 2119

Considerations

Multi-tenancy today

There are today several different ways one can run multi-team Airflow:

A. Separate Airflow instance per-tenant

The simplest option is to implement and manage separate different Airflow instances, each with own database, web server, scheduler, workers and configuration, including execution environment (libraries, operating system, variables and connections)

B. Separate Airflow instance per-tenand with some shared resources

Slightly more complex option is to reuse some of the resources - to save cost. The Database MAY be shared (each Airflow environment can use its own schema in the same database), web server instances could be run in the same environment, and the same Kubernetes Clusters can be used to execute workloads of tasks.

In either of the solution A/B, where multiple Airflow instances are deployed, the AuthUI access of Airflow (especially with the new Auth Manager feature AIP-56) can be delegated to a single Authentication proxy (for example KeyCloak Auth Manager could be implemented that uses single KeyCloak authentication proxy to implement a unified way of accessing the various UI webserver instances of Airflow that can be exposed in a unified URL scheme).

C. Using a single instance of Airflow by different teams

In the Airflow as we have it today he execution and parsing environment “per-team” could be - potentially - separated for different teams.

You could have separate set of DAG file processors per each folder in Airflow DAG folder, and use different set of execution environments (libraries, system libraries, hardware) - as well as separate queues (Celery) or separate Kubernetes Pod Templates used to separate workload execution for different teams. Those are enforceable using Cluster Policies. UI access “per team” could also be configured via custom Auth Manager implementation integrating with organization-managed authentication and authorization proxy.

In this mode, however, the workloads still have access to the same database and can interfere with each other’s DB including Connections and Variables - wmanhich means that there is no “security perimeter” per team, that would disallow DAG authors of one team to interfere with the DAGs written by DAG authors of other teams. Lack of isolation between the team group is the main shortcoming of such a setup that compared to A. and B. this AIP aims to solve.

Also it is not at all "supported" by Airflow configuration in an easy way. Some users do it by putting limitations on their users (for example only using Kubernetes Pod Operator), some might implement custom cluster policies or code review rules to make sure DAG authors from different teams are not mixed with other teams, but there is no single, easy to use mechanism to enable it

Differences of the multi-team proposal vs. current Airflow multi-tenant options

How exactly the current proposal is different from is possible today:

The resource usage for Scheduling and execution can be slightly lower compared to A. or B. The hardware used for Scheduling and UI can be shared, while the workloads run “per-team” would be anyhow separated in the similar way as they are when either A. or B. is used. Single database and single schema is reused for all the teams, but the resource gains and isolation is not very different from the one achieved in B. by utilizing multiple, independent schemas in the same database. Decreasing resource utilization is a non-goal of the proposal.

The proposal - when it comes to maintenance - is a trade-off between complete isolation of the execution environment available in options A. and B. and being able to centralize part of the maintenance effort from option C. This has some benefits and some drawbacks - increased coupling between teams (same Airflow version) for example, but also better and complete workload isolation than option C.

The proposed solution allows to - easier and more complete than via cluster policies in option C. - manage a separation between teams. With this proposal you can trust that teams cannot interfere with each other’s code and execution. Assuming fairness in Scheduling algorithms of the scheduler also execution efficiency between the teams SHOULD be isolated. Security and isolation of workflows and inability to interference coming from DAG authors belonging to different teams is the primary difference that brings the properties of isolation and security available in options A. and B. to single instance deployment of Airflow.

The proposal allows to have a single unified web server entrypoint for all teams and allows to have a single administrative UI interface for management of the whole team “cluster”.

Utilizing the (currently enhanced and improved) dataset feature allowing teams to interface with each other via datasets. While this will be (as of Airflow 2.9) possible in options A. and B. using the new APIs that will allow sharing dataset events between different instances of Airflow, having a multi-team single instance Airflow to allow for dataset driven scheduling between teams without setting up authentication between different, independent Airflow instances.
Allowing users to analyse their Airflow usage and corellation/task analysis with single API/DB source.

Credits and standing on the shoulders of giants.

Prior AIPs that made it possible

It took quite a long time to finalize the concept, mostly because we had to build on other AIPs - funcationality that has been steadily added to Airflow and iterated on - and the complete design could only be based once the other AIPs are in the shape that we coud "stand on the shoulders of giants" and add multi-team layer on top of those AIPs.

The list of related AIPs:

AIP-12 Persist DAG into DB- initial decision to persist DAG structure to the DB (Serialization) - that allowed to have OPTIONAL DAG serialization in the DB.
AIP-24 DAG Persistence in DB using JSON for Airflow Webserver and (optional) Scheduler - that allowed to separate "Webserver" nd extract it to "common" infrastructure.
AIP-43 DAG Processor separation - it allowed to separate DAG Parsing and Execution environment from Scheduler and move Scheduler to "common" infrastructure
AIP-44 Airflow Internal API - (in progress) allows to separate DB access for components that can also execute code created by DAG Authors - introducing a complete Security Perimeter for DAG parsing and execution
AIP-48 Data Dependency Management and Data Driven Scheduling - introducing the concept of DataSets that could be used as "Interface" Between teams
AIP-51 Removing Executor Coupling from Core Airflow - that decoupled executors and introduced clear executor API that allowed to implement Hybrid executors
AIP-56 Extensible user management - allowing to have externally managed and flexible way to integrate with Organization's Identity services
AIP-60 Standard URI representation for Airflow Datasets - implementing "common language" that teams could use to communicate via Datasets
AIP-61 Hybrid Execution - (in-progress) allowing to have multiple executors - paving the way to have separate set executors per team

Design Goals

Structural/architectural changes

The goal of the implementation approach proposed is to minimize structural and architectural changes in Airflow to introduce multi-team, features and to allow for maximum backwards compatibility for DAG authors and minimum overhead of Organization Deployment Managers. All DAGs written for Airflow 2, providing that they are using only the Public Interface of Airflow SHOULD run unmodified in a multi-team environment.

Minimum database structure needs to be modified to support multi-team mode of deployment.

The whole Airlfow instance continues to use the same shared DAG folder containing all the DAGs it can handle from all teams, however each team user might have access to only subfolder(s) belonging to that team and their Standalone Dag File Processor(s) will only work on those sub-folders. This MAY be relaxed in the future, if we use different way of identifying source files coming from different teams than based on the location of the DAG in the shared DAG folder being the source of all DAGs.

Security considerations

The following assumptions have been made when it comes to security properties of multi-team Airflow:

Reliance on other AIPs: Multi-team mode can only be used in conjunction with Internal API (AIP-44) and Standalone DAG file processor.
Resource access security perimeter: The security perimeter for parsing and execution in this AIP is set at the team boundaries. The security model implemented in this AIP assumes that once your workload runs withing a team environment (in execution or parsing) it has full access (read/write) to all the resources (DAGs and related resources) belonging the same team and no access to any of those resources belonging to another team.
DAG authoring: DAGs for each team are stored in a team-specific folder (or several folders per team) - this allows per-folder permission management of DAG files. Having teams based on team-specific folders allows for common organization DAG code managed and accessed by a separate “infrastructure” team within a deployment of Airflow. When Airflow is deployed in multi-team deployment, all DAGs MUST belong to one of the teams.
Execution: Both parsing and execution of DAGs SHOULD be executed in isolated environment. Note that Organization Deployment Managers might choose different approach here and colocate some or all teams within the same environments/containers/machines if the separation based on process separation is good enough for them.
Dataset sharing: Datasets are shared between teams. DAG Authors can produce and consume datasets between teams. That allows us to share datasets between different organizations. In the future we MAY need to limit it - assumption is that the limit will at most disallow access to specific datasets but that it will not break general compatiblity - the APIs will remain the same.
DB access: In Phase 2 of the implementation Airflow is deployed with internal API / GRPC API component enabled (separate internal API component per team) - this Internal API component only allows access to DAGs belonging to the team. Implementing filtering of access "per-team" in the Internal API component is part of this AIP.
UI access: Filtering resources (related to DAGs) in the UI is done by the auth manager. For each type of resource, the auth manager is responsible for filtering these given the team(s) the user belongs to. We do not plan to implement such filtering in the FAB Auth Manager which we consider as legacy/non-multi-team capable.
Custom plugins: the definition of plugin in Airflow cover a number of ways Airflow can be extended: plugins can contribute parsing and execution "extension" (for example macros), UI components (custom views) or Scheduler extensions (timetables). This means that plugins installed in "team" parsing and execution environment will only contribute the "parsing and execution" extensions, plugins installed in "scheduler" environment will contribute "scheduler" extensions, plugins installed in "webserver" will contribute "webserver" extensions. Which plugins are installed where, depends on those who manage the deployment. There might be different plugins installed by the team Deployment Manager (contributing parsing and execution extensions) and different plugins installed by the organization admin - contributing scheduler and UI extensions.
UI plugins auth manager integration: Since webserve is shared, the custom UI plugins have to be implemented in multi-team compliant way, in order to be deployable in the multi-team environment. This means that they have to support AIP-56 based auth management and they have to utilize the feature of the Auth Manager exposed to it that it will allow to distinguish team users and their permissions. It MAY require to extend AuthManager API to support multi-team environments. A lot could be done with existing APIs and we do not technicallly have to support custom UI plugins at all for the multi-team setup. It MAY be easily implemented as a follow-up to this AIP add better support in the future.
UI controls and user/team management: with AIP-56, Airflow delegated all responsibility to Authentication and Authorisation management to Auth Manager. This continues in case of multi-team deployment. This means that Airlfow webserver completely abstracts-away from knowing and deciding which resources and which parts of the UI are accessible to the logged-in user. This also means for example that Airflow does not care nor manages which users has access to which team and whether the users can access more than one team or whether the user can switch between teams while using Airflow UI. All the features connected with this are not going to be implemented in this API, particular implementations of Auth Managers might choose different approaches there. It MAY be that some more advanced features (for example switching the team for the user logged in) might require new APIs in Auth Manager (for example a way how to contribute controls to Airflow UI to allow such switching) - but this is outside of the scope of this AIP and if needed should be designed and implemented as a follow up AIP.

Design Non Goals

It’s also important to explain the non-goals of this proposal. This aims to help to get more understanding of what the proposal really means for the users of Airflow and Organization Deployment Managers who would like to deploy a multi-team Airflow.

It’s not a primary goal of this proposal to significantly decrease resource consumption for Airflow installation compared to the current ways of achieving “multi-tenant” setyp. With security and isolation in mind, we deliberately propose a solution that MAY have small impact on the resource usage, but it’s not a goal to impact it significantly (especially compared to option B above where the same Database can be reused to host multiple, independent Airflow instances. However isolation trumps performance whenever we made design decision and we are willing to sacrifice performance gain in favour of isolation.

It’s not a goal of this proposal to increase the overall capacity of a single instance of Airflow. With the proposed changes, Airflow’s capacity in terms of total number of DAGs and tasks it can handle will not increase. That also means that any scalability limits of the Airflow instance apply as today and - for example - it’s not achievable to host multiple 100s or thousands of teams with a single Airflow instance and assume Airflow will scale it’s capacity with every new team

It’s not a goal of the proposal to provide a one-stop installation mechanism for “Multi-team” Airflow. The goal of this proposal is to make it possible to deploy Airflow in a multi-team way, but it has to be archite
cted, designed and deployed by the Organization Deployment Manager. It won't be a turn-key solution that you might simply enable in (for example) Airflow Helm Chart. However the documentation we provide MAY explain how to combine several instances of Airflow Helm Chart to achieve that effect - still this will not be a "turn-key", it will be more of a guideline on how to implement it.

It’s not a goal to decrease the overall maintenance effort involved in responding to needs of different teams, but it allows to delegate some of the responsibilities for doing it to teams, while allowing to maintain “central” Airflow instance - common for everyone. There will be different maintenance trade-offs to make compared to the multi-team options available today - for example, the Organization Deployment Manager will be able to upgrade Airflow once for all the teams, but each of the teams MAY have their own set of providers and libraries that can be managed and maintained separately. Each deployment will have to add their own rules on maintenance and upgrades to maintain the properties of the Multi-team environmet, where they all share the same Airflow version but each of the teams will have their own set of additional dependencies. Airflow will not provide any more tooling for that than those existing today - constraint files, reference container images and documentation on how to build, extend and customise container images based on Airflow reference images. This might mean that when single Airflow instance has 20 teams, there needs to be a proper build and customisation pipeline set-up outside of Airflow environment that will manage deployment, rebuilds, upgrades of 20 different container images and making sure they are properly used in 20 different teame-specific environments deployed as parts of the deployment.

It's not a goal to support or implement the case where different teams are used to support branching strategies and DEV/PROD/QA environments for the same team. The goal of this solution is to allow for isolation between different groups of people accessing same Airflow instance, not to support case where the same group of people would like to manage different variants of the same environment.

Architecture

The Multi team Airflow extends “Sepearate DAG processing” architecture described in the Overall Airflow Architecture.

Current "Separate DAG processing" architecture overview

The "Separate DAG processing" brings a few isolation features, but does not addresses a number of those. The features it brings is that execution of code provided by DAG authors can happen in a separate, isolated perimeter from scheduler and webserver. This means that today you can have a deployment of Airflow where code that DAG author submits is never executed in the same environment where Scheduler and Webserver are running. The isolation features it does not bring - are lack of Database access isolation and inability of DAG authors to isolate code execution from one another. Which means that DAG authors can currently write code in their DAGs that can modify directly Airflow's database, and allows to interact (including code injection, remote code execution etc.) with the code that other DAG authors submit. Also there are no straightforward mechanisms to limit access to the Operations / UI actions - currently the way how permissions are managed are "per-individual-DAG" and while it is possible to apply some tooling and permissions syncing to apply permissions to groups of DAGs, it's pretty clunky and poorly understood.

Proposed target architecture with multi-team setup

The target architecture is implemented in two stages

Stage 1: Without DB access isolation
Stage 2: With DB access isolation

There are two highly requested features of multi-team setup - workload isolation and DB isolation.

The first stage of implementation focuses on isolating the workload between teams. This is a relatively small change comparing to the current Aiflow, that mostly focuses on isolating Dag File Processing and Task/Triggerer execution so that the code from one tenant can have different dependencies for different teams, the teams can have separate executors configured and their code can execute in isolation from other teams. It does not provide database isolation that prevents user from one team to modify Airflow Database.

Multi-team setup without DB isolation

What this stage achieves:

allows separate set of dependencies for different teams
by having separate configuration and disabling MetaDB accesss to Connections/Variables it isolates credentials and secrets used by each team
provides full isolation of code workloads between teams - where code from one team is executed in a separate security perimeter
does not provide DB isolation - DAG authors from one team might still modify Airflow Metadata DB for the whole deployment

Multi-team setup with DB isolation

adds DB isolation (via GRPC API server) that only allows DAG authors from a team to modify DAG and related resources belonging to the team

Implementation proposal

Managing multiple teams at Deployment level

Multi-tenancy is a feature that is available to Organization Deployment Managers who manage the whole deployment environment and they SHOULD be able to apply configuration, networking features and deploy airflow components belonging to each team in an isolated security perimeter to make sure they “team” environment has no access to Airflow Database other than via Internal API / GRPC API component.

It’s up to the Organization Deployment manager, to create and prepare the deployment in a multi-team way. Airflow components will perform consistency checks for the configuration - verifying the presence of appropriate folders, per-team configuration etc., but it will not provide separate tools and mechanisms to manage (Add / remove / rename teams).

Executor support

The implementation utilizes AIP-61 (Hybrid Execution) support where each team can have their own set of executors defined (with separate configuration). While multi-team deployment will work with multiple Local Executors, it SHOULD only be used with Local Executor for testing purposes, because it does not provide execution isolation for DAGs.

Since each team has their own per-team configuration, in case of remote executors, credentials, brokers, k8s namespaces etc. can be separated and isolated from each other.

When scheduler starts, it instantiates all executors of all teams configured and passes each of them configuration defined in per-team configuration.

In the case of Celery Executor, several teams will use a separate broker and result backend. For the future we MAY consider using a single broker/result backend and use `team:` prefixed queues, but this is out of scope from this implementation.

Changes in configuration

Multi-tenancy of scheduler and webserver is controlled by a "core/multi-team" bool flag (default False).

Each team configuration SHOULD be a separate configuration file or separate set of environment variables, holding team-specific configuration needed by DAG file processor, Workers and Triggerer. This configuration might have some data duplicated from the global configuration defined for webserver and scheduler, but in general the "global" airflow configuration for the organization should not be shared with "teams".

Internal API / GRPC API components will have to have configuration allowing them to communicate with Airflow DB. This configuration should only be accessible by the internal API component and should not be shared with other components that are run inside the team. This should be part of the deployment configuration, the components should be deployed with isolation that does not allow the code run in DAG file processor, Worker or Triggerrer to be able to retrieve that configuration. Similarly none of the team components (including internal API component) should be able to retrieve "Global" configuration used by Scheduler and webserver.

The multi-executor configuration should be extended to allow for different sets of executors to be created and configured by different teams (and having separate configuration). That configuration is managed by the Organization Deployment Manager. This extends the configuration described in the AIP-61 Hybrid Execution. The proposal is to change the AIRFLOW__CORE__EXECUTOR env variable/configuration into a dictionary:

AIRFLOW__CORE__EXECUTOR = '{
   "team1": [
        {
           "LocalExecutor": {
               "name": "executor 1",
               "Configuration_key1" : "value",
               "Configuration_key1" : "value",
           },
           "CeleryExecutor": {
               "name": "executor 2", 
               "Configuration_key1" : "value",
               "Configuration_key1" : "value",
           },
        },
        ...
   ],
   "team2": [
        ...
   ]
}'

This also opens up the possibility of having more than one executor of the same type defined for the same team, extending capabilities of AIP-61 Hybrid Execution. In the follow up AIP we might also allowing a "global" executor entry for non-multi-team environments and we MAY extend it even further to allow combined "global" and "multi-team" deployment into a single deployment, but this is a non-goal of this AIP.

This configuration is not backwards-compatible with the single-team configuration. The "core"/ "multi-team" (bool) flag controls which configuration format is used and generally switching airflow in multi-team mode.. In the future, if we decide to bring the capability of having multiple executors of the same type to single-team deployment we might consider applying similar flag/change to allow multiple executors of the same type in "multi-team" = False mode.

Connections and Variables access control

In multi-team deployment, reading Connections and Variables from the meta-data DB is disabled, similarly Connection and Variable menu is disabled. Each team has their own configuration specifying the Secrets Manager they use and has access to their own specific connections and variables.

Dataset triggering access control

While any DAG can produce events that are related to any dataset, the DAGs consuming the datasets will - by default - only receive events that are triggered in the same team or by an API call from a user that is recognized to belong to the same team by Auth Manager. DAG authors could specify a list of teams that should additionally be allowed to trigger the DAG by adding a new parameter.

allow_triggering_by_teams = [ "team_1", "team_2" ]

Changes in the metadata database

The structure of the metadata database SHOULD remain unchanged for the team flag introduction. Instead of adding a separate field (team_id) - dag_id are namespaced with `<team>:` prefix. This allows for minimum changes in Airflow scheduler and UI code, while allowing for unique dag_ids to be used by scheduler and UI. This is also the way to avoid conflicts where DAGs with the same dag_id are created by different teams. The prefix is added by DAG File Processors when parsing dags. (when DAG file processor is run with `--team` flag. DAG File Processor only runs callbacks with the same team prefix as `--team` flag

Airflow scheduler runs scheduling unmodified - as today - with the difference that it will choose the right sets of executors to send the dags to, based on the team of the DAG retrieved from dag_id.

We MAY consider adding per-team concurrency control (later) and prioritization of scheduling per team but it is out of the scope of this change. It MAY even turn to be not needed, providing that we will address any potential scheduling fairness problem in the current scheduling algorithm, without the need of having a specific per-team prioritization and separation of scheduling per-team.

The current separation of scheduling and execution in Airflow internal architecture, SHOULD cope pretty well with fairnes of scheduling - currently Airflow copes pretty well with “fair” scheduling decisions involving hundreds and thousands of independent DAGs and there are very few issues - if any - connected with Airflow not being able to create DAG runs and mark tasks as ready for execution, and there seems to be not many cases where scheduling decisions are starved between DAGs. Therefore it seems reasonable to assume that this algorithm continue to be fair and efficient, when multiple teams are involved. It’s worth noting that scheduling decisions are made mostly upfront - DAG runs are created before tasks are eligible for execution, so bulk of the work necessary to prepare tasks for execution is not time-critical and Airflow can efficiently scale to tens of thousands of DAGs with those. Also the scheduling decision capacity can be increased by using multiple schedulers in Active-Active HA setup, which means that generally speaking it does not really matter how many teams there are configured - overall scheduling capacity depends on the number of DAGs and tasks handled by the Airlfow instance - and increasing that capacity is a non-goal of this proposal.

The more important aspect here which does impact the fairness and potential starvation problem between teams is addressed by utilizing multiple executors (and separate executors per team). It’s the executor that handles the actual workload of performing “task execution” once the task are ready to be executed, and properly isolating the executors is far more important in the context of isolation and starvation than isolating scheduling decisions .

Internal API / GRPC API component will only allow accessing dags/trigger tables with team: prefix when run with `--team` flag. This way only specific team components using internal API team components will have access to team-specific metadata entries.

UI modifications

Changes to the UI, filtering is done by AuthManager based on presence of team: in dag_id - the way how to determine which team prefixes are allowed for the user SHOULD be implemented separately by each AuthManager.

Team flag

Dag file processor, Workers Triggerers and Internal API will have a new `–team` flag which limits access to the team-specific data.

For the DAG file processor, the flag enables adding `team:` prefix to dag_id.

For the internal API component, the flag will only allow access to team-specific data (using prefixes in dag_id).

For worker component, the flag will only be used for diagnostics purposes, each team SHOULD have their own set of workers, broker, result backend, so there is no need to change worker implementation (until we implement capability of sharing broker/result backend between teams, which is out of scope of this change)

Separation of access to the internal-api is done on a deployment level - each team component will only be ab

le to connect to the corresponding team internal-api component, this can be done both by physical/network separation of security perimeters and by signing/verifying calls to internal-api using per-team key. It is important to stress from the security point of view - that security model assumed in this AIP gives full access to "own team" resources for any workload that belongs to the team, but it prevents to modify any "other" team resources. The security perimeter is not on the DAG or TASK level, but on the team level. AIP-44 does not make assumptions about Authorisation and Authentication in the Internal API, it was specifically left to follow-up AIPs. This AIP's goal is to implement authentication and authorisation on "team" level.

Per-team deployment

Since each team is deployed in its own security perimeter and with own configuration, the following properties of deployment can be defined per-team:

set of dependencies (possibly container images) used by each team (each component belonging to the team can have different set of dependencies)
credential/secrets manager configuration can be specified separately for each team separately in their configuration
team components MUST only have access their own team configuration, not to the configuration of other teams

This introduces a separate Parsing/Runtime environment that is specific "per-team" and not shared with other teams. While the environments are separate and isolated, there are some limits for the environment that introduce some coupling between them:

All the environments of all teams have to have the same Airflow version installed (exact patch-level version)
Dependencies installed by each team cannot conflict with that of Airflow core dependencies or other providers and libraries instaled in the same environment
No connections or variables are visible in UI of multi-team Airflow. All Connections and Variables must come from Secrets Manager

Roles of Deployment managers

There are two kinds of Deployment Managers in the multi-team Airflow Architecture: Organisation Deployment Managers and team Deployment Managers.

Organization Deployment Managers

Organization Deployment Managers are responsible for designing and implementing the whole deployment including defining teams and defining how security perimeters are implementing, deploying firewalls and physical isolation between teams and figure out how to connect the identity and authentication systemst of the organisation with Airflow deployment. They also manage common / shared Airflow configuration, Metadata DB, the Airflow Scheduler and Webserver runtime environment including appropriate packages and plugins (usually appropriate container images), Manage running Scheduler and Webserver. The design of their deployment has to provide appropriate isolation betweeen the security perimeters. Both physical isolation of the workloads run in different security perimeters but also implementation and deployment ot the appropriate connectivity rules between different team perimeters. The rules implemented has to isolate the components running in different perimeters, so that those components which need to communicate outside of their security perimeter can do it, and make sure the components cannot communicate with components outside of their security perimeters when it's not needed. This means for example that it's up to the Organisation Deployment manager to figure out the deployment setup where Internal APIs / GRPC API components

running inside of the team security perimeters are the only componets there that can commmunicate with the metadata Database - all other components should not be able to communicate with the Database directly - they should only be able to communicate with the Internal API / GRPC API component running in the same security perimeter (i.e. the team they run in).

When it comes to integration of organisation's identity and authentication systems with airflow such integration has to be performed by the Deployment Manager in two areas:

Implementation of the Auth Manager integrated with organization's identity system that applies appropriate rules that the operations users that should have permissions to access specific team resources, shoudl be able to do it. Airflow does not provide any specific implementation for that kind of Auth Manager (and this AIP will not change it) - it's entirely up to the Auth Manager implementation to assign appropriate permissiosn. The Auth Manager API of Airflow implemented as part of AIP-56 providers all the necessary information to take the decision by the implementation of such organisation-specific Auth Manager
Implementation of the permissions for the DAG authors. Airflow completely abstract away from the mechanisms used to limit permissions to folders belonging to each team. There are no mechanisms implemented by Airflow itself - this is no different than today, where Airflow abstracts away from the permissions needed to access the whole DAG folder. It's up to the Deployment Manager to define and implement appropriate rules, group access and integration of the access with the organisation's identity, authentication, authorisation system to make sure that only users who should access the team DAG files have appropriate access. Choosing and implementing mechanism to do that is outside of the scope of this AIP.
Configuration of per-team executors in "common" configuration - while Tenatn Deployment Managers can make decisions on runtime environment they use, only Organization Deployment Manager can change the executors configured for each team.

Team Deployment Managers

Team Deployment Manager is a role that might or might be given to other people, but it can also be performed by Organization Deployment Managers. The role of team Deployment Manager is to manage configuration and runtime environment for their team - that means set of pacakges and plugins that should be deployed in team runtime environment (where Airflow version should be the same as the one in the common environment) and to manage configuration specific "per team". They might also be involved in managing access to DAG subfolders that are assigned to the team, but the scope and way of doing it is outside of the scope of this AIP. team Deployment managers cannot on their own decide on the set of executors configured for their team. Such configuraiton and decisions should be implemented by the Organization Deployment Manager.

Team Deployment Managers can control resources used to handle tasks - by controlling the size and nodes of K8S clusters that are on the receiving end of K8S executor, or by controlling number of worker nodes handling their specific Celery queue or by controlling resources of any other receiving end of executors they use (AWS/Fargate for example). In the current state of the AIP and configuration team Deployment Manager must involve Organization Deployment Managers to change certain aspects of configuration for executors (for example Kubernetes Pod templates used for their K8S executor) - however nothing prevents future changes to the executors to be able to derive that configuration from remote team configuration (for example KPO template could be deployed as Custom Resources at teh K8S cluster used by the K8S executor and could be pulled by the executor.

Team Deployment Managers must agree the resources dedicated for their executors with the Organization Deployment Manager. While currently executors are run inside the process of Scheduler as sub-processes, there is little control over the resources they used, as a follow-up to this AIP we might implement more fine-grained resource control for teams and executors used.

Support for custom connections in Airflow UI

in multi-teeam setup, connections and variable menus are disabled. Connections and Variables must come from Secrets Managers - never from the Database.

Why is it needed?

The idea of multi-tenancy has been floating in the community for a long itme, however it was ill-defined. It was not clear who would be the target users and what purpose would multi-tenancy serve - however for many of our users this meant "multi-team"

This document aims to define multi-team as a way to streamline organizations who either manage themselves or use a managed Airflow to setup an instance of Airflow where they could manage logically separated teams of users - usually internal teams and departments within the organization. The main reason for having multi-team deploymnent of Airflow is achieving security and isolation between the teams, coupled with ability of the isolated teams to collaborate via shared Datasets. Those are main, and practically all needs that proposed multi-team serves. It's not needed to save resources or maintenance cost of Airflow installation(s) but rather streamlining and making it easy to both provide isolated environment for DAG authoring and execution "per team" but within the same environment which allows for deploying "organisnation" wide solutions affecting everyone and allowing everyone to easily connect dataset workflows coming from different teams.

Are there any downsides to this change?

Increased complexity of Airflow configuration, UI and Multiple Executors.
Complexity of deployment consisting of a number of separate isolated environment.
Increased responsibilty of maintainers with regards to isolation and taking care about setting security perimeter boundaries

Which users are affected by the change?

The change does not affect users who do not enable multi-team mode. There will be no database changes affecting such users and no functional changes to Airflow.
Users who want to deploy airflow with separate subsets of DAGs in isolated teams - particularly those who wish to provide a unified environment for separate teams within their organization - allowing them to work in isolated, but connected environment.
Database migration is REQUIRED as well as specific deployment managed by the Organization Deployment Manager has to be created

What defines this AIP as "done"?

All necessary Airflow components expose –team flags and all airflow components provide isolation between the team environment (DAGs, UI, secrets, executors, ....)
Documentation describes how to deploy multi-team environment
Implemented simple reference implementation “demo” authentication and deployment mechanism in multi-team deployment (development only)

What is excluded from the scope?

Sharing broker/backend for celery executors between teams. This MAY be covered by future AIPs
Implementation of FAB-based multi-team Auth Manager. This is unlikely to happen in the community as we are moving away from FAB as Airflow's main authentication and authorisation mechanism.
Implementation of generic, configurable, multi-team aware Auth Manager suitable for production. This is not likely to happen in the future, unless the community will implement and release a KeyCloak (or simillar) Auth Manager. It's quite possible however, that there might be 3rd-party Auth Managers deployed that will provide such features.
Per-team concurrency and prioritization of tasks. This is unlikely to happen in the future unless we find limitations in the current Airlflow scheduling implementation. Note that there are several phases in the process of task execution in Airflow. 1) Scheduling - where scheduler prepares dag runs and task instances in the DB so that they are ready for execution, 2) queueing - where scheduled tasks are picked by executors and made eligible for running 3) actual execution where (having sufficient resources) executors make sure that the tasks are picked from the queue and executed. We believe that 1) scheduling is suffificiently well implement in Airflow to avoid starvation betweeen teams and 2) and 3) are handled by separate executors and environments respectively. Since each team will have own set of executors, and own execution environment, there is no risk of starvation between teams in those phases as wel, and there is no need to implement separte prioritisation.
Resource allocation per-executor. In the current proposal, executors are run as sub-processes of Scheduler and we have very little control over their individual resource usage. This should not cause problems as generally resource needs for executors is limited, however in more complex cases and deployments it migh become necessary to limit those resources on a finer-grained level (per executor or per all executors used by the team). This is not part of this API and will likely be investigated and discussed as follow-up.
Turn-key multi-team Deployment of Airflow (for example via Helm chart). This is unlikely to happen. Usually multi-team deployments require a number of case-sepecific decisions and organization-specific integrations (for example integration with organization's identity services, per-team secrets managements etc. that make it unsuitable to have 'turn-key' solutions for such deployments.
team management tools (creation, removal, rename etc.). This is unlikely to happen in the future, but maybe we will be able to propose some tooling based on experience and expectations of our users after they start using multi-team setup. The idea behind multi-team proposed in this AIP is not to manage dynamic list of teams, but fairly static one, that changes rarely. Configuration, naming decisions in such case can be done with upfront deliberation and some of the steps there (creating configurations etc. ) can be easily semi-automated by the Organization Deployment Managers.
Combining "global" execution with "team" execution. While it should be possible in the proposed architecture to have a "team" execution and "global" execution in a single instance of Airflow, this has it's own unique set of challenges and assumption is that Airflow Deployment is either "global" (today) or "multi-team" (After this AIP is implemented) - but it cannot be combined (yet). This is possible to be implemented in the future.
Isolation of Code execution and DB for DAGs submitted by different authors belonging to the same team. In the proposed AIP security perimeter is at the boundaries of team, not at the boundaries of single DAG or what DAG author ownership is. Assumption is the all DAGs belonging to the team share the same access and permission mechanisms from the DAG author point of view. This MAY change in the future and more fine-grained security perimeters might be implemented but this is not a goal of this AIP.

Space shortcuts

Page tree