Skip to content

Commit

Permalink
Improve and add more complete description in the architecture diagrams (
Browse files Browse the repository at this point in the history
#36513)

When it comes to access and management plugins management follows
different pattern than DAG files. While DAG files can (and should) be
modified by DAG authors, the whole idea of Plugins was to make it only
possible to modify plugins folder (and installl plugin-enabled packages)
by the Deploymenet Managers, not DAG authors.

The difference is quite important because even in a simplest
installation, airflow webserver never needs to access DAG files, while
it should be able to access plugins.

This is even more profound in the environment (leading in the future to
multi-tenant deployments) plugins are not 'per-tenant" - they must be
installed and managed by deployment manager, because those plugins can
be used by Airflow Webservers.

In the future we might want to make distinction between these two
different types of plugins, because theorethically it would be possible
to distingquish "scheduler, worker & triggerer" plugins from the
"webserver" plugins - however we do not have such disctinction today and
whoever manages plugins folder is impacting both webserver and workers.

This change also re-adds the "basic" architecture which is targetted
as single-user and single machine deployment and presents it as the
first architecture that the user encounters - which might make it more
digestible, while it also explains tha this is a simplified architecture
and is followed by more complete and complex deployment scenarios
involving distributed architecture, different user roles and security
boundaries.
  • Loading branch information
potiuk authored Jan 5, 2024
1 parent 16d16e2 commit c47dcc5
Show file tree
Hide file tree
Showing 13 changed files with 330 additions and 60 deletions.
153 changes: 134 additions & 19 deletions docs/apache-airflow/core-concepts/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,49 +18,164 @@
Architecture Overview
=====================

Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called :doc:`tasks`, arranged with dependencies and data flows taken into account.
Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a
:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called
:doc:`tasks`, arranged with dependencies and data flows taken into account.

.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph

A DAG specifies the dependencies between Tasks, and the order in which to execute them and run retries; the Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more.
A DAG specifies the dependencies between tasks, which defines the order in which to execute the tasks.
Tasks describe what to do, be it fetching data, running analysis, triggering other systems, or more.

An Airflow installation generally consists of the following components:
Airflow itself is agnostic to what you're running - it will happily orchestrate and run anything,
either with high-level support from one of our providers, or directly as a command using the shell
or Python :doc:`operators`.

* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled workflows, and submitting :doc:`tasks` to the executor to run.
Airflow components
------------------

* An :doc:`executor <executor/index>`, which handles running tasks. In the default Airflow installation, this runs everything *inside* the scheduler, but most production-suitable executors actually push task execution out to *workers*.
Airflow's architecture consists of multiple components. The following sections describe each component's
function and whether they're required for a bare-minimum Airflow installation, or an optional component
to achieve better Airflow extensibility, performance, and scalability.

* A *triggerer*, which executes deferred tasks - executed in an async-io event loop.
Required components
...................

* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
A minimal Airflow installation consists of the following components:

* A folder of *DAG files*, read by the scheduler and executor (and any workers the executor has)
* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled
workflows, and submitting :doc:`tasks` to the executor to run. The :doc:`executor <executor/index>`, is
a configuration property of the *scheduler*, not a separate component and runs within the scheduler
process. There are several executors available out of the box, and you can also write your own.

* A *metadata database*, used by the scheduler, executor and webserver to store state.
* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of
DAGs and tasks.

* A folder of *DAG files* is read by the *scheduler* to figure out what tasks to run and when and to
run them.

Basic airflow architecture
--------------------------
* A *metadata database*, which airflow components use to store state of workflows and tasks.
Setting up a metadata database is described in :doc:`/howto/set-up-database` and is required for
Airflow to work.

This is the basic architecture of Airflow that you'll see in simple installations:
Optional components
...................

Some Airflow components are optional and can enable better extensibility, scalability, and
performance in your Airflow:

* Optional *worker*, which executes the tasks given to it by the scheduler. In the basic installation
worker might be part of the scheduler not a separate component. It can be run as a long running process
in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
:doc:`KubernetesExecutor <executor/kubernetes>`.

* Optional *triggerer*, which executes deferred tasks in an asyncio event loop. In basic installation
where deferred tasks are not used, a triggerer is not necessary. More about deferring tasks can be
found in :doc:`/authoring-and-scheduling/deferring`.

* Optional *dag processor*, which parses DAG files and serializes them into the
*metadata database*. By default, the *dag processor* process is part of the scheduler, but it can
be run as a separate component for scalability and security reasons. If *dag processor* is present
*scheduler* does not need to read the *DAG files* directly. More about
processing DAG files can be found in :doc:`/authoring-and-scheduling/dagfile-processing`

* Optional folder of *plugins*. Plugins are a way to extend Airflow's functionality (similar to installed
packages). Plugins are read by the *scheduler*, *dag processor*, *triggerer* and *webserver*. More about
plugins can be found in :doc:`/authoring-and-scheduling/plugins`.

Deploying Airflow components
----------------------------

All the components are Python applications that can be deployed using various deployment mechanisms.

They can have extra *installed packages* installed in their Python environment. This is useful for example to
install custom operators or sensors or extend Airflow functionality with custom plugins.

While Airflow can be run in a single machine and with simple installation where only *scheduler* and
*webserver* are deployed, Airflow is designed to be scalable and secure, and is able ot run in a distributed
environment - where various components can run on different machines, with different security perimeters
and can be scaled by running multiple instances of the components above.

The separation of components also allow for increased security, by isolating the components from each other
and by allowing to perform different tasks. For example separating *dag processor* from *scheduler*
allows to make sure that the *scheduler* does not have access to the *DAG files* and cannot execute
code provided by *DAG author*.

Also while single person can run and manage Airflow installation, Airflow Deployment in more complex
setup can involve various roles of users that can interact with different parts of the system, which is
an important aspect of secure Airflow deployment. The roles are described in detail in the
:doc:`/security/security_model` and generally speaking include:

* Deployment Manager - a person that installs and configures Airflow and manages the deployment
* DAG author - a person that writes DAGs and submits them to Airflow
* Operations User - a person that triggers DAGs and tasks and monitors their execution

Architecture Diagrams
---------------------

The diagrams below show different ways to deploy Airflow - gradually from the simple "one machine" and
single person deployment, to a more complex deployment with separate components, separate user roles and
finally with more isolated security perimeters.

The meaning of the different connection types in the diagrams below is as follows:

* **brown solid lines** represent *DAG files* submission and synchronization
* **blue solid lines** represent deploying and accessing *installed packages* and *plugins*
* **black dashed lines** represent control flow of workers by the *scheduler* (via executor)
* **black solid lines** represent accessing the UI to manage execution of the workflows
* **red dashed lines** represent accessing the *metadata database* by all components

Basic Airflow deployment
........................

This is the simplest deployment of Airflow, usually operated and managed on a single
machine. Such a deployment usually uses the LocalExecutor, where the *scheduler* and the *workers* are in
the same Python process and the *DAG files* are read directly from the local filesystem by the *scheduler*.
The *webserver* runs on the same machine as the *scheduler*. There is no *triggerer* component, which
means that task deferral is not possible.

Such an installation typically does not separate user roles - deployment, configuration, operation, authoring
and maintenance are all done by the same person and there are no security perimeters between the components.

.. image:: ../img/diagram_basic_airflow_architecture.png

Most executors will generally also introduce other components to let them talk to their workers - like a task queue - but you can still think of the executor and its workers as a single logical component in Airflow overall, handling the actual task execution.
If you want to run Airflow on a single machine in a simple single-machine setup, you can skip the
more complex diagrams below and go straight to the :ref:`overview:workloads` section.

Distributed Airflow architecture
................................

This is the architecture of Airflow where components of Airflow are distributed among multiple machines
and where various roles of users are introduced - *Deployment Manager*, **DAG author**,
**Operations User**. You can read more about those various roles in the :doc:`/security/security_model`.

In the case of a distributed deployment, it is important to consider the security aspects of the components.
The *webserver* does not have access to the *DAG files* directly. The code in the ``Code`` tab of the
UI is read from the *metadata database*. The *webserver* cannot execute any code submitted by the
**DAG author**. It can only execute code that is installed as an *installed package* or *plugin* by
the **Deployment Manager**. The **Operations User** only has access to the UI and can only trigger
DAGs and tasks, but cannot author DAGs.

The *DAG files* need to be synchronized between all the components that use them - *scheduler*,
*triggerer* and *workers*. The *DAG files* can be synchronized by various mechanisms - typical
ways how DAGs can be synchronized are described in :doc:`helm-chart:manage-dags-files` ot our
Helm Chart documentation. Helm chart is one of the ways how to deploy Airflow in K8S cluster.

Airflow itself is agnostic to what you're running - it will happily orchestrate and run anything, either with high-level support from one of our providers, or directly as a command using the shell or Python :doc:`operators`.
.. image:: ../img/diagram_distributed_airflow_architecture.png

Separate DAG processing architecture
------------------------------------
....................................

In a more complex installation where security and isolation are important, you'll also see the standalone **dag file processor** component that allows to separate scheduler from accessing DAG file. This is suitable if the
deployment focus is on isolation between parsed tasks. While Airflow does not yet support full multi-tenant features, it can be used to make sure that DAG-author provided code is never executed in the context of the scheduler.
In a more complex installation where security and isolation are important, you'll also see the
standalone *dag processor* component that allows to separate *scheduler* from accessing *DAG files*.
This is suitable if the deployment focus is on isolation between parsed tasks. While Airflow does not yet
support full multi-tenant features, it can be used to make sure that **DAG author** provided code is never
executed in the context of the scheduler.

.. image:: ../img/diagram_dag_processor_airflow_architecture.png

You can read more about the different types of users and how they interact with Airflow and how the
security model of Airflow access look like in the :doc:`/security/security_model`
.. _overview:workloads:

Workloads
---------
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
ac9bd11824e7faf5ed5232ff242c3157
cc2aca72cb388d28842e539f599d373c
Binary file modified docs/apache-airflow/img/diagram_basic_airflow_architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
64 changes: 39 additions & 25 deletions docs/apache-airflow/img/diagram_basic_airflow_architecture.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,55 +21,69 @@
from diagrams import Cluster, Diagram, Edge
from diagrams.custom import Custom
from diagrams.onprem.client import User
from diagrams.onprem.database import PostgreSQL
from diagrams.programming.flowchart import MultipleDocuments
from diagrams.programming.language import Python
from rich.console import Console

MY_DIR = Path(__file__).parent
MY_FILENAME = Path(__file__).with_suffix("").name
PYTHON_MULTIPROCESS_LOGO = MY_DIR.parents[1] / "diagrams" / "python_multiprocess_logo.png"
PACKAGES_IMAGE = MY_DIR.parents[1] / "diagrams" / "packages.png"
DATABASE_IMAGE = MY_DIR.parents[1] / "diagrams" / "database.png"
MULTIPLE_FILES_IMAGE = MY_DIR.parents[1] / "diagrams" / "multiple_files.png"

console = Console(width=400, color_system="standard")

graph_attr = {
"concentrate": "false",
"splines": "splines",
}

edge_attr = {
"minlen": "2",
}


def generate_basic_airflow_diagram():
image_file = (MY_DIR / MY_FILENAME).with_suffix(".png")

console.print(f"[bright_blue]Generating architecture image {image_file}")
with Diagram(
name="", show=False, direction="LR", curvestyle="ortho", filename=MY_FILENAME, outformat="png"
name="",
show=False,
direction="LR",
filename=MY_FILENAME,
outformat="png",
graph_attr=graph_attr,
edge_attr=edge_attr,
):
with Cluster("Parsing & Scheduling"):
schedulers = Custom("Scheduler(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())

metadata_db = PostgreSQL("Metadata DB")
user = User("Airflow User")

dag_author = User("DAG Author")
dag_files = MultipleDocuments("DAG files")
dag_files = Custom("DAG files", MULTIPLE_FILES_IMAGE.as_posix())
user >> Edge(color="brown", style="solid", reverse=False, label="author\n\n") >> dag_files

dag_author >> Edge(color="black", style="dashed", reverse=False) >> dag_files
with Cluster("Parsing, Scheduling & Executing"):
scheduler = Python("Scheduler")

with Cluster("Execution"):
workers = Custom("Worker(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
triggerer = Custom("Triggerer(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
metadata_db = Custom("Metadata DB", DATABASE_IMAGE.as_posix())
scheduler >> Edge(color="red", style="dotted", reverse=True) >> metadata_db

schedulers - Edge(color="blue", style="dashed", taillabel="Executor") - workers
plugins_and_packages = Custom(
"Plugin folder\n& installed packages", PACKAGES_IMAGE.as_posix(), color="transparent"
)

schedulers >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
workers >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
triggerer >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
user >> Edge(color="blue", style="solid", reverse=False, label="install\n\n") >> plugins_and_packages

operations_user = User("Operations User")
with Cluster("UI"):
webservers = Custom("Webserver(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
webserver = Python("Webserver")

webserver >> Edge(color="black", style="solid", reverse=True, label="operate\n\n") >> user

metadata_db >> Edge(color="red", style="dotted", reverse=True) >> webserver

webservers >> Edge(color="black", style="dashed", reverse=True) >> operations_user
dag_files >> Edge(color="brown", style="solid", label="read\n\n") >> scheduler

metadata_db >> Edge(color="red", style="dotted", reverse=True) >> webservers
plugins_and_packages >> Edge(color="blue", style="solid", label="install\n\n") >> scheduler
plugins_and_packages >> Edge(color="blue", style="solid", label="install\n\n") >> webserver

dag_files >> Edge(color="brown", style="solid") >> workers
dag_files >> Edge(color="brown", style="solid") >> schedulers
dag_files >> Edge(color="brown", style="solid") >> triggerer
console.print(f"[green]Generating architecture image {image_file}")


Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
e189c45f79a7a878802bde13be27a112
00f67a1e0cd073ba521da168dc80ccaa
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit c47dcc5

Please sign in to comment.