Improve and add more complete description in the architecture diagrams (

#36513) When it comes to access and management plugins management follows different pattern than DAG files. While DAG files can (and should) be modified by DAG authors, the whole idea of Plugins was to make it only possible to modify plugins folder (and installl plugin-enabled packages) by the Deploymenet Managers, not DAG authors. The difference is quite important because even in a simplest installation, airflow webserver never needs to access DAG files, while it should be able to access plugins. This is even more profound in the environment (leading in the future to multi-tenant deployments) plugins are not 'per-tenant" - they must be installed and managed by deployment manager, because those plugins can be used by Airflow Webservers. In the future we might want to make distinction between these two different types of plugins, because theorethically it would be possible to distingquish "scheduler, worker & triggerer" plugins from the "webserver" plugins - however we do not have such disctinction today and whoever manages plugins folder is impacting both webserver and workers. This change also re-adds the "basic" architecture which is targetted as single-user and single machine deployment and presents it as the first architecture that the user encounters - which might make it more digestible, while it also explains tha this is a simplified architecture and is followed by more complete and complex deployment scenarios involving distributed architecture, different user roles and security boundaries.
apache · Jan 5, 2024 · c47dcc5 · c47dcc5
1 parent 16d16e2
commit c47dcc5
Show file tree

Hide file tree

Showing 13 changed files with 330 additions and 60 deletions.
diff --git a/docs/apache-airflow/core-concepts/overview.rst b/docs/apache-airflow/core-concepts/overview.rst
@@ -18,49 +18,164 @@
 Architecture Overview
 =====================
 
-Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called :doc:`tasks`, arranged with dependencies and data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
 
 .. image:: ../img/edge_label_example.png
   :alt: An example Airflow DAG, rendered in Graph
 
-A DAG specifies the dependencies between Tasks, and the order in which to execute them and run retries; the Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between tasks, which defines the order in which to execute the tasks.
+Tasks describe what to do, be it fetching data, running analysis, triggering other systems, or more.
 
-An Airflow installation generally consists of the following components:
+Airflow itself is agnostic to what you're running - it will happily orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a command using the shell
+or Python :doc:`operators`.
 
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled workflows, and submitting :doc:`tasks` to the executor to run.
+Airflow components
+------------------
 
-* An :doc:`executor <executor/index>`, which handles running tasks. In the default Airflow installation, this runs everything *inside* the scheduler, but most production-suitable executors actually push task execution out to *workers*.
+Airflow's architecture consists of multiple components. The following sections describe each component's
+function and whether they're required for a bare-minimum Airflow installation, or an optional component
+to achieve better Airflow extensibility, performance, and scalability.
 
-* A *triggerer*, which executes deferred tasks - executed in an async-io event loop.
+Required components
+...................
 
-* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
+A minimal Airflow installation consists of the following components:
 
-* A folder of *DAG files*, read by the scheduler and executor (and any workers the executor has)
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled
+  workflows, and submitting :doc:`tasks` to the executor to run. The :doc:`executor <executor/index>`, is
+  a configuration property of the *scheduler*, not a separate component and runs within the scheduler
+  process. There are several executors available out of the box, and you can also write your own.
 
-* A *metadata database*, used by the scheduler, executor and webserver to store state.
+* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of
+  DAGs and tasks.
 
+* A folder of *DAG files* is read by the *scheduler* to figure out what tasks to run and when and to
+  run them.
 
-Basic airflow architecture
---------------------------
+* A *metadata database*, which airflow components use to store state of workflows and tasks.
+  Setting up a metadata database is described in :doc:`/howto/set-up-database` and is required for
+  Airflow to work.
 
-This is the basic architecture of Airflow that you'll see in simple installations:
+Optional components
+...................
+
+Some Airflow components are optional and can enable better extensibility, scalability, and
+performance in your Airflow:
+
+* Optional *worker*, which executes the tasks given to it by the scheduler. In the basic installation
+  worker might be part of the scheduler not a separate component. It can be run as a long running process
+  in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+  :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an asyncio event loop. In basic installation
+  where deferred tasks are not used, a triggerer is not necessary. More about deferring tasks can be
+  found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and serializes them into the
+  *metadata database*. By default, the *dag processor* process is part of the scheduler, but it can
+  be run as a separate component for scalability and security reasons. If *dag processor* is present
+  *scheduler* does not need to read the *DAG files* directly. More about
+  processing DAG files can be found in :doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's functionality (similar to installed
+  packages). Plugins are read by the *scheduler*, *dag processor*, *triggerer* and *webserver*. More about
+  plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+----------------------------
+
+All the components are Python applications that can be deployed using various deployment mechanisms.
+
+They can have extra *installed packages* installed in their Python environment. This is useful for example to
+install custom operators or sensors or extend Airflow functionality with custom plugins.
+
+While Airflow can be run in a single machine and with simple installation where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and is able ot run in a distributed
+environment - where various components can run on different machines, with different security perimeters
+and can be scaled by running multiple instances of the components above.
+
+The separation of components also allow for increased security, by isolating the components from each other
+and by allowing to perform different tasks. For example separating *dag processor* from *scheduler*
+allows to make sure that the *scheduler* does not have access to the *DAG files* and cannot execute
+code provided by *DAG author*.
+
+Also while single person can run and manage Airflow installation, Airflow Deployment in more complex
+setup can involve various roles of users that can interact with different parts of the system, which is
+an important aspect of secure Airflow deployment. The roles are described in detail in the
+:doc:`/security/security_model` and generally speaking include:
+
+* Deployment Manager - a person that installs and configures Airflow and manages the deployment
+* DAG author - a person that writes DAGs and submits them to Airflow
+* Operations User - a person that triggers DAGs and tasks and monitors their execution
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the simple "one machine" and
+single person deployment, to a more complex deployment with separate components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages* and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler* (via executor)
+* **black solid lines** represent accessing the UI to manage execution of the workflows
+* **red dashed lines** represent accessing the *metadata database* by all components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a single
+machine. Such a deployment usually uses the LocalExecutor, where the *scheduler* and the *workers* are in
+the same Python process and the *DAG files* are read directly from the local filesystem by the *scheduler*.
+The *webserver* runs on the same machine as the *scheduler*. There is no *triggerer* component, which
+means that task deferral is not possible.
+
+Such an installation typically does not separate user roles - deployment, configuration, operation, authoring
+and maintenance are all done by the same person and there are no security perimeters between the components.
 
 .. image:: ../img/diagram_basic_airflow_architecture.png
 
-Most executors will generally also introduce other components to let them talk to their workers - like a task queue - but you can still think of the executor and its workers as a single logical component in Airflow overall, handling the actual task execution.
+If you want to run Airflow on a single machine in a simple single-machine setup, you can skip the
+more complex diagrams below and go straight to the :ref:`overview:workloads` section.
+
+Distributed Airflow architecture
+................................
+
+This is the architecture of Airflow where components of Airflow are distributed among multiple machines
+and where various roles of users are introduced - *Deployment Manager*, **DAG author**,
+**Operations User**. You can read more about those various roles in the :doc:`/security/security_model`.
+
+In the case of a distributed deployment, it is important to consider the security aspects of the components.
+The *webserver* does not have access to the *DAG files* directly. The code in the ``Code`` tab of the
+UI is read from the *metadata database*. The *webserver* cannot execute any code submitted by the
+**DAG author**. It can only execute code that is installed as an *installed package* or *plugin* by
+the **Deployment Manager**. The **Operations User** only has access to the UI and can only trigger
+DAGs and tasks, but cannot author DAGs.
+
+The *DAG files* need to be synchronized between all the components that use them - *scheduler*,
+*triggerer* and *workers*. The *DAG files* can be synchronized by various mechanisms - typical
+ways how DAGs can be synchronized are described in :doc:`helm-chart:manage-dags-files` ot our
+Helm Chart documentation. Helm chart is one of the ways how to deploy Airflow in K8S cluster.
 
-Airflow itself is agnostic to what you're running - it will happily orchestrate and run anything, either with high-level support from one of our providers, or directly as a command using the shell or Python :doc:`operators`.
+.. image:: ../img/diagram_distributed_airflow_architecture.png
 
 Separate DAG processing architecture
-------------------------------------
+....................................
 
-In a more complex installation where security and isolation are important, you'll also see the standalone **dag file processor** component that allows to separate scheduler from accessing DAG file. This is suitable if the
-deployment focus is on isolation between parsed tasks. While Airflow does not yet support full multi-tenant features, it can be used to make sure that DAG-author provided code is never executed in the context of the scheduler.
+In a more complex installation where security and isolation are important, you'll also see the
+standalone *dag processor* component that allows to separate *scheduler* from accessing *DAG files*.
+This is suitable if the deployment focus is on isolation between parsed tasks. While Airflow does not yet
+support full multi-tenant features, it can be used to make sure that **DAG author** provided code is never
+executed in the context of the scheduler.
 
 .. image:: ../img/diagram_dag_processor_airflow_architecture.png
 
-You can read more about the different types of users and how they interact with Airflow and how the
-security model of Airflow access look like in the :doc:`/security/security_model`
+.. _overview:workloads:
 
 Workloads
 ---------

diff --git a/docs/apache-airflow/img/diagram_basic_airflow_architecture.md5sum b/docs/apache-airflow/img/diagram_basic_airflow_architecture.md5sum
@@ -1 +1 @@
-ac9bd11824e7faf5ed5232ff242c3157
+cc2aca72cb388d28842e539f599d373c
diff --git a/docs/apache-airflow/img/diagram_basic_airflow_architecture.png b/docs/apache-airflow/img/diagram_basic_airflow_architecture.png
diff --git a/docs/apache-airflow/img/diagram_basic_airflow_architecture.py b/docs/apache-airflow/img/diagram_basic_airflow_architecture.py
@@ -21,55 +21,69 @@
 from diagrams import Cluster, Diagram, Edge
 from diagrams.custom import Custom
 from diagrams.onprem.client import User
-from diagrams.onprem.database import PostgreSQL
-from diagrams.programming.flowchart import MultipleDocuments
+from diagrams.programming.language import Python
 from rich.console import Console
 
 MY_DIR = Path(__file__).parent
 MY_FILENAME = Path(__file__).with_suffix("").name
-PYTHON_MULTIPROCESS_LOGO = MY_DIR.parents[1] / "diagrams" / "python_multiprocess_logo.png"
+PACKAGES_IMAGE = MY_DIR.parents[1] / "diagrams" / "packages.png"
+DATABASE_IMAGE = MY_DIR.parents[1] / "diagrams" / "database.png"
+MULTIPLE_FILES_IMAGE = MY_DIR.parents[1] / "diagrams" / "multiple_files.png"
 
 console = Console(width=400, color_system="standard")
 
+graph_attr = {
+    "concentrate": "false",
+    "splines": "splines",
+}
+
+edge_attr = {
+    "minlen": "2",
+}
+
 
 def generate_basic_airflow_diagram():
     image_file = (MY_DIR / MY_FILENAME).with_suffix(".png")
 
     console.print(f"[bright_blue]Generating architecture image {image_file}")
     with Diagram(
-        name="", show=False, direction="LR", curvestyle="ortho", filename=MY_FILENAME, outformat="png"
+        name="",
+        show=False,
+        direction="LR",
+        filename=MY_FILENAME,
+        outformat="png",
+        graph_attr=graph_attr,
+        edge_attr=edge_attr,
     ):
-        with Cluster("Parsing & Scheduling"):
-            schedulers = Custom("Scheduler(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
-
-        metadata_db = PostgreSQL("Metadata DB")
+        user = User("Airflow User")
 
-        dag_author = User("DAG Author")
-        dag_files = MultipleDocuments("DAG files")
+        dag_files = Custom("DAG files", MULTIPLE_FILES_IMAGE.as_posix())
+        user >> Edge(color="brown", style="solid", reverse=False, label="author\n\n") >> dag_files
 
-        dag_author >> Edge(color="black", style="dashed", reverse=False) >> dag_files
+        with Cluster("Parsing, Scheduling & Executing"):
+            scheduler = Python("Scheduler")
 
-        with Cluster("Execution"):
-            workers = Custom("Worker(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
-            triggerer = Custom("Triggerer(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
+        metadata_db = Custom("Metadata DB", DATABASE_IMAGE.as_posix())
+        scheduler >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
 
-        schedulers - Edge(color="blue", style="dashed", taillabel="Executor") - workers
+        plugins_and_packages = Custom(
+            "Plugin folder\n& installed packages", PACKAGES_IMAGE.as_posix(), color="transparent"
+        )
 
-        schedulers >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
-        workers >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
-        triggerer >> Edge(color="red", style="dotted", reverse=True) >> metadata_db
+        user >> Edge(color="blue", style="solid", reverse=False, label="install\n\n") >> plugins_and_packages
 
-        operations_user = User("Operations User")
         with Cluster("UI"):
-            webservers = Custom("Webserver(s)", PYTHON_MULTIPROCESS_LOGO.as_posix())
+            webserver = Python("Webserver")
+
+        webserver >> Edge(color="black", style="solid", reverse=True, label="operate\n\n") >> user
+
+        metadata_db >> Edge(color="red", style="dotted", reverse=True) >> webserver
 
-        webservers >> Edge(color="black", style="dashed", reverse=True) >> operations_user
+        dag_files >> Edge(color="brown", style="solid", label="read\n\n") >> scheduler
 
-        metadata_db >> Edge(color="red", style="dotted", reverse=True) >> webservers
+        plugins_and_packages >> Edge(color="blue", style="solid", label="install\n\n") >> scheduler
+        plugins_and_packages >> Edge(color="blue", style="solid", label="install\n\n") >> webserver
 
-        dag_files >> Edge(color="brown", style="solid") >> workers
-        dag_files >> Edge(color="brown", style="solid") >> schedulers
-        dag_files >> Edge(color="brown", style="solid") >> triggerer
     console.print(f"[green]Generating architecture image {image_file}")
 
 

diff --git a/docs/apache-airflow/img/diagram_dag_processor_airflow_architecture.md5sum b/docs/apache-airflow/img/diagram_dag_processor_airflow_architecture.md5sum
@@ -1 +1 @@
-e189c45f79a7a878802bde13be27a112
+00f67a1e0cd073ba521da168dc80ccaa
diff --git a/docs/apache-airflow/img/diagram_dag_processor_airflow_architecture.png b/docs/apache-airflow/img/diagram_dag_processor_airflow_architecture.png
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		ac9bd11824e7faf5ed5232ff242c3157
		cc2aca72cb388d28842e539f599d373c
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		e189c45f79a7a878802bde13be27a112
		00f67a1e0cd073ba521da168dc80ccaa