Allow configuration to be contributed by providers #32604

potiuk · 2023-07-14T13:50:39Z

The changes implemented:

provider.yaml files for providers can optionally contribute extra
configuration, the configuration is exposed via "get_provider_info"
entrypoint, thus allowing Airflow to discover the configuration
from both - sources (in Breeze and local development) and from
installed packages
Provider configuraitions are lazily loaded - only for commands that
actually need them
Documentation for configuration contributed by providers is
generated as part of Provider documentation. It is also discoverable
by having a "core-extension" page displaying all community providers
that contribute their own configuration.
Celery configuration (and in the future Kubernetes configuration) is
linked directly from the airflow documentation - the providers are
preinstalled, which means that celery (and Kubernetes in the future)
configuration is considered as important to be directly mentioned
and linked from the core. Similarly Celery and Kubernetes executor
documentation remains in the core documentation (still configuration
options are detailed only in the provider documentation and only
linked from the core.
configuration writing happens in "main" not in the configuration
initialization and we will always execute provider configuration
initialization. This will make sure that the generated configuration
will contain configuration for the providers as well.
Related documentation about custom and community providers have been
updated and somewhat refactored - I realized that some of it was quite
out-of-date and some of it was really "developer" not user docs.
The docs are restructured a bit, cleaned, missing information is
added and old/irrelevant parts removed.

potiuk · 2023-07-14T14:08:41Z

Some tests will still fail but I wanted to put that one up for review.

And if you think that this one is huge, then well, yes it is. But worry not - in case we agree it is too huge I have a plan to split that one into a few smaller ones. I just wanted to show what is the target we are heading to - because some of the smaller PRs are better explained seeing the target.

This is something I've attempted already 3 times I think, and I think this time I succeeded - refactoring the way how configuration is read and generated, and converting our config.yml to be the single source of truth - all that while making it possible for providers to contribute their own config.

I hope this PR once merged, will remove a LOT of confusion on how our configuration is read for "production" and "test" mode.

The main change is to allow to contribute configuration by provider (via config: section in provider.yaml) - but there are many side-effects

Among the side-effects of this change:

no more templated almost-looking-like-real-but-not-entirely-working default_airflow.cfg is removed and replaced with simplte airflow config list --defaults (which BTW will also include configuration from installed providers)
no more generated unittest.cfg (I actually delete one when I find it to get rid of the confusion) - the test configuration gets its new home (config_unit_tests.yml) in `config_templates"
running configuration tests locally and in CI or breeze will have much higher chance to yield the same results, no more confusion with wher test configuration comes from (hint: there were three places where it could come from)
provider-contributed configuration is lazy - loaded only after provider's manager is initialized, this way the config cli command and a number of other configuration options will work at the "airflow bootstraping" phase without the need of discovering providers - which is very important for performance of airflow startup and CLI responsivenes. That was quite tricky to figure out, but I think I nailed finally.
airflow config list became a swiss-army-knife of showing all things config, you can decide what you are going to see as an output, including the capability of writing default configuration to a file manually if you want (including providers). It also - similarly to our API shows sources for each configuration value (if you ask for it) and even shows which provider contributed the default value (that's really nice actually)

The code is - I think - much more readable, understandable. more type-hinted, intentions of the code are documented with docstrings and any "special cases" are also documented with comments - why we do things in some strange ways. I removed/replaced quite a bit of "strange" code that seemed to be implemented as band-aid-for-band-aid.

I think I got 100% backwards compatibility

potiuk · 2023-07-14T14:14:18Z

Looking forward for comments :). Especially - is that OK to merge that one as a single PR, or would reviewers prefer to split that one into smaller ones?

Small sneak-peek of the output generated:

airflow config list --include-examples --include-sources --include-variables --include-descriptions

TESTING.rst

airflow/cli/cli_config.py

airflow/config_templates/config.yml

docs/exts/operators_and_hooks_ref.py

scripts/in_container/airflow_ci.cfg

potiuk · 2023-07-14T14:31:48Z

Also some docs screenshots:

potiuk · 2023-07-20T20:54:54Z

Looks like we’re being hit by the Cython 3.0 release. I’ve heard a few instances of it breaking the build.

Very much that. We are hit rather small-ish way (pymssql on ARM) but this is a mayhem for the whole Python ecosystem it seems.

airflow/__main__.py

airflow/cli/commands/config_command.py

airflow/configuration.py

docs/apache-airflow/configurations-ref.rst

docs/providers-configurations-ref.rst

airflow/providers/celery/provider.yaml

airflow/configuration.py

The changes implemented: * provider.yaml files for providers can optionally contribute extra configuration, the configuration is exposed via "get_provider_info" entrypoint, thus allowing Airflow to discover the configuration from both - sources (in Breeze and local development) and from installed packages * Provider configuraitions are lazily loaded - only for commands that actually need them * Documentation for configuration contributed by providers is generated as part of Provider documentation. It is also discoverable by having a "core-extension" page displaying all community providers that contribute their own configuration. * Celery configuration (and in the future Kubernetes configuration) is linked directly from the airflow documentation - the providers are preinstalled, which means that celery (and Kubernetes in the future) configuration is considered as important to be directly mentioned and linked from the core. Similarly Celery and Kubernetes executor documentation remains in the core documentation (still configuration options are detailed only in the provider documentation and only linked from the core. * configuration writing happens in "main" not in the configuration initialization and we will always execute provider configuration initialization. This will make sure that the generated configuration will contain configuration for the providers as well. * Related documentation about custom and community providers have been updated and somewhat refactored - I realized that some of it was quite out-of-date and some of it was really "developer" not user docs. The docs are restructured a bit, cleaned, missing information is added and old/irrelevant parts removed. Co-authored-by: Jed Cunningham <[email protected]>

Co-authored-by: Jed Cunningham <[email protected]>

The apache#32604 moved initialization of airflow config to after config initialization but webserver config is still in initialization part. Previously when the AIRFLOW_HOME folder was missing, it was created during config writing but it needs to be created now before webserver config is written.

The #32604 moved initialization of airflow config to after config initialization but webserver config is still in initialization part. Previously when the AIRFLOW_HOME folder was missing, it was created during config writing but it needs to be created now before webserver config is written.

The Celery Executor tests in Helm started to fail after the configuration migration has been merged (apache#32604). The PR did not have "full tests needed" label and it skipped K8S tests because there was no change related to kubernetes (but some fundamental changes in how configuration were retrieved caused the Celery Executor failed on missing default configuration value. The change adds ProvidersManager configuration initialization when executors are started in order to fix the problem temporarily, however there is an ongoing effort to optimise the path of retrieving provider configuration without having to initialize all provider's configuration and those lines will be removed when it happens.

The Celery Executor tests in Helm started to fail after the configuration migration has been merged (#32604). The PR did not have "full tests needed" label and it skipped K8S tests because there was no change related to kubernetes (but some fundamental changes in how configuration were retrieved caused the Celery Executor failed on missing default configuration value. The change adds ProvidersManager configuration initialization when executors are started in order to fix the problem temporarily, however there is an ongoing effort to optimise the path of retrieving provider configuration without having to initialize all provider's configuration and those lines will be removed when it happens.

* Allow configuration to be contributed by providers The changes implemented: * provider.yaml files for providers can optionally contribute extra configuration, the configuration is exposed via "get_provider_info" entrypoint, thus allowing Airflow to discover the configuration from both - sources (in Breeze and local development) and from installed packages * Provider configuraitions are lazily loaded - only for commands that actually need them * Documentation for configuration contributed by providers is generated as part of Provider documentation. It is also discoverable by having a "core-extension" page displaying all community providers that contribute their own configuration. * Celery configuration (and in the future Kubernetes configuration) is linked directly from the airflow documentation - the providers are preinstalled, which means that celery (and Kubernetes in the future) configuration is considered as important to be directly mentioned and linked from the core. Similarly Celery and Kubernetes executor documentation remains in the core documentation (still configuration options are detailed only in the provider documentation and only linked from the core. * configuration writing happens in "main" not in the configuration initialization and we will always execute provider configuration initialization. This will make sure that the generated configuration will contain configuration for the providers as well. * Related documentation about custom and community providers have been updated and somewhat refactored - I realized that some of it was quite out-of-date and some of it was really "developer" not user docs. The docs are restructured a bit, cleaned, missing information is added and old/irrelevant parts removed. Co-authored-by: Jed Cunningham <[email protected]> * Update airflow/configuration.py Co-authored-by: Jed Cunningham <[email protected]> --------- Co-authored-by: Jed Cunningham <[email protected]>

potiuk requested review from ashb, jedcunningham and kaxil as code owners July 14, 2023 13:50

boring-cyborg bot added area:CLI area:dev-tools provider:cncf-kubernetes Kubernetes provider related issues area:providers area:Scheduler including HA (high availability) scheduler kind:documentation provider:celery labels Jul 14, 2023

potiuk requested review from dstandish, o-nikolas, hussein-awala, ferruzzi, ephraimbuddy, mobuchowski, eladkal, vincbeck, uranusjr, pierrejeambrun and amoghrajesh July 14, 2023 13:50