Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuration to be contributed by providers #32604

Merged
merged 2 commits into from
Jul 21, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jul 14, 2023

The changes implemented:

  • provider.yaml files for providers can optionally contribute extra
    configuration, the configuration is exposed via "get_provider_info"
    entrypoint, thus allowing Airflow to discover the configuration
    from both - sources (in Breeze and local development) and from
    installed packages

  • Provider configuraitions are lazily loaded - only for commands that
    actually need them

  • Documentation for configuration contributed by providers is
    generated as part of Provider documentation. It is also discoverable
    by having a "core-extension" page displaying all community providers
    that contribute their own configuration.

  • Celery configuration (and in the future Kubernetes configuration) is
    linked directly from the airflow documentation - the providers are
    preinstalled, which means that celery (and Kubernetes in the future)
    configuration is considered as important to be directly mentioned
    and linked from the core. Similarly Celery and Kubernetes executor
    documentation remains in the core documentation (still configuration
    options are detailed only in the provider documentation and only
    linked from the core.

  • configuration writing happens in "main" not in the configuration
    initialization and we will always execute provider configuration
    initialization. This will make sure that the generated configuration
    will contain configuration for the providers as well.

  • Related documentation about custom and community providers have been
    updated and somewhat refactored - I realized that some of it was quite
    out-of-date and some of it was really "developer" not user docs.
    The docs are restructured a bit, cleaned, missing information is
    added and old/irrelevant parts removed.

@potiuk
Copy link
Member Author

potiuk commented Jul 14, 2023

Some tests will still fail but I wanted to put that one up for review.

And if you think that this one is huge, then well, yes it is. But worry not - in case we agree it is too huge I have a plan to split that one into a few smaller ones. I just wanted to show what is the target we are heading to - because some of the smaller PRs are better explained seeing the target.

This is something I've attempted already 3 times I think, and I think this time I succeeded - refactoring the way how configuration is read and generated, and converting our config.yml to be the single source of truth - all that while making it possible for providers to contribute their own config.

I hope this PR once merged, will remove a LOT of confusion on how our configuration is read for "production" and "test" mode.

The main change is to allow to contribute configuration by provider (via config: section in provider.yaml) - but there are many side-effects

Among the side-effects of this change:

  • no more templated almost-looking-like-real-but-not-entirely-working default_airflow.cfg is removed and replaced with simplte airflow config list --defaults (which BTW will also include configuration from installed providers)

  • no more generated unittest.cfg (I actually delete one when I find it to get rid of the confusion) - the test configuration gets its new home (config_unit_tests.yml) in `config_templates"

  • running configuration tests locally and in CI or breeze will have much higher chance to yield the same results, no more confusion with wher test configuration comes from (hint: there were three places where it could come from)

  • provider-contributed configuration is lazy - loaded only after provider's manager is initialized, this way the config cli command and a number of other configuration options will work at the "airflow bootstraping" phase without the need of discovering providers - which is very important for performance of airflow startup and CLI responsivenes. That was quite tricky to figure out, but I think I nailed finally.

  • airflow config list became a swiss-army-knife of showing all things config, you can decide what you are going to see as an output, including the capability of writing default configuration to a file manually if you want (including providers). It also - similarly to our API shows sources for each configuration value (if you ask for it) and even shows which provider contributed the default value (that's really nice actually)

The code is - I think - much more readable, understandable. more type-hinted, intentions of the code are documented with docstrings and any "special cases" are also documented with comments - why we do things in some strange ways. I removed/replaced quite a bit of "strange" code that seemed to be implemented as band-aid-for-band-aid.

I think I got 100% backwards compatibility

@potiuk
Copy link
Member Author

potiuk commented Jul 14, 2023

Looking forward for comments :). Especially - is that OK to merge that one as a single PR, or would reviewers prefer to split that one into smaller ones?

Small sneak-peek of the output generated:

airflow config list --include-examples --include-sources --include-variables --include-descriptions 

Screenshot from 2023-07-14 16-11-44

TESTING.rst Outdated Show resolved Hide resolved
airflow/cli/cli_config.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Jul 14, 2023

Also some docs screenshots:

Screenshot from 2023-07-14 16-27-17

image

@potiuk
Copy link
Member Author

potiuk commented Jul 20, 2023

Looks like we’re being hit by the Cython 3.0 release. I’ve heard a few instances of it breaking the build.

Very much that. We are hit rather small-ish way (pymssql on ARM) but this is a mayhem for the whole Python ecosystem it seems.

airflow/__main__.py Outdated Show resolved Hide resolved
airflow/cli/commands/config_command.py Show resolved Hide resolved
airflow/configuration.py Outdated Show resolved Hide resolved
docs/apache-airflow/configurations-ref.rst Outdated Show resolved Hide resolved
docs/providers-configurations-ref.rst Outdated Show resolved Hide resolved
docs/providers-configurations-ref.rst Outdated Show resolved Hide resolved
docs/providers-configurations-ref.rst Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the allow-configuration-in-providers branch 3 times, most recently from 551167e to d174fbf Compare July 21, 2023 16:31
airflow/configuration.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the allow-configuration-in-providers branch from d174fbf to ede1803 Compare July 21, 2023 17:17
potiuk and others added 2 commits July 21, 2023 19:28
The changes implemented:

* provider.yaml files for providers can optionally contribute extra
  configuration, the configuration is exposed via "get_provider_info"
  entrypoint, thus allowing Airflow to discover the configuration
  from both - sources (in Breeze and local development) and from
  installed packages

* Provider configuraitions are lazily loaded - only for commands that
  actually need them

* Documentation for configuration contributed by providers is
  generated as part of Provider documentation. It is also discoverable
  by having a "core-extension" page displaying all community providers
  that contribute their own configuration.

* Celery configuration (and in the future Kubernetes configuration) is
  linked directly from the airflow documentation - the providers are
  preinstalled, which means that celery (and Kubernetes in the future)
  configuration is considered as important to be directly mentioned
  and linked from the core. Similarly Celery and Kubernetes executor
  documentation remains in the core documentation (still configuration
  options are detailed only in the provider documentation and only
  linked from the core.

* configuration writing happens in "main" not in the configuration
  initialization and we will always execute provider configuration
  initialization. This will make sure that the generated configuration
  will contain configuration for the providers as well.

* Related documentation about custom and community providers have been
  updated and somewhat refactored - I realized that some of it was quite
  out-of-date and some of it was really "developer" not user docs.
  The docs are restructured a bit, cleaned, missing information is
  added and old/irrelevant parts removed.

Co-authored-by: Jed Cunningham <[email protected]>
Co-authored-by: Jed Cunningham <[email protected]>
@potiuk potiuk force-pushed the allow-configuration-in-providers branch from f98aacc to 9e6508f Compare July 21, 2023 17:28
@potiuk potiuk merged commit 73b90c4 into apache:main Jul 21, 2023
@potiuk potiuk deleted the allow-configuration-in-providers branch July 21, 2023 18:13
potiuk added a commit to potiuk/airflow that referenced this pull request Jul 21, 2023
The apache#32604 moved initialization of airflow config to after config
initialization but webserver config is still in initialization
part. Previously when the AIRFLOW_HOME folder was missing, it was
created during config writing but it needs to be created now
before webserver config is written.
potiuk added a commit that referenced this pull request Jul 21, 2023
The #32604 moved initialization of airflow config to after config
initialization but webserver config is still in initialization
part. Previously when the AIRFLOW_HOME folder was missing, it was
created during config writing but it needs to be created now
before webserver config is written.
potiuk added a commit to potiuk/airflow that referenced this pull request Jul 22, 2023
The Celery Executor tests in Helm started to fail after the
configuration migration has been merged (apache#32604). The PR did not
have "full tests needed" label and it skipped K8S tests because
there was no change related to kubernetes (but some fundamental
changes in how configuration were retrieved caused the Celery
Executor failed on missing default configuration value.

The change adds ProvidersManager configuration initialization when
executors are started in order to fix the problem temporarily,
however there is an ongoing effort to optimise the path of
retrieving provider configuration without having to initialize
all provider's configuration and those lines will be removed
when it happens.
potiuk added a commit that referenced this pull request Jul 22, 2023
The Celery Executor tests in Helm started to fail after the
configuration migration has been merged (#32604). The PR did not
have "full tests needed" label and it skipped K8S tests because
there was no change related to kubernetes (but some fundamental
changes in how configuration were retrieved caused the Celery
Executor failed on missing default configuration value.

The change adds ProvidersManager configuration initialization when
executors are started in order to fix the problem temporarily,
however there is an ongoing effort to optimise the path of
retrieving provider configuration without having to initialize
all provider's configuration and those lines will be removed
when it happens.
pateash pushed a commit to pateash/airflow that referenced this pull request Jul 23, 2023
* Allow configuration to be contributed by providers

The changes implemented:

* provider.yaml files for providers can optionally contribute extra
  configuration, the configuration is exposed via "get_provider_info"
  entrypoint, thus allowing Airflow to discover the configuration
  from both - sources (in Breeze and local development) and from
  installed packages

* Provider configuraitions are lazily loaded - only for commands that
  actually need them

* Documentation for configuration contributed by providers is
  generated as part of Provider documentation. It is also discoverable
  by having a "core-extension" page displaying all community providers
  that contribute their own configuration.

* Celery configuration (and in the future Kubernetes configuration) is
  linked directly from the airflow documentation - the providers are
  preinstalled, which means that celery (and Kubernetes in the future)
  configuration is considered as important to be directly mentioned
  and linked from the core. Similarly Celery and Kubernetes executor
  documentation remains in the core documentation (still configuration
  options are detailed only in the provider documentation and only
  linked from the core.

* configuration writing happens in "main" not in the configuration
  initialization and we will always execute provider configuration
  initialization. This will make sure that the generated configuration
  will contain configuration for the providers as well.

* Related documentation about custom and community providers have been
  updated and somewhat refactored - I realized that some of it was quite
  out-of-date and some of it was really "developer" not user docs.
  The docs are restructured a bit, cleaned, missing information is
  added and old/irrelevant parts removed.

Co-authored-by: Jed Cunningham <[email protected]>

* Update airflow/configuration.py

Co-authored-by: Jed Cunningham <[email protected]>

---------

Co-authored-by: Jed Cunningham <[email protected]>
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:dev-tools area:providers area:Scheduler including HA (high availability) scheduler kind:documentation provider:celery provider:cncf-kubernetes Kubernetes provider related issues type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants