Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webserver fails to pull secrets from Hashicorp Vault on start up #26185

Closed
2 tasks done
Wilder60 opened this issue Sep 6, 2022 · 10 comments · Fixed by #26223
Closed
2 tasks done

Webserver fails to pull secrets from Hashicorp Vault on start up #26185

Wilder60 opened this issue Sep 6, 2022 · 10 comments · Fixed by #26223
Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug

Comments

@Wilder60
Copy link

Wilder60 commented Sep 6, 2022

Apache Airflow version

2.3.4

What happened

Since upgrading to Airflow 2.3.4 our webserver fails on start up to pull secrets from our Vault instance. Setting AIRFLOW__WEBSERVER_WORKERS = 1 allowed the webserver to start up successfully, but reverting the change added here https://github.com/apache/airflow/pull/25556 was the only way we found to fix the issue without adjusting the webserver's worker count.

What you think should happen instead

The airflow webserver should be able to successfully read from Vault with AIRFLOW__WEBSERVERS__WORKERS > 1.

How to reproduce

Star a Webserver instance set to authenticate with Vault using the approle method and AIRFLOW__DATABASE__SQL_ALCHEMY_CONN_SECRET and AIRFLOW__WEBSERVER__SECRET_KEY_SECRET set. The webserver should fail to initialize all of the gunicorn workers and exit.

Operating System

Fedora 29

Versions of Apache Airflow Providers

apache-airflow-providers-hashicorp==3.1.0

Deployment

Docker-Compose

Deployment details

Python 3.9.13
Vault 1.9.4

Anything else

None

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Wilder60 Wilder60 added area:core kind:bug This is a clearly a bug labels Sep 6, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 6, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Sep 7, 2022

@pdebelak @potiuk is there anything extra we can provide here?

It was a bit of a surprise to us that we needed to revert #25556 to run the webserver on 2.3.4, it was one of the PRs we were looking forward to. But it's not a 100% clear to us why this PR broke the webserver/vault interaction.

@pdebelak
Copy link
Contributor

pdebelak commented Sep 7, 2022

@notatallshaw-work @Wilder60 I don't use approle, but this is mysterious. Are there any other clues you can provide? All other configuration is at default values? Do you get a stack trace when the workers crash?

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Sep 7, 2022

Yes we do get a stack trace it's a http.client.BadStatusLine, after staring at it a while I think that the vault clients in the different processes are all reading the same HTTP responses (in the order that came from the vault server), rather than reading their own individual responses.

Here is the full exception log (I've had to sanitize some of the values): https://gist.github.com/notatallshaw-work/6c1a77b4341caa653a231190bd64cdaa

Running just 1 worker or reverting the PR both fixes this exception.

@pdebelak
Copy link
Contributor

pdebelak commented Sep 7, 2022

Yeah, I don't really understand how/why that might be happening, but that's clearly a problem.. It's possible that this change is too fundamental to be safe for all configurations and we should just get rid of this caching and deal with the multiple logins.

@potiuk - what do you think? Should I just revert this caching change altogether?

@potiuk
Copy link
Member

potiuk commented Sep 7, 2022

I think so. It has proven to be more problematic than we both thought and the benefit is limited, I think that using external caching mechanism from the respective secrets providers is a much more robust way.

We should treat it as a failed experiment I think.

@pdebelak
Copy link
Contributor

pdebelak commented Sep 7, 2022

Okay, just opened #26223 to remove this caching.

@uranusjr uranusjr added area:webserver Webserver related Issues and removed area:core labels Sep 7, 2022
@potiuk
Copy link
Member

potiuk commented Sep 8, 2022

Merged. I think good learning for everyone :)

@notatallshaw-gts
Copy link
Contributor

FYI I wonder if the behavior would change with this PR: #27297

With each gunicorn worker preloaded perhaps the secrets would be sent back to the correct worker?

@potiuk
Copy link
Member

potiuk commented Oct 31, 2022

Worth checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants