Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

Closed
2 tasks done
RRap0so opened this issue Jul 30, 2024 · 12 comments
Closed
2 tasks done
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working

Comments

@RRap0so
Copy link
Contributor

RRap0so commented Jul 30, 2024

Describe the bug

We've tried several times upgrading to flyte v1.13.0 and we've found some serious issues blocking the upgrade, we don't know the cause yet but overall it seems that Flyte Propeller calls to FlyteAdmin makes FlyteAdmin make many calls too two GRPC Endpoints:

  • GetOauth2Metadata
  • GetPublicClientConfig

What we see in Flyteadmin is a lot of pressure in the GRPC side and an increase in memory until it gets killed by the k8s with OOM. This in turn causes the FlytePropeller calls to fail with grpc errors and eventually also cause a crash. What we also noticed is that this versions added a retry mechanism which in our case is too aggressive but we've tuned that down.

Hard to say looking at the code but I suspect this PR introduced some kind of bug or missing config unaware to us which makes FlyteAdmin constantly trying to fetch a token (and fails).

#5407

When we rollback to FlytePropeller v1.11.0 things are fine and FlyteAdmin behaves nicely (still running in v1.13.0).

Expected behavior

Flyteadmin doesn't get massively impacted by grpc calls.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@RRap0so RRap0so added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 30, 2024
Copy link

runllm bot commented Jul 30, 2024

The issue you're describing seems to be related to the interaction between FlytePropeller and FlyteAdmin, particularly with the gRPC calls to GetOauth2Metadata and GetPublicClientConfig. This is causing FlyteAdmin to experience high memory usage and eventually get killed by Kubernetes due to OOM (Out of Memory) errors.

Based on the provided data sources, here are a few steps and considerations to address this issue:

  1. Check FlyteAdmin Configuration: Ensure that the FlyteAdmin configuration is optimized for your environment. According to this GitHub issue, you might need to adjust the resource limits and liveness probe settings for FlyteAdmin. For example, increasing the CPU and memory limits can help handle the increased load.

  2. Adjust Retry Mechanism: You mentioned that the retry mechanism introduced in the newer version is too aggressive. Make sure you have tuned it down appropriately. This can help reduce the pressure on FlyteAdmin.

  3. Review Authentication Setup: Ensure that the authentication setup between FlytePropeller and FlyteAdmin is correctly configured. According to this discussion, issues with token retrieval can cause repeated authentication attempts, leading to high load.

  4. Monitor Resource Usage: Continuously monitor the resource usage of FlyteAdmin and FlytePropeller. Use Kubernetes metrics and logs to identify any spikes in memory or CPU usage.

Would you like more detailed steps on how to implement these changes or further troubleshooting steps?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@davidmirror-ops davidmirror-ops added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Aug 1, 2024
@davidmirror-ops davidmirror-ops moved this from Backlog to Assigned in Flyte Issues/PRs maintenance Aug 1, 2024
@wild-endeavor
Copy link
Contributor

I think what is happening here is that the auth intercepter, provided by the NewAuthInterceptor function, calls the MaterializeCredentials function at every invocation, even if there is a token.

Inside MaterializeCredentials, the code does a couple things that maybe can be optimized.

  • it always creates a new authMetadataClient
  • and it always calls NewTokenSourceProvider
    Inside NewTokenSourceProvider, in the AuthTypeClientSecret case, if cfg.TokenURL is not defined, then the code will always make a call to Admin's GetOAuth2Metadata endpoint. And a similar story for the GetPublicClientConfig endpoint. it just gets called every time.

@pmahindrakar-oss can you think of any quick fixes here? I think it should be okay to cache it on first read but i don't know how often these get refreshed etc. maybe we refresh it whenever we do a token refresh?

On the Admin side, it does seem likely that the oom is caused by this repeated calling of this endpoint. That's a side effect though. Calling these two endpoints thousands of times shouldn't be a problem. We're still looking into this.

This is what a pprof looks like right now, so it feels like something prometheus related (which would make sense given how simple those two endpoints are)
image

but @eapolinario and I dug for a while and didn't see a culprit just yet.

@pvditt
Copy link
Contributor

pvditt commented Aug 6, 2024

@wild-endeavor

I noticed this PR wasn't upstreamed. This PR fixed "propeller panic when talking to admin , due to client interceptor sending nil token cache".

I figured this could be a potential cause for the hammering of NewAuthInterceptor -> MaterializeCredentials -> GetOauth2Metadata + GetPublicClientConfig.

I'm unsure if this is the issue, however, as this seems like it would cause issues for most users on 1.13. Also I don't get the worker panics that this PR fixed when running without it.

@pvditt
Copy link
Contributor

pvditt commented Aug 6, 2024

@RRap0so

Would you have any propeller logs while it was crashing during your upgrade to 1.13? Also would you be able to share any logs of workflows that failed during the upgrade?

@wild-endeavor
Copy link
Contributor

@RRap0so I think there are two issues happening here. Admin shouldn't be oom'ing, and the client that propeller uses shouldn't be making that many calls to Admin.

Wrt the second one, I asked internally and confirmed, unless you instantiate a new auth server, the details shouldn't change. To that end, until we add in a caching layer or change the logic, you can just short-circuit the logic by setting the config fields here and here. This is not ideal as you'll have these fields twice in your config - once where Admin is currently grabbing the values (to then serve in these two endpoints) and once here. Make sure you set scopes as well.

The memory leak we're still investigating. It's not clear to me how they are related. Will post back here when we have more.

@RRap0so
Copy link
Contributor Author

RRap0so commented Aug 7, 2024

Hey @pvditt and @wild-endeavor thank you for all the help! I'll try to find some logs but honestly I don't think there were any problems with the execution of the workflows. Things were still running, what I'm not sure is if it was able to publish events to admin.

@Sovietaced
Copy link
Contributor

I feel that it is fairly normal for clients to cache oauth metadata. That is something I'd be happy to work on as I've done it before.

@Sovietaced
Copy link
Contributor

Does anyone have an MVP flyte-binary values.yaml that can reproduce this issue?

@eapolinario
Copy link
Contributor

@RRap0so , can you give 1.13.1 a try? #5686 implemented this idea of caching oauth data.

@RRap0so
Copy link
Contributor Author

RRap0so commented Aug 27, 2024

@eapolinario we just gave it another go and see the same behaviour. The Oath2Metadata spikes in flyteadmin, flyteadmin OOMing and restarting.

@wild-endeavor
Copy link
Contributor

@RRap0so if the change we talked about works, I think we should close out this issue. There may be a memory leak when Admin serves up 404s in perpetuity but i think that's not the right thing to be optimizing for either. The memory issue identified in this issue with propeller was isolated to metrics, and is related to this issue here that @eapolinario found. I think we should cut a separate ticket to discuss metric life-cycle with @EngHabu as that discussion is pretty orthogonal. what do you guys think?

@RRap0so
Copy link
Contributor Author

RRap0so commented Sep 2, 2024

@wild-endeavor so far so good. We've update our code with the latest flyeidl libs!. I'm going to ahead and close thise one.

@RRap0so RRap0so closed this as completed Sep 2, 2024
@github-project-automation github-project-automation bot moved this from Assigned to Done in Flyte Issues/PRs maintenance Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

6 participants