[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

RRap0so · 2024-07-30T14:30:51Z

Describe the bug

We've tried several times upgrading to flyte v1.13.0 and we've found some serious issues blocking the upgrade, we don't know the cause yet but overall it seems that Flyte Propeller calls to FlyteAdmin makes FlyteAdmin make many calls too two GRPC Endpoints:

GetOauth2Metadata
GetPublicClientConfig

What we see in Flyteadmin is a lot of pressure in the GRPC side and an increase in memory until it gets killed by the k8s with OOM. This in turn causes the FlytePropeller calls to fail with grpc errors and eventually also cause a crash. What we also noticed is that this versions added a retry mechanism which in our case is too aggressive but we've tuned that down.

Hard to say looking at the code but I suspect this PR introduced some kind of bug or missing config unaware to us which makes FlyteAdmin constantly trying to fetch a token (and fails).

#5407

When we rollback to FlytePropeller v1.11.0 things are fine and FlyteAdmin behaves nicely (still running in v1.13.0).

Expected behavior

Flyteadmin doesn't get massively impacted by grpc calls.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

runllm · 2024-07-30T14:31:16Z

The issue you're describing seems to be related to the interaction between FlytePropeller and FlyteAdmin, particularly with the gRPC calls to GetOauth2Metadata and GetPublicClientConfig. This is causing FlyteAdmin to experience high memory usage and eventually get killed by Kubernetes due to OOM (Out of Memory) errors.

Based on the provided data sources, here are a few steps and considerations to address this issue:

Check FlyteAdmin Configuration: Ensure that the FlyteAdmin configuration is optimized for your environment. According to this GitHub issue, you might need to adjust the resource limits and liveness probe settings for FlyteAdmin. For example, increasing the CPU and memory limits can help handle the increased load.
Adjust Retry Mechanism: You mentioned that the retry mechanism introduced in the newer version is too aggressive. Make sure you have tuned it down appropriately. This can help reduce the pressure on FlyteAdmin.
Review Authentication Setup: Ensure that the authentication setup between FlytePropeller and FlyteAdmin is correctly configured. According to this discussion, issues with token retrieval can cause repeated authentication attempts, leading to high load.
Monitor Resource Usage: Continuously monitor the resource usage of FlyteAdmin and FlytePropeller. Use Kubernetes metrics and logs to identify any spikes in memory or CPU usage.

Would you like more detailed steps on how to implement these changes or further troubleshooting steps?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

wild-endeavor · 2024-08-06T02:21:12Z

I think what is happening here is that the auth intercepter, provided by the NewAuthInterceptor function, calls the MaterializeCredentials function at every invocation, even if there is a token.

Inside MaterializeCredentials, the code does a couple things that maybe can be optimized.

it always creates a new authMetadataClient
and it always calls NewTokenSourceProvider
Inside NewTokenSourceProvider, in the AuthTypeClientSecret case, if cfg.TokenURL is not defined, then the code will always make a call to Admin's GetOAuth2Metadata endpoint. And a similar story for the GetPublicClientConfig endpoint. it just gets called every time.

@pmahindrakar-oss can you think of any quick fixes here? I think it should be okay to cache it on first read but i don't know how often these get refreshed etc. maybe we refresh it whenever we do a token refresh?

On the Admin side, it does seem likely that the oom is caused by this repeated calling of this endpoint. That's a side effect though. Calling these two endpoints thousands of times shouldn't be a problem. We're still looking into this.

This is what a pprof looks like right now, so it feels like something prometheus related (which would make sense given how simple those two endpoints are)

but @eapolinario and I dug for a while and didn't see a culprit just yet.

pvditt · 2024-08-06T02:32:36Z

@wild-endeavor

I noticed this PR wasn't upstreamed. This PR fixed "propeller panic when talking to admin , due to client interceptor sending nil token cache".

I figured this could be a potential cause for the hammering of NewAuthInterceptor -> MaterializeCredentials -> GetOauth2Metadata + GetPublicClientConfig.

I'm unsure if this is the issue, however, as this seems like it would cause issues for most users on 1.13. Also I don't get the worker panics that this PR fixed when running without it.

pvditt · 2024-08-06T02:33:04Z

@RRap0so

Would you have any propeller logs while it was crashing during your upgrade to 1.13? Also would you be able to share any logs of workflows that failed during the upgrade?

wild-endeavor · 2024-08-06T19:16:20Z

@RRap0so I think there are two issues happening here. Admin shouldn't be oom'ing, and the client that propeller uses shouldn't be making that many calls to Admin.

Wrt the second one, I asked internally and confirmed, unless you instantiate a new auth server, the details shouldn't change. To that end, until we add in a caching layer or change the logic, you can just short-circuit the logic by setting the config fields here and here. This is not ideal as you'll have these fields twice in your config - once where Admin is currently grabbing the values (to then serve in these two endpoints) and once here. Make sure you set scopes as well.

The memory leak we're still investigating. It's not clear to me how they are related. Will post back here when we have more.

RRap0so · 2024-08-07T07:07:03Z

Hey @pvditt and @wild-endeavor thank you for all the help! I'll try to find some logs but honestly I don't think there were any problems with the execution of the workflows. Things were still running, what I'm not sure is if it was able to publish events to admin.

Sovietaced · 2024-08-16T15:01:16Z

I feel that it is fairly normal for clients to cache oauth metadata. That is something I'd be happy to work on as I've done it before.

Sovietaced · 2024-08-19T20:24:47Z

Does anyone have an MVP flyte-binary values.yaml that can reproduce this issue?

eapolinario · 2024-08-27T01:38:06Z

@RRap0so , can you give 1.13.1 a try? #5686 implemented this idea of caching oauth data.

RRap0so · 2024-08-27T08:35:31Z

@eapolinario we just gave it another go and see the same behaviour. The Oath2Metadata spikes in flyteadmin, flyteadmin OOMing and restarting.

wild-endeavor · 2024-08-27T21:23:17Z

@RRap0so if the change we talked about works, I think we should close out this issue. There may be a memory leak when Admin serves up 404s in perpetuity but i think that's not the right thing to be optimizing for either. The memory issue identified in this issue with propeller was isolated to metrics, and is related to this issue here that @eapolinario found. I think we should cut a separate ticket to discuss metric life-cycle with @EngHabu as that discussion is pretty orthogonal. what do you guys think?

RRap0so · 2024-09-02T14:28:53Z

@wild-endeavor so far so good. We've update our code with the latest flyeidl libs!. I'm going to ahead and close thise one.

RRap0so added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 30, 2024

github-project-automation bot added this to Flyte Issues/PRs maintenance Jul 30, 2024

github-project-automation bot moved this to Backlog in Flyte Issues/PRs maintenance Jul 30, 2024

davidmirror-ops added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Aug 1, 2024

davidmirror-ops assigned pvditt Aug 1, 2024

davidmirror-ops moved this from Backlog to Assigned in Flyte Issues/PRs maintenance Aug 1, 2024

pvditt assigned wild-endeavor and unassigned pvditt Aug 20, 2024

wild-endeavor mentioned this issue Aug 23, 2024

Auth/prevent lookup per call #5686

Merged

3 tasks

wild-endeavor mentioned this issue Aug 29, 2024

[BUG] Investigate memory leak in single-binary deployment #3991

Closed

2 tasks

RRap0so closed this as completed Sep 2, 2024

github-project-automation bot moved this from Assigned to Done in Flyte Issues/PRs maintenance Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

RRap0so commented Jul 30, 2024

runllm bot commented Jul 30, 2024

wild-endeavor commented Aug 6, 2024

pvditt commented Aug 6, 2024 •

edited

Loading

pvditt commented Aug 6, 2024 •

edited

Loading

wild-endeavor commented Aug 6, 2024

RRap0so commented Aug 7, 2024

Sovietaced commented Aug 16, 2024

Sovietaced commented Aug 19, 2024

eapolinario commented Aug 27, 2024

RRap0so commented Aug 27, 2024

wild-endeavor commented Aug 27, 2024

RRap0so commented Sep 2, 2024

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

Comments

RRap0so commented Jul 30, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm bot commented Jul 30, 2024

wild-endeavor commented Aug 6, 2024

pvditt commented Aug 6, 2024 • edited Loading

pvditt commented Aug 6, 2024 • edited Loading

wild-endeavor commented Aug 6, 2024

RRap0so commented Aug 7, 2024

Sovietaced commented Aug 16, 2024

Sovietaced commented Aug 19, 2024

eapolinario commented Aug 27, 2024

RRap0so commented Aug 27, 2024

wild-endeavor commented Aug 27, 2024

RRap0so commented Sep 2, 2024

pvditt commented Aug 6, 2024 •

edited

Loading

pvditt commented Aug 6, 2024 •

edited

Loading