Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API throttle when using GCS as checkpoint backend #621

Closed
gbto opened this issue May 7, 2024 · 3 comments · Fixed by #634
Closed

API throttle when using GCS as checkpoint backend #621

gbto opened this issue May 7, 2024 · 3 comments · Fixed by #634
Labels
bug Something isn't working

Comments

@gbto
Copy link
Contributor

gbto commented May 7, 2024

We have observed on our Kubernetes deployments some responsiveness issue in the UI. After investigation, we've discovered the API calls made to fetch pipelines and jobs were pending and eventually timed out. Digging deeper, this was due to the controller calling the GCP metadata server too often:

object_store::gcp::credential::fetching token from metadata server

Which results in:

object_store::client::retry::Encountered transport error (error sending request for url (http://metadata/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): error trying to connect: operation timed out) backing off for 0.1 seconds, retry 1 of 10

So after an attempt to reduce the number of object_store::gcp::credential call to the metadata server, I've switched from Kubernetes SA authentication to passing the serialised JSON of the SA credentials as GOOGLE_SERVICE_ACCOUNT_KEY environment variable, which did reduce the number of calls but did not resolve the unresponsive Arroyo API issue. I also tried hacking the object_store by forcing it to use the S3 client for authenticating to GCS with AWS_DEFAULT_REGION, AWS_ENDPOINT but had no luck for the authentication...

Here are the screenshots of what that looks like in practice:

Screenshot_2024-05-06_at_18 33 00 Screenshot 2024-05-06 at 18 34 45

Happy to help if there's anything I can do !

@qgab-flowdesk
Copy link

Still facing weird UI behaviour with the cache mecanism in the v0.11.0 release unfortunately :/ The workers run as expected I think but the checkpoint would suddenly go from a couple of secs duration to several minutes, for all pipelines at the same moment. Regarding the UI it seems the pipelines page returns partial and varying results each time it is refreshed.

Screenshot 2024-07-04 at 16 38 13
Screenshot 2024-07-04 at 16 41 18
Screenshot 2024-07-04 at 16 41 41
Screenshot 2024-07-04 at 16 42 19

@mwylde mwylde reopened this Jul 8, 2024
@mwylde
Copy link
Member

mwylde commented Oct 24, 2024

We believe we have found the root cause of this issue within the object_store crate, and have opened apache/arrow-rs#6625 there.

@mwylde mwylde added the bug Something isn't working label Oct 24, 2024
@mwylde
Copy link
Member

mwylde commented Oct 29, 2024

Fixed in #770

@mwylde mwylde closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants