Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: address multiquery long timeouts causing excessive queue growth … #3284

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gvelez17
Copy link
Contributor

@gvelez17 gvelez17 commented Oct 10, 2024

During a period of heavy traffic and possible IPFS overload or failures, many errors of the following form occurred:

w8y6ppdchg8sfi0dnlcoixh4j5liiqhloal1dlz8yndht0nxxdafp at time undefined as part of a multiQuery request: Error: Timeout after 30000ms

Approximately 292K in one day.

We also have been seeing dramatic queue size growth and p-queue memory usage during these periods of intense load. This could be related to the large number of streams waiting for the 90 sec timeout, which still eventually fails. It would be better to fail sooner and not increase the load further.

Include relevant motivation, context, brief description and impact of the change(s). List follow-up tasks here.

How Has This Been Tested?

Describe the tests that you ran to verify your changes. Provide instructions for reproduction.

  • Test A (e.g. Test A - New test that ... ran in local, docker, and dev unstable.)
  • Test B

PR checklist

Before submitting this PR, please make sure:

  • I have tagged the relevant reviewers and interested parties
  • I have updated the READMEs of affected packages
  • I have made corresponding changes to the documentation

References:

Please list relevant documentation (e.g. tech specs, articles, related work etc.) relevant to this change, and note if the documentation has been updated.

@@ -67,7 +67,7 @@ import { LevelKVFactory } from './store/level-kv-factory.js'
const METRICS_CALLER_NAME = 'js-ceramic'
const DEFAULT_CACHE_LIMIT = 500 // number of streams stored in the cache
const DEFAULT_QPS_LIMIT = 10 // Max number of pubsub query messages that can be published per second without rate limiting
const DEFAULT_MULTIQUERY_TIMEOUT_MS = 30 * 1000
const DEFAULT_MULTIQUERY_TIMEOUT_MS = 3 * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not positive, but I believe when this timeout fires it will only fail the multiquery request but not clean up the work happening in the background in the execution queue. I think probably you want to pair this with also turning down the IPFS timeouts in the dispatcher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i had made a separate PR for those since i only saw the multiquery errors in the logs. we could do #3285 instead. I thought that might require more testing if its hitting more locations.

@gvelez17 gvelez17 mentioned this pull request Oct 10, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants