-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
504 Gateway timeout when running GetWordGraph #464
Comments
Follow up: I re-ran the calls to getWordGraph, and they returned (the very large) results quickly. Is there some warm up time required for SMT? |
Yes, there is some warm up time (loading the model) when first calling the endpoint. The Cloudflare timeouts are probably too short for these endpoints. |
As per Damien's insight, this is likely an engine lock not being released. We need to determine the best way to fix this. Options include:
|
It appears that one fix could be that for the The HTTP timeout is 60 seconds, and I believe that at most there will be one timeout per call. Therefore, we could make it 55 seconds. |
The lifetime is just for trying to acquire the lock, not for cancelling an existing lock. If a writer lock grabs it and holds onto it, we must assume that either (1) the process has exited in some weird way where it has not released the lock or (2) the process is hanging - not for scoped calls but for some locks that aren't scoped to an HTTP call. Moreover, if resetting the servers fixed it, it is likely that the "finally" code of clearing the lock actually happened, which means that it is (2), the process is hanging forever.
|
The lifetime is the max duration of the acquired lock. Once the lock expires, other callers can acquire a lock even if it has never been released. |
Clearing the lock didn't help - but it may be the ClearML monitor. |
I got the issue again - there were no writer locks that were held onto - and it was still crashing with the timeouts (just on the cancel/delete/add endpoints). and resetting everything fixed it. |
All of those endpoints try to acquire a writer lock, so it could be a reader lock that hasn't been released. |
* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues
* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared
* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared
* Fix #464 - add lock lifetime for all * Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared * preserve fix fro #468
On the latest QA, for some (admittedly large) projects I am receiving 504 gateway timeout issues. These have occurred both today and late last week.
For example, calling:
Returns:
The text was updated successfully, but these errors were encountered: