Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

504 Gateway timeout when running GetWordGraph #464

Closed
pmachapman opened this issue Aug 25, 2024 · 11 comments · Fixed by #473
Closed

504 Gateway timeout when running GetWordGraph #464

pmachapman opened this issue Aug 25, 2024 · 11 comments · Fixed by #473
Assignees
Labels
bug Something isn't working

Comments

@pmachapman
Copy link
Collaborator

On the latest QA, for some (admittedly large) projects I am receiving 504 gateway timeout issues. These have occurred both today and late last week.

For example, calling:

curl -X 'POST' \
  'https://qa.serval-api.org/api/v1/translation/engines/6667aab0db23836577801280/get-word-graph' \
  -H 'accept: application/json' \
  -H 'authorization: Bearer !!!!!!REDACTED!!!!!!!!' \
  -H 'Content-Type: application/json' \
  -d '"And the LORD God planted a garden toward the east in Eden, and there he put the man whom he had formed."'

Returns:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>


<title>qa.serval-api.org | 504: Gateway time-out</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css" />


</head>
<body>
<div id="cf-wrapper">
    <div id="cf-error-details" class="p-0">
        <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-8">
            <h1 class="inline-block sm:block sm:mb-2 font-light text-60 lg:text-4xl text-black-dark leading-tight mr-2">
              <span class="inline-block">Gateway time-out</span>
              <span class="code-label">Error code 504</span>
            </h1>
            <div>
               Visit <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">cloudflare.com</a> for more information.
            </div>
            <div class="mt-3">2024-08-25 19:43:18 UTC</div>
        </header>
        <div class="my-8 bg-gradient-gray">
            <div class="w-240 lg:w-full mx-auto">
                <div class="clearfix md:px-8">
                  
<div id="cf-browser-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    
    <span class="cf-icon-browser block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    
  </div>
  <span class="md:block w-full truncate">You</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    
    Browser
    
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-cloudflare-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    <span class="cf-icon-cloud block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    </a>
  </div>
  <span class="md:block w-full truncate">Auckland</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    Cloudflare
    </a>
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-host-status" class="cf-error-source relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    
    <span class="cf-icon-server block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-error w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    
  </div>
  <span class="md:block w-full truncate">qa.serval-api.org</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    
    Host
    
  </h3>
  <span class="leading-1.3 text-2xl text-red-error">Error</span>
</div>

                </div>
            </div>
        </div>

        <div class="w-240 lg:w-full mx-auto mb-8 lg:px-8">
            <div class="clearfix">
                <div class="w-1/2 md:w-full float-left pr-6 md:pb-10 md:pr-0 leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What happened?</h2>
                    <p>The web server reported a gateway time-out error.</p>
                </div>
                <div class="w-1/2 md:w-full float-left leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What can I do?</h2>
                    <p class="mb-6">Please try again in a few minutes.</p>
                </div>
            </div>
        </div>

        <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">8b8e10850a9c508c</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
      Your IP:
      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
      <span class="hidden" id="cf-footer-ip">203.211.73.132</span>
      <span class="cf-footer-separator sm:hidden">&bull;</span>
    </span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" id="brand_link" target="_blank">Cloudflare</a></span>
    
  </p>
  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
</div><!-- /.error-footer -->


    </div>
</div>
</body>
</html>
@pmachapman pmachapman added the bug Something isn't working label Aug 25, 2024
@pmachapman
Copy link
Collaborator Author

pmachapman commented Aug 25, 2024

Follow up: I re-ran the calls to getWordGraph, and they returned (the very large) results quickly.

Is there some warm up time required for SMT?

@ddaspit
Copy link
Contributor

ddaspit commented Aug 26, 2024

Yes, there is some warm up time (loading the model) when first calling the endpoint. The Cloudflare timeouts are probably too short for these endpoints.

@johnml1135
Copy link
Collaborator

As per Damien's insight, this is likely an engine lock not being released. We need to determine the best way to fix this. Options include:

  • Setting an expiration time for the lockouts
  • Making sure that there is a "finally" for locks
  • Something else?

@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Aug 29, 2024
@johnml1135 johnml1135 moved this from 🆕 New to 🔖 Ready in Serval Aug 29, 2024
@johnml1135
Copy link
Collaborator

It appears that one fix could be that for the DistributedReaderWriterLock.WriterLockAsync, we could use a standard timeout for all locks and then if the timeout fails, we can log it as an error and throw an exception. This lock is used in 20+ places, all over the code. We should expect the lock to never timeout, but if it does, it will not bring the engine to a standstill and give us some breadcrumbs as to what may have failed to prevent it from continuing.

The HTTP timeout is 60 seconds, and I believe that at most there will be one timeout per call. Therefore, we could make it 55 seconds.

@johnml1135
Copy link
Collaborator

The lifetime is just for trying to acquire the lock, not for cancelling an existing lock. If a writer lock grabs it and holds onto it, we must assume that either (1) the process has exited in some weird way where it has not released the lock or (2) the process is hanging - not for scoped calls but for some locks that aren't scoped to an HTTP call. Moreover, if resetting the servers fixed it, it is likely that the "finally" code of clearing the lock actually happened, which means that it is (2), the process is hanging forever.
Here is a proposed way to address the surface issue (processes hang and are not terminated) and figure out which thing is actually hanging.

@ddaspit
Copy link
Contributor

ddaspit commented Aug 30, 2024

The lifetime is the max duration of the acquired lock. Once the lock expires, other callers can acquire a lock even if it has never been released.

@github-project-automation github-project-automation bot moved this from 🔖 Ready to ✅ Done in Serval Sep 3, 2024
@ddaspit ddaspit reopened this Sep 3, 2024
@ddaspit ddaspit moved this from ✅ Done to 🔖 Ready in Serval Sep 3, 2024
johnml1135 added a commit that referenced this issue Sep 3, 2024
@johnml1135
Copy link
Collaborator

Confirmed it's a lock that doesn't die:
image

@johnml1135
Copy link
Collaborator

Clearing the lock didn't help - but it may be the ClearML monitor.

@johnml1135
Copy link
Collaborator

I got the issue again - there were no writer locks that were held onto - and it was still crashing with the timeouts (just on the cancel/delete/add endpoints). and resetting everything fixed it.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 4, 2024

All of those endpoints try to acquire a writer lock, so it could be a reader lock that hasn't been released.

johnml1135 added a commit that referenced this issue Sep 6, 2024
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
johnml1135 added a commit that referenced this issue Sep 6, 2024
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Fix some ci-e2e issues
johnml1135 added a commit that referenced this issue Sep 10, 2024
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Fix some ci-e2e issues

Only use locking when accessing SMT model

Fix unit tests

Update to latest version of Machine

Fix bug where wrong id is used when starting a build

Remove reference to Serval.Shared in Serval.Machine.Shared
johnml1135 added a commit that referenced this issue Sep 10, 2024
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Fix some ci-e2e issues

Only use locking when accessing SMT model

Fix unit tests

Update to latest version of Machine

Fix bug where wrong id is used when starting a build

Remove reference to Serval.Shared in Serval.Machine.Shared
johnml1135 added a commit that referenced this issue Sep 10, 2024
* Fix #464 - add lock lifetime for all
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Fix some ci-e2e issues

Only use locking when accessing SMT model

Fix unit tests

Update to latest version of Machine

Fix bug where wrong id is used when starting a build

Remove reference to Serval.Shared in Serval.Machine.Shared

* preserve fix fro #468
@github-project-automation github-project-automation bot moved this from 🔖 Ready to ✅ Done in Serval Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants