504 Gateway timeout when running GetWordGraph #464

pmachapman · 2024-08-25T19:51:04Z

On the latest QA, for some (admittedly large) projects I am receiving 504 gateway timeout issues. These have occurred both today and late last week.

For example, calling:

curl -X 'POST' \
  'https://qa.serval-api.org/api/v1/translation/engines/6667aab0db23836577801280/get-word-graph' \
  -H 'accept: application/json' \
  -H 'authorization: Bearer !!!!!!REDACTED!!!!!!!!' \
  -H 'Content-Type: application/json' \
  -d '"And the LORD God planted a garden toward the east in Eden, and there he put the man whom he had formed."'

Returns:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>


<title>qa.serval-api.org | 504: Gateway time-out</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css" />


</head>
<body>
<div id="cf-wrapper">
    <div id="cf-error-details" class="p-0">
        <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-8">
            <h1 class="inline-block sm:block sm:mb-2 font-light text-60 lg:text-4xl text-black-dark leading-tight mr-2">
              <span class="inline-block">Gateway time-out</span>
              <span class="code-label">Error code 504</span>
            </h1>
            <div>
               Visit <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">cloudflare.com</a> for more information.
            </div>
            <div class="mt-3">2024-08-25 19:43:18 UTC</div>
        </header>
        <div class="my-8 bg-gradient-gray">
            <div class="w-240 lg:w-full mx-auto">
                <div class="clearfix md:px-8">
                  
<div id="cf-browser-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    
    <span class="cf-icon-browser block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    
  </div>
  <span class="md:block w-full truncate">You</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    
    Browser
    
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-cloudflare-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    <span class="cf-icon-cloud block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    </a>
  </div>
  <span class="md:block w-full truncate">Auckland</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" target="_blank" rel="noopener noreferrer">
    Cloudflare
    </a>
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-host-status" class="cf-error-source relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    
    <span class="cf-icon-server block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-error w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    
  </div>
  <span class="md:block w-full truncate">qa.serval-api.org</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    
    Host
    
  </h3>
  <span class="leading-1.3 text-2xl text-red-error">Error</span>
</div>

                </div>
            </div>
        </div>

        <div class="w-240 lg:w-full mx-auto mb-8 lg:px-8">
            <div class="clearfix">
                <div class="w-1/2 md:w-full float-left pr-6 md:pb-10 md:pr-0 leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What happened?</h2>
                    <p>The web server reported a gateway time-out error.</p>
                </div>
                <div class="w-1/2 md:w-full float-left leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What can I do?</h2>
                    <p class="mb-6">Please try again in a few minutes.</p>
                </div>
            </div>
        </div>

        <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">8b8e10850a9c508c</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
      Your IP:
      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
      <span class="hidden" id="cf-footer-ip">203.211.73.132</span>
      <span class="cf-footer-separator sm:hidden">&bull;</span>
    </span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=qa.serval-api.org" id="brand_link" target="_blank">Cloudflare</a></span>
    
  </p>
  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
</div><!-- /.error-footer -->


    </div>
</div>
</body>
</html>

The text was updated successfully, but these errors were encountered:

pmachapman · 2024-08-25T19:57:42Z

Follow up: I re-ran the calls to getWordGraph, and they returned (the very large) results quickly.

Is there some warm up time required for SMT?

johnml1135 · 2024-08-26T15:27:47Z

https://learn.microsoft.com/en-us/answers/questions/1863887/getting-499-and-504-errors-only-in-certain-endpoin?

ddaspit · 2024-08-26T15:34:46Z

Yes, there is some warm up time (loading the model) when first calling the endpoint. The Cloudflare timeouts are probably too short for these endpoints.

johnml1135 · 2024-08-29T17:06:20Z

As per Damien's insight, this is likely an engine lock not being released. We need to determine the best way to fix this. Options include:

Setting an expiration time for the lockouts
Making sure that there is a "finally" for locks
Something else?

johnml1135 · 2024-08-30T13:11:28Z

It appears that one fix could be that for the DistributedReaderWriterLock.WriterLockAsync, we could use a standard timeout for all locks and then if the timeout fails, we can log it as an error and throw an exception. This lock is used in 20+ places, all over the code. We should expect the lock to never timeout, but if it does, it will not bring the engine to a standstill and give us some breadcrumbs as to what may have failed to prevent it from continuing.

The HTTP timeout is 60 seconds, and I believe that at most there will be one timeout per call. Therefore, we could make it 55 seconds.

johnml1135 · 2024-08-30T14:04:23Z

The lifetime is just for trying to acquire the lock, not for cancelling an existing lock. If a writer lock grabs it and holds onto it, we must assume that either (1) the process has exited in some weird way where it has not released the lock or (2) the process is hanging - not for scoped calls but for some locks that aren't scoped to an HTTP call. Moreover, if resetting the servers fixed it, it is likely that the "finally" code of clearing the lock actually happened, which means that it is (2), the process is hanging forever.
Here is a proposed way to address the surface issue (processes hang and are not terminated) and figure out which thing is actually hanging.

Log the calling function with the lock as it is acquired: https://stackoverflow.com/questions/171970/how-can-i-find-the-method-that-called-the-current-method
Make a default lifetime of trying to acquire locks to 55 seconds.
If it cannot acquire the lock, grab the lock that is still being held onto (with the calling function name) and throw an error.
Optional: should we also clear that lock and then let the blocked function continue and just drive over the previous lock?
@ddaspit and @Enkidu93, what do you think?

ddaspit · 2024-08-30T22:26:01Z

The lifetime is the max duration of the acquired lock. Once the lock expires, other callers can acquire a lock even if it has never been released.

johnml1135 · 2024-09-03T21:00:46Z

Confirmed it's a lock that doesn't die:

johnml1135 · 2024-09-03T21:10:47Z

Clearing the lock didn't help - but it may be the ClearML monitor.

johnml1135 · 2024-09-04T14:23:37Z

I got the issue again - there were no writer locks that were held onto - and it was still crashing with the timeouts (just on the cancel/delete/add endpoints). and resetting everything fixed it.

ddaspit · 2024-09-04T14:59:55Z

All of those endpoints try to acquire a writer lock, so it could be a reader lock that hasn't been released.

* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup

* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues

* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared

* Fix #464 - add lock lifetime for all * Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues Only use locking when accessing SMT model Fix unit tests Update to latest version of Machine Fix bug where wrong id is used when starting a build Remove reference to Serval.Shared in Serval.Machine.Shared * preserve fix fro #468

pmachapman added the bug Something isn't working label Aug 25, 2024

johnml1135 assigned ddaspit and johnml1135 Aug 29, 2024

johnml1135 added this to Serval Aug 29, 2024

github-project-automation bot moved this to 🆕 New in Serval Aug 29, 2024

johnml1135 moved this from 🆕 New to 🔖 Ready in Serval Aug 29, 2024

This was referenced Aug 29, 2024

Build cancel and engine delete not working #462

Closed

504 Gateway timeout on QA #450

Closed

johnml1135 closed this as completed in d1a4b1c Sep 3, 2024

github-project-automation bot moved this from 🔖 Ready to ✅ Done in Serval Sep 3, 2024

ddaspit reopened this Sep 3, 2024

ddaspit moved this from ✅ Done to 🔖 Ready in Serval Sep 3, 2024

johnml1135 added a commit that referenced this issue Sep 3, 2024

Fix #464 - add lock timeout

76e5ed9

johnml1135 mentioned this issue Sep 3, 2024

Fix #464 - add lock timeout #473

Merged

johnml1135 added a commit that referenced this issue Sep 6, 2024

Fix #464 - add lock lifetime for all

b643a06

* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup

johnml1135 added a commit that referenced this issue Sep 6, 2024

Fix #464 - add lock lifetime for all

1d4032d

* Add HTTP timeout * Make adjustable through options * Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup * Fix some ci-e2e issues

johnml1135 closed this as completed in #473 Sep 10, 2024

github-project-automation bot moved this from 🔖 Ready to ✅ Done in Serval Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

504 Gateway timeout when running GetWordGraph #464

504 Gateway timeout when running GetWordGraph #464

pmachapman commented Aug 25, 2024

pmachapman commented Aug 25, 2024 •

edited

Loading

johnml1135 commented Aug 26, 2024

ddaspit commented Aug 26, 2024

johnml1135 commented Aug 29, 2024

johnml1135 commented Aug 30, 2024

johnml1135 commented Aug 30, 2024

ddaspit commented Aug 30, 2024

johnml1135 commented Sep 3, 2024

johnml1135 commented Sep 3, 2024

johnml1135 commented Sep 4, 2024

ddaspit commented Sep 4, 2024

504 Gateway timeout when running GetWordGraph #464

504 Gateway timeout when running GetWordGraph #464

Comments

pmachapman commented Aug 25, 2024

pmachapman commented Aug 25, 2024 • edited Loading

johnml1135 commented Aug 26, 2024

ddaspit commented Aug 26, 2024

johnml1135 commented Aug 29, 2024

johnml1135 commented Aug 30, 2024

johnml1135 commented Aug 30, 2024

ddaspit commented Aug 30, 2024

johnml1135 commented Sep 3, 2024

johnml1135 commented Sep 3, 2024

johnml1135 commented Sep 4, 2024

ddaspit commented Sep 4, 2024

pmachapman commented Aug 25, 2024 •

edited

Loading