Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cause of API outage on April 7 2020 #270

Open
Kilo59 opened this issue Apr 9, 2020 · 6 comments
Open

Investigate cause of API outage on April 7 2020 #270

Kilo59 opened this issue Apr 9, 2020 · 6 comments
Assignees
Labels
bug Something isn't working down Related to API availability help wanted Extra attention is needed question Further information is requested

Comments

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 9, 2020

Opening this issue to continue the discussion of the Outage on April 7th.
#264

Metrics From the time of the Outage and afterwords

I will update this comment with new information as needed.

Notes and resources

  1. https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app
  2. https://devcenter.heroku.com/articles/python-gunicorn

Possible causes

  1. memory leak
  2. Async routes have a blocking call that is blocking the event loop.
  3. Gunicorn killing slow workers
@Kilo59 Kilo59 added bug Something isn't working question Further information is requested labels Apr 9, 2020
@Kilo59 Kilo59 mentioned this issue Apr 9, 2020
@Kilo59 Kilo59 added the help wanted Extra attention is needed label Apr 9, 2020
@toxyl
Copy link

toxyl commented Apr 9, 2020

Can you install monitoring software like Munin? Might help to spot issues. My nodes both have a Munin installed. Should a node become unavailable I'll check if I can spot something useful.

@toxyl
Copy link

toxyl commented Apr 9, 2020

@Kilo59 I was checking the FastAPI repo to see if there are any known issues that might match and I did find one that could be related: fastapi/fastapi#596 and this might be useful for debugging: https://stackoverflow.com/questions/15455048/releasing-memory-in-python

I've also done a load test on my mirror to see if I can provoke a failure or something else that might shed some light on the issue. On three servers I started in total 72 scripts curling the /all endpoint every second. Eventually one node did stop responding, but it wasn't enough to take either one out entirely.

When starting the load test the average request duration dropped a little (300ms to 220ms), then it started shooting up to 2270ms. After stopping the test it dropped down to 1360ms and stayed there for another 25 minutes and then gradually dropped down to 1310ms and seems to stay there.

IMHO that can have two possible causes:
a) The way DO measures the average request duration is flawed and does not reflect the real value
b) Something happens on the nodes (within the Docker containers, not the nodes themselves) that causes them to react slower, e.g. open processes / file handles / event loops that are not closed properly. Which I assumed already because I see the average request duration goes up every 1-1.5h hours - obviously that assumption only works if a) is not the case.

I've also checked the container logs but I didn't find anything suspicious.

Now both nodes seem to have cooled down and in my Munin stats I do see an interesting difference between the nodes. On both nodes the active memory consumption went up during the load test and seemed to stabilize after stopping the test. The second node experienced a sharper increase of active memory and now, more than an hour after stopping the test, I see it rising again. Let's see if that continues, if so then we might have a culprit.

@toxyl
Copy link

toxyl commented Apr 9, 2020

Looks like it has stabilized, here's the graphs:

Munin

cvtapi-munin
Ignore the drop in the middle, that was another test.

DigitalOcean

cvtapi-loadbalancer

@toxyl
Copy link

toxyl commented Apr 11, 2020

Not sure what to make of this. I've started another test with simply curling every minute for bout a day. From the moment I started the test the Average Request Duration steadily dropped, after 12 hours it was down to about 100ms and stayed there. But from 200ms down I saw an increase in 4xx responses. I just stopped the test, let's see if the Average Request Duration starts to increase again.

Memory wise it hadn't stabilized as I thought but slowly kept rising but eventually dropped down.

  1. https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app
    Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker. This can result in a seemingly unexplained death spiral of workers being killed on startup, and when there aren't enough workers to handle requests, requests timeout, and workers are once again killed.
    Since the workers are sent a SIGKILL (no SIGTERM) they aren't able to log any errors during this time, so most of the time you'll just see a SIGKILL in your logs when this happens and nothing else.

That sounds like a potential reason. My Munin does have a couple of stats that should stick out if that happens.

@Kilo59
Copy link
Collaborator Author

Kilo59 commented Apr 11, 2020

Duplicate Log messages

@toxyl @Bost
Here's another possible explanation for the duplicate log message.
Looks like it was fixed upstream yesterday. Not sure if it's been cut into a new release yet.
encode/uvicorn#614 (comment)

Memory Leak

https://devcenter.heroku.com/articles/python-gunicorn#max-request-recycling

If your application suffers from memory leaks, you can configure Gunicorn to gracefully restart a worker after it has processed a given number of requests. This can be a convenient way to help limit the effects of the memory leak.

@toxyl However seems like this seems like it could exacerbate the process killing issue (if that is what was happening).

Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker.

The first time a worker restarts and a request is fired it has to build its data cache. I don't know if this is a long enough time to be "killed" but it could be. It might be a good idea to use a start_up/background task to build this cache before the first request of a new worker is processed.
https://fastapi.tiangolo.com/advanced/events/
https://fastapi.tiangolo.com/tutorial/background-tasks/

@Kilo59
Copy link
Collaborator Author

Kilo59 commented Apr 14, 2020

Just switched over to LogDNA and so far I like it.
Looks like we are having worker timeouts pretty frequently.

Apr 13 19:20:28 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:20:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:191)
Apr 13 19:25:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:281)
Apr 13 19:25:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:286)
Apr 13 19:25:48 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:47 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:291)
Apr 13 19:26:00 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:00 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:296)
Apr 13 19:26:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:19 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:301)
Apr 13 19:26:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:306)
Apr 13 19:26:50 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:50 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:311)
Apr 13 19:27:03 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:03 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:316)
Apr 13 19:27:21 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:21 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:321)
Apr 13 19:27:34 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:34 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:326)
Apr 13 19:27:53 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:331)
Apr 13 19:28:06 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:05 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:336)
Apr 13 19:28:24 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:23 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:341)
Apr 13 19:28:37 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:37 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:346)
Apr 13 19:28:55 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:55 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:351)
Apr 13 19:29:08 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:08 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:356)
Apr 13 19:29:26 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:26 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:361)
Apr 13 19:29:39 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:39 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:366)
Apr 13 19:29:57 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:57 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:371)
Apr 13 19:30:10 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:10 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:376)
Apr 13 19:30:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:381)
Apr 13 19:30:42 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:41 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:386)
Apr 13 19:31:01 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:01 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:391)
Apr 13 19:31:13 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:13 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:396)
Apr 13 19:31:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:401)
Apr 13 19:31:44 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:44 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:406)
Apr 13 19:32:04 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:04 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:411)
Apr 13 19:32:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:416)
Apr 13 19:32:47 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:46 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:426)
Apr 13 19:33:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:33:18 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:431)
Apr 13 19:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-13 23:39:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:09:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:33:52 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 00:33:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:436)
Apr 13 20:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:39:01 +0000] [421] [ERROR] Exception in ASGI application
Apr 13 21:00:18 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:17 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:441)
Apr 13 21:00:49 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:48 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:446)
Apr 13 21:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 01:09:02 +0000] [421] [ERROR] Exception in ASGI application

@Kilo59 Kilo59 added the down Related to API availability label Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working down Related to API availability help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants