-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate cause of API outage on April 7 2020 #270
Comments
Can you install monitoring software like Munin? Might help to spot issues. My nodes both have a Munin installed. Should a node become unavailable I'll check if I can spot something useful. |
@Kilo59 I was checking the FastAPI repo to see if there are any known issues that might match and I did find one that could be related: fastapi/fastapi#596 and this might be useful for debugging: https://stackoverflow.com/questions/15455048/releasing-memory-in-python I've also done a load test on my mirror to see if I can provoke a failure or something else that might shed some light on the issue. On three servers I started in total 72 scripts curling the When starting the load test the IMHO that can have two possible causes: I've also checked the container logs but I didn't find anything suspicious. Now both nodes seem to have cooled down and in my Munin stats I do see an interesting difference between the nodes. On both nodes the |
Not sure what to make of this. I've started another test with simply curling every minute for bout a day. From the moment I started the test the Memory wise it hadn't stabilized as I thought but slowly kept rising but eventually dropped down.
That sounds like a potential reason. My Munin does have a couple of stats that should stick out if that happens. |
Duplicate Log messages@toxyl @Bost Memory Leakhttps://devcenter.heroku.com/articles/python-gunicorn#max-request-recycling
@toxyl However seems like this seems like it could exacerbate the process killing issue (if that is what was happening).
The first time a worker restarts and a request is fired it has to build its data cache. I don't know if this is a long enough time to be "killed" but it could be. It might be a good idea to use a start_up/background task to build this cache before the first request of a new worker is processed. |
Just switched over to LogDNA and so far I like it. Apr 13 19:20:28 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:20:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:191)
Apr 13 19:25:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:281)
Apr 13 19:25:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:286)
Apr 13 19:25:48 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:47 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:291)
Apr 13 19:26:00 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:00 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:296)
Apr 13 19:26:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:19 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:301)
Apr 13 19:26:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:306)
Apr 13 19:26:50 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:50 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:311)
Apr 13 19:27:03 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:03 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:316)
Apr 13 19:27:21 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:21 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:321)
Apr 13 19:27:34 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:34 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:326)
Apr 13 19:27:53 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:331)
Apr 13 19:28:06 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:05 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:336)
Apr 13 19:28:24 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:23 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:341)
Apr 13 19:28:37 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:37 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:346)
Apr 13 19:28:55 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:55 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:351)
Apr 13 19:29:08 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:08 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:356)
Apr 13 19:29:26 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:26 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:361)
Apr 13 19:29:39 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:39 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:366)
Apr 13 19:29:57 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:57 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:371)
Apr 13 19:30:10 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:10 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:376)
Apr 13 19:30:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:381)
Apr 13 19:30:42 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:41 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:386)
Apr 13 19:31:01 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:01 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:391)
Apr 13 19:31:13 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:13 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:396)
Apr 13 19:31:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:401)
Apr 13 19:31:44 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:44 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:406)
Apr 13 19:32:04 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:04 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:411)
Apr 13 19:32:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:416)
Apr 13 19:32:47 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:46 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:426)
Apr 13 19:33:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:33:18 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:431)
Apr 13 19:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-13 23:39:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:09:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:33:52 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 00:33:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:436)
Apr 13 20:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:39:01 +0000] [421] [ERROR] Exception in ASGI application
Apr 13 21:00:18 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:17 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:441)
Apr 13 21:00:49 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:48 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:446)
Apr 13 21:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 01:09:02 +0000] [421] [ERROR] Exception in ASGI application |
Opening this issue to continue the discussion of the Outage on April 7th.
#264
Metrics From the time of the Outage and afterwords
I will update this comment with new information as needed.
Notes and resources
Possible causes
The text was updated successfully, but these errors were encountered: