Sep 21 brief outage and heroku timeouts #1864

jrochkind · 2022-09-21T15:19:28Z

What happened

Last night we had some brief outages reported by our monitoring services
- Sep 21 ~12:20AT ET to 12:23, and then again ~12:53 to 12:56 (so ~6 minutes total)
- Reported by both HoneyBadger and UptimeRobot; honeybadger monitors a search results URL, not sure about UptimeRobot
Also last night we had a high number of errors.
- ~5600 errors between roughly 12:37AM and 1:37AM.
- They look to all be H12 Request Timeout -- the Rails app did not respond within 30 seconds, and heroku cancelled the client request.

Analysis

Did we have more traffic than usual? Was it bots?

It's hard to say, we don't really have easily available tools at present to graph our logs to let us see how much traffic there was at a given time; what proportion was bots; or compare that over time.

Ah, but heroku dashboard metrics do show bumps in: Response Time; Dyno Load; and Throughput during these times. So that may indicate, yes, a burst of traffic, probably? Seems consistent with theory that: Throughput went way up, resulting in an increase of dyno load and increase in error responses -- as there were not workers available to handle all requests and/or there was not enough CPU on the dyno for the available workers to do what they needed.

Memory usage stayed pretty flat throughout though.

screenshot of heroku dashboard metrics

What does traffic look like?

Again, we don't have tools to give us good descriptive statistics or characterization, so this is just me visually scanning through logs and giving anecdotal impression.

For the actual H12 requests -- it looks like they never made it to our app to log them, so we don't have user-agent/bot info logged, only the heroku router log with path requested only.

but looking during that rough time period at requests that did make it to app...

It looks like there's actually quite a lot of traffic from one IP address specifically: ua=Chrome-87/Windows-10 ip=149.56.16.76. It doesn't have a bot User-Agent, but it's a lot of traffic, like maybe 20 requests a second? For a while?

There's also plenty of identifed bot traffic, like a few bot requsts a second. Is this more than usual though? Hard to say, it looks like it might be? Bots include PetalBot and SemrushBot, but also at least a sprinkling of Bing, Apple, etc.

What is 149.56.16.76 IP?

Not really sure, it seems to be in Canada, hosted by OVH Cloud, "which offers VPS, dedicated servers and other web services." So a rented server. Someone running some kind of experiment? It seems to be doing a crawl of our site (it's accessing all sorts of things), but not respecting robots.txt or advertising itself as a bot in the user-agent. #1008

It seems to have started accessing us last night at 23:38, and stopped last night at 00:58

(Although last night wasn't the first date we saw it, it's been here a few days ago too).

We could file an abuse report to OVH at https://www.ovh.com/abuse?

What are identified bots doing?

Actual identified bot traffic is also present, and they are scraping a bunch of things they probably don't have to.

facet combinations, even though our robots.txt tries to keep them out
every combination of derivatives

Remediation?

We could (and maybe should) fine tune our robots.txt to keep even more out.
- However, that won't keep out the poorly-behaved 149.56.16.76 client, which does not seem to be respecting robots.txt.
We could also experiment with auto-scaling via hirefire of our web dyno too.
- Unclear if we could possibly scale up to handle poorly behaved 149.56.16.76 though
~~We could add more puma workers, as we seem to have enough RAM for them~~
- Nah, the high dyno load suggests we don't have the CPU for more workers, even though we have the RAM. (load should not be over 1 generally, ours got to 4-5).
We could use rack-attack to rate-limit our clients at the app-level.
- it's unclear if this would actually keep 149.56.16.76 from DoS'ing us though, as each request would still GET to our app, before being quickly rejected by our app with an http 429.
- we could also use rack-attack to log very high-volume requesters at least, which might help us diagnose this more quickly again
We could file an abuse report to OVH for 149.56.16.76
We could investigate fronting with CloudFlare, for their security/DoS-prevention features.

The text was updated successfully, but these errors were encountered:

jrochkind · 2022-09-21T15:30:41Z

I did file an abuse report with OVH.

jrochkind · 2022-09-21T15:33:21Z

Implementing rack-attack would require us configuring a Rails.cache, which we haven't gotten around to. We'd probably have to purchase an additional redis server from heroku; it's bad practice to use the same redis for ActiveJob queues and a cache, as they require different configuration, and you risk losing queue data if your cache grows too large.

jrochkind · 2022-09-22T17:34:35Z

@CHF-IT-Staff says that for this fiscal year (2022-2023) it is no problem to add $15-$20 a month for a redis or memcached heroku plugin for rack-attack.

eddierubeiz · 2022-09-27T13:36:53Z

Very similar incident occurred over the weekend. Same IP address.
See the Heroku metrics and Papertrail logs starting at 2022-09-25 05:44:40 EST.

jrochkind · 2022-09-27T14:29:52Z

I did file an abuse report with OVH, but don't seem to have heard back.

ok maybe we'll get rack-attack going with a) a rate limit, and b) the ability to block-list certain IPs manually.

I'm not totally sure if blocking them at the app level (after they've already taken up routing resources) will be sufficient, but seems like the first thing to try.
Will involve adding a ~$15-$20 memecached to our heroku account. (And document that on our heroku components wiki page)
Initially let's just do the rate limit and not even block-list the known IP -- to find out if the rate limit is enough to keep the current known IP from harming our app!
let's have a rate limit, along with lower limits set as log-only -- so if there's a problematic user that does NOT trigger rate limit, we still have logged what lower rate limits might apply to them.
Initial blocked limit, maybea fairly generous rate limit of say... 4 requests per second, average sustained over 10 seconds -- so 40 requests over a 10 second period -- will, I think, lock out current patterns of that IP.
Make sure all logs are identified with a token related to rate limiting (say rack-attack, the name of the gem), and set up Papertrail monitoring to send us slack on all rack-attack logs, so we can stay aware of when they are being triggered etc.

jrochkind · 2022-10-03T15:45:17Z

OK, rack-attack caught one! it's a new IP.

3.95.223.35

It is getting both tracked for exceeding 60 requests per minute, and throttled when it moves on to exceeding 120 requests per minute.

We are getting more "track" logs lines then I wanted -- ah, but my attempt to limit the logging to once per period worked, but we still get a log line as much as once every minute, and this guy just keeps requesting (even after we throttle him).

So we might need to tune the logging further if it gives us lots of noise. But for now we want to be aware of what's going on.

What can we learn about 3.95.223.35?

Looks like an AWS EC2. ec2-3-95-223-35.compute-1.amazonaws.com
In this case it DOES correctly identify itself as a bot, specifically a python script ua=Python_Requests-2
It looks like it's mostly downloading derivatives. Our robots.txt isn't disallowing access to any derivatives, so this one is quite possibly obeying robots.txt, it's well-behaved.
Are these limits too strict, should we let it get those derivatives faster?

jrochkind closed this as completed Oct 3, 2022

jrochkind reopened this Oct 3, 2022

jrochkind mentioned this issue Oct 17, 2022

Bump rack-attack limits a bit higher, and mention in README #1887

Merged

jrochkind closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sep 21 brief outage and heroku timeouts #1864

Sep 21 brief outage and heroku timeouts #1864

jrochkind commented Sep 21, 2022 •

edited

Loading

jrochkind commented Sep 21, 2022

jrochkind commented Sep 21, 2022

jrochkind commented Sep 22, 2022

eddierubeiz commented Sep 27, 2022

jrochkind commented Sep 27, 2022

jrochkind commented Oct 3, 2022 •

edited

Loading

Sep 21 brief outage and heroku timeouts #1864

Sep 21 brief outage and heroku timeouts #1864

Comments

jrochkind commented Sep 21, 2022 • edited Loading

What happened

Analysis

Did we have more traffic than usual? Was it bots?

What does traffic look like?

What is 149.56.16.76 IP?

What are identified bots doing?

Remediation?

jrochkind commented Sep 21, 2022

jrochkind commented Sep 21, 2022

jrochkind commented Sep 22, 2022

eddierubeiz commented Sep 27, 2022

jrochkind commented Sep 27, 2022

jrochkind commented Oct 3, 2022 • edited Loading

jrochkind commented Sep 21, 2022 •

edited

Loading

jrochkind commented Oct 3, 2022 •

edited

Loading