Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sep 21 brief outage and heroku timeouts #1864

Closed
jrochkind opened this issue Sep 21, 2022 · 6 comments
Closed

Sep 21 brief outage and heroku timeouts #1864

jrochkind opened this issue Sep 21, 2022 · 6 comments

Comments

@jrochkind
Copy link
Contributor

jrochkind commented Sep 21, 2022

What happened

  • Last night we had some brief outages reported by our monitoring services
    • Sep 21 ~12:20AT ET to 12:23, and then again ~12:53 to 12:56 (so ~6 minutes total)
    • Reported by both HoneyBadger and UptimeRobot; honeybadger monitors a search results URL, not sure about UptimeRobot
  • Also last night we had a high number of errors.
    • ~5600 errors between roughly 12:37AM and 1:37AM.
    • They look to all be H12 Request Timeout -- the Rails app did not respond within 30 seconds, and heroku cancelled the client request.

Analysis

Did we have more traffic than usual? Was it bots?

It's hard to say, we don't really have easily available tools at present to graph our logs to let us see how much traffic there was at a given time; what proportion was bots; or compare that over time.

Ah, but heroku dashboard metrics do show bumps in: Response Time; Dyno Load; and Throughput during these times. So that may indicate, yes, a burst of traffic, probably? Seems consistent with theory that: Throughput went way up, resulting in an increase of dyno load and increase in error responses -- as there were not workers available to handle all requests and/or there was not enough CPU on the dyno for the available workers to do what they needed.

Memory usage stayed pretty flat throughout though.

screenshot of heroku dashboard metrics

Screen Shot 2022-09-21 at 10 32 44 AM

What does traffic look like?

Again, we don't have tools to give us good descriptive statistics or characterization, so this is just me visually scanning through logs and giving anecdotal impression.

For the actual H12 requests -- it looks like they never made it to our app to log them, so we don't have user-agent/bot info logged, only the heroku router log with path requested only.

but looking during that rough time period at requests that did make it to app...

It looks like there's actually quite a lot of traffic from one IP address specifically: ua=Chrome-87/Windows-10 ip=149.56.16.76. It doesn't have a bot User-Agent, but it's a lot of traffic, like maybe 20 requests a second? For a while?

There's also plenty of identifed bot traffic, like a few bot requsts a second. Is this more than usual though? Hard to say, it looks like it might be? Bots include PetalBot and SemrushBot, but also at least a sprinkling of Bing, Apple, etc.

What is 149.56.16.76 IP?

Not really sure, it seems to be in Canada, hosted by OVH Cloud, "which offers VPS, dedicated servers and other web services." So a rented server. Someone running some kind of experiment? It seems to be doing a crawl of our site (it's accessing all sorts of things), but not respecting robots.txt or advertising itself as a bot in the user-agent. #1008

It seems to have started accessing us last night at 23:38, and stopped last night at 00:58

(Although last night wasn't the first date we saw it, it's been here a few days ago too).

We could file an abuse report to OVH at https://www.ovh.com/abuse?

What are identified bots doing?

Actual identified bot traffic is also present, and they are scraping a bunch of things they probably don't have to.

  • facet combinations, even though our robots.txt tries to keep them out
  • every combination of derivatives

Remediation?

  • We could (and maybe should) fine tune our robots.txt to keep even more out.

    • However, that won't keep out the poorly-behaved 149.56.16.76 client, which does not seem to be respecting robots.txt.
  • We could also experiment with auto-scaling via hirefire of our web dyno too.

    • Unclear if we could possibly scale up to handle poorly behaved 149.56.16.76 though
  • We could add more puma workers, as we seem to have enough RAM for them

    • Nah, the high dyno load suggests we don't have the CPU for more workers, even though we have the RAM. (load should not be over 1 generally, ours got to 4-5).
  • We could use rack-attack to rate-limit our clients at the app-level.

    • it's unclear if this would actually keep 149.56.16.76 from DoS'ing us though, as each request would still GET to our app, before being quickly rejected by our app with an http 429.
    • we could also use rack-attack to log very high-volume requesters at least, which might help us diagnose this more quickly again
  • We could file an abuse report to OVH for 149.56.16.76

  • We could investigate fronting with CloudFlare, for their security/DoS-prevention features.

@jrochkind
Copy link
Contributor Author

I did file an abuse report with OVH.

@jrochkind
Copy link
Contributor Author

Implementing rack-attack would require us configuring a Rails.cache, which we haven't gotten around to. We'd probably have to purchase an additional redis server from heroku; it's bad practice to use the same redis for ActiveJob queues and a cache, as they require different configuration, and you risk losing queue data if your cache grows too large.

@jrochkind
Copy link
Contributor Author

@CHF-IT-Staff says that for this fiscal year (2022-2023) it is no problem to add $15-$20 a month for a redis or memcached heroku plugin for rack-attack.

@eddierubeiz
Copy link
Contributor

Very similar incident occurred over the weekend. Same IP address.
See the Heroku metrics and Papertrail logs starting at 2022-09-25 05:44:40 EST.

@jrochkind
Copy link
Contributor Author

I did file an abuse report with OVH, but don't seem to have heard back.

ok maybe we'll get rack-attack going with a) a rate limit, and b) the ability to block-list certain IPs manually.

  • I'm not totally sure if blocking them at the app level (after they've already taken up routing resources) will be sufficient, but seems like the first thing to try.
  • Will involve adding a ~$15-$20 memecached to our heroku account. (And document that on our heroku components wiki page)
  • Initially let's just do the rate limit and not even block-list the known IP -- to find out if the rate limit is enough to keep the current known IP from harming our app!
  • let's have a rate limit, along with lower limits set as log-only -- so if there's a problematic user that does NOT trigger rate limit, we still have logged what lower rate limits might apply to them.
  • Initial blocked limit, maybea fairly generous rate limit of say... 4 requests per second, average sustained over 10 seconds -- so 40 requests over a 10 second period -- will, I think, lock out current patterns of that IP.
  • Make sure all logs are identified with a token related to rate limiting (say rack-attack, the name of the gem), and set up Papertrail monitoring to send us slack on all rack-attack logs, so we can stay aware of when they are being triggered etc.

@jrochkind jrochkind reopened this Oct 3, 2022
@jrochkind
Copy link
Contributor Author

jrochkind commented Oct 3, 2022

OK, rack-attack caught one! it's a new IP.

3.95.223.35

It is getting both tracked for exceeding 60 requests per minute, and throttled when it moves on to exceeding 120 requests per minute.

We are getting more "track" logs lines then I wanted -- ah, but my attempt to limit the logging to once per period worked, but we still get a log line as much as once every minute, and this guy just keeps requesting (even after we throttle him).

So we might need to tune the logging further if it gives us lots of noise. But for now we want to be aware of what's going on.

What can we learn about 3.95.223.35?

  • Looks like an AWS EC2. ec2-3-95-223-35.compute-1.amazonaws.com
  • In this case it DOES correctly identify itself as a bot, specifically a python script ua=Python_Requests-2
  • It looks like it's mostly downloading derivatives. Our robots.txt isn't disallowing access to any derivatives, so this one is quite possibly obeying robots.txt, it's well-behaved.
  • Are these limits too strict, should we let it get those derivatives faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants