-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and log long running http(s) requests #17842
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,4 +33,22 @@ def start_rails_server(options) | |
server.start | ||
end | ||
end | ||
|
||
def do_heartbeat_work | ||
log_long_running_requests | ||
end | ||
|
||
CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds | ||
def log_long_running_requests | ||
@last_checked_hung_requests ||= Time.now.utc | ||
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago | ||
|
||
RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread| | ||
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight This will be nice to proactively track stuff down. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I can see making this more generic (it assumes a threaded web server like puma). We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people. So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to log the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😆 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.
So my thought with this, and others, was:
So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time. This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise). Anyway, just some ideas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤣 yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned. |
||
_log.warn(message) | ||
Rails.logger.warn(message) | ||
end | ||
|
||
@last_checked_hung_requests = Time.now.utc | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Add to config/application.rb: | ||
# | ||
# config.middleware.use 'RequestStartedOnMiddleware' | ||
# | ||
class RequestStartedOnMiddleware | ||
def initialize(app) | ||
@app = app | ||
end | ||
|
||
def call(env) | ||
start_request(env['PATH_INFO'], Time.now.utc) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pedantic, but can we extract this Honestly asking here though, because I am curious if you think there is an advantage or disadvantage to one or the other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. I was thinking you're right... maybe we should use the one in the headers but then I was looking at how newrelic pulls the times out of the header and it looks faster to do it the way I already have it. Check it out here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, yeah, I think what I have is fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Huh, I was under the impression that That said, Rails does do this, which lead me down a rabbit hole of investigation into "how", and going to spike on something real quick as a POC just for you to let me know what you think. This "idea" probably won't be superior in features, and is digging into the Rails private API a bit, but might be less code for us to maintain overall. We'll see... |
||
@app.call(env) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NickL, thanks for this simplification |
||
ensure | ||
complete_request | ||
end | ||
|
||
def start_request(path, started_on) | ||
Thread.current[:current_request] = path | ||
Thread.current[:current_request_started_on] = started_on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the thread local vars |
||
end | ||
|
||
def complete_request | ||
Thread.current[:current_request] = nil | ||
Thread.current[:current_request_started_on] = nil | ||
end | ||
|
||
def self.long_running_requests | ||
requests = [] | ||
timed_out_request_started_on = request_timeout.ago.utc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks Keenan for this idea ^ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This variable name seems a bit wordy and confusing when read in the Maybe something like if ... && started_on < allowable_request_start_time
# ...
end |
||
|
||
relevant_thread_list.each do |thread| | ||
request = thread[:current_request] | ||
started_on = thread[:current_request_started_on] | ||
|
||
# There's a race condition where the complete_request method runs in another | ||
# thread after we set one or more of the above local variables. The fallout | ||
# of this is we return a false positive for a request that finished very close | ||
# to the 2 minute timeout. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal. Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @NickLaMuro 😆 🤣 @kbrock |
||
if request.present? && started_on.kind_of?(Time) && timed_out_request_started_on > started_on | ||
duration = (Time.now.utc - started_on).to_f | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can "cache" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also something I noticed just now, is the Might be something that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 Yeah, since we're writing and reading the time, there's no need to store it as utc as long as it's always local There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note, we're only running |
||
requests << [request, duration, thread] | ||
end | ||
end | ||
|
||
requests | ||
end | ||
|
||
REQUEST_TIMEOUT = 2.minutes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two quick suggestions about this (which I didn't think to add into my last review):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree on both. I like the |
||
private_class_method def self.request_timeout | ||
REQUEST_TIMEOUT | ||
end | ||
|
||
# For testing: mocking Thread.list feels dangerous | ||
private_class_method def self.relevant_thread_list | ||
Thread.list | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Think this represents your race condition |
||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
describe RequestStartedOnMiddleware do | ||
context ".long_running_requests" do | ||
before do | ||
allow(described_class).to receive(:relevant_thread_list) { fake_threads } | ||
allow(described_class).to receive(:request_timeout).and_return(2.minutes) | ||
end | ||
|
||
let(:fake_threads) { [@fake_thread] } | ||
|
||
it "returns request, duration and thread" do | ||
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 3.minutes.ago} | ||
long_requests = described_class.long_running_requests.first | ||
expect(long_requests[0]).to eql "/api/ping" | ||
expect(long_requests[1]).to be_within(0.1).of(Time.now.utc - 3.minutes.ago) | ||
expect(long_requests[2]).to eql @fake_thread | ||
end | ||
|
||
it "skips threads that haven't timed out yet" do | ||
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 30.seconds.ago} | ||
expect(described_class.long_running_requests).to be_empty | ||
end | ||
|
||
it "skips threads with no requests" do | ||
@fake_thread = {} | ||
expect(described_class.long_running_requests).to be_empty | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who knows if every 30 seconds is too frequent or not