Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and log long running http(s) requests #17842

Merged
merged 3 commits into from
Sep 26, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions app/models/mixins/miq_web_server_runner_mixin.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,22 @@ def start_rails_server(options)
server.start
end
end

def do_heartbeat_work
log_long_running_requests
end

CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who knows if every 30 seconds is too frequent or not

def log_long_running_requests
@last_checked_hung_requests ||= Time.now.utc
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago

RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see making this more generic (it assumes a threaded web server like puma).

We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people.

So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log the thread.backtrace here as well? It would help debugging engineers find out where the bad parts of the codebase exist. If this is really a hung process, multiple checks on this on the 30 second intervals would let us know if we are looping on the same section of code on each check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.

Maybe? I don't know if it would be too noisy though.

So my thought with this, and others, was:

  1. Ideally, no requests go over 10 seconds ever.
  2. When they do, and we have a reproducible environment, with zero changes (or maybe a settings tweak+restart) we can hit a slow request and git the UI workers on a regular interval with this signal (SIGHUP possibly or something unused by puma).

So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time.

This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise).

Anyway, just some ideas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no requests go over 10 seconds ever.

🤣

yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned.

_log.warn(message)
Rails.logger.warn(message)
end

@last_checked_hung_requests = Time.now.utc
end
end
2 changes: 2 additions & 0 deletions config/application.rb
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ class Application < Rails::Application

config.autoload_once_paths << Rails.root.join("lib", "vmdb", "console_methods.rb").to_s

config.middleware.use 'RequestStartedOnMiddleware'

# config.eager_load_paths accepts an array of paths from which Rails will eager load on boot if cache classes is enabled.
# Defaults to every folder in the app directory of the application.

Expand Down
57 changes: 57 additions & 0 deletions lib/request_started_on_middleware.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Add to config/application.rb:
#
# config.middleware.use 'RequestStartedOnMiddleware'
#
class RequestStartedOnMiddleware
def initialize(app)
@app = app
end

def call(env)
start_request(env['PATH_INFO'], Time.zone.now)
@app.call(env)
ensure
complete_request
end

def start_request(path, started_on)
Thread.current[:current_request] = path
Thread.current[:current_request_started_on] = started_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the thread local vars

end

def complete_request
Thread.current[:current_request] = nil
Thread.current[:current_request_started_on] = nil
end

def self.long_running_requests
requests = []
allowable_request_start_time = long_request.ago

relevant_thread_list.each do |thread|
request = thread[:current_request]
started_on = thread[:current_request_started_on]

# There's a race condition where the complete_request method runs in another
# thread after we set one or more of the above local variables. The fallout
# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
Copy link
Member

@NickLaMuro NickLaMuro Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal.

Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLaMuro 😆 🤣 @kbrock

if request.present? && started_on.kind_of?(Time) && started_on < allowable_request_start_time
duration = (Time.zone.now - started_on).to_f
requests << [request, duration, thread]
end
end

requests
end

LONG_REQUEST = 1.minute
private_class_method def self.long_request
LONG_REQUEST
end

# For testing: mocking Thread.list feels dangerous
private_class_method def self.relevant_thread_list
Thread.list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this represents your race condition

end
end
28 changes: 28 additions & 0 deletions spec/lib/request_started_on_middleware_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
describe RequestStartedOnMiddleware do
context ".long_running_requests" do
before do
allow(described_class).to receive(:relevant_thread_list) { fake_threads }
allow(described_class).to receive(:request_timeout).and_return(2.minutes)
end

let(:fake_threads) { [@fake_thread] }

it "returns request, duration and thread" do
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 3.minutes.ago}
long_requests = described_class.long_running_requests.first
expect(long_requests[0]).to eql "/api/ping"
expect(long_requests[1]).to be_within(0.1).of(Time.now.utc - 3.minutes.ago)
expect(long_requests[2]).to eql @fake_thread
end

it "skips threads that haven't timed out yet" do
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 30.seconds.ago}
expect(described_class.long_running_requests).to be_empty
end

it "skips threads with no requests" do
@fake_thread = {}
expect(described_class.long_running_requests).to be_empty
end
end
end