Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and log long running http(s) requests #17842

Merged
merged 3 commits into from
Sep 26, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions app/models/mixins/miq_web_server_runner_mixin.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,22 @@ def start_rails_server(options)
server.start
end
end

def do_heartbeat_work
log_long_running_requests
end

CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who knows if every 30 seconds is too frequent or not

def log_long_running_requests
@last_checked_hung_requests ||= Time.now.utc
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago

RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see making this more generic (it assumes a threaded web server like puma).

We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people.

So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log the thread.backtrace here as well? It would help debugging engineers find out where the bad parts of the codebase exist. If this is really a hung process, multiple checks on this on the 30 second intervals would let us know if we are looping on the same section of code on each check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.

Maybe? I don't know if it would be too noisy though.

So my thought with this, and others, was:

  1. Ideally, no requests go over 10 seconds ever.
  2. When they do, and we have a reproducible environment, with zero changes (or maybe a settings tweak+restart) we can hit a slow request and git the UI workers on a regular interval with this signal (SIGHUP possibly or something unused by puma).

So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time.

This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise).

Anyway, just some ideas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no requests go over 10 seconds ever.

🤣

yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned.

_log.warn(message)
Rails.logger.warn(message)
end

@last_checked_hung_requests = Time.now.utc
end
end
2 changes: 2 additions & 0 deletions config/application.rb
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ class Application < Rails::Application

config.autoload_once_paths << Rails.root.join("lib", "vmdb", "console_methods.rb").to_s

config.middleware.use 'RequestStartedOnMiddleware'

# config.eager_load_paths accepts an array of paths from which Rails will eager load on boot if cache classes is enabled.
# Defaults to every folder in the app directory of the application.

Expand Down
57 changes: 57 additions & 0 deletions lib/request_started_on_middleware.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Add to config/application.rb:
#
# config.middleware.use 'RequestStartedOnMiddleware'
#
class RequestStartedOnMiddleware
def initialize(app)
@app = app
end

def call(env)
start_request(env['PATH_INFO'], Time.now.utc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but can we extract this Time.now value from the env instead of looking it up ourselves? Does it makes sense to make another Time.now allocation, or should use what is there?

Honestly asking here though, because I am curious if you think there is an advantage or disadvantage to one or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I was thinking you're right... maybe we should use the one in the headers but then I was looking at how newrelic pulls the times out of the header and it looks faster to do it the way I already have it. Check it out here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yeah, I think what I have is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I was under the impression that rack/puma would have added this info for us, but I guess not.

That said, Rails does do this, which lead me down a rabbit hole of investigation into "how", and going to spike on something real quick as a POC just for you to let me know what you think. This "idea" probably won't be superior in features, and is digging into the Rails private API a bit, but might be less code for us to maintain overall. We'll see...

@app.call(env)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NickL, thanks for this simplification

ensure
complete_request
end

def start_request(path, started_on)
Thread.current[:current_request] = path
Thread.current[:current_request_started_on] = started_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the thread local vars

end

def complete_request
Thread.current[:current_request] = nil
Thread.current[:current_request_started_on] = nil
end

def self.long_running_requests
requests = []
timed_out_request_started_on = request_timeout.ago.utc
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Keenan for this idea ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name seems a bit wordy and confusing when read in the if below.

Maybe something like timeout_threshold or allowable_request_start_time, and then the if reads like:

if ... && started_on < allowable_request_start_time
  # ...
end


relevant_thread_list.each do |thread|
request = thread[:current_request]
started_on = thread[:current_request_started_on]

# There's a race condition where the complete_request method runs in another
# thread after we set one or more of the above local variables. The fallout
# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
Copy link
Member

@NickLaMuro NickLaMuro Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal.

Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLaMuro 😆 🤣 @kbrock

if request.present? && started_on.kind_of?(Time) && timed_out_request_started_on > started_on
duration = (Time.now.utc - started_on).to_f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can "cache" Time.now in a local variable above with request and started_on instead of looking it up every iteration of the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also something I noticed just now, is the .utc calls done through out these changes even necessary? We never log the raw time, and only the difference between the start and current time when logging long requests. I only mention this because we are trying to keep it as minimal/lightweight as possible, and calls to .utc seem pointless when dealing with time diffs.

Might be something that rubocop complained about though, so just something I noticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yeah, since we're writing and reading the time, there's no need to store it as utc as long as it's always local

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we're only running Time.now once for each long request, which shouldn't be that frequent. The alternative is to do it once for EVERY request.

requests << [request, duration, thread]
end
end

requests
end

REQUEST_TIMEOUT = 2.minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two quick suggestions about this (which I didn't think to add into my last review):

  • Maybe rename this to LONG_REQUEST, or something, since we really aren't "timing out" anything here. It is based on the apache timeout, but this simply logs a long request, and this is the value for what constitutes that.
  • Possibly make this only 1.minute? On the performance team, we were told to try and make anything over 10 seconds in the UI faster, so waiting until 2.minutes to say something is "long" seems like it is a bit extreme. When paired with my suggestion to possibly include the .backtrace with it, it seems like this would allow us to more quickly see when something is long and causing issues, and if the issue is moving from one code path to the next. And most requests (ideally) shouldn't get to this point where it is a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on both. I like the LONG_REQUEST name... it makes more sense. I was thinking we could expose the 2.minutes value because I also see value in changing it. Maybe 1.minute across the board will be better as it will pick up long requests that don't currently timeout apache but also requests that do. I'll change it to 1.minute and we'll go from there.

private_class_method def self.request_timeout
REQUEST_TIMEOUT
end

# For testing: mocking Thread.list feels dangerous
private_class_method def self.relevant_thread_list
Thread.list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this represents your race condition

end
end
28 changes: 28 additions & 0 deletions spec/lib/request_started_on_middleware_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
describe RequestStartedOnMiddleware do
context ".long_running_requests" do
before do
allow(described_class).to receive(:relevant_thread_list) { fake_threads }
allow(described_class).to receive(:request_timeout).and_return(2.minutes)
end

let(:fake_threads) { [@fake_thread] }

it "returns request, duration and thread" do
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 3.minutes.ago}
long_requests = described_class.long_running_requests.first
expect(long_requests[0]).to eql "/api/ping"
expect(long_requests[1]).to be_within(0.1).of(Time.now.utc - 3.minutes.ago)
expect(long_requests[2]).to eql @fake_thread
end

it "skips threads that haven't timed out yet" do
@fake_thread = {:current_request => "/api/ping", :current_request_started_on => 30.seconds.ago}
expect(described_class.long_running_requests).to be_empty
end

it "skips threads with no requests" do
@fake_thread = {}
expect(described_class.long_running_requests).to be_empty
end
end
end