-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V2V] Modify active_tasks so that it always reloads #18860
Conversation
If you use .count instead of .size then it wont execute the query, and we'll instead do a count * every time, and thus no need for reload. |
@Fryguy Ah, true. Well, unless you use a |
Not sure what's happening with 2.5.x, but the failures appear to be unrelated. |
@@ -1,10 +1,10 @@ | |||
class InfraConversionThrottler | |||
def self.start_conversions | |||
pending_conversion_jobs.each do |ems, jobs| | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size } | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a terrible N+1 (query in a loop is bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kbrock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be less than 20 conversion hosts.
I did come up with a single query for this, but since it is not in the primary loop, that can hold off for another day. This is only called 1 time per ems and is a lower concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newer changes introduce N+1s (which may have been there previously if the relations weren't cached). cc @kbrock
I don't see a way around it, we have to have up to date information every time. Plus, it looks to me like the database caches that explain plan (the plan, not the result), so subsequent runs are zippier than the first. Unless @kbrock has a suggestion, I'm afraid I don't know how to avoid it without losing accuracy. |
Use count instead of reload.
Checked commit https://github.com/djberg96/manageiq/commit/9efc1a621415c60c3a2864bc1b7d740eabb2857d with ruby 2.3.3, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0 lib/infra_conversion_throttler.rb
|
I'm not fixing those current cops, they're dumb. |
ok, was chatting with @djberg96 defining a virtual total gets us 95% of the way there wasn't able to class ConversionHost
virtual_total :total_tasks, :active_tasks
end
class ExtManagementSystem
def total_conversion_host_tasks
vm_conversion_hosts.sum(:total_tasks) + host_conversion_hosts.sum(:total_tasks)
end
end |
Also of note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm not sure that cached counts is really our culprit, I can respect us wanting to get the current task count.
just s/size/count
will force an 2N+1 at the very least. It will slow down this method too much.
The time between downloading all these records (or counts) and acting upon them is a kind of race condition that will cause the imbalances that you are seeing. So getting this as fast as possible will shorten this window and give us better results.
lets start off defining a virtual_total
, to at least get the counts into the individual {vm,host}_conversion_host
collections. If nothing else, we'll be able to use select(:total_tasks)
to prefetch these total values in one fell swoop.
then lets see if we can figure out how to get eligible into the db.
that will allow us to pull back only 2 conversion_host records.
running counts in the db may be quicker, but it will also be atomic, giving us a much better chance of picking the best host quickly.
def check_concurrent_tasks | ||
max_tasks = max_concurrent_tasks || Settings.transformation.limits.max_concurrent_tasks_per_host | ||
active_tasks.size < max_tasks | ||
active_tasks.count < max_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please introduce total_tasks
and reference that here.
that way we can preload this value in this query and not cause an N+1
I understand that you don't want to have a cached value from more than 10 seconds ago, but caching it within a single query / second seems prudent and non-wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, of note, this count is checked within a loop that also runs counts.
so a separate count here doesn't make sense.
(also using size and prefetching all active_tasks isn't much better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consensus: this is bad and should be changed
BUT
not today
==> Keep it with count
@@ -1,15 +1,24 @@ | |||
class InfraConversionThrottler | |||
def self.start_conversions | |||
pending_conversion_jobs.each do |ems, jobs| | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size } | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it may be tricky treating all conversion_hosts the same.
I'll try and think of a way, but adding together vm_conversion_hosts
values and host_conversion_hosts
values may be the next best thing.
encapsulating this and putting it into ems may at least make this method look good.
jobs.each do |job| | ||
eligible_hosts = ems.conversion_hosts.select(&:eligible?).sort_by { |ch| ch.active_tasks.size } | ||
eligible_hosts = ems.conversion_hosts.select(&:eligible?).sort_by { |ch| ch.active_tasks.count } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big fan of bringing back every conversion host to then select out the eligible ones.
to then hit the database for each of those.
would like to find a way to get eligible
into the query and then possibly do the sum in the database too
if eligible_hosts.size > 0 | ||
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', ')) | ||
end | ||
|
||
break if slots <= 0 || eligible_hosts.empty? | ||
job.migration_task.update_attributes!(:conversion_host => eligible_hosts.first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not you but...
do we really need to bring back every eligible host?
if we could get it into the query, could we just bring back the top host and vm and pick one of those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punting on this idea - looks like it may make sense to just cache the conversion hosts
if eligible_hosts.size > 0 | ||
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', ')) | ||
end | ||
|
||
break if slots <= 0 || eligible_hosts.empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again not you but, if there are no slots, do we even need to do this jobs.each - what will that buy us?
we're going to throw all this work away anyway.
@kbrock thanks for all these comments. They show how bad our original code is :) |
@miq-bot add-label transformation, bug, hammer/yes |
@@ -1,15 +1,24 @@ | |||
class InfraConversionThrottler | |||
def self.start_conversions | |||
pending_conversion_jobs.each do |ems, jobs| | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size } | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count } | |||
$log&.debug("There are currently #{running} conversion hosts running.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why $log&.debug
? $log
shouldn't be nil
. Also, you probably only want to log to debug if the logger is in debug mode, so something like this may be more appropriate..
$log.debug { "There are currently #{running} conversion hosts running." }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please remove this from the PR
@kbrock I appreciate your insights here and there's no doubt this could use some refactoring. The problem is that I don't really understand the solutions you're proposing, e.g. I have no idea what I wouldn't feel comfortable submitting a PR where I don't really understand the code, so I would like for you to submit a separate PR to refactor this at a later time. Given that we are pressed for time, I would ask that this be given a pass for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FUTURE: move the eligible (minus the max check) out of the jobs loop and up into the ems loop
NOW: ship it
@@ -1,10 +1,10 @@ | |||
class InfraConversionThrottler | |||
def self.start_conversions | |||
pending_conversion_jobs.each do |ems, jobs| | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size } | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be less than 20 conversion hosts.
I did come up with a single query for this, but since it is not in the primary loop, that can hold off for another day. This is only called 1 time per ems and is a lower concern
def check_concurrent_tasks | ||
max_tasks = max_concurrent_tasks || Settings.transformation.limits.max_concurrent_tasks_per_host | ||
active_tasks.size < max_tasks | ||
active_tasks.count < max_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consensus: this is bad and should be changed
BUT
not today
==> Keep it with count
@@ -1,15 +1,24 @@ | |||
class InfraConversionThrottler | |||
def self.start_conversions | |||
pending_conversion_jobs.each do |ems, jobs| | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size } | |||
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count } | |||
$log&.debug("There are currently #{running} conversion hosts running.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please remove this from the PR
slots = (ems.miq_custom_get('Max Transformation Runners') || Settings.transformation.limits.max_concurrent_tasks_per_ems).to_i - running | ||
$log&.debug("The maximum number of concurrent tasks for the EMS is: #{slots}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this from the PR as well?
if eligible_hosts.size > 0 | ||
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', ')) | ||
end | ||
|
||
break if slots <= 0 || eligible_hosts.empty? | ||
job.migration_task.update_attributes!(:conversion_host => eligible_hosts.first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punting on this idea - looks like it may make sense to just cache the conversion hosts
@kbrock I'll leave this one up to you to review and merge. |
@kbrock , As @fdupont-redhat said above, I'd be ok with merging a less-than optimal change if it fixes the code and we can do a followup to remove the N+1s or other optimizations later, but I'll defer to your judgement on this one. |
[V2V] Modify active_tasks so that it always reloads (cherry picked from commit 660387c) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1721117 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1721118
Hammer backport details:
|
At the moment the relation we created for
active_tasks
is getting cached, which is causing failures for v2v concurrency. By callingcount
instead ofsize
it effectively forces a reload on active tasks so that we always get an up to date value when checking for the number of concurrent tasks.BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1716283
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1698761
Thanks to @Fryguy for the suggestion. :)