Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V2V] Modify active_tasks so that it always reloads #18860

Merged
merged 1 commit into from
Jun 14, 2019
Merged

[V2V] Modify active_tasks so that it always reloads #18860

merged 1 commit into from
Jun 14, 2019

Conversation

djberg96
Copy link
Contributor

@djberg96 djberg96 commented Jun 13, 2019

At the moment the relation we created for active_tasks is getting cached, which is causing failures for v2v concurrency. By calling count instead of size it effectively forces a reload on active tasks so that we always get an up to date value when checking for the number of concurrent tasks.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1716283

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1698761

Thanks to @Fryguy for the suggestion. :)

@Fryguy
Copy link
Member

Fryguy commented Jun 13, 2019

If you use .count instead of .size then it wont execute the query, and we'll instead do a count * every time, and thus no need for reload.

https://mensfeld.pl/2014/09/activerecord-count-vs-length-vs-size-and-what-will-happen-if-you-use-it-the-way-you-shouldnt/

@djberg96
Copy link
Contributor Author

@Fryguy Ah, true. Well, unless you use a counter_cache, but we aren't. :)

@djberg96
Copy link
Contributor Author

Not sure what's happening with 2.5.x, but the failures appear to be unrelated.

@Fryguy
Copy link
Member

Fryguy commented Jun 13, 2019

That error looks related to the embedded ansible changes in #18687 . @djberg96 if this PR is not fully rebased, please try that first, otherwise just rerun that one travis build. That code is being replaced anyway, so if it's sporadic, I'll take care of it.

@@ -1,10 +1,10 @@
class InfraConversionThrottler
def self.start_conversions
pending_conversion_jobs.each do |ems, jobs|
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size }
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a terrible N+1 (query in a loop is bad)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kbrock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be less than 20 conversion hosts.

I did come up with a single query for this, but since it is not in the primary loop, that can hold off for another day. This is only called 1 time per ems and is a lower concern

@Fryguy Fryguy self-requested a review June 13, 2019 13:59
Fryguy
Fryguy previously requested changes Jun 13, 2019
Copy link
Member

@Fryguy Fryguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newer changes introduce N+1s (which may have been there previously if the relations weren't cached). cc @kbrock

@djberg96
Copy link
Contributor Author

I don't see a way around it, we have to have up to date information every time. Plus, it looks to me like the database caches that explain plan (the plan, not the result), so subsequent runs are zippier than the first.

Unless @kbrock has a suggestion, I'm afraid I don't know how to avoid it without losing accuracy.

@djberg96
Copy link
Contributor Author

@Fryguy was forked off latest master, e90a007 was last commit (from today by Martin P) before I created branch.

Use count instead of reload.
@miq-bot
Copy link
Member

miq-bot commented Jun 13, 2019

Checked commit https://github.com/djberg96/manageiq/commit/9efc1a621415c60c3a2864bc1b7d740eabb2857d with ruby 2.3.3, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
2 files checked, 2 offenses detected

lib/infra_conversion_throttler.rb

@djberg96
Copy link
Contributor Author

I'm not fixing those current cops, they're dumb.

@kbrock
Copy link
Member

kbrock commented Jun 13, 2019

ok, was chatting with @djberg96

defining a virtual total gets us 95% of the way there
unfortunately, Ems#conversion_hosts is not a simple relation but rather it is vm_conversion_hosts + host_conversion_hosts. So we'll need to define a manual sum method

wasn't able to

class ConversionHost
  virtual_total :total_tasks, :active_tasks
end

class ExtManagementSystem
  def total_conversion_host_tasks
    vm_conversion_hosts.sum(:total_tasks) + host_conversion_hosts.sum(:total_tasks)
  end
end

@kbrock
Copy link
Member

kbrock commented Jun 13, 2019

Use !empty? instead of size > 0
--rubocop

Also of note: size vs count vs empty? vs present? are not always easily substitutable. Some call queries and others do not. Please don't blindly listen to rubocop here. (I know you said "no", but just asking others to do the same when it comes to relations)

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm not sure that cached counts is really our culprit, I can respect us wanting to get the current task count.

just s/size/count will force an 2N+1 at the very least. It will slow down this method too much.

The time between downloading all these records (or counts) and acting upon them is a kind of race condition that will cause the imbalances that you are seeing. So getting this as fast as possible will shorten this window and give us better results.

lets start off defining a virtual_total, to at least get the counts into the individual {vm,host}_conversion_host collections. If nothing else, we'll be able to use select(:total_tasks) to prefetch these total values in one fell swoop.

then lets see if we can figure out how to get eligible into the db.
that will allow us to pull back only 2 conversion_host records.

running counts in the db may be quicker, but it will also be atomic, giving us a much better chance of picking the best host quickly.

def check_concurrent_tasks
max_tasks = max_concurrent_tasks || Settings.transformation.limits.max_concurrent_tasks_per_host
active_tasks.size < max_tasks
active_tasks.count < max_tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please introduce total_tasks and reference that here.

that way we can preload this value in this query and not cause an N+1

I understand that you don't want to have a cached value from more than 10 seconds ago, but caching it within a single query / second seems prudent and non-wasteful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, of note, this count is checked within a loop that also runs counts.
so a separate count here doesn't make sense.

(also using size and prefetching all active_tasks isn't much better)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus: this is bad and should be changed
BUT
not today

==> Keep it with count

@@ -1,15 +1,24 @@
class InfraConversionThrottler
def self.start_conversions
pending_conversion_jobs.each do |ems, jobs|
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size }
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it may be tricky treating all conversion_hosts the same.

I'll try and think of a way, but adding together vm_conversion_hosts values and host_conversion_hosts values may be the next best thing.

encapsulating this and putting it into ems may at least make this method look good.

jobs.each do |job|
eligible_hosts = ems.conversion_hosts.select(&:eligible?).sort_by { |ch| ch.active_tasks.size }
eligible_hosts = ems.conversion_hosts.select(&:eligible?).sort_by { |ch| ch.active_tasks.count }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big fan of bringing back every conversion host to then select out the eligible ones.
to then hit the database for each of those.

would like to find a way to get eligible into the query and then possibly do the sum in the database too

if eligible_hosts.size > 0
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', '))
end

break if slots <= 0 || eligible_hosts.empty?
job.migration_task.update_attributes!(:conversion_host => eligible_hosts.first)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not you but...

do we really need to bring back every eligible host?

if we could get it into the query, could we just bring back the top host and vm and pick one of those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

punting on this idea - looks like it may make sense to just cache the conversion hosts

if eligible_hosts.size > 0
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', '))
end

break if slots <= 0 || eligible_hosts.empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again not you but, if there are no slots, do we even need to do this jobs.each - what will that buy us?
we're going to throw all this work away anyway.

@ghost
Copy link

ghost commented Jun 14, 2019

@kbrock thanks for all these comments. They show how bad our original code is :)
However, this PR aims at fixing a regression and @Fryguy agreed that a small/good enough fix is ok for backport. I have created #18865 to track the effort to make that whole method better in master, where we can do bigger/riskier changes. Would you mind commenting on the issue, so that we don't lose all this great analysis?

@ghost
Copy link

ghost commented Jun 14, 2019

@miq-bot add-label transformation, bug, hammer/yes

@@ -1,15 +1,24 @@
class InfraConversionThrottler
def self.start_conversions
pending_conversion_jobs.each do |ems, jobs|
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size }
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count }
$log&.debug("There are currently #{running} conversion hosts running.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why $log&.debug? $log shouldn't be nil. Also, you probably only want to log to debug if the logger is in debug mode, so something like this may be more appropriate..

$log.debug { "There are currently #{running} conversion hosts running." }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please remove this from the PR

@djberg96
Copy link
Contributor Author

@kbrock I appreciate your insights here and there's no doubt this could use some refactoring. The problem is that I don't really understand the solutions you're proposing, e.g. I have no idea what virtual_total is actually doing behind the scenes.

I wouldn't feel comfortable submitting a PR where I don't really understand the code, so I would like for you to submit a separate PR to refactor this at a later time. Given that we are pressed for time, I would ask that this be given a pass for now.

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FUTURE: move the eligible (minus the max check) out of the jobs loop and up into the ems loop

NOW: ship it

@@ -1,10 +1,10 @@
class InfraConversionThrottler
def self.start_conversions
pending_conversion_jobs.each do |ems, jobs|
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size }
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be less than 20 conversion hosts.

I did come up with a single query for this, but since it is not in the primary loop, that can hold off for another day. This is only called 1 time per ems and is a lower concern

def check_concurrent_tasks
max_tasks = max_concurrent_tasks || Settings.transformation.limits.max_concurrent_tasks_per_host
active_tasks.size < max_tasks
active_tasks.count < max_tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus: this is bad and should be changed
BUT
not today

==> Keep it with count

@@ -1,15 +1,24 @@
class InfraConversionThrottler
def self.start_conversions
pending_conversion_jobs.each do |ems, jobs|
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.size }
running = ems.conversion_hosts.inject(0) { |sum, ch| sum + ch.active_tasks.count }
$log&.debug("There are currently #{running} conversion hosts running.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please remove this from the PR

slots = (ems.miq_custom_get('Max Transformation Runners') || Settings.transformation.limits.max_concurrent_tasks_per_ems).to_i - running
$log&.debug("The maximum number of concurrent tasks for the EMS is: #{slots}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this from the PR as well?

if eligible_hosts.size > 0
$log&.debug("The following conversion hosts are currently eligible: " + eligible_hosts.map(&:name).join(', '))
end

break if slots <= 0 || eligible_hosts.empty?
job.migration_task.update_attributes!(:conversion_host => eligible_hosts.first)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

punting on this idea - looks like it may make sense to just cache the conversion hosts

@Fryguy Fryguy self-requested a review June 14, 2019 20:22
@Fryguy Fryguy dismissed their stale review June 14, 2019 20:22

deferring to @kbrock

@Fryguy
Copy link
Member

Fryguy commented Jun 14, 2019

@kbrock I'll leave this one up to you to review and merge.

@Fryguy
Copy link
Member

Fryguy commented Jun 14, 2019

@kbrock , As @fdupont-redhat said above, I'd be ok with merging a less-than optimal change if it fixes the code and we can do a followup to remove the N+1s or other optimizations later, but I'll defer to your judgement on this one.

@kbrock kbrock added this to the Sprint 114 Ending Jun 24, 2019 milestone Jun 14, 2019
@kbrock kbrock merged commit 660387c into ManageIQ:master Jun 14, 2019
simaishi pushed a commit that referenced this pull request Jun 17, 2019
@simaishi
Copy link
Contributor

Hammer backport details:

$ git log -1
commit 59e5de4f03c21eaaa584ef5a2ab933bd4ab4a8c5
Author: Keenan Brock <[email protected]>
Date:   Fri Jun 14 18:03:37 2019 -0400

    Merge pull request #18860 from djberg96/conversion_host_throttling
    
    [V2V] Modify active_tasks so that it always reloads
    
    (cherry picked from commit 660387c242a7581405ad85280370cd77317aac36)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1721117
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1721118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants