Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542

Ladas · 2017-03-28T16:47:28Z

Optimize store_ids_for_new_records by getting rid of the O(n^2)
lookups. These lookups can take hours when we go over 50-100k records
being processed by store_ids_for_new_records.

And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69

The association has the records doubled, the association is passed
into store_ids_for_new_records as 'records' parameter
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114

And here when we do:
records.count => 50k
records.uniq.count => 25k
records.reload.count => 25k

That was slowing down the store_ids_for_new_records method even
more it seems.

Partially fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1436176

agrare · 2017-03-28T16:49:48Z

cc @Fryguy

Ladas · 2017-03-28T16:52:24Z

@chrisarcand hola, could it be that ems.association.push(new_records) has new issues in new rails? It doubles the records there as described in the PR description.

Ladas · 2017-03-28T17:16:36Z

@Fryguy not sure what is the history of this complex casting
r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k])

but I am just basically doing

r.send(k).to_s == h[k].to_s

Fryguy · 2017-03-28T19:03:45Z

app/models/ems_refresh/save_inventory_helper.rb

+  end
+
+  def build_index_from_record(keys, record)
+    keys.map { |key| record.send(key).to_s }.join(index_joiner)


There's no need to use strings and joiners...Arrays are acceptable Hash keys

I know, though I always preferred string ( it's smaller in memory :-) ). I suppose I can change it to array if you think it's better

Technically, since you are not creating a string, it's that much less in memory.

right, I meant more what I need to keep in memory, but this whole indexed hash will be probably GCed together with all the objects anyway. :-)

Fryguy · 2017-03-28T19:04:22Z

app/models/ems_refresh/save_inventory_helper.rb

+      hashes_index[build_index_from_hash(keys, hash)] = hash
+    end
+
+    records.find_each do |record|


Why .find_each as opposed to .each?

the records are preloaded, but duplicated, so I rather fetch them again in batches, the issue described in description

Ah, I didn't realize the issue about duplicated records was related to the .find_each here. Even so, find_each forces you to go back to the database. Can you get away with just doing records.uniq.each ?

right, yeah, the uniq could be fast enough

Fryguy · 2017-03-28T19:06:38Z

@Fryguy not sure what is the history of this complex casting
r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k])

This is because a provider author could pass, say, a number into a string column. ActiveRecord does just fine in handling writing that number into this string column. However, when we try to do the comparison between the value in the ActiveRecord object and the value from the parser, they will be different because 0 != "0". Casting allows them to match based on the column type.

r.send(k).to_s == h[k].to_s

This seems like it might have problems, but offhand, I'm not sure how. First thought it that timestamps will definitely not match...I'd have to think if other datatypes would run into an issue.

Ladas · 2017-03-28T19:41:49Z

@Fryguy right, 0 != "0" should be ok when both are .to_s. And seems to me all datatypes that are a candidate for an index should be ok, I could be wrong though. :-) But you know, the refresh specs are passing, so it must be all good. :-)

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups. These lookups can take hours when we go over 50-100k records being processed by store_ids_for_new_records. And seems like there might be a Rails issue, since by doing association.push(new_records) at https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69 The association has the records doubled, the association is passed into store_ids_for_new_records as 'records' parameter https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114 And here when we do: records.count => 50k records.uniq.count => 25k records.reload.count => 25k That was slowing down the store_ids_for_new_records method even more it seems. Partially fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1436176

Fryguy · 2017-03-28T21:21:25Z

And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69

How can the records be there twice if we are only adding the new_records?

Fryguy · 2017-03-28T21:22:29Z

@Fryguy right, 0 != "0" should be ok when both are .to_s. And seems to me all datatypes that are a candidate for an index should be ok, I could be wrong though. :-) But you know, the refresh specs are passing, so it must be all good. :-)

Why not just leave it? This particular change is irrelevant to this PR, and could be done in a follow up PR if necessary, or am I missing something?

Ladas · 2017-03-28T21:32:24Z

app/models/ems_refresh/save_inventory_helper.rb

@@ -113,10 +113,31 @@ def save_child_inventory(obj, hashes, child_keys, *args)

  def store_ids_for_new_records(records, hashes, keys)
    keys = Array(keys)
-    hashes.each do |h|
-      r = records.detect { |r| keys.all? { |k| r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k]) } }


@Fryguy so I could keep the casting I think, if I load those upfront. I will need my morning brain for this though. :-)

yeah I was thinking the casted values would be loaded up front for the index, giving an array of values. Then you just look them up by that same Array from the record.

right, I did it closer to the original way, so indexing the records themselves

but there is some bigger can of worms with the association.push, it's not actually creating some records (I suppose until we do the ems.save! ) the duplication might be related to this

Use the array as the index

…g records Use uniq instead of find_each to get pass association.push duplicating records. Uniq is not doing exra query, so it should be quicker.

Change store_ids_for_new_records to support type cast again

Ladas · 2017-03-28T21:56:47Z

app/models/ems_refresh/save_inventory_helper.rb

+      record_index[build_index_from_record(keys, record)] = record
+    end
+
+    record_class = records.first.class


@Fryguy so if I want to keep the type cast, I need to get the class like this. So it's not exactly accurate. But that was the O(n^2), that we compare each hash to half of the records in average

Ladas · 2017-03-28T21:58:46Z

app/models/ems_refresh/save_inventory_helper.rb

+
+    hashes.each do |hash|
+      record = record_index[build_index_from_hash(keys, hash, record_class)]
+      hash[:id] = record.try(:id)


@agrare @Fryguy I do get here record == nil now, so that is different from the previous code

ah, seems like the association.push is not saving the record and the uniq is somehow filtering them out

so it could be this is actually able to fill the refresh only in the second pass?

Do not use uniq since it filter out the non saved objects, but then the calling .id result in the nil anyway. So the crosslink will get filled only in the second refresh.

Fryguy · 2017-03-28T22:15:49Z

association.push is not saving the record

Correct, and if I recall that was intentional. It allowed the actual saves to get pushed down to when the parent record was saved, but perhaps that no longer makes sense if we are always returning the ids.

Ladas · 2017-03-28T22:26:37Z

app/models/ems_refresh/save_inventory_helper.rb

+
+    hashes.each do |hash|
+      record = record_index[build_index_from_hash(keys, hash, record_class)]
+      hash[:id] = record.id


@Fryguy The id being nil seems to be a Containers issue, I don't see that happening for AWS

In containers ContainerQuotaItem, ContainerLimitItem and CustomAttribute are not being saved which is cause by the parent not being saved. I am not sure what I am looking at

@Fryguy ah, it's because save_child_inventory is called before we call association.push, so in a case of child_inventory for a newly created record, we are sending there record that was only built and not yet created. Therefore the record.id result in nil and this relation is only filled in a followup refresh, where we update the existing record

So this seems like older and unrelated issue. :-) And now I know why we do https://github.com/Ladas/manageiq/blob/1c203b0401e357e338bd6ddc2ed90b32e5d81299/app/models/ems_refresh/save_inventory_network.rb#L188-L188 :-D

But, it should not be really an issue that the h[:id] is missing, unless we actually need to use that for a crosslink.

@Ladas are you able to reproduce that issue without this change?

I was able to reproduce this on master running the kubernetes refresher_spec

hashes.first[:container_quota_items].first {:resource=>"cpu", :quota_desired=>"20", :quota_enforced=>"20", :quota_observed=>"100m", :id=>nil}

so nothing new from this PR 👍

Interesting, I'll look into it.

cben · 2017-03-29T15:04:31Z

app/models/ems_refresh/save_inventory_helper.rb

+      record_index[build_index_from_record(keys, record)] = record
+    end
+
+    record_class = records.first.class


Can the class vary among records due to STI?

Can type_for_attribute vary along STI?

There are certainly STI'd tables involved (e.g. PersistentVolume < ContainerVolume) — but hopefully not passed to this function in the same association? E.g. each of these calls should see uniform types:

store_ids_for_new_records(ems.persistent_volumes, hashes, :ems_ref) ...elswhere... store_ids_for_new_records(container_group.container_volumes, hashes, :name)

so the suggestion is to use base_class?

👍 to using base_class

Another question: can't records be sometimes empty? records.first.class would be nil...

I think type_for_attribute comes from the DB, so any class should be the same as base class, but might be better to use base class.

I was thinking if the records can be empty, it should not be. But better to check it. :-)

kbrock

Looking good.

Thanks for continuously squashing these O(N^2)

kbrock · 2017-03-29T17:17:12Z

app/models/ems_refresh/save_inventory_helper.rb

+    # Lets first index the hashes based on keys, so we can do O(1) lookups
+    record_index = {}
+    records.each do |record|
+      record_index[build_index_from_record(keys, record)] = record


Does rails do this for us?

record_index = records.index_by { |record| build_index_from_record(keys, record) }

(I may be totally wrong here)

yeah I think this should work

kbrock · 2017-03-29T17:17:36Z

app/models/ems_refresh/save_inventory_helper.rb

+      record_index[build_index_from_record(keys, record)] = record
+    end
+
+    record_class = records.first.class


so the suggestion is to use base_class?

cben

This is great. I noticed same O(n²) couple days ago, started to attack today then Mooli told of this PR :)

There is something that confuses me about this function's purpose, and its name...
If these records have already been saved, couldn't we jot down their ids at the moment of saving?
If they're deliberately not saved yet (kind of buffering as @Fryguy said), then it can't work? "_for_new_records" sounds a bit like "only the unsaved ones", which is precisely what it doesn't :-) IIUC it also works equally updating long-existing records.

The only sane contract I can see is that after doing save_something then store_ids, you'll have the ids.
=> I.e. we should save all unsaved records?
(Not in this PR probably. Just trying to wrap my head about it.)

cben · 2017-03-29T21:11:00Z

app/models/ems_refresh/save_inventory_helper.rb

-      h[:id] = r.id
+    # Lets first index the hashes based on keys, so we can do O(1) lookups
+    record_index = {}
+    records.each do |record|


Could records.select(:id, *keys) help further?
Also, we don't need record_index to store the records themselves, can directly store the id.

@cben so the records are already in the memory, so we should not fetch them again if possible. But then we should not even keep all the records in memory. But saying that, we should not be doing this post indexing where we go through all of it again. :-)

The answer is, I rewrote it all for the graph refresh. :-) For the old refresh, I am just trying to eliminate O(n^2) and similar issues, that breaks the refresh after reaching some amount of data. But to actually optimize memory and speed, it needs a total rewrite, like the graph refresh does. :-)

cben · 2017-03-29T21:14:33Z

app/models/ems_refresh/save_inventory_helper.rb

+      record_index[build_index_from_record(keys, record)] = record
+    end
+
+    record_class = records.first.class


Another question: can't records be sometimes empty? records.first.class would be nil...

Ladas · 2017-03-30T15:48:56Z

@cben explained in #14542 (comment)

@agrare @cben
The issue with new records I described is only issue if you would need that id to be filled as a foreign key elsewhere. In that case we do the manual saving step as visible e.g. here https://github.com/Ladas/manageiq/blob/1c203b0401e357e338bd6ddc2ed90b32e5d81299/app/models/ems_refresh/save_inventory_network.rb#L188-L188

So it's not really an issue, unless you would have some actual foreign_keys using those, then you need to remember to use ^. But this should be visible in refresh specs, that are testing these foreign_keys relationshpis.

Reformat the record_index using index_by

Ladas · 2017-03-31T08:41:48Z

@cben good point indeed, but in general it's hard to say if you need the store_ids_for_new_records. You would need to manually go through the parser and check what points to what, then you could delete the store_ids_for_new_records, if it's not needed. It's easier for the containers, since the saving code is not shared by many different providers.

In graph refresh, you can infer this automatically. Cause you check can if there is a edge, that would need the :id. Though right now, the :id assignment is basically a noop, there might be other strategies that will be more like it is now. E.g. for super large tables(like CustomAttribute, which is quickly going over >100k), I want to try to generate SQL queries, that will update/create thousands records at once, instead of 1 record at a time. Then we would need this post indexing, but only if there is an edge that would require one. :-)

Fryguy · 2017-03-31T12:10:05Z

it's hard to say if you need the store_ids_for_new_records

Originally, this was used to store only the ids in the hash that were needed by later save methods. Then they would have the id in place for foreign keys, so they wouldn't need to fetch the object...that is, the id gets placed in the hash, and due to hash reference sharing it's just automatically available later. Then, people saw a pattern and just started putting store_ids_for_new_records everywhere, even if it's not needed. Conceptually, it should never be an expensive method call, so there was no harm in it.

@Ladas do you have measurements on how much time this method contributed to the overall time? It should be trivial to add a Benchmark.realtime_block to add to the refresh timings.

Fryguy · 2017-03-31T12:15:47Z

@Ladas this is great! Awesome work!

@agrare once it goes green, I think this will be good to merge

Ladas · 2017-03-31T12:37:41Z

@Fryguy not an exact measuring. Doing the AWS perf tests, eliminating store_ids_for_new_records had cut the refresh time by 15-30%. But the perf specs were probably hitting this bug. So it might be much less in the end.

agrare

LGTM 👍
This reduced refresh time on a customer system from 22min to 15min

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e)

simaishi · 2017-04-05T22:06:40Z

Fine backport details:

$ git log -1
commit 708fd26720a82d09c40babbc1ddb0d2dcac5323e
Author: Adam Grare <[email protected]>
Date:   Sat Apr 1 14:22:13 2017 -0400

    Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75)

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e) https://bugzilla.redhat.com/show_bug.cgi?id=1436176

simaishi · 2017-04-11T12:55:16Z

Euwe backport details:

$ git log -1
commit 23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa
Author: Adam Grare <[email protected]>
Date:   Sat Apr 1 14:22:13 2017 -0400

    Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1436176

cben · 2017-04-19T10:52:04Z

Turns out the new code here is a near duplicate of class TypedIndex.
https://github.com/Ladas/manageiq/blob/b4c143ac1626892847080df2f60a7c4059509cbb/app/models/ems_refresh/save_inventory_helper.rb#L2-L34

Ladas · 2017-04-19T11:48:34Z

@cben yeah, it might be doing similar thing on lower number of lines. TypedIndex is a bit complex for what it does (it's not present in Graph refresh)

…ew_records Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e) https://bugzilla.redhat.com/show_bug.cgi?id=1441202

chessbyte requested a review from kbrock March 28, 2017 17:16

chessbyte assigned Fryguy Mar 28, 2017

chessbyte requested a review from agrare March 28, 2017 17:17

chessbyte added performance providers labels Mar 28, 2017

Fryguy reviewed Mar 28, 2017

View reviewed changes

Ladas commented Mar 28, 2017

View reviewed changes

Ladas added 3 commits March 28, 2017 23:36

Use the array as the index

54f7100

Use the array as the index

Use uniq instead of find_each to get pass association.push duplicatin…

6c3b73d

…g records Use uniq instead of find_each to get pass association.push duplicating records. Uniq is not doing exra query, so it should be quicker.

Change store_ids_for_new_records to support type cast again

8289c25

Change store_ids_for_new_records to support type cast again

Ladas force-pushed the optimize_store_ids_for_new_records branch from 32e18b4 to 8289c25 Compare March 28, 2017 21:54

Ladas commented Mar 28, 2017

View reviewed changes

Do not use uniq since it filter out the non saved objects

3844890

Do not use uniq since it filter out the non saved objects, but then the calling .id result in the nil anyway. So the crosslink will get filled only in the second refresh.

Ladas commented Mar 28, 2017

View reviewed changes

cben reviewed Mar 29, 2017

View reviewed changes

kbrock reviewed Mar 29, 2017

View reviewed changes

cben reviewed Mar 29, 2017

View reviewed changes

Reformat the record_index using index_by

7f60ac3

Reformat the record_index using index_by

Ladas closed this Mar 31, 2017

Ladas reopened this Mar 31, 2017

agrare closed this Mar 31, 2017

agrare reopened this Mar 31, 2017

agrare closed this Apr 1, 2017

agrare reopened this Apr 1, 2017

agrare approved these changes Apr 1, 2017

View reviewed changes

agrare merged commit 053c16e into ManageIQ:master Apr 1, 2017

agrare added this to the Sprint 58 Ending Apr 10, 2017 milestone Apr 1, 2017

cben mentioned this pull request Apr 2, 2017

[WIP] [DEBUG] CONTAINER_INFLATE_FACTOR env var to duplicate most entities #14608

Closed

agrare added the fine/yes label Apr 5, 2017

simaishi pushed a commit that referenced this pull request Apr 5, 2017

Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records

708fd26

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e)

simaishi added fine/backported and removed fine/yes labels Apr 5, 2017

Ladas mentioned this pull request Apr 7, 2017

Optimize number of transactions sent in refresh #14670

Merged

agrare added the euwe/yes label Apr 8, 2017

simaishi added euwe/backported and removed euwe/yes labels Apr 11, 2017

agrare mentioned this pull request Apr 11, 2017

[EUWE] Do not store ar object for old refresh #14673

Merged

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542

Conversation

Ladas commented Mar 28, 2017 • edited Loading

agrare commented Mar 28, 2017

Ladas commented Mar 28, 2017 • edited Loading

Ladas commented Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Mar 28, 2017 • edited Loading

Ladas commented Mar 28, 2017 • edited Loading

Fryguy commented Mar 28, 2017

Fryguy commented Mar 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Mar 28, 2017

Choose a reason for hiding this comment

Ladas Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cben left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas commented Mar 30, 2017

Ladas commented Mar 31, 2017

Fryguy commented Mar 31, 2017 • edited Loading

Fryguy commented Mar 31, 2017 • edited Loading

Ladas commented Mar 31, 2017

agrare left a comment

Choose a reason for hiding this comment

simaishi commented Apr 5, 2017

simaishi commented Apr 11, 2017

cben commented Apr 19, 2017

Ladas commented Apr 19, 2017

Ladas commented Mar 28, 2017 •

edited

Loading

Ladas commented Mar 28, 2017 •

edited

Loading

Ladas commented Mar 28, 2017 •

edited

Loading

Ladas Mar 28, 2017 •

edited

Loading

Fryguy commented Mar 28, 2017 •

edited

Loading

Ladas commented Mar 28, 2017 •

edited

Loading

Ladas Mar 28, 2017 •

edited

Loading

Ladas Mar 28, 2017 •

edited

Loading

cben left a comment •

edited

Loading

Fryguy commented Mar 31, 2017 •

edited

Loading

Fryguy commented Mar 31, 2017 •

edited

Loading