Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542

Merged
merged 8 commits into from
Apr 1, 2017

Conversation

Ladas
Copy link
Contributor

@Ladas Ladas commented Mar 28, 2017

Optimize store_ids_for_new_records by getting rid of the O(n^2)
lookups. These lookups can take hours when we go over 50-100k records
being processed by store_ids_for_new_records.

And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69

The association has the records doubled, the association is passed
into store_ids_for_new_records as 'records' parameter
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114

And here when we do:
records.count => 50k
records.uniq.count => 25k
records.reload.count => 25k

That was slowing down the store_ids_for_new_records method even
more it seems.

Partially fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1436176

@agrare
Copy link
Member

agrare commented Mar 28, 2017

cc @Fryguy

@Ladas
Copy link
Contributor Author

Ladas commented Mar 28, 2017

@chrisarcand hola, could it be that ems.association.push(new_records) has new issues in new rails? It doubles the records there as described in the PR description.

@Ladas
Copy link
Contributor Author

Ladas commented Mar 28, 2017

@Fryguy not sure what is the history of this complex casting
r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k])

but I am just basically doing

r.send(k).to_s == h[k].to_s

end

def build_index_from_record(keys, record)
keys.map { |key| record.send(key).to_s }.join(index_joiner)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to use strings and joiners...Arrays are acceptable Hash keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, though I always preferred string ( it's smaller in memory :-) ). I suppose I can change it to array if you think it's better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, since you are not creating a string, it's that much less in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I meant more what I need to keep in memory, but this whole indexed hash will be probably GCed together with all the objects anyway. :-)

hashes_index[build_index_from_hash(keys, hash)] = hash
end

records.find_each do |record|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .find_each as opposed to .each?

Copy link
Contributor Author

@Ladas Ladas Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the records are preloaded, but duplicated, so I rather fetch them again in batches, the issue described in description

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize the issue about duplicated records was related to the .find_each here. Even so, find_each forces you to go back to the database. Can you get away with just doing records.uniq.each ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, yeah, the uniq could be fast enough

@Fryguy
Copy link
Member

Fryguy commented Mar 28, 2017

@Fryguy not sure what is the history of this complex casting
r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k])

This is because a provider author could pass, say, a number into a string column. ActiveRecord does just fine in handling writing that number into this string column. However, when we try to do the comparison between the value in the ActiveRecord object and the value from the parser, they will be different because 0 != "0". Casting allows them to match based on the column type.

r.send(k).to_s == h[k].to_s

This seems like it might have problems, but offhand, I'm not sure how. First thought it that timestamps will definitely not match...I'd have to think if other datatypes would run into an issue.

@Ladas
Copy link
Contributor Author

Ladas commented Mar 28, 2017

@Fryguy right, 0 != "0" should be ok when both are .to_s. And seems to me all datatypes that are a candidate for an index should be ok, I could be wrong though. :-) But you know, the refresh specs are passing, so it must be all good. :-)

Optimize store_ids_for_new_records by getting rid of the O(n^2)
lookups. These lookups can take hours when we go over 50-100k records
being processed by store_ids_for_new_records.

And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69

The association has the records doubled, the association is passed
into store_ids_for_new_records as 'records' parameter
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114

And here when we do:
records.count => 50k
records.uniq.count => 25k
records.reload.count => 25k

That was slowing down the store_ids_for_new_records method even
more it seems.

Partially fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1436176
@Fryguy
Copy link
Member

Fryguy commented Mar 28, 2017

And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69

How can the records be there twice if we are only adding the new_records?

@Fryguy
Copy link
Member

Fryguy commented Mar 28, 2017

@Fryguy right, 0 != "0" should be ok when both are .to_s. And seems to me all datatypes that are a candidate for an index should be ok, I could be wrong though. :-) But you know, the refresh specs are passing, so it must be all good. :-)

Why not just leave it? This particular change is irrelevant to this PR, and could be done in a follow up PR if necessary, or am I missing something?

@@ -113,10 +113,31 @@ def save_child_inventory(obj, hashes, child_keys, *args)

def store_ids_for_new_records(records, hashes, keys)
keys = Array(keys)
hashes.each do |h|
r = records.detect { |r| keys.all? { |k| r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k]) } }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy so I could keep the casting I think, if I load those upfront. I will need my morning brain for this though. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was thinking the casted values would be loaded up front for the index, giving an array of values. Then you just look them up by that same Array from the record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I did it closer to the original way, so indexing the records themselves

but there is some bigger can of worms with the association.push, it's not actually creating some records (I suppose until we do the ems.save! ) the duplication might be related to this

Ladas added 3 commits March 28, 2017 23:36
Use the array as the index
…g records

Use uniq instead of find_each to get pass association.push
duplicating records. Uniq is not doing exra query, so it should be
quicker.
Change store_ids_for_new_records to support type cast again
@Ladas Ladas force-pushed the optimize_store_ids_for_new_records branch from 32e18b4 to 8289c25 Compare March 28, 2017 21:54
record_index[build_index_from_record(keys, record)] = record
end

record_class = records.first.class
Copy link
Contributor Author

@Ladas Ladas Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy so if I want to keep the type cast, I need to get the class like this. So it's not exactly accurate. But that was the O(n^2), that we compare each hash to half of the records in average


hashes.each do |hash|
record = record_index[build_index_from_hash(keys, hash, record_class)]
hash[:id] = record.try(:id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrare @Fryguy I do get here record == nil now, so that is different from the previous code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, seems like the association.push is not saving the record and the uniq is somehow filtering them out

so it could be this is actually able to fill the refresh only in the second pass?

Do not use uniq since it filter out the non saved objects,
but then the calling .id result in the nil anyway. So the
crosslink will get filled only in the second refresh.
@Fryguy
Copy link
Member

Fryguy commented Mar 28, 2017

association.push is not saving the record

Correct, and if I recall that was intentional. It allowed the actual saves to get pushed down to when the parent record was saved, but perhaps that no longer makes sense if we are always returning the ids.


hashes.each do |hash|
record = record_index[build_index_from_hash(keys, hash, record_class)]
hash[:id] = record.id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy The id being nil seems to be a Containers issue, I don't see that happening for AWS

In containers ContainerQuotaItem, ContainerLimitItem and CustomAttribute are not being saved which is cause by the parent not being saved. I am not sure what I am looking at

Copy link
Contributor Author

@Ladas Ladas Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy ah, it's because save_child_inventory is called before we call association.push, so in a case of child_inventory for a newly created record, we are sending there record that was only built and not yet created. Therefore the record.id result in nil and this relation is only filled in a followup refresh, where we update the existing record

So this seems like older and unrelated issue. :-) And now I know why we do https://github.com/Ladas/manageiq/blob/1c203b0401e357e338bd6ddc2ed90b32e5d81299/app/models/ems_refresh/save_inventory_network.rb#L188-L188 :-D

But, it should not be really an issue that the h[:id] is missing, unless we actually need to use that for a crosslink.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas are you able to reproduce that issue without this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce this on master running the kubernetes refresher_spec

hashes.first[:container_quota_items].first
{:resource=>"cpu", :quota_desired=>"20", :quota_enforced=>"20", :quota_observed=>"100m", :id=>nil}

so nothing new from this PR 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'll look into it.

record_index[build_index_from_record(keys, record)] = record
end

record_class = records.first.class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can the class vary among records due to STI?
  2. Can type_for_attribute vary along STI?

There are certainly STI'd tables involved (e.g. PersistentVolume < ContainerVolume) — but hopefully not passed to this function in the same association? E.g. each of these calls should see uniform types:

    store_ids_for_new_records(ems.persistent_volumes, hashes, :ems_ref)
...elswhere...
    store_ids_for_new_records(container_group.container_volumes, hashes, :name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the suggestion is to use base_class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to using base_class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: can't records be sometimes empty? records.first.class would be nil...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think type_for_attribute comes from the DB, so any class should be the same as base class, but might be better to use base class.

I was thinking if the records can be empty, it should not be. But better to check it. :-)

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

Thanks for continuously squashing these O(N^2)

# Lets first index the hashes based on keys, so we can do O(1) lookups
record_index = {}
records.each do |record|
record_index[build_index_from_record(keys, record)] = record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rails do this for us?

record_index = records.index_by { |record| build_index_from_record(keys, record) }

(I may be totally wrong here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this should work

record_index[build_index_from_record(keys, record)] = record
end

record_class = records.first.class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the suggestion is to use base_class?

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I noticed same O(n²) couple days ago, started to attack today then Mooli told of this PR :)

There is something that confuses me about this function's purpose, and its name...
If these records have already been saved, couldn't we jot down their ids at the moment of saving?
If they're deliberately not saved yet (kind of buffering as @Fryguy said), then it can't work? "_for_new_records" sounds a bit like "only the unsaved ones", which is precisely what it doesn't :-) IIUC it also works equally updating long-existing records.

The only sane contract I can see is that after doing save_something then store_ids, you'll have the ids.
=> I.e. we should save all unsaved records?
(Not in this PR probably. Just trying to wrap my head about it.)

h[:id] = r.id
# Lets first index the hashes based on keys, so we can do O(1) lookups
record_index = {}
records.each do |record|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could records.select(:id, *keys) help further?
Also, we don't need record_index to store the records themselves, can directly store the id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cben so the records are already in the memory, so we should not fetch them again if possible. But then we should not even keep all the records in memory. But saying that, we should not be doing this post indexing where we go through all of it again. :-)

The answer is, I rewrote it all for the graph refresh. :-) For the old refresh, I am just trying to eliminate O(n^2) and similar issues, that breaks the refresh after reaching some amount of data. But to actually optimize memory and speed, it needs a total rewrite, like the graph refresh does. :-)

record_index[build_index_from_record(keys, record)] = record
end

record_class = records.first.class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: can't records be sometimes empty? records.first.class would be nil...

@Ladas
Copy link
Contributor Author

Ladas commented Mar 30, 2017

@cben explained in #14542 (comment)

@agrare @cben
The issue with new records I described is only issue if you would need that id to be filled as a foreign key elsewhere. In that case we do the manual saving step as visible e.g. here https://github.com/Ladas/manageiq/blob/1c203b0401e357e338bd6ddc2ed90b32e5d81299/app/models/ems_refresh/save_inventory_network.rb#L188-L188

So it's not really an issue, unless you would have some actual foreign_keys using those, then you need to remember to use ^. But this should be visible in refresh specs, that are testing these foreign_keys relationshpis.

Reformat the record_index using index_by
@Ladas
Copy link
Contributor Author

Ladas commented Mar 31, 2017

@cben good point indeed, but in general it's hard to say if you need the store_ids_for_new_records. You would need to manually go through the parser and check what points to what, then you could delete the store_ids_for_new_records, if it's not needed. It's easier for the containers, since the saving code is not shared by many different providers.

In graph refresh, you can infer this automatically. Cause you check can if there is a edge, that would need the :id. Though right now, the :id assignment is basically a noop, there might be other strategies that will be more like it is now. E.g. for super large tables(like CustomAttribute, which is quickly going over >100k), I want to try to generate SQL queries, that will update/create thousands records at once, instead of 1 record at a time. Then we would need this post indexing, but only if there is an edge that would require one. :-)

@Fryguy
Copy link
Member

Fryguy commented Mar 31, 2017

it's hard to say if you need the store_ids_for_new_records

Originally, this was used to store only the ids in the hash that were needed by later save methods. Then they would have the id in place for foreign keys, so they wouldn't need to fetch the object...that is, the id gets placed in the hash, and due to hash reference sharing it's just automatically available later. Then, people saw a pattern and just started putting store_ids_for_new_records everywhere, even if it's not needed. Conceptually, it should never be an expensive method call, so there was no harm in it.

@Ladas do you have measurements on how much time this method contributed to the overall time? It should be trivial to add a Benchmark.realtime_block to add to the refresh timings.

@Fryguy
Copy link
Member

Fryguy commented Mar 31, 2017

@Ladas this is great! Awesome work!

@agrare once it goes green, I think this will be good to merge

@Ladas
Copy link
Contributor Author

Ladas commented Mar 31, 2017

@Fryguy not an exact measuring. Doing the AWS perf tests, eliminating store_ids_for_new_records had cut the refresh time by 15-30%. But the perf specs were probably hitting this bug. So it might be much less in the end.

@Ladas Ladas closed this Mar 31, 2017
@Ladas Ladas reopened this Mar 31, 2017
@agrare agrare closed this Mar 31, 2017
@agrare agrare reopened this Mar 31, 2017
@agrare agrare closed this Apr 1, 2017
@agrare agrare reopened this Apr 1, 2017
Copy link
Member

@agrare agrare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
This reduced refresh time on a customer system from 22min to 15min

@agrare agrare merged commit 053c16e into ManageIQ:master Apr 1, 2017
@agrare agrare added this to the Sprint 58 Ending Apr 10, 2017 milestone Apr 1, 2017
@agrare agrare added the fine/yes label Apr 5, 2017
simaishi pushed a commit that referenced this pull request Apr 5, 2017
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
(cherry picked from commit 053c16e)
@simaishi
Copy link
Contributor

simaishi commented Apr 5, 2017

Fine backport details:

$ git log -1
commit 708fd26720a82d09c40babbc1ddb0d2dcac5323e
Author: Adam Grare <[email protected]>
Date:   Sat Apr 1 14:22:13 2017 -0400

    Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75)

@agrare agrare added the euwe/yes label Apr 8, 2017
simaishi pushed a commit that referenced this pull request Apr 11, 2017
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
(cherry picked from commit 053c16e)

https://bugzilla.redhat.com/show_bug.cgi?id=1436176
@simaishi
Copy link
Contributor

Euwe backport details:

$ git log -1
commit 23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa
Author: Adam Grare <[email protected]>
Date:   Sat Apr 1 14:22:13 2017 -0400

    Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1436176

@cben
Copy link
Contributor

cben commented Apr 19, 2017

@Ladas
Copy link
Contributor Author

Ladas commented Apr 19, 2017

@cben yeah, it might be doing similar thing on lower number of lines. TypedIndex is a bit complex for what it does (it's not present in Graph refresh)

agrare added a commit to agrare/manageiq that referenced this pull request Apr 19, 2017
…ew_records

Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
(cherry picked from commit 053c16e)

https://bugzilla.redhat.com/show_bug.cgi?id=1441202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants