-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups #14542
Conversation
cc @Fryguy |
@chrisarcand hola, could it be that ems.association.push(new_records) has new issues in new rails? It doubles the records there as described in the PR description. |
@Fryguy not sure what is the history of this complex casting but I am just basically doing
|
end | ||
|
||
def build_index_from_record(keys, record) | ||
keys.map { |key| record.send(key).to_s }.join(index_joiner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to use strings and joiners...Arrays are acceptable Hash keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, though I always preferred string ( it's smaller in memory :-) ). I suppose I can change it to array if you think it's better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, since you are not creating a string, it's that much less in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I meant more what I need to keep in memory, but this whole indexed hash will be probably GCed together with all the objects anyway. :-)
hashes_index[build_index_from_hash(keys, hash)] = hash | ||
end | ||
|
||
records.find_each do |record| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why .find_each as opposed to .each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the records are preloaded, but duplicated, so I rather fetch them again in batches, the issue described in description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize the issue about duplicated records was related to the .find_each here. Even so, find_each forces you to go back to the database. Can you get away with just doing records.uniq.each
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, yeah, the uniq could be fast enough
This is because a provider author could pass, say, a number into a string column. ActiveRecord does just fine in handling writing that number into this string column. However, when we try to do the comparison between the value in the ActiveRecord object and the value from the parser, they will be different because
This seems like it might have problems, but offhand, I'm not sure how. First thought it that timestamps will definitely not match...I'd have to think if other datatypes would run into an issue. |
@Fryguy right, 0 != "0" should be ok when both are .to_s. And seems to me all datatypes that are a candidate for an index should be ok, I could be wrong though. :-) But you know, the refresh specs are passing, so it must be all good. :-) |
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups. These lookups can take hours when we go over 50-100k records being processed by store_ids_for_new_records. And seems like there might be a Rails issue, since by doing association.push(new_records) at https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69 The association has the records doubled, the association is passed into store_ids_for_new_records as 'records' parameter https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114 And here when we do: records.count => 50k records.uniq.count => 25k records.reload.count => 25k That was slowing down the store_ids_for_new_records method even more it seems. Partially fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1436176
How can the records be there twice if we are only adding the new_records? |
Why not just leave it? This particular change is irrelevant to this PR, and could be done in a follow up PR if necessary, or am I missing something? |
@@ -113,10 +113,31 @@ def save_child_inventory(obj, hashes, child_keys, *args) | |||
|
|||
def store_ids_for_new_records(records, hashes, keys) | |||
keys = Array(keys) | |||
hashes.each do |h| | |||
r = records.detect { |r| keys.all? { |k| r.send(k) == r.class.type_for_attribute(k.to_s).cast(h[k]) } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy so I could keep the casting I think, if I load those upfront. I will need my morning brain for this though. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was thinking the casted values would be loaded up front for the index, giving an array of values. Then you just look them up by that same Array from the record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I did it closer to the original way, so indexing the records themselves
but there is some bigger can of worms with the association.push, it's not actually creating some records (I suppose until we do the ems.save! ) the duplication might be related to this
Use the array as the index
…g records Use uniq instead of find_each to get pass association.push duplicating records. Uniq is not doing exra query, so it should be quicker.
Change store_ids_for_new_records to support type cast again
32e18b4
to
8289c25
Compare
record_index[build_index_from_record(keys, record)] = record | ||
end | ||
|
||
record_class = records.first.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy so if I want to keep the type cast, I need to get the class like this. So it's not exactly accurate. But that was the O(n^2), that we compare each hash to half of the records in average
|
||
hashes.each do |hash| | ||
record = record_index[build_index_from_hash(keys, hash, record_class)] | ||
hash[:id] = record.try(:id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, seems like the association.push is not saving the record and the uniq is somehow filtering them out
so it could be this is actually able to fill the refresh only in the second pass?
Do not use uniq since it filter out the non saved objects, but then the calling .id result in the nil anyway. So the crosslink will get filled only in the second refresh.
Correct, and if I recall that was intentional. It allowed the actual saves to get pushed down to when the parent record was saved, but perhaps that no longer makes sense if we are always returning the ids. |
|
||
hashes.each do |hash| | ||
record = record_index[build_index_from_hash(keys, hash, record_class)] | ||
hash[:id] = record.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy The id being nil seems to be a Containers issue, I don't see that happening for AWS
In containers ContainerQuotaItem, ContainerLimitItem and CustomAttribute are not being saved which is cause by the parent not being saved. I am not sure what I am looking at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy ah, it's because save_child_inventory is called before we call association.push, so in a case of child_inventory for a newly created record, we are sending there record that was only built and not yet created. Therefore the record.id result in nil and this relation is only filled in a followup refresh, where we update the existing record
So this seems like older and unrelated issue. :-) And now I know why we do https://github.com/Ladas/manageiq/blob/1c203b0401e357e338bd6ddc2ed90b32e5d81299/app/models/ems_refresh/save_inventory_network.rb#L188-L188 :-D
But, it should not be really an issue that the h[:id] is missing, unless we actually need to use that for a crosslink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ladas are you able to reproduce that issue without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce this on master running the kubernetes refresher_spec
hashes.first[:container_quota_items].first
{:resource=>"cpu", :quota_desired=>"20", :quota_enforced=>"20", :quota_observed=>"100m", :id=>nil}
so nothing new from this PR 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I'll look into it.
record_index[build_index_from_record(keys, record)] = record | ||
end | ||
|
||
record_class = records.first.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can the class vary among records due to STI?
- Can
type_for_attribute
vary along STI?
There are certainly STI'd tables involved (e.g. PersistentVolume < ContainerVolume
) — but hopefully not passed to this function in the same association? E.g. each of these calls should see uniform types:
store_ids_for_new_records(ems.persistent_volumes, hashes, :ems_ref)
...elswhere...
store_ids_for_new_records(container_group.container_volumes, hashes, :name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the suggestion is to use base_class
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to using base_class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question: can't records
be sometimes empty? records.first.class
would be nil...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think type_for_attribute comes from the DB, so any class should be the same as base class, but might be better to use base class.
I was thinking if the records can be empty, it should not be. But better to check it. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
Thanks for continuously squashing these O(N^2)
# Lets first index the hashes based on keys, so we can do O(1) lookups | ||
record_index = {} | ||
records.each do |record| | ||
record_index[build_index_from_record(keys, record)] = record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rails do this for us?
record_index = records.index_by { |record| build_index_from_record(keys, record) }
(I may be totally wrong here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think this should work
record_index[build_index_from_record(keys, record)] = record | ||
end | ||
|
||
record_class = records.first.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the suggestion is to use base_class
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I noticed same O(n²) couple days ago, started to attack today then Mooli told of this PR :)
There is something that confuses me about this function's purpose, and its name...
If these records have already been saved, couldn't we jot down their ids at the moment of saving?
If they're deliberately not saved yet (kind of buffering as @Fryguy said), then it can't work? "_for_new_records" sounds a bit like "only the unsaved ones", which is precisely what it doesn't :-) IIUC it also works equally updating long-existing records.
The only sane contract I can see is that after doing save_something
then store_ids
, you'll have the ids.
=> I.e. we should save all unsaved records
?
(Not in this PR probably. Just trying to wrap my head about it.)
h[:id] = r.id | ||
# Lets first index the hashes based on keys, so we can do O(1) lookups | ||
record_index = {} | ||
records.each do |record| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could records.select(:id, *keys)
help further?
Also, we don't need record_index to store the records themselves, can directly store the id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cben so the records are already in the memory, so we should not fetch them again if possible. But then we should not even keep all the records in memory. But saying that, we should not be doing this post indexing where we go through all of it again. :-)
The answer is, I rewrote it all for the graph refresh. :-) For the old refresh, I am just trying to eliminate O(n^2) and similar issues, that breaks the refresh after reaching some amount of data. But to actually optimize memory and speed, it needs a total rewrite, like the graph refresh does. :-)
record_index[build_index_from_record(keys, record)] = record | ||
end | ||
|
||
record_class = records.first.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question: can't records
be sometimes empty? records.first.class
would be nil...
@cben explained in #14542 (comment) @agrare @cben So it's not really an issue, unless you would have some actual foreign_keys using those, then you need to remember to use ^. But this should be visible in refresh specs, that are testing these foreign_keys relationshpis. |
Reformat the record_index using index_by
@cben good point indeed, but in general it's hard to say if you need the store_ids_for_new_records. You would need to manually go through the parser and check what points to what, then you could delete the store_ids_for_new_records, if it's not needed. It's easier for the containers, since the saving code is not shared by many different providers. In graph refresh, you can infer this automatically. Cause you check can if there is a edge, that would need the :id. Though right now, the :id assignment is basically a noop, there might be other strategies that will be more like it is now. E.g. for super large tables(like CustomAttribute, which is quickly going over >100k), I want to try to generate SQL queries, that will update/create thousands records at once, instead of 1 record at a time. Then we would need this post indexing, but only if there is an edge that would require one. :-) |
Originally, this was used to store only the ids in the hash that were needed by later save methods. Then they would have the id in place for foreign keys, so they wouldn't need to fetch the object...that is, the id gets placed in the hash, and due to hash reference sharing it's just automatically available later. Then, people saw a pattern and just started putting store_ids_for_new_records everywhere, even if it's not needed. Conceptually, it should never be an expensive method call, so there was no harm in it. @Ladas do you have measurements on how much time this method contributed to the overall time? It should be trivial to add a Benchmark.realtime_block to add to the refresh timings. |
@Fryguy not an exact measuring. Doing the AWS perf tests, eliminating store_ids_for_new_records had cut the refresh time by 15-30%. But the perf specs were probably hitting this bug. So it might be much less in the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
This reduced refresh time on a customer system from 22min to 15min
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e)
Fine backport details:
|
Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e) https://bugzilla.redhat.com/show_bug.cgi?id=1436176
Euwe backport details:
|
Turns out the new code here is a near duplicate of |
@cben yeah, it might be doing similar thing on lower number of lines. TypedIndex is a bit complex for what it does (it's not present in Graph refresh) |
…ew_records Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e) https://bugzilla.redhat.com/show_bug.cgi?id=1441202
Optimize store_ids_for_new_records by getting rid of the O(n^2)
lookups. These lookups can take hours when we go over 50-100k records
being processed by store_ids_for_new_records.
And seems like there might be a Rails issue, since by doing
association.push(new_records) at
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69
The association has the records doubled, the association is passed
into store_ids_for_new_records as 'records' parameter
https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114
And here when we do:
records.count => 50k
records.uniq.count => 25k
records.reload.count => 25k
That was slowing down the store_ids_for_new_records method even
more it seems.
Partially fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1436176