Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split metric collections into smaller intervals #14332

Merged
merged 6 commits into from
Mar 21, 2017

Conversation

blomquisg
Copy link
Member

@blomquisg blomquisg commented Mar 14, 2017

When a target's last_perf_capture_on is far into the past, the next attempt to capture metrics for that target will result in an attempt to gather a potentially huge amount of metrics at one time.

This change will break down the intervals between last_perf_capture_on and today into individual days. This is the same logic that's used when collecting gap metrics.

Links

https://bugzilla.redhat.com/show_bug.cgi?id=1429593
https://bugzilla.redhat.com/show_bug.cgi?id=1432719

Steps for Testing/QA

  1. Take any active VM and manually change the last_perf_capture_on field to some time in the past.
  2. Collect metrics for that VM

The expected result is the MiqQueue should receive N number of queue entries, 1 for each full or partial day since last_perf_capture_on.

/cc @simon3z

items << [interval_name]
items[0] << start_time << end_time unless start_time.nil?

start_time = last_perf_capture_on unless start_time
Copy link
Member Author

@blomquisg blomquisg Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect of this is that it's still possible to call target.perf_capture("realtime") on a target with a very old last_perf_capture_on and end up with the bad behavior of attempting to grab many many days of metrics in one go.

This PR's change only helps if the caller uses the queue to queue up the perf capture.

This leads me to think that we need to refactor the metrics capture system past the perf_capture_queue method to always insist that the start_date and end_date are provided. That way, either you know what you're doing from the caller's standpoint, or, you queue up the call and let the queuing mechanism handle the break down of date intervals.

In fact, even trying to catch the problem at the perf_capture level leaves the individual provider implementations with the ability to interpret what it means to have a missing start_date or end_date. It feels very dirty allowing each layer of the metrics capture code make those assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was sort of relying on the fact that on vmware, if you ask for anything older than 4 hours for realtime, you are just going to get the last 4 hours. I think it's ok to split it up, because then you should just get a bunch of no-op collections.

This change only helps if the caller uses the queue to queue up the perf capture.

Is there any place that doesn't do this?

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change only helps if the caller uses the queue to queue up the perf capture.
Is there any place that doesn't do this?

I believe all of the important parts of the code queue today, but there's still some method name building that's hard to track down.

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

No, I'm saying that there are provider-specific implementations that still expect that the caller can send in nil for start/end dates. And, I think the interface should be more strict to assume that the caller always provides a valid start/end date. Instead of allowing the assumption of what a nil start/end date means permeate throughout the metrics capture code. In the end, if no one is calling specific implementations with nil start/end date, then it's wasted logic. And, if people are calling specific implementations with nil start/end date, then it's a pattern of deviation that will show up as we consolidate more things into the provider platform arena.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

D'oh, I'll go over that one more time, then. I think I can just reverse what I've got now with the steps and each_cons bits.

@agrare agrare requested a review from jameswnl March 14, 2017 20:33
@Fryguy
Copy link
Member

Fryguy commented Mar 14, 2017

If this is the same logic as gap collection, can we then use the same code from gap collection to keep it DRY?

Nevermind...looks like you did. I just misread the code.

intervals = []
start_hour = start_time
while start_hour != end_time
#TODO: replace 1.day with a configurable threshold defined by provider type
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh ... follow-up PR if anything.

It's possible that some providers will gather so much data every hour that attempting to even get just one day would be too much data. Dunno.

intervals = []
start_hour = start_time
while start_hour != end_time
#TODO: replace 1.day with a configurable threshold defined by provider type
Copy link
Member

@Fryguy Fryguy Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it a parameter?

@blomquisg
Copy link
Member Author

If this is the same logic as gap collection, can we then use the same code from gap collection to keep it DRY?

Yeah, it's the same logic ... and it's DRYish. The "historical" internal_name is equivalent to gap collection. And, this logic was ripped right from that and pulled into its own method.

end_hour = end_time if end_hour > end_time
intervals << [start_hour, end_hour]
start_hour = end_hour
end
Copy link
Member

@Fryguy Fryguy Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very manual. I think you can do this with

(start_time..end_time).step_value(1.day)

If you definitely need them in start/end pairs, just tack on a each_cons(2)

(start_time..end_time).step_value(1.day).each_cons(2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also feels very "copy'n'paste" 😄

I can update it to something prettier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL I didn't see the original, which was why I thought you were writing something new. Heh, yeah, now that we have step_value method, it's a LOT nicer :)

@@ -36,6 +36,16 @@ def expected_stats_period_end
parse_datetime('2013-08-28T12:41:40Z')
end

context "#perf_capture_queue" do
it "splits up long perf_capture durations for old last_perf_capture_on" do
# there shoud be 10 days + a partial day, or 11 queue items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment typo shoud -> should

def split_capture_intervals(start_time, end_time)
intervals = []
start_hour = start_time
while start_hour != end_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any validation on the end_time? Looks like this could loop infinitely if start_time and end_time aren't aligned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should definitely be a <. Or just use step_value as mentioned above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step_value ftw

@blomquisg blomquisg force-pushed the splitting_gap_collection branch 2 times, most recently from 406d016 to 9e06dbe Compare March 14, 2017 21:46
# [[2017-01-03 12:00:00 UTC, 2017-01-04 12:00:00 UTC],
# [2017-01-02 12:00:00 UTC, 2017-01-03 12:00:00 UTC],
# [2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]]
(start_time..end_time).step_value(threshold).each_cons(2).collect { |s_time, e_time| [s_time, e_time] }.reverse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the collect bit necessary? .each_cons already returns that format, so curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried it, each_cons returned a single array:

> (1..20).step_value(2).each_cons(2)
=> #<Enumerator: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 20]:each_cons(2)>

Whereas the collect will group them into the tuples:

> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]}
=> [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]

And, reverse gets the ordering correct:

> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]}.reverse
=> [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I could do is pass in the interval_name as an argument and forego the extra call to map below.

Then, the collect call here really is manipulating the tuples.

Copy link
Member

@Fryguy Fryguy Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you read it wrong...The first param is returning an Enumerator over the set, but it hadn't yet applied the each_cons. Calling to_a shows that each_cons by itself works.

> (1..20).step_value(2).each_cons(2).to_a
=> [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]

However, TIL you can't call .reverse on a Enumerator directly. Even so, you can call it on .to_a

> (1..20).step_value(2).each_cons(2).to_a.reverse
=> [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]

Copy link
Member

@Fryguy Fryguy Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your point though, if you pass in interval_name, then you just call map directly on each_cons.

> (1..20).step_value(2).each_cons(2).map { |t1, t2| [interval_name, t1, t2] }.reverse
=> [["realtime", 19, 20],
 ["realtime", 17, 19],
 ["realtime", 15, 17],
 ["realtime", 13, 15],
 ["realtime", 11, 13],
 ["realtime", 9, 11],
 ["realtime", 7, 9],
 ["realtime", 5, 7],
 ["realtime", 3, 5],
 ["realtime", 1, 3]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly this! 😄

Done.

cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id
end

# Determine what items we should be queuing up
items = [interval_name]
items = split_capture_intervals(start_time, end_time).map do |start_and_end_time|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this will be cleaner if instead of a single start_and_end_time, you have two separate, st and et variables, maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by virtue of removing this line

original_queue_count = MiqQueue.count
@vm.last_perf_capture_on = 10.days.ago - 5.hours - 23.minutes
@vm.perf_capture_queue("realtime")
expect(MiqQueue.count - original_queue_count).to eq 11
Copy link
Member

@Fryguy Fryguy Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to check some candidate time values and also to see if they are in reverse. You can use Timecop.freeze if you need to ensure specific time values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added

When a target's `last_perf_capture_on` is far into the past, the next
attempt to capture metrics for that target will result in an attempt to
gather a potentially huge amount of metrics at one time.

This change will break down the intervals between `last_perf_capture_on`
and `today` into individual days.  This is the same logic that's used
when collecting gap metrics.

https://bugzilla.redhat.com/show_bug.cgi?id=1429593
Remove a redundant call to `map`, and add a test to make sure that the
dates in metrics collection are in a properly reversed order.

https://bugzilla.redhat.com/show_bug.cgi?id=1429593
@blomquisg blomquisg force-pushed the splitting_gap_collection branch from 9e06dbe to f874607 Compare March 17, 2017 15:26
items[0] << start_time << end_time unless start_time.nil?

start_time = last_perf_capture_on unless start_time
end_time = Time.now.utc unless end_time
cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause a problem if we have multiple queue items calling the same callback now?

I can't really understand the code on line 79 below.

Copy link
Member Author

@blomquisg blomquisg Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


NM, I misread.

Lemme look through that callback to see how that's being handled.

context "#perf_capture_queue" do
it "splits up long perf_capture durations for old last_perf_capture_on" do
# there should be 10 days + a partial day, or 11 queue items
Timecop.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer Timecop.freeze with a block. This will auto-return at the end even in the case of an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw that and changed it in another update ... testing that out locally and I can update in a few minutes

it "splits up long perf_capture durations for old last_perf_capture_on" do
# there should be 10 days + a partial day, or 11 queue items
Timecop.freeze
original_queue_count = MiqQueue.count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately, just whack the queue before you start, though I'm very surprised to see queue items already in the queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So was I. It's because we queue up stop_event_monitor when we create OpenStack EMSes. And, it just so happens that all this testing is relying on OpenStack mocks.

I thought about whacking it, but didn't wanna impact anything else. I'm sure no one is relying on those existing queue messages, but ¯\_(ツ)_/¯.

queue_items = MiqQueue.peek(:conditions => {:zone => @vm.my_zone,
:queue_name => @vm.queue_name_for_metrics_collection,
:role => "ems_metrics_collector"},
:limit => 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you whack the queue items from the beginning, then you don't need to use peek or anything complicated here. Just MiqQueue.order(:id) should be sufficient to get you the items in order so that you can traverse over them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

queue_items.each do |q_item|
q_start_time, q_end_time = q_item.args
expect(q_start_time).to eq start_time
expect(q_end_time).to eq end_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use be_same_time_as for the times here for start and end time. It's possible that since we are serializing that there could be a loss of precision which would given sporadic test failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger

# [interval_name, 2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]]
start_time = start_time.utc
end_time = end_time.utc
(start_time..end_time).step_value(threshold).each_cons(2).collect do |s_time, e_time|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I think you can collapse out the .utc calls into the range...

(start_time.utc..end_time.utc).step_value(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, wilco

Ensure that perf_capture_queue works on long intervals and single days.
@miq-bot
Copy link
Member

miq-bot commented Mar 20, 2017

Checked commits blomquisg/manageiq@5571320~...7bd3090 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks good. 🍰

@blomquisg
Copy link
Member Author

I can't reproduce the same spec failure locally with the same seed. I'll close/reopen and see if that helps. If not, I'll rebase on master and see what I'm up against there.

@blomquisg blomquisg closed this Mar 20, 2017
@blomquisg blomquisg reopened this Mar 20, 2017
@blomquisg
Copy link
Member Author

FWIW, I rebased locally and tried the same seed and couldn't get the same error on spec/models/metric_spec.rb:333.

@Fryguy Fryguy merged commit 660e3a3 into ManageIQ:master Mar 21, 2017
@Fryguy Fryguy added this to the Sprint 57 Ending Mar 27, 2017 milestone Mar 21, 2017
simaishi pushed a commit that referenced this pull request Mar 23, 2017
@simaishi
Copy link
Contributor

Euwe backport details:

$ git log -1
commit 41721b42d58bb5a50aee449bbda8d57e71cfe4eb
Author: Jason Frey <[email protected]>
Date:   Tue Mar 21 16:11:54 2017 -0400

    Merge pull request #14332 from blomquisg/splitting_gap_collection
    
    Split metric collections into smaller intervals
    (cherry picked from commit 660e3a36f6d5655e8173065ad14a34ab9ef31207)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1435278

blomquisg added a commit to blomquisg/manageiq that referenced this pull request Apr 4, 2017
Previously, if the `last_perf_capture_on` value was very very old, MIQ
would attempt to gather all of the metrics from `last_perf_capture_on`
until now.

ManageIQ#14332 introduced the concept
of grouping the historical metrics by days, so the collector worker
would not try to gather *ALL* of the metrics at once.

This change further restricts the metrics by favoring the more recent
date between:

 * `last_perf_capture_on`
 * Metric::Capture.historical_start_time

If `last_perf_capture_on` is too far in the past,
`Metric::Capture.historical_start_time` will be a forceful override.
agrare pushed a commit to agrare/manageiq that referenced this pull request Apr 19, 2017
…tion

Split metric collections into smaller intervals
(cherry picked from commit 660e3a3)

https://bugzilla.redhat.com/show_bug.cgi?id=1435278
@blomquisg blomquisg deleted the splitting_gap_collection branch November 28, 2017 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants