Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics saving #15976

Merged
merged 12 commits into from
Sep 26, 2017
Merged

Improve metrics saving #15976

merged 12 commits into from
Sep 26, 2017

Conversation

Ladas
Copy link
Contributor

@Ladas Ladas commented Sep 15, 2017

Lets optimize metrics saving

@Ladas Ladas force-pushed the improve_metrics_saving branch from 89d5384 to 54ddd52 Compare September 15, 2017 12:25
@miq-bot miq-bot added the wip label Sep 15, 2017
@Ladas
Copy link
Contributor Author

Ladas commented Sep 15, 2017

@agrare so this is just a super WIP, but the numbers looks good

example of just saving 25920 AWS samples (I am changing generic saving code, so any metrics will apply), which is 6 days of data, doing vm.perf_capture('realtime', 6.days.ago.utc, Time.now.utc)

@kbrock @Fryguy ^ check the :process_perfs_db, the saving is 20x - 30x faster just by doing a simple batching (batch INSERT and batch UPDATE), with more room to optimize it (batch upsert, rules instead of triggers, etc...)

# Create all metrics
[----] I, [2017-09-15T06:43:32.056193 #12860:f89120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T04:32:20Z - 2017-09-15T04:32:00Z]...Complete - Timings: {:process_counter_values=>0.5581705570220947, 
:db_find_prev_perfs=>0.01391744613647461, 
:process_perfs=>44.54817485809326,
:process_perfs_db=>304.4324731826782, 
:total_time=>363.9644160270691}

# Update all metrics
[----] I, [2017-09-15T06:36:34.663954 #12860:f89120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T04:27:20Z - 2017-09-15T04:22:00Z]...Complete - Timings: {:process_counter_values=>0.40729641914367676, 
:db_find_prev_perfs=>2.697371244430542, 
:process_perfs=>38.26711559295654, 
:process_perfs_db=>361.15534019470215, 
:total_time=>413.3819372653961}

# Optimized create all metrics
[----] I, [2017-09-15T14:37:51.457336 #15213:633120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T12:32:20Z - 2017-09-15T12:32:00Z]...Complete - Timings: {:process_counter_values=>0.34055447578430176, 
:process_perfs=>4.472509145736694, 
:process_perfs_db=>11.858701229095459,
:write_multiple=>19.626208066940308, 
:total_time=>26.49343514442444}

# Optimized update all metrics
[----] I, [2017-09-15T14:35:09.783703 #15213:633120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T12:32:20Z - 2017-09-15T12:27:00Z]...Complete - Timings: {:process_counter_values=>1.3960025310516357, 
:process_perfs=>2.927391529083252, 
:process_perfs_db=>23.891483783721924, 
:write_multiple=>31.282296657562256, 
:total_time=>37.81452202796936}

@agrare
Copy link
Member

agrare commented Sep 15, 2017

Hm I hadn't thought of using an InventoryCollection for metrics...

@Ladas
Copy link
Contributor Author

Ladas commented Sep 20, 2017

@agrare so I am thinking that I should add postgre_batch adapter, so we can easily switch to the old version?

@Ladas Ladas changed the title [WIP] ultra WIP -> Improve metrics saving [WIP] Improve metrics saving Sep 20, 2017
@agrare
Copy link
Member

agrare commented Sep 20, 2017

@Ladas is that something we could set with an options hash? Idk how much logic is in the adapter but to duplicate a whole adapter for one option sounds like a lot.

@@ -35,6 +35,32 @@ def write(metric)
def write_multiple(*_metrics)
raise NotImplementedError, "must implemented by the adapter"
end

def transform_parameters(resource, interval_name, _start_time, _end_time, rt_rows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas can this stay in perf_process? It seems weird to call into ActiveMetrics then call back out to Metric::Helper.process_derived_columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I am extracting it here since PG adapter will have specific code. Right now we are transforming the data like 4 times from the original data, ending with original data format. :-)

perf_process(interval_name, start_range, end_range, counters, counter_values)

# Convert to format allowing to send multiple resources at once
counters_data = {
Copy link
Contributor Author

@Ladas Ladas Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrare @kbrock so something like this can easily go through a queue, having multiple resources inside

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this will be perfect to queue perf_process for the metrics_processor_worker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool 👍

For 5.0, I would like to move to the generic Persistor format, so we can reuse inventory Persistors (and using grouping, we should be able to make generic worker a dedicated worker?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going over to using the queue before 5.0 - so would be nice to go over to the generic Persistor format within the next week or so here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbrock for 4.6, I think we will just keep this format. Generic Persistors are 5.0 thing, right @agrare ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this format is fine since we're going to be invoking perf_process directly in the metrics_processor_worker not some generic persister.

@Ladas
Copy link
Contributor Author

Ladas commented Sep 22, 2017

@agrare @Fryguy @kbrock these are current time/memory benefits of the optimized code. In a next PR, I am going to bring back the old code as a postgre_legacy adapter and I'll try to create a spec that compares the 2 postgre adapters produce exactly the same numbers (plus I'll try to expand the specs a bit, we do not test it as much as I would like)

Benchmark scenario:
Doing vm.perf_capture('realtime', 6.days.ago.utc, Time.now.utc), that produces 25905 Metric rows, fetching time 38s for 1 AWS VM (perf_capture part being optimized by @agrare )

  DB persist time [s] Total perf_process time [s] Memory [GB]
old perf_process create all metrics 300 354 1.83GB
old perf_process update all metrics 370 427 2.22GB
new perf_process create all metrics 14 22 0.69GB
new perf_process update all metrics 24 51 0.96GB

The most important metric is create all metrics as 95% of metric are just created, not updated. But we can get to update all Metric state if we need to rerun collection for some reason. The scenario is done on 6days of data for 1 Vm, but it's the same as 1day of data for 6 Vms, etc.

Also it's worth to note that the old code is doing extra ~75k queries (1 per each metric + begin/end transaction per each), comparing to the new code. So on a ~1ms remote DB, that is +75s. So the total time will be easily 20x faster for a new code. :-)

@Ladas Ladas changed the title [WIP] Improve metrics saving Improve metrics saving Sep 22, 2017
@Ladas
Copy link
Contributor Author

Ladas commented Sep 22, 2017

@miq-bot remove_label wip

@miq-bot miq-bot removed the wip label Sep 22, 2017
@@ -175,6 +175,7 @@ def perf_capture(interval_name, start_time = nil, end_time = nil)

start_range = end_range = counters = counter_values = nil
_, t = Benchmark.realtime_block(:total_time) do
# TODO why we call capture here? We call the same in processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean via vim_performance_state_for_ts? It looks like that's only in the event that we are missing the state:

state = vim_performance_states.find_by(:timestamp => ts)
...
state ||= perf_capture_state
``

So this the the primary place it should be called from.  Maybe we can get rid of the "backup" in perf_process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah possibly, my tests are a bit weird with this, since I do run capture 6 days back, most of the vim_performance_states are missing. So then it does a lot of duplicate queries trying to catch them, ending with assigning the last vim_performance_states to every ts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled with the ts lookup caching. There may be a way to explicitly look them up and populate them. I'm suspect this code is causing a memory leak that @NickLaMuro is researching. But also, it would be nice to load the correct ts values for @Ladas .

(Sorry @NickLaMuro - I may have misunderstood and unnecessarily roped you in)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, is it? I was thinking it could mem leak, but it just stores cached values in instance variable? (or is it doing something nasty inside?)

@kbrock right, so I am at least calling preload_vim_performance_state_for_ts_iso8601, so it doesn't do n+1 queries, but it still does n+1 queries if it can't find some hour state. But yeah, more refactoring can be done here, but perf-wise, this is not bottleneck now.

@kbrock
Copy link
Member

kbrock commented Sep 25, 2017

@agrare @Ladas is this good to :shipit: ? How far back can we backport?

@Ladas
Copy link
Contributor Author

Ladas commented Sep 25, 2017

@kbrock for me this is good to go. (I just wanted to add YARD doc, haha)

For backport, I am using GraphRefresh batch saving to build the batched query, so we can't really backport unless we backport that whole graph refresh machinery (I'd just copy paste the whole manager_refresh dir, so it's doable theoretically :-))

@@ -35,6 +35,33 @@ def write(metric)
def write_multiple(*_metrics)
raise NotImplementedError, "must implemented by the adapter"
end

def transform_parameters(_resources, interval_name, _start_time, _end_time, rt_rows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas I still think this belongs in Metric::CiMixin::Processing not in ActiveMetrics. It is very specific to the rt_rows that we return from Metric::CiMixin::Capture and already calls back out to Metric::Processing for process_derived_columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so would you rather do the parsing based on Adapter class, in the Metric::CiMixin::Processing code? (there is a different transform_parameters for the new postgre adapter)

Save metrics in batches
@miq-bot
Copy link
Member

miq-bot commented Sep 26, 2017

This pull request is not mergeable. Please rebase and repush.

@Ladas Ladas force-pushed the improve_metrics_saving branch from f422a44 to 8aa20fd Compare September 26, 2017 13:40
Tweak graph refresh to allow timestamp as an index
More effective interface for passing data to pg adapter
Optimize the memory some more
Preload vim_performance_state_for_ts to avoind n+1 queries
Allow perf_process to receive data of multiple resources
Fetch all records for 1 model in 1 query, instead of doing
n+1 queries.
Move transforming parameters back to processing code. Based on
reviews, this doesn't belong to ActiveMetric adapter code.
@Ladas Ladas force-pushed the improve_metrics_saving branch from 8aa20fd to 7523d62 Compare September 26, 2017 13:45
Fix rubocop issues
@Ladas Ladas force-pushed the improve_metrics_saving branch from 7523d62 to 47fc2e6 Compare September 26, 2017 13:50
Remove clarified TODO
Pack transform_resources! in a helper method
@miq-bot
Copy link
Member

miq-bot commented Sep 26, 2017

Checked commits Ladas/manageiq@c3b649c~...ce5ebca with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
6 files checked, 2 offenses detected

app/models/metric/ci_mixin/capture.rb

app/models/metric/ci_mixin/processing.rb

Copy link
Member

@agrare agrare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM
look forward to getting rid of the legacy adapter

agrare added a commit to agrare/manageiq that referenced this pull request Sep 26, 2017
@agrare agrare merged commit ce5ebca into ManageIQ:master Sep 26, 2017
@agrare agrare added this to the Sprint 70 Ending Oct 2, 2017 milestone Sep 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants