Improve metrics saving #15976

Ladas · 2017-09-15T12:23:27Z

Lets optimize metrics saving

Ladas · 2017-09-15T12:29:06Z

@agrare so this is just a super WIP, but the numbers looks good

example of just saving 25920 AWS samples (I am changing generic saving code, so any metrics will apply), which is 6 days of data, doing vm.perf_capture('realtime', 6.days.ago.utc, Time.now.utc)

@kbrock @Fryguy ^ check the :process_perfs_db, the saving is 20x - 30x faster just by doing a simple batching (batch INSERT and batch UPDATE), with more room to optimize it (batch upsert, rules instead of triggers, etc...)

# Create all metrics
[----] I, [2017-09-15T06:43:32.056193 #12860:f89120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T04:32:20Z - 2017-09-15T04:32:00Z]...Complete - Timings: {:process_counter_values=>0.5581705570220947, 
:db_find_prev_perfs=>0.01391744613647461, 
:process_perfs=>44.54817485809326,
:process_perfs_db=>304.4324731826782, 
:total_time=>363.9644160270691}

# Update all metrics
[----] I, [2017-09-15T06:36:34.663954 #12860:f89120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T04:27:20Z - 2017-09-15T04:22:00Z]...Complete - Timings: {:process_counter_values=>0.40729641914367676, 
:db_find_prev_perfs=>2.697371244430542, 
:process_perfs=>38.26711559295654, 
:process_perfs_db=>361.15534019470215, 
:total_time=>413.3819372653961}

# Optimized create all metrics
[----] I, [2017-09-15T14:37:51.457336 #15213:633120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T12:32:20Z - 2017-09-15T12:32:00Z]...Complete - Timings: {:process_counter_values=>0.34055447578430176, 
:process_perfs=>4.472509145736694, 
:process_perfs_db=>11.858701229095459,
:write_multiple=>19.626208066940308, 
:total_time=>26.49343514442444}

# Optimized update all metrics
[----] I, [2017-09-15T14:35:09.783703 #15213:633120]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Amazon::CloudManager::Vm name: [ladas_test_40], id: [72], for range [2017-09-09T12:32:20Z - 2017-09-15T12:27:00Z]...Complete - Timings: {:process_counter_values=>1.3960025310516357, 
:process_perfs=>2.927391529083252, 
:process_perfs_db=>23.891483783721924, 
:write_multiple=>31.282296657562256, 
:total_time=>37.81452202796936}

agrare · 2017-09-15T12:43:21Z

Hm I hadn't thought of using an InventoryCollection for metrics...

Ladas · 2017-09-20T07:53:29Z

@agrare so I am thinking that I should add postgre_batch adapter, so we can easily switch to the old version?

agrare · 2017-09-20T12:15:42Z

@Ladas is that something we could set with an options hash? Idk how much logic is in the adapter but to duplicate a whole adapter for one option sounds like a lot.

agrare · 2017-09-20T12:54:42Z

lib/active_metrics/connection_adapters/abstract_adapter.rb

@@ -35,6 +35,32 @@ def write(metric)
      def write_multiple(*_metrics)
        raise NotImplementedError, "must implemented by the adapter"
      end
+
+      def transform_parameters(resource, interval_name, _start_time, _end_time, rt_rows)


@Ladas can this stay in perf_process? It seems weird to call into ActiveMetrics then call back out to Metric::Helper.process_derived_columns

right, I am extracting it here since PG adapter will have specific code. Right now we are transforming the data like 4 times from the original data, ending with original data format. :-)

Ladas · 2017-09-21T16:34:25Z

app/models/metric/ci_mixin/capture.rb

-      perf_process(interval_name, start_range, end_range, counters, counter_values)
+
+      # Convert to format allowing to send multiple resources at once
+      counters_data = {


@agrare @kbrock so something like this can easily go through a queue, having multiple resources inside

Agreed this will be perfect to queue perf_process for the metrics_processor_worker

cool 👍

For 5.0, I would like to move to the generic Persistor format, so we can reuse inventory Persistors (and using grouping, we should be able to make generic worker a dedicated worker?)

We are going over to using the queue before 5.0 - so would be nice to go over to the generic Persistor format within the next week or so here

@kbrock for 4.6, I think we will just keep this format. Generic Persistors are 5.0 thing, right @agrare ?

I think this format is fine since we're going to be invoking perf_process directly in the metrics_processor_worker not some generic persister.

Ladas · 2017-09-22T07:39:50Z

@agrare @Fryguy @kbrock these are current time/memory benefits of the optimized code. In a next PR, I am going to bring back the old code as a postgre_legacy adapter and I'll try to create a spec that compares the 2 postgre adapters produce exactly the same numbers (plus I'll try to expand the specs a bit, we do not test it as much as I would like)

Benchmark scenario:
Doing vm.perf_capture('realtime', 6.days.ago.utc, Time.now.utc), that produces 25905 Metric rows, fetching time 38s for 1 AWS VM (perf_capture part being optimized by @agrare )

	DB persist time [s]	Total perf_process time [s]	Memory [GB]
old perf_process create all metrics	300	354	1.83GB
old perf_process update all metrics	370	427	2.22GB
new perf_process create all metrics	14	22	0.69GB
new perf_process update all metrics	24	51	0.96GB

The most important metric is create all metrics as 95% of metric are just created, not updated. But we can get to update all Metric state if we need to rerun collection for some reason. The scenario is done on 6days of data for 1 Vm, but it's the same as 1day of data for 6 Vms, etc.

Also it's worth to note that the old code is doing extra ~75k queries (1 per each metric + begin/end transaction per each), comparing to the new code. So on a ~1ms remote DB, that is +75s. So the total time will be easily 20x faster for a new code. :-)

Ladas · 2017-09-22T07:56:51Z

@miq-bot remove_label wip

agrare · 2017-09-22T13:01:32Z

app/models/metric/ci_mixin/capture.rb

@@ -175,6 +175,7 @@ def perf_capture(interval_name, start_time = nil, end_time = nil)

    start_range = end_range = counters = counter_values = nil
    _, t = Benchmark.realtime_block(:total_time) do
+      # TODO why we call capture here? We call the same in processing.


You mean via vim_performance_state_for_ts? It looks like that's only in the event that we are missing the state:

state = vim_performance_states.find_by(:timestamp => ts) ... state ||= perf_capture_state `` So this the the primary place it should be called from. Maybe we can get rid of the "backup" in perf_process?

yeah possibly, my tests are a bit weird with this, since I do run capture 6 days back, most of the vim_performance_states are missing. So then it does a lot of duplicate queries trying to catch them, ending with assigning the last vim_performance_states to every ts.

I'm not thrilled with the ts lookup caching. There may be a way to explicitly look them up and populate them. I'm suspect this code is causing a memory leak that @NickLaMuro is researching. But also, it would be nice to load the correct ts values for @Ladas .

(Sorry @NickLaMuro - I may have misunderstood and unnecessarily roped you in)

Hum, is it? I was thinking it could mem leak, but it just stores cached values in instance variable? (or is it doing something nasty inside?)

@kbrock right, so I am at least calling preload_vim_performance_state_for_ts_iso8601, so it doesn't do n+1 queries, but it still does n+1 queries if it can't find some hour state. But yeah, more refactoring can be done here, but perf-wise, this is not bottleneck now.

kbrock · 2017-09-25T15:01:59Z

@agrare @Ladas is this good to ? How far back can we backport?

Ladas · 2017-09-25T15:33:40Z

@kbrock for me this is good to go. (I just wanted to add YARD doc, haha)

For backport, I am using GraphRefresh batch saving to build the batched query, so we can't really backport unless we backport that whole graph refresh machinery (I'd just copy paste the whole manager_refresh dir, so it's doable theoretically :-))

agrare · 2017-09-25T19:57:36Z

lib/active_metrics/connection_adapters/abstract_adapter.rb

@@ -35,6 +35,33 @@ def write(metric)
      def write_multiple(*_metrics)
        raise NotImplementedError, "must implemented by the adapter"
      end
+
+      def transform_parameters(_resources, interval_name, _start_time, _end_time, rt_rows)


@Ladas I still think this belongs in Metric::CiMixin::Processing not in ActiveMetrics. It is very specific to the rt_rows that we return from Metric::CiMixin::Capture and already calls back out to Metric::Processing for process_derived_columns

Right, so would you rather do the parsing based on Adapter class, in the Metric::CiMixin::Processing code? (there is a different transform_parameters for the new postgre adapter)

Save metrics in batches

miq-bot · 2017-09-26T13:37:55Z

This pull request is not mergeable. Please rebase and repush.

Tweak graph refresh to allow timestamp as an index

More effective interface for passing data to pg adapter

Optimize the memory some more

Preload vim_performance_state_for_ts to avoind n+1 queries

Allow perf_process to receive data of multiple resources

Fetch all records for 1 model in 1 query, instead of doing n+1 queries.

Move transforming parameters back to processing code. Based on reviews, this doesn't belong to ActiveMetric adapter code.

Fix rubocop issues

Remove clarified TODO

Pack transform_resources! in a helper method

miq-bot · 2017-09-26T14:34:29Z

Checked commits Ladas/manageiq@c3b649c~...ce5ebca with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
6 files checked, 2 offenses detected

app/models/metric/ci_mixin/capture.rb

❗ - Line 211, Col 38 - Style/RedundantSelf - Redundant self detected.

app/models/metric/ci_mixin/processing.rb

❗ - Line 86, Col 18 - Rails/SkipsModelValidations - Avoid using update_attribute because it skips validations.

agrare

👍 LGTM
look forward to getting rid of the legacy adapter

Improve metrics saving

Ladas force-pushed the improve_metrics_saving branch from 89d5384 to 54ddd52 Compare September 15, 2017 12:25

miq-bot added the wip label Sep 15, 2017

chessbyte assigned agrare Sep 19, 2017

chessbyte added enhancement performance providers/metrics labels Sep 19, 2017

Ladas force-pushed the improve_metrics_saving branch from 54ddd52 to ec220d8 Compare September 20, 2017 07:51

Ladas changed the title ~~[WIP] ultra WIP -> Improve metrics saving~~ [WIP] Improve metrics saving Sep 20, 2017

agrare reviewed Sep 20, 2017

View reviewed changes

Ladas commented Sep 21, 2017

View reviewed changes

Ladas changed the title ~~[WIP] Improve metrics saving~~ Improve metrics saving Sep 22, 2017

miq-bot removed the wip label Sep 22, 2017

agrare reviewed Sep 22, 2017

View reviewed changes

Ladas mentioned this pull request Sep 22, 2017

Add miq postgres legacy adapter and proper specs #16017

Merged

1 task

agrare reviewed Sep 25, 2017

View reviewed changes

Save metrics in batches

c3b649c

Save metrics in batches

miq-bot added the unmergeable label Sep 26, 2017

Ladas force-pushed the improve_metrics_saving branch from f422a44 to 8aa20fd Compare September 26, 2017 13:40

Ladas added 3 commits September 26, 2017 15:44

Tweak graph refresh to allow timestamp as an index

c3f6295

Tweak graph refresh to allow timestamp as an index

More effective interface for passing data to pg adapter

61a270c

More effective interface for passing data to pg adapter

Optimize loading

1974c77

Ladas added 5 commits September 26, 2017 15:44

Optimize the memory some more

0699b85

Optimize the memory some more

Preload vim_performance_state_for_ts

375e848

Preload vim_performance_state_for_ts to avoind n+1 queries

Allow perf_process to receive data of multiple resources

7eebbeb

Allow perf_process to receive data of multiple resources

Fetch all records for 1 model in 1 query

d60832e

Fetch all records for 1 model in 1 query, instead of doing n+1 queries.

Move transforming parameters back to processing code

4af9dc4

Move transforming parameters back to processing code. Based on reviews, this doesn't belong to ActiveMetric adapter code.

Ladas force-pushed the improve_metrics_saving branch from 8aa20fd to 7523d62 Compare September 26, 2017 13:45

miq-bot removed the unmergeable label Sep 26, 2017

Fix rubocop issues

47fc2e6

Fix rubocop issues

Ladas force-pushed the improve_metrics_saving branch from 7523d62 to 47fc2e6 Compare September 26, 2017 13:50

Ladas added 2 commits September 26, 2017 16:17

Remove clarified TODO

f4debed

Remove clarified TODO

Pack transform_resources! in a helper method

ce5ebca

Pack transform_resources! in a helper method

agrare approved these changes Sep 26, 2017

View reviewed changes

agrare added a commit to agrare/manageiq that referenced this pull request Sep 26, 2017

Merge pull request ManageIQ#15976 from Ladas/improve_metrics_saving

cd93697

Improve metrics saving

agrare merged commit ce5ebca into ManageIQ:master Sep 26, 2017

agrare added this to the Sprint 70 Ending Oct 2, 2017 milestone Sep 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metrics saving #15976

Improve metrics saving #15976

Ladas commented Sep 15, 2017

Ladas commented Sep 15, 2017 •

edited

Loading

agrare commented Sep 15, 2017

Ladas commented Sep 20, 2017

agrare commented Sep 20, 2017

agrare Sep 20, 2017

Ladas Sep 20, 2017

Ladas Sep 21, 2017 •

edited

Loading

agrare Sep 22, 2017

Ladas Sep 25, 2017

kbrock Sep 25, 2017

Ladas Sep 25, 2017

agrare Sep 25, 2017

Ladas commented Sep 22, 2017 •

edited

Loading

Ladas commented Sep 22, 2017

agrare Sep 22, 2017

Ladas Sep 22, 2017

kbrock Sep 25, 2017

Ladas Sep 25, 2017

kbrock commented Sep 25, 2017

Ladas commented Sep 25, 2017

agrare Sep 25, 2017

Ladas Sep 26, 2017

miq-bot commented Sep 26, 2017

miq-bot commented Sep 26, 2017

agrare left a comment

Improve metrics saving #15976

Improve metrics saving #15976

Conversation

Ladas commented Sep 15, 2017

Ladas commented Sep 15, 2017 • edited Loading

agrare commented Sep 15, 2017

Ladas commented Sep 20, 2017

agrare commented Sep 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas Sep 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas commented Sep 22, 2017 • edited Loading

Ladas commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrock commented Sep 25, 2017

Ladas commented Sep 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Sep 26, 2017

miq-bot commented Sep 26, 2017

agrare left a comment

Choose a reason for hiding this comment

Ladas commented Sep 15, 2017 •

edited

Loading

Ladas Sep 21, 2017 •

edited

Loading

Ladas commented Sep 22, 2017 •

edited

Loading