feature(counter_filter): Event procces to count events #6302

aleksbykov · 2023-06-27T10:42:11Z

For perf tests with operations, it is required to
collect statisitcs about events: Reactor stall and
sort them by stall duration. Also on next step,
it is required to decode all reactor stalls by operations
New event process is presented and new context manager.
Context manager allow to start/stop count events and
collect some stats
event process allow to filter collected events and
save events to files in specified directory

PR pre-checks (self review)

I followed KISS principle and best practices
I didn't leave commented-out/debugging code
I added the relevant backport labels
New configuration option are added and documented (in sdcm/sct_config.py)
I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
All new and existing unit tests passed (CI)
I have updated the Readme/doc folder accordingly (if needed)

aleksbykov · 2023-06-27T10:45:02Z

Example of report:

aleksbykov · 2023-06-27T10:45:38Z

Staging job is running

aleksbykov · 2023-06-28T08:15:19Z

Staging tests passed: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/scylla-master-perf-regression-latency-650gb-with-nemesis/14, report was sent

soyacz

Why it requires special event handler and not just counting on event publish?
Can you also elaborate why EventStatHandler saves events?

temichus · 2023-06-28T12:25:21Z

Why it requires special event handler and not just counting on event publish?

It can be tons of events, and I do think that we need to count all of them + I like that @aleksbykov implemented it as a separate module and does not touch the existing Events logic, because it is already complicated, and changing it may lead us to more complicated module and bugs

Can you also elaborate why EventStatHandler saves events?

according to the code, for loging purpose + reporting an issue(this is case for REACTOR_STALLED particulary)

fgelcer

for now, 2 little nitpick comments

fgelcer · 2023-06-28T12:12:02Z

sdcm/report_templates/results_latency_during_ops_short.html

@@ -163,6 +166,19 @@ <h2>{{ operation }}</h2>
                            {% endfor %}
                        {% endif %}
                    </table>
+                    {% for cycle in results['cycles'] %}


nitpick: i would add all of this new for-loop after the current line 182 <span STYLE="font-size:12px" class="red">* All latency values are in ms. if latency has color red, check detailed HDR report</span> so we have that little legend on the tables closer to the table

fgelcer · 2023-06-28T12:13:04Z

sdcm/sct_events/event_counter.py

+#
+# See LICENSE for more details.
+#
+# Copyright (c) 2020 ScyllaDB


nitpick:

Suggested change

# Copyright (c) 2020 ScyllaDB

# Copyright (c) 2023 ScyllaDB

soyacz · 2023-06-28T12:38:37Z

sdcm/sct_events/setup.py

@@ -84,6 +86,7 @@ def stop_events_device(_registry: Optional[EventsProcessesRegistry] = None) -> N
        EVENTS_HANDLER_ID,
        EVENTS_ANALYZER_ID,
        EVENTS_MAIN_DEVICE_ID,
+        EVENTS_COUNTER_ID,


I think it should go before EVENTS_MAIN_DEVICE_ID, possibly before EVENTS_HANDLER_ID.

soyacz · 2023-06-28T13:02:39Z

sdcm/sct_events/event_counter.py

+        if counter_data := self._counter_device.get_counter(self._id):
+            self._statistics = counter_data.stats
+            self._counter_device.remove_counter(self._id)
+        self._counter_device.stop_counter()


What's the point of having different counters if on __exit__ we make EventsCounter not to count?
Why do we need different counters? Maybe it would be enough to get initial count value on enter and get diff on exit instead adding different counters?

The idea was, that update stats on exit, if EventCounterContextmanager instance was created not in with statement and if it is needed, to get stats periodically:

I think start/stop counter is not only redundant in that case, but also may lead to errors when within EventCounterContextManager someone will open another EventCounterContextManager with different counter - then on exit will stop counting for all counters.

I think we could just drop self._start_count Event idea and count only if register contain counters, otherwise, skip any work.

I think start/stop counter is not only redundant in that case, but also may lead to errors when within EventCounterContextManager someone will open another EventCounterContextManager with different counter - then on exit will stop counting for all counters.

stop counting will happened , only if no any registered context managers will stay in _register. So if some on open and close counterevent_cm, it will not affect on others, and only latest closed counter_cm will stop counting

I think we could just drop self._start_count Event idea and count only if register contain counters, otherwise, skip any work.

It is interesting idea, will check it now

@soyacz , removed the _event. Added 2 more unit tests could take a look

aleksbykov · 2023-06-30T08:36:14Z

Why it requires special event handler and not just counting on event publish? Can you also elaborate why EventStatHandler saves events?

Originally we need count only one Reactor stall event and only during specified operation. Specified operation could run several times one by one. This need for performance latency test with operations. But in future list of event could be extended for example with Kernal Stack events. Once operation finished, we don't need to count any events any more. Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.
Reactor stalls events will be decoded in batch for investigation. That why, they need to be saved to file. But EventsStatHandler don't require that and was use in testing puposes. I will remove it

soyacz · 2023-06-30T15:17:48Z

Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.

How events number can overload memory? if we just counting we don't increase memory over time...

aleksbykov · 2023-07-03T07:35:26Z

Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.

How events number can overload memory? if we just counting we don't increase memory over time...

If we count many events in parallel, also for some events we want additional operations as for Reactor stall we want to parse and collect additional info

For perf tests with operations, it is required to collect statisitcs about events: Reactor stall and sort them by stall duration. Also on next step, it is required to decode all reactor stalls by operations New event process is presented and new context manager. Context manager allow to start/stop count events and collect some stats event process allow to filter collected events and save events to files in specified directory.

Add event stats to report, collected by new event counter process

soyacz

LGTM

soyacz

@aleksbykov safe to merge or you want to test it on staging before?

aleksbykov · 2023-07-04T11:21:02Z

@soyacz , i am running 2 staging jobs, will update you after they finished

roydahan · 2023-07-05T13:05:39Z

@aleksbykov it's not cleanly backported to v14.
Can you please check it?

aleksbykov · 2023-07-06T07:40:44Z

@soyacz , i am running 2 staging jobs, will update you after they finished

Bot jobs are passed. Regular longevity-4h(where no call to counter) and latency 650 GB with nemesis are passed

aleksbykov · 2023-07-06T07:51:56Z

@roydahan , the problem happened for unit tests. it is happened, because branch-perf-v14 doesn't have this commit:

commit 9697300188dd5142428574624134048222648ddf
Author: Lukasz Sojka <[email protected]>
Date:   Fri Mar 10 11:32:05 2023 +0100

    feature(adaptive-timeouts): calculate timeouts based on node load

but if try to backport it , then another conflict happened with nemesis.py, if try to backport commit which was not backported for resolving issue in nemesis.py, this will build long chain of commits which were not backported to perf-v14

@roydahan , WDYT, if i prepare new pr explicitly for perf-v14 where resolve the unit-test conflicts?

roydahan · 2023-07-06T21:08:34Z

Ok, please send a PR directly to v14.

…

On Thu, Jul 6, 2023 at 10:52 aleksbykov ***@***.***> wrote: @roydahan <https://github.com/roydahan> , the problem happened for unit tests. it is happened, because branch-perf-v14 doesn't have this commit: commit 9697300 Author: Lukasz Sojka ***@***.***> Date: Fri Mar 10 11:32:05 2023 +0100 feature(adaptive-timeouts): calculate timeouts based on node load but if try to backport it , then another conflict happened with nemesis.py, if try to backport commit which was not backported for resolving issue in nemesis.py, this will build long chain of commits which were not backported to perf-v14 @roydahan <https://github.com/roydahan> , WDYT, if i prepare new pr explicitly for perf-v14 where resolve the unit-test conflicts? — Reply to this email directly, view it on GitHub <#6302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE75CYGZ4KXGW5BTQSHCS7TXOZVCPANCNFSM6AAAAAAZVM553I> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aleksbykov · 2023-07-11T07:24:03Z

@roydahan i created a PR: #6346 directly for perf-v14

aleksbykov added the backport/perf-v14 label Jun 27, 2023

aleksbykov requested review from fruch, temichus, soyacz, roydahan and fgelcer June 27, 2023 10:42

github-actions bot assigned aleksbykov Jun 27, 2023

aleksbykov changed the title ~~Add counter events~~ feature(counter_filter): Event procces to count events Jun 27, 2023

aleksbykov mentioned this pull request Jun 27, 2023

Count reactor stall #6281

Closed

7 tasks

aleksbykov force-pushed the add-counter-events branch from be6066d to 6225cb7 Compare June 28, 2023 08:37

soyacz reviewed Jun 28, 2023

View reviewed changes

temichus previously approved these changes Jun 28, 2023

View reviewed changes

fgelcer reviewed Jun 28, 2023

View reviewed changes

soyacz reviewed Jun 28, 2023

View reviewed changes

aleksbykov dismissed temichus’s stale review via 3e2c97d June 30, 2023 08:43

aleksbykov force-pushed the add-counter-events branch 2 times, most recently from 3e2c97d to 7f35338 Compare June 30, 2023 09:26

aleksbykov requested review from vponomaryov, soyacz, fgelcer and temichus June 30, 2023 09:26

roydahan previously approved these changes Jul 3, 2023

View reviewed changes

aleksbykov added 2 commits July 3, 2023 18:53

fix(calculate_latency): Add Event stats to report

f1845b4

Add event stats to report, collected by new event counter process

aleksbykov dismissed roydahan’s stale review via f1845b4 July 3, 2023 11:53

aleksbykov force-pushed the add-counter-events branch from 7f35338 to f1845b4 Compare July 3, 2023 11:53

roydahan approved these changes Jul 3, 2023

View reviewed changes

soyacz approved these changes Jul 3, 2023

View reviewed changes

soyacz reviewed Jul 3, 2023

View reviewed changes

roydahan merged commit ac86ae0 into scylladb:master Jul 5, 2023

soyacz mentioned this pull request Jul 11, 2023

SCT fails with OSError: [Errno 98] Address already in use #6345

Closed

2 tasks

aleksbykov mentioned this pull request Jul 11, 2023

Backport reactor stall counter #6346

Merged

7 tasks

fruch removed the backport/perf-v14 label Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(counter_filter): Event procces to count events #6302

feature(counter_filter): Event procces to count events #6302

aleksbykov commented Jun 27, 2023 •

edited

Loading

aleksbykov commented Jun 27, 2023

aleksbykov commented Jun 27, 2023

aleksbykov commented Jun 28, 2023

soyacz left a comment

temichus commented Jun 28, 2023

fgelcer left a comment

fgelcer Jun 28, 2023

aleksbykov Jun 30, 2023

fgelcer Jun 28, 2023

aleksbykov Jun 30, 2023

soyacz Jun 28, 2023

soyacz Jun 28, 2023

aleksbykov Jun 30, 2023

soyacz Jun 30, 2023

aleksbykov Jul 3, 2023

aleksbykov Jul 3, 2023

aleksbykov commented Jun 30, 2023

soyacz commented Jun 30, 2023

aleksbykov commented Jul 3, 2023

soyacz left a comment

soyacz left a comment

aleksbykov commented Jul 4, 2023

roydahan commented Jul 5, 2023

aleksbykov commented Jul 6, 2023

aleksbykov commented Jul 6, 2023

roydahan commented Jul 6, 2023 via email

aleksbykov commented Jul 11, 2023

feature(counter_filter): Event procces to count events #6302

feature(counter_filter): Event procces to count events #6302

Conversation

aleksbykov commented Jun 27, 2023 • edited Loading

PR pre-checks (self review)

aleksbykov commented Jun 27, 2023

aleksbykov commented Jun 27, 2023

aleksbykov commented Jun 28, 2023

soyacz left a comment

Choose a reason for hiding this comment

temichus commented Jun 28, 2023

fgelcer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksbykov commented Jun 30, 2023

soyacz commented Jun 30, 2023

aleksbykov commented Jul 3, 2023

soyacz left a comment

Choose a reason for hiding this comment

soyacz left a comment

Choose a reason for hiding this comment

aleksbykov commented Jul 4, 2023

roydahan commented Jul 5, 2023

aleksbykov commented Jul 6, 2023

aleksbykov commented Jul 6, 2023

roydahan commented Jul 6, 2023 via email

aleksbykov commented Jul 11, 2023

aleksbykov commented Jun 27, 2023 •

edited

Loading