Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(counter_filter): Event procces to count events #6302

Merged
merged 2 commits into from
Jul 5, 2023

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Jun 27, 2023

For perf tests with operations, it is required to
collect statisitcs about events: Reactor stall and
sort them by stall duration. Also on next step,
it is required to decode all reactor stalls by operations
New event process is presented and new context manager.
Context manager allow to start/stop count events and
collect some stats
event process allow to filter collected events and
save events to files in specified directory

PR pre-checks (self review)

  • I followed KISS principle and best practices
  • I didn't leave commented-out/debugging code
  • I added the relevant backport labels
  • New configuration option are added and documented (in sdcm/sct_config.py)
  • I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • All new and existing unit tests passed (CI)
  • I have updated the Readme/doc folder accordingly (if needed)

@aleksbykov aleksbykov changed the title Add counter events feature(counter_filter): Event procces to count events Jun 27, 2023
@aleksbykov
Copy link
Contributor Author

Example of report:
Screenshot from 2023-06-27 17-42-42
Screenshot from 2023-06-27 17-42-29

@aleksbykov
Copy link
Contributor Author

Staging job is running

@aleksbykov aleksbykov mentioned this pull request Jun 27, 2023
7 tasks
@aleksbykov
Copy link
Contributor Author

@aleksbykov aleksbykov force-pushed the add-counter-events branch from be6066d to 6225cb7 Compare June 28, 2023 08:37
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it requires special event handler and not just counting on event publish?
Can you also elaborate why EventStatHandler saves events?

temichus
temichus previously approved these changes Jun 28, 2023
@temichus
Copy link
Contributor

Why it requires special event handler and not just counting on event publish?

It can be tons of events, and I do think that we need to count all of them + I like that @aleksbykov implemented it as a separate module and does not touch the existing Events logic, because it is already complicated, and changing it may lead us to more complicated module and bugs

Can you also elaborate why EventStatHandler saves events?

according to the code, for loging purpose + reporting an issue(this is case for REACTOR_STALLED particulary)

Copy link
Contributor

@fgelcer fgelcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, 2 little nitpick comments

@@ -163,6 +166,19 @@ <h2>{{ operation }}</h2>
{% endfor %}
{% endif %}
</table>
{% for cycle in results['cycles'] %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: i would add all of this new for-loop after the current line 182 <span STYLE="font-size:12px" class="red">* All latency values are in ms. if latency has color red, check detailed HDR report</span> so we have that little legend on the tables closer to the table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#
# See LICENSE for more details.
#
# Copyright (c) 2020 ScyllaDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
# Copyright (c) 2020 ScyllaDB
# Copyright (c) 2023 ScyllaDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -84,6 +86,7 @@ def stop_events_device(_registry: Optional[EventsProcessesRegistry] = None) -> N
EVENTS_HANDLER_ID,
EVENTS_ANALYZER_ID,
EVENTS_MAIN_DEVICE_ID,
EVENTS_COUNTER_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should go before EVENTS_MAIN_DEVICE_ID, possibly before EVENTS_HANDLER_ID.

if counter_data := self._counter_device.get_counter(self._id):
self._statistics = counter_data.stats
self._counter_device.remove_counter(self._id)
self._counter_device.stop_counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of having different counters if on __exit__ we make EventsCounter not to count?
Why do we need different counters? Maybe it would be enough to get initial count value on enter and get diff on exit instead adding different counters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was, that update stats on exit, if EventCounterContextmanager instance was created not in with statement and if it is needed, to get stats periodically:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think start/stop counter is not only redundant in that case, but also may lead to errors when within EventCounterContextManager someone will open another EventCounterContextManager with different counter - then on exit will stop counting for all counters.

I think we could just drop self._start_count Event idea and count only if register contain counters, otherwise, skip any work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think start/stop counter is not only redundant in that case, but also may lead to errors when within EventCounterContextManager someone will open another EventCounterContextManager with different counter - then on exit will stop counting for all counters.

stop counting will happened , only if no any registered context managers will stay in _register. So if some on open and close counterevent_cm, it will not affect on others, and only latest closed counter_cm will stop counting

I think we could just drop self._start_count Event idea and count only if register contain counters, otherwise, skip any work.

It is interesting idea, will check it now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soyacz , removed the _event. Added 2 more unit tests could take a look

@aleksbykov
Copy link
Contributor Author

Why it requires special event handler and not just counting on event publish? Can you also elaborate why EventStatHandler saves events?

Originally we need count only one Reactor stall event and only during specified operation. Specified operation could run several times one by one. This need for performance latency test with operations. But in future list of event could be extended for example with Kernal Stack events. Once operation finished, we don't need to count any events any more. Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.
Reactor stalls events will be decoded in batch for investigation. That why, they need to be saved to file. But EventsStatHandler don't require that and was use in testing puposes. I will remove it

@aleksbykov aleksbykov force-pushed the add-counter-events branch 2 times, most recently from 3e2c97d to 7f35338 Compare June 30, 2023 09:26
@soyacz
Copy link
Contributor

soyacz commented Jun 30, 2023

Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.

How events number can overload memory? if we just counting we don't increase memory over time...

@aleksbykov
Copy link
Contributor Author

Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself.

How events number can overload memory? if we just counting we don't increase memory over time...

If we count many events in parallel, also for some events we want additional operations as for Reactor stall we want to parse and collect additional info

roydahan
roydahan previously approved these changes Jul 3, 2023
For perf tests with operations, it is required to
collect statisitcs about events: Reactor stall and
sort them by stall duration. Also on next step,
it is required to decode all reactor stalls by operations

New event process is presented and new context manager.
Context manager allow to start/stop count events and
collect some stats
event process allow to filter collected events and
save events to files in specified directory.
Add event stats to report, collected by new
event counter process
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksbykov safe to merge or you want to test it on staging before?

@aleksbykov
Copy link
Contributor Author

@soyacz , i am running 2 staging jobs, will update you after they finished

@roydahan roydahan merged commit ac86ae0 into scylladb:master Jul 5, 2023
@roydahan
Copy link
Contributor

roydahan commented Jul 5, 2023

@aleksbykov it's not cleanly backported to v14.
Can you please check it?

@aleksbykov
Copy link
Contributor Author

@soyacz , i am running 2 staging jobs, will update you after they finished

Bot jobs are passed. Regular longevity-4h(where no call to counter) and latency 650 GB with nemesis are passed

@aleksbykov
Copy link
Contributor Author

@roydahan , the problem happened for unit tests. it is happened, because branch-perf-v14 doesn't have this commit:

commit 9697300188dd5142428574624134048222648ddf
Author: Lukasz Sojka <[email protected]>
Date:   Fri Mar 10 11:32:05 2023 +0100

    feature(adaptive-timeouts): calculate timeouts based on node load

but if try to backport it , then another conflict happened with nemesis.py, if try to backport commit which was not backported for resolving issue in nemesis.py, this will build long chain of commits which were not backported to perf-v14

@roydahan , WDYT, if i prepare new pr explicitly for perf-v14 where resolve the unit-test conflicts?

@roydahan
Copy link
Contributor

roydahan commented Jul 6, 2023 via email

@aleksbykov
Copy link
Contributor Author

@roydahan i created a PR: #6346 directly for perf-v14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants