-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(counter_filter): Event procces to count events #6302
Conversation
Staging job is running |
Staging tests passed: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/scylla-master-perf-regression-latency-650gb-with-nemesis/14, report was sent |
be6066d
to
6225cb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it requires special event handler and not just counting on event publish?
Can you also elaborate why EventStatHandler saves events?
It can be tons of events, and I do think that we need to count all of them + I like that @aleksbykov implemented it as a separate module and does not touch the existing Events logic, because it is already complicated, and changing it may lead us to more complicated module and bugs
according to the code, for loging purpose + reporting an issue(this is case for REACTOR_STALLED particulary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, 2 little nitpick comments
@@ -163,6 +166,19 @@ <h2>{{ operation }}</h2> | |||
{% endfor %} | |||
{% endif %} | |||
</table> | |||
{% for cycle in results['cycles'] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: i would add all of this new for-loop after the current line 182 <span STYLE="font-size:12px" class="red">* All latency values are in ms. if latency has color red, check detailed HDR report</span>
so we have that little legend on the tables closer to the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdcm/sct_events/event_counter.py
Outdated
# | ||
# See LICENSE for more details. | ||
# | ||
# Copyright (c) 2020 ScyllaDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
# Copyright (c) 2020 ScyllaDB | |
# Copyright (c) 2023 ScyllaDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sdcm/sct_events/setup.py
Outdated
@@ -84,6 +86,7 @@ def stop_events_device(_registry: Optional[EventsProcessesRegistry] = None) -> N | |||
EVENTS_HANDLER_ID, | |||
EVENTS_ANALYZER_ID, | |||
EVENTS_MAIN_DEVICE_ID, | |||
EVENTS_COUNTER_ID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should go before EVENTS_MAIN_DEVICE_ID
, possibly before EVENTS_HANDLER_ID.
sdcm/sct_events/event_counter.py
Outdated
if counter_data := self._counter_device.get_counter(self._id): | ||
self._statistics = counter_data.stats | ||
self._counter_device.remove_counter(self._id) | ||
self._counter_device.stop_counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of having different counters if on __exit__
we make EventsCounter
not to count?
Why do we need different counters? Maybe it would be enough to get initial count value on enter and get diff on exit instead adding different counters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was, that update stats on exit, if EventCounterContextmanager instance was created not in with statement and if it is needed, to get stats periodically:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think start/stop counter is not only redundant in that case, but also may lead to errors when within EventCounterContextManager
someone will open another EventCounterContextManager
with different counter - then on exit will stop counting for all counters.
I think we could just drop self._start_count
Event idea and count only if register contain counters, otherwise, skip any work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think start/stop counter is not only redundant in that case, but also may lead to errors when within
EventCounterContextManager
someone will open anotherEventCounterContextManager
with different counter - then on exit will stop counting for all counters.
stop counting will happened , only if no any registered context managers will stay in _register. So if some on open and close counterevent_cm, it will not affect on others, and only latest closed counter_cm will stop counting
I think we could just drop
self._start_count
Event idea and count only if register contain counters, otherwise, skip any work.
It is interesting idea, will check it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soyacz , removed the _event. Added 2 more unit tests could take a look
Originally we need count only one Reactor stall event and only during specified operation. Specified operation could run several times one by one. This need for performance latency test with operations. But in future list of event could be extended for example with Kernal Stack events. Once operation finished, we don't need to count any events any more. Because number of events could large, to avoid memory overloading by main process, i decided to run such counter in another process, so if anything go wrong not to kill test itself. |
3e2c97d
to
7f35338
Compare
How events number can overload memory? if we just counting we don't increase memory over time... |
If we count many events in parallel, also for some events we want additional operations as for Reactor stall we want to parse and collect additional info |
For perf tests with operations, it is required to collect statisitcs about events: Reactor stall and sort them by stall duration. Also on next step, it is required to decode all reactor stalls by operations New event process is presented and new context manager. Context manager allow to start/stop count events and collect some stats event process allow to filter collected events and save events to files in specified directory.
Add event stats to report, collected by new event counter process
7f35338
to
f1845b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleksbykov safe to merge or you want to test it on staging before?
@soyacz , i am running 2 staging jobs, will update you after they finished |
@aleksbykov it's not cleanly backported to v14. |
Bot jobs are passed. Regular longevity-4h(where no call to counter) and latency 650 GB with nemesis are passed |
@roydahan , the problem happened for unit tests. it is happened, because branch-perf-v14 doesn't have this commit:
but if try to backport it , then another conflict happened with @roydahan , WDYT, if i prepare new pr explicitly for perf-v14 where resolve the unit-test conflicts? |
Ok, please send a PR directly to v14.
…On Thu, Jul 6, 2023 at 10:52 aleksbykov ***@***.***> wrote:
@roydahan <https://github.com/roydahan> , the problem happened for unit
tests. it is happened, because branch-perf-v14 doesn't have this commit:
commit 9697300
Author: Lukasz Sojka ***@***.***>
Date: Fri Mar 10 11:32:05 2023 +0100
feature(adaptive-timeouts): calculate timeouts based on node load
but if try to backport it , then another conflict happened with nemesis.py,
if try to backport commit which was not backported for resolving issue in
nemesis.py, this will build long chain of commits which were not backported
to perf-v14
@roydahan <https://github.com/roydahan> , WDYT, if i prepare new pr
explicitly for perf-v14 where resolve the unit-test conflicts?
—
Reply to this email directly, view it on GitHub
<#6302 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE75CYGZ4KXGW5BTQSHCS7TXOZVCPANCNFSM6AAAAAAZVM553I>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For perf tests with operations, it is required to
collect statisitcs about events: Reactor stall and
sort them by stall duration. Also on next step,
it is required to decode all reactor stalls by operations
New event process is presented and new context manager.
Context manager allow to start/stop count events and
collect some stats
event process allow to filter collected events and
save events to files in specified directory
PR pre-checks (self review)
backport
labelssdcm/sct_config.py
)unit-test/
folder)