You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is quite slow at the default parameters with 1000x block time scale: it
takes a minute and a half! It gets down to 9 seconds if we shrink that scale:
$ for i in 1000 100 10 1 0.1 0.01; do /usr/bin/time bin64/drrun -t drmemtrace -tool schedule_stats -core_sharded -cores 3 -indir drmemtrace.threadsig.5* -verbose 0 -sched_block_scale $i; done 2>&1 | grep elapsed
121.25user 50.43system 1:26.17elapsed 199%CPU (0avgtext+0avgdata 10348maxresident)k
51.87user 13.09system 0:27.53elapsed 235%CPU (0avgtext+0avgdata 10496maxresident)k
33.83user 4.41system 0:14.96elapsed 255%CPU (0avgtext+0avgdata 10476maxresident)k
26.34user 0.25system 0:08.88elapsed 299%CPU (0avgtext+0avgdata 10240maxresident)k
27.05user 0.26system 0:09.11elapsed 299%CPU (0avgtext+0avgdata 10624maxresident)k
26.97user 0.24system 0:09.08elapsed 299%CPU (0avgtext+0avgdata 10332maxresident)k
$ for i in STATS-*; do echo $i; grep 'cpu busy' $i | head -3; done
STATS-1
92.60% cpu busy by record count
93.63% cpu busy by time
96.85% cpu busy by time, ignoring idle past last instr
STATS-10
90.59% cpu busy by record count
56.49% cpu busy by time
79.65% cpu busy by time, ignoring idle past last instr
STATS-100
81.27% cpu busy by record count
32.03% cpu busy by time
47.20% cpu busy by time, ignoring idle past last instr
STATS-1000
63.01% cpu busy by record count
13.78% cpu busy by time
15.32% cpu busy by time, ignoring idle past last instr
Looks like a ~20ms SYS_futex early in each thread. But why turn that into
25 seconds (the max) in an analyzer? I guess I did analyze the wall-clock
time for analyzers and 1000x did seem right on internal very large traces
where it took about 1-2us of wall-clock time to process each
record. But these small local analyses are
at least 10x faster: so we should drop to 100x scale? Also drop the max
from 25s down to 5s or sthg?
Even at 10x scale it spends a while (~6 seconds according to the time but
it feels like more when watching it; prob is more b/c of the time to print
it out) just sitting there w/ things blocked:
Still has ~150 heartbeats in a row in biggest cluster: maybe they're too frequent.
Maybe the original run did have quite a bit of idle: but it seems better to
err on the side of prompt running times and let the user tweak the
parameters if they want to push for more representative idle times, esp
when the idle times and parameters depend on wall-clock time and on
analyzer or simulator speed and so there isn't one value that works for
everything.
The text was updated successfully, but these errors were encountered:
Reduces the scheduler and drmemtrace launcher default values for
block_time_scale down to 10 and block_time_max down to 2.5s. This
improves the scheduler behavior for small traces under fast analyzers.
It seems better to err on the side of faster and let more heavyweight
simulations tune the block times for more idle time; otherwise we can
end up with local runs and especially new users trying things out and
seeing the tool seem to just sit there doing nothing.
This reduces the threadsig core-sharded time from a minute and a half
down to 10 seconds in local runs (see #6945 for command lines); there
is still some idle time in there so it seems a reasonable compromise.
Fixes#6945
Reduces the scheduler and drmemtrace launcher default values for
block_time_scale down to 10 and block_time_max down to 2.5s. This
improves the scheduler behavior for small traces under fast analyzers.
It seems better to err on the side of faster and let more heavyweight
simulations tune the block times for more idle time; otherwise we can
end up with local runs and especially new users trying things out and
seeing the tool seem to just sit there doing nothing.
This reduces the threadsig core-sharded time from a minute and a half
down to 10 seconds in local runs (see #6945 for command lines); there is
still some idle time in there so it seems a reasonable compromise.
Fixes#6945
I made a trace like so using the threadsig app in drmemtrace_samples:
It is quite slow at the default parameters with 1000x block time scale: it
takes a minute and a half! It gets down to 9 seconds if we shrink that scale:
It's very fast in the checked-in threadsig:
Sizes:
I probably did a delayed start to shrink the checked-in trace.
So 30x larger: but does it really have a lot of blocking syscalls?
Looks like a ~20ms SYS_futex early in each thread. But why turn that into
25 seconds (the max) in an analyzer? I guess I did analyze the wall-clock
time for analyzers and 1000x did seem right on internal very large traces
where it took about 1-2us of wall-clock time to process each
record. But these small local analyses are
at least 10x faster: so we should drop to 100x scale? Also drop the max
from 25s down to 5s or sthg?
Even at 10x scale it spends a while (~6 seconds according to the time but
it feels like more when watching it; prob is more b/c of the time to print
it out) just sitting there w/ things blocked:
With uniq:
Biggest blocked time is '12579930': wow.
So dropping the max would help.
This is better:
Still has ~150 heartbeats in a row in biggest cluster: maybe they're too frequent.
Maybe the original run did have quite a bit of idle: but it seems better to
err on the side of prompt running times and let the user tweak the
parameters if they want to push for more representative idle times, esp
when the idle times and parameters depend on wall-clock time and on
analyzer or simulator speed and so there isn't one value that works for
everything.
The text was updated successfully, but these errors were encountered: