Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default block time parameters have too much idle time for small local traces #6945

Closed
derekbruening opened this issue Aug 28, 2024 · 0 comments · Fixed by #6946
Closed

Default block time parameters have too much idle time for small local traces #6945

derekbruening opened this issue Aug 28, 2024 · 0 comments · Fixed by #6946
Assignees

Comments

@derekbruening
Copy link
Contributor

I made a trace like so using the threadsig app in drmemtrace_samples:

bin64/drrun -t drmemtrace -core_sharded -cores 8 -offline -- ~/dr/test/threadsig 4 200

It is quite slow at the default parameters with 1000x block time scale: it
takes a minute and a half! It gets down to 9 seconds if we shrink that scale:

$ for i in 1000 100 10 1 0.1 0.01; do /usr/bin/time bin64/drrun -t drmemtrace -tool schedule_stats -core_sharded -cores 3 -indir drmemtrace.threadsig.5* -verbose 0 -sched_block_scale $i; done 2>&1 | grep elapsed
121.25user 50.43system 1:26.17elapsed 199%CPU (0avgtext+0avgdata 10348maxresident)k
51.87user 13.09system 0:27.53elapsed 235%CPU (0avgtext+0avgdata 10496maxresident)k
33.83user 4.41system 0:14.96elapsed 255%CPU (0avgtext+0avgdata 10476maxresident)k
26.34user 0.25system 0:08.88elapsed 299%CPU (0avgtext+0avgdata 10240maxresident)k
27.05user 0.26system 0:09.11elapsed 299%CPU (0avgtext+0avgdata 10624maxresident)k
26.97user 0.24system 0:09.08elapsed 299%CPU (0avgtext+0avgdata 10332maxresident)k

$ for i in STATS-*; do echo $i; grep 'cpu busy' $i | head -3; done
STATS-1
       92.60% cpu busy by record count
       93.63% cpu busy by time
       96.85% cpu busy by time, ignoring idle past last instr
STATS-10
       90.59% cpu busy by record count
       56.49% cpu busy by time
       79.65% cpu busy by time, ignoring idle past last instr
STATS-100
       81.27% cpu busy by record count
       32.03% cpu busy by time
       47.20% cpu busy by time, ignoring idle past last instr
STATS-1000
       63.01% cpu busy by record count
       13.78% cpu busy by time
       15.32% cpu busy by time, ignoring idle past last instr

It's very fast in the checked-in threadsig:

$ /usr/bin/time bin64/drrun -t drmemtrace -tool schedule_stats -core_sharded -cores 3 -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1
3.07user 0.10system 0:01.06elapsed 297%CPU (0avgtext+0avgdata 9216maxresident)k

Sizes:

$ bin64/drrun -t drmemtrace -tool basic_counts -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ 2>&1 | head -20
Basic counts tool results:
Total counts:
      638938 total (fetched) instructions
           8 total threads

$ bin64/drrun -t drmemtrace -tool basic_counts -indir drmemtrace.threadsig.5*/ 2>&1 | head -20
Basic counts tool results:
Total counts:
    19899692 total (fetched) instructions
           5 total threads

I probably did a delayed start to shrink the checked-in trace.
So 30x larger: but does it really have a lot of blocking syscalls?

     58 [scheduler] input 2 maybe-blocking syscall latency 26224 * scale 1000.0 => blocked time 25000000
     74 [scheduler] input 3 maybe-blocking syscall latency 23383 * scale 1000.0 => blocked time 23383000
     83 [scheduler] input 1 maybe-blocking syscall latency 26234 * scale 1000.0 => blocked time 25000000
    236 [scheduler] input 4 maybe-blocking syscall latency 12956 * scale 1000.0 => blocked time 12956000

Looks like a ~20ms SYS_futex early in each thread. But why turn that into
25 seconds (the max) in an analyzer? I guess I did analyze the wall-clock
time for analyzers and 1000x did seem right on internal very large traces
where it took about 1-2us of wall-clock time to process each
record. But these small local analyses are
at least 10x faster: so we should drop to 100x scale? Also drop the max
from 25s down to 5s or sthg?

Even at 10x scale it spends a while (~6 seconds according to the time but
it feels like more when watching it; prob is more b/c of the time to print
it out) just sitting there w/ things blocked:

[scheduler] heartbeat[1] 3 in queue; 3 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[1] 3 in queue; 3 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[1] 3 in queue; 3 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[1] 3 in queue; 3 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[1] 3 in queue; 3 blocked; 0 unscheduled => -1 8
<...>
[scheduler] heartbeat[2] 1 in queue; 1 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[2] 1 in queue; 1 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[2] 1 in queue; 1 blocked; 0 unscheduled => -1 8
[scheduler] heartbeat[2] 1 in queue; 1 blocked; 0 unscheduled => -1 8

With uniq:

$ /usr/bin/time bin64/drrun -t drmemtrace -tool schedule_stats -core_sharded -cores 3 -indir drmemtrace.threadsig.5* -verbose 2 -sched_block_scale 10 > HEART-10 2>&1
$ sed -e 's/\[[0-9]\]/[x]/' < HEART-10 | uniq -c | less
      1 Core # 2 @     5426 refs,      3473 instrs: input    4 @        8 refs,         1 instrs, time 13368670870545289 == thread 532460
      1 [scheduler] next_record[x]: blocked time 262340
     55 [scheduler] heartbeat[x] 3 in queue; 3 blocked; 0 unscheduled => -1 8
      1 [scheduler] next_record[x]: blocked time 129560
     22 [scheduler] heartbeat[x] 4 in queue; 4 blocked; 0 unscheduled => -1 8
      1 Core # 2 @   136885 refs,     96152 instrs: input    3 @     2593 refs,      1681 instrs, time 13368670870606786 == thread 532459
      8 [scheduler] heartbeat[x] 3 in queue; 3 blocked; 0 unscheduled => -1 8
<...>
      1 Core # 1 @  8450072 refs,   5973286 instrs: input    2 @  6017396 refs,   4253465 instrs, time 13368670871602786 == thread 532458
    396 [scheduler] heartbeat[x] 1 in queue; 1 blocked; 0 unscheduled => -1 8
      1 [scheduler] next_record[x]: blocked time 115580
<...>
      1 Hit EOF in drmemtrace.threadsig.532433.8360.dir/trace/drmemtrace.threadsig.532458.4899.trace.zip
      1 [scheduler] input 2 at eof; 1 live inputs left
   2390 [scheduler] heartbeat[x] 1 in queue; 1 blocked; 0 unscheduled => -1 8
      1 Core # 0 @  8913597 refs,   6390745 instrs: input    0 @  2377636 refs,   1771008 instrs, time 13368670871863208 == thread 532433
      1 [file_reader] At thread EOF
      1 Hit EOF in drmemtrace.threadsig.532433.8360.dir/trace/drmemtrace.threadsig.532433.3888.trace.zip
      1 [scheduler] input 0 at eof; 0 live inputs left
      1 [analyzer] [analyzer] Worker 2 finished trace shard 
      1 Worker 1 finished trace shard 
      1 [analyzer] Worker 0 finished trace shard drmemtrace.threadsig.532433.3888.trace.zip
      1 Schedule stats tool results:

Biggest blocked time is '12579930': wow.
So dropping the max would help.
This is better:

$ /usr/bin/time bin64/drrun -t drmemtrace -tool schedule_stats -core_sharded -cores 3 -indir drmemtrace.threadsig.5* -verbose 2 -sched_block_scale 10 -sched_block_max_us 5000000 > HEART-10-max5s 2>&1
28.33user 0.40system 0:09.70elapsed 295%CPU (0avgtext+0avgdata 10624maxresident)k

Still has ~150 heartbeats in a row in biggest cluster: maybe they're too frequent.

Maybe the original run did have quite a bit of idle: but it seems better to
err on the side of prompt running times and let the user tweak the
parameters if they want to push for more representative idle times, esp
when the idle times and parameters depend on wall-clock time and on
analyzer or simulator speed and so there isn't one value that works for
everything.

@derekbruening derekbruening self-assigned this Aug 28, 2024
derekbruening added a commit that referenced this issue Aug 28, 2024
Reduces the scheduler and drmemtrace launcher default values for
block_time_scale down to 10 and block_time_max down to 2.5s.  This
improves the scheduler behavior for small traces under fast analyzers.
It seems better to err on the side of faster and let more heavyweight
simulations tune the block times for more idle time; otherwise we can
end up with local runs and especially new users trying things out and
seeing the tool seem to just sit there doing nothing.

This reduces the threadsig core-sharded time from a minute and a half
down to 10 seconds in local runs (see #6945 for command lines); there
is still some idle time in there so it seems a reasonable compromise.

Fixes #6945
@github-staff github-staff deleted a comment Aug 28, 2024
derekbruening added a commit that referenced this issue Aug 28, 2024
Reduces the scheduler and drmemtrace launcher default values for
block_time_scale down to 10 and block_time_max down to 2.5s. This
improves the scheduler behavior for small traces under fast analyzers.
It seems better to err on the side of faster and let more heavyweight
simulations tune the block times for more idle time; otherwise we can
end up with local runs and especially new users trying things out and
seeing the tool seem to just sit there doing nothing.

This reduces the threadsig core-sharded time from a minute and a half
down to 10 seconds in local runs (see #6945 for command lines); there is
still some idle time in there so it seems a reasonable compromise.

Fixes #6945
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant