Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

Open
prasun3 opened this issue Oct 10, 2023 · 13 comments · May be fixed by #6355
Open

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

prasun3 opened this issue Oct 10, 2023 · 13 comments · May be fixed by #6355
Assignees

Comments

@prasun3
Copy link
Contributor

prasun3 commented Oct 10, 2023

Describe the bug

See https://groups.google.com/g/DynamoRIO-Users/c/Dsf4As64r7o

In some of my tests, I am seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT.

The value returned by stream->get_output_cpuid matches the initial CPU_ID marker values, but at some point on some streams, we see other CPU_ID values as shown below.

Example:

Output #5 processing CPU #7                   (from get_stream_cpuid)
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
...
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 5

To Reproduce
Steps to reproduce the behavior:

  1. Collect a trace with drrun -t drcachesim -offline
  2. Use the scheduler infrastructure with the MAP_TO_RECORDED_OUTPUT option
  3. We see multiple CPU IDs on the same output stream which is not expected

Please also answer these questions:

  • What happens when you run without any client?
    • Not applicable
  • What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)?
    • No difference

Expected behavior
With the MAP_TO_RECORDED_OUTPUT option, each output stream maps to exactly one CPU

Screenshots or Pasted Text

Versions

  • What version of DynamoRIO are you using?
  • Does the latest build from https://github.com/DynamoRIO/dynamorio/releases solve the problem?
    • No
  • What operating system version are you running on? ("Windows 10" is not sufficient: give the release number.)
    • Ubuntu 18
  • Is your application 32-bit or 64-bit?
    • 64-bit

Additional context

prasun3 added a commit that referenced this issue Oct 10, 2023
Added check to verify that output stream's CPU ID matches the CPU ID
markers seen in the trace.

Issue: #6354
@derekbruening
Copy link
Contributor

It looks like this is the culprit, this thread with zero instructions:

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 85300 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           1           0:       85300 <marker: version 6>
           2           0:       85300 <marker: filetype 0xe40>
           3           0:       85300 <marker: cache line size 64>
           4           0:       85300 <marker: chunk instruction count 10000000>
           5           0:       85300 <marker: page size 4096>
           6           0:       85300 <marker: timestamp 13341391202394706>
           7           0:       85300 <marker: tid 85300 on core 1>
           8           0:       85300 <marker: function==syscall #202>
           9           0:       85300 <marker: function return value 0x0>
          10           0:       85300 <marker: timestamp 13341391214912820>
          11           0:       85300 <marker: tid 85300 on core 0>
          12           0:       85300 <thread 85300 exited>
View tool results:
              0 : total instructions

All the scheduling is instruction-count-based so this degenerate thread has 2 entries both at instruction count 0:

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 0 | od -t d8 -w32
0000000                85303    13341391204829725                    0                    0
0000040                85304    13341391205074867                    0                    0
0000100                85300    13341391214912820                    0                    0
0000140                85304    13341391214943240                    0               109071
0000200

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 1 | od -t d8 -w32
0000000                85300    13341391202394706                    1                    0
0000040                85303    13341391206798536                    1                29985
0000100

The first question is, how did this thread end up having no instructions and still being output: I thought we had a filter to remove empty threads. That filter is based bytes written: so maybe the futex syscall exit (must have attached mid-futex as there's no entry) is what made it past the filter?

derekbruening added a commit that referenced this issue Oct 10, 2023
Tightens the normal-mode output rule to discard a thread with no
instructions (keeping the no-data check for i-filtered modes).

Adds the scheduler_launcher CPU check from PR #6355.

It is not easy to create a test with a thread that reliably executes
zero instructions yet still outputs some data.  I added an invariant
check for a zero-instruction thread and confirmed it fires on the
trace in PR #6355:

```
$ bin64/drrun -t drcachesim -simulator_type invariant_checker -indir ~/tmp/drmemtrace.schedtest.x64
Trace invariant failure in T85300 at ref # 12 (0 instrs since timestamp 13341391214912820): An unfiltered thread should have at least 1 instruction
```

Issue: #6354
@derekbruening derekbruening self-assigned this Oct 10, 2023
@prasun3
Copy link
Contributor Author

prasun3 commented Oct 11, 2023

Thanks for the debug! I added a comment here: #6356 (comment)

prasun3 added a commit that referenced this issue Oct 12, 2023
prasun3 added a commit that referenced this issue Oct 12, 2023
@derekbruening
Copy link
Contributor

derekbruening commented Oct 12, 2023

The second case from #6355 (comment) is rep string:

Input    1 @ 13109305 refs,   3329113 instrs
ERROR: CPU marker 0 on core #1 differs from output stream CPU ID 1

$ for i in 0 1 12; do echo "Core #$i"; unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip $i | od -t d8 -w32; done
Core #0
0000000               180525    13341574296169946                    0                    0
0000040
Core #1
0000000               180525    13341574300611346                    1              3329113
0000040
Core #12
0000000               180524    13341574296003803                   12                    0
0000040

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 180525 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64 -skip_instrs 3329110 -sim_refs 30
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
    13109282     3329110:      180525 <marker: timestamp 13341574300611269>
    13109283     3329110:      180525 <marker: tid 180525 on core 0>
    13109284     3329111:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f20 48 89 c7             mov    %rax -> %rdi
    13109285     3329112:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f23 48 89 d6             mov    %rdx -> %rsi
    13109286     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109287     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae0 by PC 0x00007f25505e5f26
    13109288     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce0 by PC 0x00007f25505e5f26
    13109289     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109290     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae8 by PC 0x00007f25505e5f26
    13109291     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce8 by PC 0x00007f25505e5f26
    13109292     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109293     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af0 by PC 0x00007f25505e5f26
    13109294     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf0 by PC 0x00007f25505e5f26
    13109295     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109296     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af8 by PC 0x00007f25505e5f26
    13109297     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf8 by PC 0x00007f25505e5f26
    13109298     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109299     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b00 by PC 0x00007f25505e5f26
    13109300     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d00 by PC 0x00007f25505e5f26
    13109301     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109302     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b08 by PC 0x00007f25505e5f26
    13109303     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d08 by PC 0x00007f25505e5f26
    13109304     3329113:      180525 <marker: timestamp 13341574300611280>
    13109305     3329113:      180525 <marker: tid 180525 on core 0>
    13109306     3329113:      180525 <marker: timestamp 13341574300611346>
    13109307     3329113:      180525 <marker: tid 180525 on core 1>
    13109308     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109309     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b10 by PC 0x00007f25505e5f26
    13109310     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d10 by PC 0x00007f25505e5f26
    13109311     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx

All the rep strings have the same instr count: making it not work well as a divider point! Will have to consider how to handle this.

@derekbruening
Copy link
Contributor

If we eliminated the ifetch record for rep string (#4948, #4915) it doesn't help: the load/store records themselves could fill the buffer up and trigger cpuid markers in the middle.

@derekbruening
Copy link
Contributor

derekbruening commented Oct 12, 2023

Want instr ord for fast skip. So have to store both instr and record count?
And have skip_instructions() take in both and no longer always align at the
start of an instruction?

This affects record-replay too: though the switch point would just always be the
start of the rep string loop.

@derekbruening
Copy link
Contributor

Theoretically the same thing could happen with a ton of markers in between two instructions.

@derekbruening
Copy link
Contributor

Since the rep string loop could be quite long, it doesn't seem like we can just ignore this case unfortunately (for the marker case I would say it's not worth more effort than throwing away the offending cpuid marker in the middle).

@derekbruening
Copy link
Contributor

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

derekbruening added a commit that referenced this issue Oct 12, 2023
Tightens the normal-mode output rule to discard a thread with no
instructions (keeping the no-data check for i-filtered modes) and
expands the empty-thread behavior to include regular exit runs and not
just detach.

Adds the scheduler_launcher CPU check from PR #6355.

It is not easy to create a test with a thread that reliably executes
zero instructions yet still outputs some data. I added an invariant
check for a zero-instruction thread and confirmed it fires on the trace
in PR #6355:

```
$ bin64/drrun -t drcachesim -simulator_type invariant_checker -indir ~/tmp/drmemtrace.schedtest.x64
Trace invariant failure in T85300 at ref # 12 (0 instrs since timestamp 13341391214912820): An unfiltered thread should have at least 1 instruction
```

Issue: #6354
@prasun3
Copy link
Contributor Author

prasun3 commented Oct 13, 2023

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

I think not switching in the middle of a rep should be okay.

@derekbruening
Copy link
Contributor

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

@prasun3
Copy link
Contributor Author

prasun3 commented Oct 13, 2023

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

We'd prefer if each iteration did not have an instr fetch record.

@derekbruening
Copy link
Contributor

For the rep string:

Currently: scheduler records and replays the start only. Problems:

  • Wrong cpuid in record-as-traced
  • Instr quantum won't switch mid-loop
  • Time quantum will switch mid-loop but on replay would switch at start so won't match

Proposal #A: Consider not a big deal to not switch mid-loop

  • Hide the wrong cpuid: raw2trace deletes??
  • Time quantum delays mid-loop switch??

Proposal #B: Count non-fetched instrs in all instr ordinals

  • Confusing with instr records not matching ordinal

Proposal #C: Reverse earlier decisions and insert instr record for each iter

Proposal #D: Store both an instruction ordinal and a record ordinal everywhere we store instruction ordinals today (as-traced and serial schedule files; record-replay files; range_t region-of-interest bounds; skip_instructions API) and update the skipping code to advance to mid-instruction points (any point?)

Proposal #E: Like #D but we store both an instruction ordinal and sub-instruction "load/store ordinal". This then does not rely on having unfetched instr records (which we have proposed removing) and has more structure than an asolute record ordinal.

Proposal #F: New counter of {fetched+unfetched} instructions used by the scheduler. But skipping and zipfile chunks are based on fetched only.

We've decided that Proposal #E is the winner. This involves bumping the version for a new field in all the schedule file records (the raw2trace-written format and the record-replay format).

@derekbruening
Copy link
Contributor

The rep string solution may also serve to support running instruction-filtered traces or portions of traces in the scheduler with instruction quanta, skips, and replayed schedules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants