Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests on KNL #1904

Closed
Keno opened this issue Nov 24, 2016 · 21 comments
Closed

Failing tests on KNL #1904

Keno opened this issue Nov 24, 2016 · 21 comments

Comments

@Keno
Copy link
Member

Keno commented Nov 24, 2016

Mostly for my own reference (though feel free to jump in and help out), these are the tests that are failing on KNL (after updating gdb to a reasonably recent version):

Fixed after #1907:

	753 - string_instructions (Failed)
	754 - string_instructions-no-syscallbuf (Failed)
	755 - string_instructions_async_signals (Timeout)
	759 - string_instructions_replay (Failed)
	760 - string_instructions_replay-no-syscallbuf (Failed)
	945 - string_instructions_break (Failed)
	946 - string_instructions_break-no-syscallbuf (Failed)
	947 - string_instructions_replay_quirk (Failed)
	948 - string_instructions_replay_quirk-no-syscallbuf (Failed)

Unrelated to CPU architecture but seen on the same machine:

Fixed by #1909:

	655 - dconf_mock (Timeout)
	656 - dconf_mock-no-syscallbuf (Timeout)

Will be fixed by #1911 (as long as #1907 is applied):

	637 - checksum_sanity (Failed)
	638 - checksum_sanity-no-syscallbuf (Failed)
	889 - record_replay (Failed)
	890 - record_replay-no-syscallbuf (Failed)
@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

string_instructions fails with the following assert:

[FATAL /global/homes/k/kfischer/rr/src/fast_forward.cc:374:fast_forward_through_instruction() errno: SUCCESS]
 (task 239903 (rec:239781) at time 313)
 -> Assertion `t->ip() == limit_ip && decoded.modifies_flags' failed to hold.
   0x0000000000400e70 <+87>:    rep stos %al,%es:(%rdi)
=> 0x0000000000400e72 <+89>:    jmp    0x400eb7 <string_store+158>
(gdb) p ip
$7 = {ptr = 4198000}
(gdb) p limit_ip
$8 = {ptr = 4198002}
(gdb) p iterations
$9 = 62
(gdb) p t->ip()
$10 = {ptr = 4198002}

info registers in the emergency debugger:

rax            0x6161616161616161       7016996765293437281
rbx            0x0      0
rcx            0x0      0
rdx            0xa00000 10485760
rsi            0x2aaaab8ce010   46912510943248
rdi            0x2aaaac2ce010   46912521429008
rbp            0x7fffffff6f70   0x7fffffff6f70
rsp            0x7fffffff6f50   0x7fffffff6f50
r8             0x1      1
r9             0x20000  131072
r10            0x22     34
r11            0x9      9
r12            0x400b30 4197168
r13            0x7fffffff7090   140737488318608

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

Looking through KNL errata, I see:

KNL11: DR6 May be Zero After Data Breakpoint on Gather, Scatter or VRSQRT14 Instructions
If gather, scatter or VRSQRT14 instructions trigger a data breakpoint, the Debug Status Register (DR6) may be cleared.
Software will see a breakpoint trap but with no indication of which data breakpoint register was triggered.
None Identified No Fix

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

It seems like we should be able to synthesize DR6 by looking at the last instruction executed and seeing which memory it would have accessed. Might be a bit of work, but should be doable. @rocallahan thoughts?

@rocallahan
Copy link
Collaborator

So you're sure that the issue here is that a watchpoint was set and should have triggered but the CPU did not trigger it?

That erratum does not apply to this particular code, right? Are you assuming this is an undocumented hardware bug?

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

It looks that way (note the watch point is still triggered just not reported). I'll have to take a more careful look in the morning. I do suspect that the erratum applies, but is just too narrowly worded. I'm talking to Intel to get some more insight into this erratum.

@rocallahan
Copy link
Collaborator

I'm a little confused here. Our internal watchpoint has WRITE_BIT so AddressSpace::notify_watchpoint_fired should treat it as triggered regardless of debug_status, if the value changed which I presume it does? Or does the value not actually change in this case?

@rocallahan
Copy link
Collaborator

It seems like we should be able to synthesize DR6 by looking at the last instruction executed and seeing which memory it would have accessed. Might be a bit of work, but should be doable.

You mean for all instructions or just for the fast-forward string instructions?

The former would be a pretty large amount of work and maintenance burden. If we had to do it I think we'd want to import DynamoRio or something similar instead of rolling it from scratch. But I don't think we have to do it. I think just handling the string instructions would be OK.

At this stage, though, I'm not sure that the kind of bug in the erratum is what you're seeing.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

You're right. I was tired and jumped to conclusions too quickly. The actual problem is shown by this test case:

#include <assert.h>
#include <stdint.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <unistd.h>

static uintptr_t get_cx(pid_t child) {
  struct user_regs_struct regs;
  assert(0 == ptrace(PTRACE_GETREGS, child, NULL, &regs));
  return regs.rcx;
}

static void wait_stopped(pid_t child) {
  int status;
  /* Wait until the tracee stops */
  assert(child == waitpid(child, &status, 0));
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
}

char dest[1000000];

int main(void) {
  pid_t child;
  if (0 == (child = fork())) {
    assert(0 == ptrace(PTRACE_TRACEME, 0, 0, 0));
    uint8_t a = 0x22;
    size_t size = sizeof(dest);
    __asm__("int3; rep stosb\n\t" ::"a"(a), "c"(size), "D"(dest));
    return 0;
  }

  wait_stopped(child);
  assert(0 == ptrace(PTRACE_SINGLESTEP, child, NULL, 0));
  wait_stopped(child);
  uintptr_t cx = get_cx(child);
  assert(0 == ptrace(PTRACE_SINGLESTEP, child, NULL, 0));
  wait_stopped(child);
  uintptr_t new_cx = get_cx(child);
  assert(new_cx == cx - 1);
  return 0;
}

Passes on Xeon, fails on KNL with

$ ./a.out
a.out: string_test.c:40: main: Assertion `get_cx(child) == cx - 1' failed.
Aborted

In particular, it looks like single stepping now suffers from the same quirk watchpoints do:

(gdb) p cx-new_cx
$4 = 64

@rocallahan
Copy link
Collaborator

OK. That means we can't use hardware singlestepping to advance to the precise stopping point. We could emulate string-instruction singlestepping from rr instead, but I just thought of a crazy hack that should work and be easier: when we want to single-step the string instruction, temporarily adjust the IP to be after the REP prefix!

@rocallahan
Copy link
Collaborator

Hmm, there might be edge cases when the instruction has another prefix before the REP prefix...

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

Rewrite rcx to 1 instead?

@rocallahan
Copy link
Collaborator

Yeah, that works.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

Will try and let you know how it goes.

@rocallahan
Copy link
Collaborator

Another issue is the very first singlestep performed by fast_forward_through_instruction. That'll trigger the same problem :-(.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

Ah, true.

@rocallahan
Copy link
Collaborator

And really we should fix this for non-fast-forward singlestep too, since it would be bad if a normal singlestep could progress further than a fast-forward.

So it's a bit horrible but I think Task::resume_execution with RESUME_SINGLESTEP needs to examine the current instruction and temporarily hack RCX/ECX if necessary :-(.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2016

It seems like there's a problem with doing this, because we won't be able to detect the early loop-exit as we do now. I guess we can look at flags and see if they changed.

@Keno
Copy link
Member Author

Keno commented Nov 25, 2016

The approach discussed above basically works. However, we have performance mysteries:

So here's a mystery:

$ time perf record -F 1750 --call-graph=fp ./bin/rr replay -a
real    1m10.845s
user    0m14.416s
sys     0m25.976s
$ time ./bin/rr replay -a
real    2m25.280s
user    0m12.192s
sys     1m59.512s

Not sure how to go about debugging this. Thoughts? Gonna play with processor affinity a bit to see if I can make something happen.

@Keno
Copy link
Member Author

Keno commented Nov 25, 2016

$ (time ./bin/rr replay -a &); (time ./bin/rr replay -a &);
EXIT-SUCCESS

real    1m31.547s
user    0m11.176s
sys     0m21.308s
EXIT-SUCCESS

real    1m32.472s
user    0m11.116s
sys     0m22.336s

And since this is the same recording, they are by necessity bound to the same cpu core. Yet they're still faster. Something is very wrong here.

@Keno
Copy link
Member Author

Keno commented Nov 25, 2016

I had a hunch. The following diff makes rr 5 times faster (in this particular test case):

diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index b9ef901..67eabd7 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -284,6 +284,10 @@ void PerfCounters::reset(Ticks ticks_period) {
     struct perf_event_attr attr = rr::ticks_attr;
     attr.sample_period = ticks_period;
     fd_ticks_interrupt = start_counter(tid, -1, &attr);
+    if (!fd_ticks_just_cause.is_open()) {
+       struct perf_event_attr attr2 = rr::cycles_attr;
+       fd_ticks_just_cause = start_counter(tid, -1, &attr2);
+    }
     // Set up a separate counter for measuring ticks, which does not have
     // a sample period and does not count events during aborted transactions.
     // Note that the sample_period should *never* be set to zero. That should
diff --git a/src/PerfCounters.h b/src/PerfCounters.h
index 07dd56d..db215ad 100644
--- a/src/PerfCounters.h
+++ b/src/PerfCounters.h
@@ -105,6 +105,7 @@ private:
   ScopedFd fd_page_faults;
   ScopedFd fd_hw_interrupts;
   ScopedFd fd_instructions_retired;
+  ScopedFd fd_ticks_just_cause;
   bool started;
 };

@Keno
Copy link
Member Author

Keno commented Nov 27, 2016

Good news! Once all the pending pull requests are merged, the only remaining failure is flock, which is not KNL-specific (and I had opened #1908 for). I'll close this in favor of the more specific PRs and issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants