Failing tests on KNL #1904

Keno · 2016-11-24T02:06:23Z

Mostly for my own reference (though feel free to jump in and help out), these are the tests that are failing on KNL (after updating gdb to a reasonably recent version):

Fixed after #1907:

	753 - string_instructions (Failed)
	754 - string_instructions-no-syscallbuf (Failed)
	755 - string_instructions_async_signals (Timeout)
	759 - string_instructions_replay (Failed)
	760 - string_instructions_replay-no-syscallbuf (Failed)
	945 - string_instructions_break (Failed)
	946 - string_instructions_break-no-syscallbuf (Failed)
	947 - string_instructions_replay_quirk (Failed)
	948 - string_instructions_replay_quirk-no-syscallbuf (Failed)

Unrelated to CPU architecture but seen on the same machine:

flock (flock test fails on lustrefs #1908)
Various tests trying to unwind out of systemcalls, e.g. fork_exec_info_thr (Fixed by Various changes having to do with unwinding #1910)

Fixed by #1909:

	655 - dconf_mock (Timeout)
	656 - dconf_mock-no-syscallbuf (Timeout)

Will be fixed by #1911 (as long as #1907 is applied):

	637 - checksum_sanity (Failed)
	638 - checksum_sanity-no-syscallbuf (Failed)
	889 - record_replay (Failed)
	890 - record_replay-no-syscallbuf (Failed)

The text was updated successfully, but these errors were encountered:

Keno · 2016-11-24T02:10:25Z

string_instructions fails with the following assert:

[FATAL /global/homes/k/kfischer/rr/src/fast_forward.cc:374:fast_forward_through_instruction() errno: SUCCESS]
 (task 239903 (rec:239781) at time 313)
 -> Assertion `t->ip() == limit_ip && decoded.modifies_flags' failed to hold.

   0x0000000000400e70 <+87>:    rep stos %al,%es:(%rdi)
=> 0x0000000000400e72 <+89>:    jmp    0x400eb7 <string_store+158>

(gdb) p ip
$7 = {ptr = 4198000}
(gdb) p limit_ip
$8 = {ptr = 4198002}
(gdb) p iterations
$9 = 62
(gdb) p t->ip()
$10 = {ptr = 4198002}

info registers in the emergency debugger:

rax            0x6161616161616161       7016996765293437281
rbx            0x0      0
rcx            0x0      0
rdx            0xa00000 10485760
rsi            0x2aaaab8ce010   46912510943248
rdi            0x2aaaac2ce010   46912521429008
rbp            0x7fffffff6f70   0x7fffffff6f70
rsp            0x7fffffff6f50   0x7fffffff6f50
r8             0x1      1
r9             0x20000  131072
r10            0x22     34
r11            0x9      9
r12            0x400b30 4197168
r13            0x7fffffff7090   140737488318608

Keno · 2016-11-24T06:06:21Z

Looking through KNL errata, I see:

KNL11: DR6 May be Zero After Data Breakpoint on Gather, Scatter or VRSQRT14 Instructions
If gather, scatter or VRSQRT14 instructions trigger a data breakpoint, the Debug Status Register (DR6) may be cleared.
Software will see a breakpoint trap but with no indication of which data breakpoint register was triggered.
None Identified No Fix

Keno · 2016-11-24T06:54:07Z

It seems like we should be able to synthesize DR6 by looking at the last instruction executed and seeing which memory it would have accessed. Might be a bit of work, but should be doable. @rocallahan thoughts?

rocallahan · 2016-11-24T08:47:13Z

So you're sure that the issue here is that a watchpoint was set and should have triggered but the CPU did not trigger it?

That erratum does not apply to this particular code, right? Are you assuming this is an undocumented hardware bug?

Keno · 2016-11-24T08:53:42Z

It looks that way (note the watch point is still triggered just not reported). I'll have to take a more careful look in the morning. I do suspect that the erratum applies, but is just too narrowly worded. I'm talking to Intel to get some more insight into this erratum.

rocallahan · 2016-11-24T09:07:47Z

I'm a little confused here. Our internal watchpoint has WRITE_BIT so AddressSpace::notify_watchpoint_fired should treat it as triggered regardless of debug_status, if the value changed which I presume it does? Or does the value not actually change in this case?

rocallahan · 2016-11-24T10:17:01Z

It seems like we should be able to synthesize DR6 by looking at the last instruction executed and seeing which memory it would have accessed. Might be a bit of work, but should be doable.

You mean for all instructions or just for the fast-forward string instructions?

The former would be a pretty large amount of work and maintenance burden. If we had to do it I think we'd want to import DynamoRio or something similar instead of rolling it from scratch. But I don't think we have to do it. I think just handling the string instructions would be OK.

At this stage, though, I'm not sure that the kind of bug in the erratum is what you're seeing.

Keno · 2016-11-24T22:45:58Z

You're right. I was tired and jumped to conclusions too quickly. The actual problem is shown by this test case:

#include <assert.h>
#include <stdint.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <unistd.h>

static uintptr_t get_cx(pid_t child) {
  struct user_regs_struct regs;
  assert(0 == ptrace(PTRACE_GETREGS, child, NULL, &regs));
  return regs.rcx;
}

static void wait_stopped(pid_t child) {
  int status;
  /* Wait until the tracee stops */
  assert(child == waitpid(child, &status, 0));
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
}

char dest[1000000];

int main(void) {
  pid_t child;
  if (0 == (child = fork())) {
    assert(0 == ptrace(PTRACE_TRACEME, 0, 0, 0));
    uint8_t a = 0x22;
    size_t size = sizeof(dest);
    __asm__("int3; rep stosb\n\t" ::"a"(a), "c"(size), "D"(dest));
    return 0;
  }

  wait_stopped(child);
  assert(0 == ptrace(PTRACE_SINGLESTEP, child, NULL, 0));
  wait_stopped(child);
  uintptr_t cx = get_cx(child);
  assert(0 == ptrace(PTRACE_SINGLESTEP, child, NULL, 0));
  wait_stopped(child);
  uintptr_t new_cx = get_cx(child);
  assert(new_cx == cx - 1);
  return 0;
}

Passes on Xeon, fails on KNL with

$ ./a.out
a.out: string_test.c:40: main: Assertion `get_cx(child) == cx - 1' failed.
Aborted

In particular, it looks like single stepping now suffers from the same quirk watchpoints do:

(gdb) p cx-new_cx
$4 = 64

rocallahan · 2016-11-24T22:59:53Z

OK. That means we can't use hardware singlestepping to advance to the precise stopping point. We could emulate string-instruction singlestepping from rr instead, but I just thought of a crazy hack that should work and be easier: when we want to single-step the string instruction, temporarily adjust the IP to be after the REP prefix!

rocallahan · 2016-11-24T23:00:48Z

Hmm, there might be edge cases when the instruction has another prefix before the REP prefix...

Keno · 2016-11-24T23:01:18Z

Rewrite rcx to 1 instead?

rocallahan · 2016-11-24T23:01:34Z

Yeah, that works.

Keno · 2016-11-24T23:02:07Z

Will try and let you know how it goes.

rocallahan · 2016-11-24T23:03:12Z

Another issue is the very first singlestep performed by fast_forward_through_instruction. That'll trigger the same problem :-(.

Keno · 2016-11-24T23:03:59Z

Ah, true.

rocallahan · 2016-11-24T23:07:25Z

And really we should fix this for non-fast-forward singlestep too, since it would be bad if a normal singlestep could progress further than a fast-forward.

So it's a bit horrible but I think Task::resume_execution with RESUME_SINGLESTEP needs to examine the current instruction and temporarily hack RCX/ECX if necessary :-(.

Keno · 2016-11-24T23:09:12Z

It seems like there's a problem with doing this, because we won't be able to detect the early loop-exit as we do now. I guess we can look at flags and see if they changed.

Keno · 2016-11-25T07:38:49Z

The approach discussed above basically works. However, we have performance mysteries:

So here's a mystery:

$ time perf record -F 1750 --call-graph=fp ./bin/rr replay -a
real    1m10.845s
user    0m14.416s
sys     0m25.976s
$ time ./bin/rr replay -a
real    2m25.280s
user    0m12.192s
sys     1m59.512s

Not sure how to go about debugging this. Thoughts? Gonna play with processor affinity a bit to see if I can make something happen.

Keno · 2016-11-25T07:55:28Z

$ (time ./bin/rr replay -a &); (time ./bin/rr replay -a &);
EXIT-SUCCESS

real    1m31.547s
user    0m11.176s
sys     0m21.308s
EXIT-SUCCESS

real    1m32.472s
user    0m11.116s
sys     0m22.336s

And since this is the same recording, they are by necessity bound to the same cpu core. Yet they're still faster. Something is very wrong here.

Keno · 2016-11-25T08:13:12Z

I had a hunch. The following diff makes rr 5 times faster (in this particular test case):

diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index b9ef901..67eabd7 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -284,6 +284,10 @@ void PerfCounters::reset(Ticks ticks_period) {
     struct perf_event_attr attr = rr::ticks_attr;
     attr.sample_period = ticks_period;
     fd_ticks_interrupt = start_counter(tid, -1, &attr);
+    if (!fd_ticks_just_cause.is_open()) {
+       struct perf_event_attr attr2 = rr::cycles_attr;
+       fd_ticks_just_cause = start_counter(tid, -1, &attr2);
+    }
     // Set up a separate counter for measuring ticks, which does not have
     // a sample period and does not count events during aborted transactions.
     // Note that the sample_period should *never* be set to zero. That should
diff --git a/src/PerfCounters.h b/src/PerfCounters.h
index 07dd56d..db215ad 100644
--- a/src/PerfCounters.h
+++ b/src/PerfCounters.h
@@ -105,6 +105,7 @@ private:
   ScopedFd fd_page_faults;
   ScopedFd fd_hw_interrupts;
   ScopedFd fd_instructions_retired;
+  ScopedFd fd_ticks_just_cause;
   bool started;
 };

Keno · 2016-11-27T07:14:03Z

Good news! Once all the pending pull requests are merged, the only remaining failure is flock, which is not KNL-specific (and I had opened #1908 for). I'll close this in favor of the more specific PRs and issues.

Keno closed this as completed Nov 27, 2016

Keno mentioned this issue Nov 28, 2016

KNL planning jeff-regier/Celeste.jl#391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing tests on KNL #1904

Failing tests on KNL #1904

Keno commented Nov 24, 2016 •

edited

Loading

Keno commented Nov 24, 2016 •

edited

Loading

Keno commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

Keno commented Nov 25, 2016

Keno commented Nov 25, 2016

Keno commented Nov 25, 2016

Keno commented Nov 27, 2016

Failing tests on KNL #1904

Failing tests on KNL #1904

Comments

Keno commented Nov 24, 2016 • edited Loading

Keno commented Nov 24, 2016 • edited Loading

Keno commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

rocallahan commented Nov 24, 2016

Keno commented Nov 24, 2016

Keno commented Nov 25, 2016

Keno commented Nov 25, 2016

Keno commented Nov 25, 2016

Keno commented Nov 27, 2016

Keno commented Nov 24, 2016 •

edited

Loading

Keno commented Nov 24, 2016 •

edited

Loading