flock test fails on lustrefs #1908

Keno · 2016-11-26T00:46:36Z

> $HOME/rr-build/bin/rr record $HOME/rr-build/bin/flock
rr: Saving execution to trace directory `/global/homes/k/kfischer/.local/share/rr/flock-7'.
parent pid is 61556
sizeof(flock) = 32
before lock: type: 2, pid: 0
  after lock: type: 0, pid: 61556
sizeof(flock64) = 32
before lock: type: 2, pid: 0
  after GETLK: type: 1, pid: 61556
P: forcing child to block on LK, sleeping ...
FAILED: errno=4 (Interrupted system call)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:98: main: Assertion `"FAILED: !" && check_cond(0 == err)' failed.
P: ... awake, releasing lock
FAILED: errno=0 (Success)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:120: main: Assertion `"FAILED: !" && check_cond(((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0x7f) == 0) && 0 == ((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0xff00) >> 8))' failed.
Aborted
> stat --file-system --format=%T .
lustre

Works fine without rr as well as with -n.
Mount options are rw,flock,lazystatfs in case it makes a difference.

The text was updated successfully, but these errors were encountered:

Keno · 2016-11-27T07:25:04Z

lfs 2.7.1.11 in case it matters.

rocallahan · 2016-11-27T20:19:14Z

I’m looking into the flock test failure on lustrefs. My best is that lustre doesn’t like us interrupting it with the desched sig. How do we usually handle that? Do we explicitly restart the system call somewhere?

We don't handle anything like that currently. The closest I've seen was when trying to add syscall buffering for epoll_wait, which I think hit the kernel bug referenced here: https://www.varnish-cache.org/lists/pipermail/varnish-commit/2016-January/014928.html. I ended up just abandoning that.

rocallahan · 2016-11-27T20:24:47Z

The immediate problem that makes this hard to fix is knowing when the flock call would really complete if it exits with EINTR whenever we get descheduled. You could modify the syscallbuf to automatically retry with a traced flock after EINTR but of course that might be incorrect if the EINTR occurred legitimately. A hack that would work OK is to disable flock syscall buffering if lustrefs is mounted.

Probably should report a lustrefs bug in any case.

brianjmurrell · 2016-11-27T21:10:36Z

Does it make any difference if you change the flock mount option to localflock?

Keno · 2016-11-27T21:13:47Z

Unfortunately, I cannot change any mount options on this system, so it's not easy for me to try that.

Keno · 2016-11-27T21:22:07Z

I'm gonna try the KVM setup guide at http://wiki.lustre.org/KVM_Quick_Start_Guide, and see if I can get a dev setup running on one of my machines.

verygreen · 2016-11-28T07:44:50Z

what's Lustre version in use? Is there a way to get the test binary without having to get a lot of dependencies for self building?

Keno · 2016-11-28T08:06:33Z

As far as I know the version mentioned above is the relevant version here (if not please let me know what command to run). rr has no dependencies other than cmake, so just cloning this repository and doing a standard cmake build should be sufficient. I'd also be happy to do a build and zip up the result, but build from source should be very straightforward. The relevant command to try is in the original post (adjust build location accordingly of course). In my tests it deterministically failed when run in a lustre directory, but passed on all other file systems I've tried (ext4, btrfs, gpfs, temps). Also I was unable to get a lustre dev setup running, so I did not try the suggestion above.

verygreen · 2016-11-28T08:17:22Z

Ah, 2.7.1.11, hm, that's a strange version number, I guss that just means 2.7.1 or something close to it. (cat /proc/fs/lustre/version should be good enough).

The filesystems you tried are all local. Did you try any other network filesystems, like say nfs4 (should be the easiest to setup).

I am going to try this on my local lustre setup and see what happens.
What's the distro (I only care due to the kernel), rhel6? rhel7? ubuntu of some sort?

Keno · 2016-11-28T08:33:21Z

As far as I'm aware gpfs is distributed. Distribution is Cray Linux, with the kernel based on 3.12 as far as I can tell:

kfischer@cori11:~> cat /proc/fs/lustre/version
lustre: 2.7.1.11
kernel: patchless_client
build:  2.7.1.11-trunk-1.0600.f2563c6.3.2-abuild-lustre-filesystem.g
it@f2563c6-2016-10-26-20:44
kfischer@cori11:~> uname -r
3.12.60-52.57.1.11767.0.PTF.996988-default

verygreen · 2016-11-28T08:37:41Z

Ah, sles12 for Cray, I think.
Yes, gpfs is distributed, I just missed it in the list.
Thanks. I'll try it shortly to see what's going on here.

verygreen · 2016-11-28T08:38:44Z

Also Cray explains why the version is so strange. They roll a bunch of their own patches in and their tree sometimes has significant departures from mainline at times.

verygreen · 2016-11-28T08:46:18Z

hm, it does seem o be pulling a bunch of stuff that I don't have on my test nodes.
Can I have just the 64 bit binary that would run on rhel7, please?

paf-49 · 2016-11-28T16:43:47Z

Just sticking my oar in so I can get the binary as well... I'm on the Cray Lustre side. (Green is right about the version info ;) )

Also, if you wanted to report this to the Cray site staff so they can open a bug, that would be helpful too... (Even if Green or I figure out the problem and create a patch without you going through Cray, you'll need that bug so we can get the fix installed on Cori.)

Keno · 2016-11-28T18:38:04Z

Ok, here's the build directory (built on Cori, so it's the same binary I've been using for tests): http://anubis.juliacomputing.io:8844/rr-build.tar.gz. Please let me know if it doesn't work (e.g. due to libc version, etc), in which case I'll spin up a CentOS machine and build one there.

Also I may have been wrong about gpfs. It appears the compute notes on Cori use Cray DataWarp for the home directory, so while the backing file system is gpfs, there's some Cray magic in there as well. I also tried running the test on the login node's home directory, which does appear to be pure gpfs (which is why I thought it would be so on the compute nodes as well) and I was able to reproduce the same behavior I saw with lustre, so this may be a more general problem. My apologies for the incorrect information.

paf-49 · 2016-11-28T18:49:04Z

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

Trying to execute it:
[root@cent7c01 bin]# ./rr record flock
rr: Saving execution to trace directory `/root/.local/share/rr/flock-2'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

Keno · 2016-11-28T18:52:46Z

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

No, that was fine, plain gpfs did appear to have the problem however.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

This was built from unmodified master, so sources are the same as those in this repository. What's /proc/cpuinfo? Any virtualization techniques in use (some don't preserve the performance counters we need)? Also can you check which counter failed by attaching gdb and getting me a backtrace?

paf-49 · 2016-11-28T18:54:37Z

CentOS 7, VMWare ESXi.

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping : 7
microcode : 0x70d
cpu MHz : 2194.711
cache size : 16384 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb pln pts dtherm xsaveopt
bogomips : 4389.42
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

I'll get a backtrace momentarily.

Keno · 2016-11-28T18:56:31Z

Hmm, VMWare is usually fine I thought. Try perf list | grep "Hardware event"?

paf-49 · 2016-11-28T19:00:37Z

[root@cent7c01 ~]# perf list | grep "Hardware event"
ref-cycles [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

paf-49 · 2016-11-28T19:01:28Z

GDB:

Starting program: /shared/paf/rr-build/bin/./rr record ./flock
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff66b9700 (LWP 2591)]
[New Thread 0x7ffff5bb7700 (LWP 2592)]
[New Thread 0x7ffff4db5700 (LWP 2593)]
[New Thread 0x7fffeffff700 (LWP 2594)]
[New Thread 0x7fffef7fe700 (LWP 2595)]
[New Thread 0x7fffeeffd700 (LWP 2596)]
[New Thread 0x7fffee7fc700 (LWP 2597)]
[New Thread 0x7fffedffb700 (LWP 2598)]
[New Thread 0x7fffed7fa700 (LWP 2599)]
rr: Saving execution to trace directory `/root/.local/share/rr/flock-8'.
Detaching after fork from child process 2600.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Program received signal SIGABRT, Aborted.
0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff69f1ce8 in __GI_abort () at abort.c:90
#2 0x00000000006a50a4 in rr::FatalOstream::~FatalOstream (this=0x7fffffffde59,
__in_chrg=) at /global/homes/k/kfischer/rr/src/log.cc:264
#3 0x00000000006b41da in rr::start_counter (tid=2600, group_fd=-1,
attr=0xa4d460 rr::cycles_attr)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:262
#4 0x00000000006b4844 in rr::PerfCounters::reset (this=0xa537d8, ticks_period=60253)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:346
#5 0x000000000076533b in rr::Task::resume_execution (this=0xa537b0,
how=rr::RESUME_SYSCALL, wait_how=rr::RESUME_NONBLOCKING, tick_period=60253, sig=0)
at /global/homes/k/kfischer/rr/src/Task.cc:938
#6 0x00000000006bef28 in rr::RecordSession::task_continue (this=0xa51560,
step_state=...) at /global/homes/k/kfischer/rr/src/RecordSession.cc:581
#7 0x00000000006c4da9 in rr::RecordSession::record_step (this=0xa51560)
at /global/homes/k/kfischer/rr/src/RecordSession.cc:1951
#8 0x00000000006bbd26 in rr::record (
args=std::vector of length 1, capacity 2 = {...}, flags=...)
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:314
#9 0x00000000006bc179 in rr::RecordCommand::run (
this=0xa4d610 rr::RecordCommand::singleton,
args=std::vector of length 1, capacity 2 = {...})
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:381
#10 0x000000000078545e in main (argc=3, argv=0x7fffffffe458)
at /global/homes/k/kfischer/rr/src/main.cc:270

Keno · 2016-11-28T19:05:54Z

Thanks. The perf list output explains the problem. We need the retired branch counter, which does not seem to be available. Odd. I thought usually perf counters were pretty much an all-or-nothing deal when it comes to virtualization. Perhaps an old VMWare version? Any chance you could try on bare metal?

paf-49 · 2016-11-28T19:18:17Z

grumble, grumble
Yes, it's a version issue. I'll spare you the boring details (just spent a few minutes trying to enable the feature).

I can try it on real hardware, but might have glibc issues again...

paf-49 · 2016-11-28T19:23:33Z

On hardware (Cray SLES 12, probably not too different from Cori):
./rr record ./flock
rr: Saving execution to trace directory /root/.local/share/rr/flock-0'. [FATAL /global/homes/k/kfischer/rr/src/AddressSpace.cc:286:map_rr_page() errno: SUCCESS] (task 23650 (rec:23650) at time 14) -> Assertion child_fd == -EACCES' failed to hold. Unexpected error mapping rr_page
Launch gdb with
gdb '-l' '10000' '-ex' 'target extended-remote :23650' /cray/css/u18/paf/shared/rr/flock

Sits there...

Tried gdb as suggested:
Reading symbols from /cray/css/u18/paf/shared/rr/flock...done.
Remote debugging using :32227
warning: limiting remote suggested packet size (17073526 bytes) to 16384
Remote connection closed
(gdb) quit

Output of rr:
rr: /global/homes/k/kfischer/rr/src/GdbConnection.cc:540: std::string rr::read_target_desc(const char*): Assertion `f' failed.
Aborted

paf-49 · 2016-11-28T19:24:15Z

Note that I am able to execute the 'flock' binary correctly by itself.

rocallahan · 2016-11-28T19:27:51Z

@Keno might it be easier for you to create a standalone testcase? Shouldn't be that hard, probably doesn't even need ptrace, just have one process call flock while another process sends it an ignored signal?

verygreen · 2016-11-28T19:28:02Z

I am getting:

rr: Saving execution to trace directory `/root/.local/share/rr/flock-0'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:156:get_cpu_microarch() errno: ENOTTY] CPU 0x620 unknown.
Aborted

Keno · 2016-11-28T19:29:58Z

@rocallahan Yes, I had started on that, but I didn't quite manage to reproduce it. I was hoping at this point we'd be robust enough to be able to have people run it ;) - wishful thinking I guess. I'll try again to do a standalone test case.

paf-49 · 2016-11-28T19:31:17Z

When you say "ignored signal", do you mean "blocked in the signal mask", or ...?

rocallahan · 2016-11-28T19:33:26Z

Just a signal with no handler and SIG_IGN disposition.

rocallahan · 2016-11-28T19:34:46Z

But apparently the problem is more complicated than that. Maybe you need ptrace as well.

paf-49 · 2016-11-28T19:37:52Z

Huh. I wasn't familiar with SIG_IGN... Lustre (and perhaps GPFS too) does interruptible waits in some places, and (usually) returns -ERESTARTSYS, making the interrupted call restartable. Do you have SA_RESTART set? If not, then you'll get -EINTR on interrupts. (Can you even set that when you're not writing your own handler? I would think so)

paf-49 · 2016-11-28T19:39:24Z

To clarify: When SA_RESTART is set, returning -ERESTARTSYS from Lustre causes the kernel to restart the syscall in question. If SA_RESTART is not set, then the kernel translates that to -EINTR and returns it back to userspace. (In certain cases Lustre returns -EINTR directly, causing SA_RESTART to be ignored)

verygreen · 2016-11-28T19:45:57Z

Yed, I would think if you have any sort of a signal during a blocking system call, ou might get EINTR and you need to handle it one way or another.
though flock is kind of a bad case since it restarts the whole thing and with RPCS potentially taking long time it might never complete as the result spinning in the retry loop forever (should not be even lustre specific as long as the wait is interruptible).

rocallahan · 2016-11-28T19:46:38Z

SA_RESTART only applies when you set a handler. Ignored signals should not cause a syscall to return EINTR.

verygreen · 2016-11-28T19:47:55Z

yes, I guess ignored signals really should not unless they could not be ignored or something.
I wonder what signal is this that's actually getting through

paf-49 · 2016-11-28T19:51:13Z

Hm. @rocallahan I'm digging through kernel source to look at the restart handling wrt SIG_IGN. I mean, if you interrupt a syscall, which needs to be possible, then it has to be restarted. We can't just wait uninterruptibly for network things which may not complete. (That's probably a difference between distributed file systems and local ones. I'm betting most local ones don't wait interruptibly.)

So, ignored is not the same thing as blocked. (Why not just block...?)

rocallahan · 2016-11-28T19:56:15Z

rr arranges for an ignored SIGSTKFLT to be delivered pretty much every time a syscall blocks in the kernel. This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time) and it doesn't cause EINTR for flock in common filesystems like ext4 or btrfs (or tests would be failing on those filesystems, which are tested often). That's why we suspect filesystem-specific kernel issues here.

We can't block the signal because rr needs to get a blocking notification (via ptrace) that the signal is being delivered. rr responds to that notification by resuming execution of the tracee, letting it complete the syscall (and meanwhile scheduling another tracee).

rocallahan · 2016-11-28T20:03:51Z

This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time)

Correction, it doesn't cause EINTR in the set of syscalls that we have fast-paths for. And we do know that epoll_wait causes unwanted EINTRs.

paf-49 · 2016-11-28T20:04:13Z

I think the "filesystem-specific" issue here is that we're interruptible, so we're getting interrupted. The syscall is getting interrupted. I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

Being network file systems, Lustre and GPFS can't do this realistically. (There's a bit more to it, but that's the basic thing.)

Still poking around in the kernel to try to understand stuff here better...

rocallahan · 2016-11-28T20:08:07Z

I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

That may be true, but other interruptible syscalls (e.g. read on a pipe or socket) wait in such a way that they don't return EINTR for ignored signals.

paf-49 · 2016-11-28T20:26:20Z

Hm, all right.

My digging around strongly suggests that in the case of SIG_IGN, the signal shouldn't truly be delivered at all. ptrace_signal is called to let it look at the attempted delivery, but after that, we don't call the actual handling code, which is where ERESTARTSYS is handled. (I think... It gets very hairy in here.)

(do_signal-->get_signal-->get_signal_to_deliver and then handle_signal)

Since we can't reproduce the problem yet, would @Keno be able to try (just for our information) setting sa_flags to SA_RESTART? (Since there is a struct sigaction when you're setting SIG_IGN.)

Keno · 2016-11-28T21:37:04Z

Ok try the following maybe: https://gist.github.com/Keno/f257142b20d212b94182058ecee363af
Results for me are as follows:
tmpfs/ext4/btrfs: ok
GPFS-via-DataWarp: ok
GPFS: Reproduces failure reported here
Lustre: Hangs

So not quite the same behavior as in the original test case on lustre, but close enough to reproduce the problem?

paf-49 · 2016-11-29T17:14:42Z

Well, on Lustre, I'd call that hang "expected". When waiting, Lustre sometimes blocks everything and otherwise it blocks all signals it considers non-fatal. (The hang is the parent signalling the child (while the child is waiting for an fcntl call that will not complete, because the parent holds a conflicting lock) and then waiting for the child to change state. This state change doesn't happen because the signal is blocked.)

The signal state on the child process can be seen in /proc//status (relevant lines pointed out below):
paf@crystal:/proc/12839> cat status
Name: a.out
State: S (sleeping)
Tgid: 12839
Ngid: 0
Pid: 12839
PPid: 12838
TracerPid: 12838
Uid: 10213 10213 10213 10213
Gid: 12790 12790 12790 12790
FDSize: 64
Groups: 1007 1013 1113 8001 12534 12584 12790
VmPeak: 4088 kB
VmSize: 4088 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 112 kB
VmRSS: 112 kB
VmData: 44 kB
VmStk: 164 kB
VmExe: 8 kB
VmLib: 1788 kB
VmPTE: 28 kB
VmSwap: 0 kB
Threads: 1
SigQ: 1/127523
SigPnd: 0000000000000000
ShdPnd: 0000000020000000 <--- PENDING
SigBlk: ffffffffffff9ef9 <--- BLOCKED
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: ffff
Cpus_allowed_list: 0-15
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 4
nonvoluntary_ctxt_switches: 2

If you switch to a signal that Lustre considers fatal and isn't blocking, the child will wake up (and die, since there's no handling set up).

So, a different bit of weirdness in signals and Lustre. Not the same one we see with rr.

paf-49 · 2016-11-29T17:33:55Z

Ah, after parsing through the various asserts, etc, it looks like sending SIGINT to the child will get the -1 and errno=EINTR behavior you mentioned.

Setting up a sigaction with SIG_IGN doesn't appear to do anything. So I think I've reproduced the problem(?)...

paf-49 · 2016-11-29T17:41:55Z

OK, yes. When we're not ptraced, the child process exits when sent SIGINT, as I'd expect, and hangs when sent '30', also more or less as expected (since it's blocked).

When I set SIG_IGN as the action for SIGINT (on the child), it hangs. No -1 and EINTR returned, syscall doesn't exit. This is when not ptraced.

When ptraced, I get -1 and EINTR (on SIGINT), whether or not I've set up the sigaction. IE, it seems like the sigaction is getting ignored when we're ptraced.

paf-49 · 2016-11-29T17:48:44Z

So the question then, I suppose, is what's different about EXT4 and friends? I suspect it's something in the handling of waiting in TASK_UNINTERRUPTIBLE...

Because they must trigger the state change the parent is waiting for after sending the signal, without actually exiting the syscall. I'm guessing it's "successful delivery" when a signal is given to a process waiting in TASK_UNINTERRUPTIBLE (which doesn't care that you signalled it).

So... Lustre is interruptible. That's by design. It's getting interrupted...

And when ptraced, the sigaction seems to be ignored. That doesn't feel like a Lustre bug.

Thoughts?

rocallahan · 2016-11-29T19:37:03Z

Getting EINTR in this testcase is a bug, because the ptracer never delivers the signal to the tracee. I don't know which part of the kernel to blame.

hangs when sent '30', also more or less as expected (since it's blocked).

Blocked by what? User-space hasn't blocked it.

I don't think we should invest too much in trying to determine whether this is a kernel bug or not. We generally work around kernel bugs anyway, and I think it would be effective to simply detect if certain filesystems are mounted and disable buffering of flocks in those cases, which should have no real downsides given those syscalls are already expensive on these filesystems.

paf-49 · 2016-11-29T19:40:27Z

Blocked by what? User-space hasn't blocked it.

Blocked by Lustre.

And I'm certainly happy to let it go with a workaround, if that works for rr. (My area is Lustre and the kernel, so I'm inclined to dig deep.) I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.

Keno · 2016-11-29T19:50:45Z

If you're willing to keep looking into the lustre/kernel side of things, I'd certainly be happy to keep helping out any way I can. We should of course still investigate the possibility of a workaround, to fix this for the immediate future (and even if we find a kernel fix to fix it for older versions).

rocallahan · 2016-11-29T20:04:58Z

Blocked by Lustre.

OK. Blocking signal delivery indefinitely when the signal wasn't blocked by user-space seems like a bug to me. If it's infeasible to fix, or just not worth fixing, then OK. Obviously for rr it's not important since we only care about the ptrace case.

I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.

Feel free to do it here :-). Generally I'm keen to get upstream bugs reported and fixed. I just meant this is not an issue where we need to struggle mightily to reach a consensus, if we're having difficulty reaching one.

paf-49 · 2016-11-29T20:17:52Z

OK. Blocking signal delivery indefinitely when the signal wasn't blocked by user-space seems like a bug to me. If it's infeasible to fix, or just not worth fixing, then OK.

Well, Oleg (who's one of the core Lustre developers, I'm on the periphery) might have an opinion here, but basically, Lustre must wait, it shouldn't do it uninterruptibly (bad form, since some of our network timeouts are in minutes, and people should be able to kill programs that are in those waits), but it doesn't want to stop waiting for signals that aren't intended to be fatal. So they came up with a mask of signals that they think people will consider critical/fatal. The sort of signals people expect to interrupt and stop a program. And they block everything else. But... SIGKILL aside, that's inherently arbitrary.

It's ugly, but I don't see a better solution.

That's kind of neither here nor there, though. FWIW, when you have SIG_IGN set and are not ptracing, this code hangs on non-Lustre filesystems. The SIG_IGN seems to mean we don't get the expected state transition that the parent is waiting for, so it never unlocks. When ptracing, SIG_IGN is, well, ignored.

rocallahan · 2016-11-29T20:42:54Z

FWIW, when you have SIG_IGN set and are not ptracing, this code hangs on non-Lustre filesystems.

I don't see this. Testcase:
https://gist.github.com/rocallahan/6b557e6f0a63bf00b586f58187af8927
Completes normally for me with btrfs.

Lustre must wait, it shouldn't do it uninterruptibly (bad form, since some of our network timeouts are in minutes, and people should be able to kill programs that are in those waits), but it doesn't want to stop waiting for signals that aren't intended to be fatal.

I believe the correct behavior here would be for signals to interrupt the system call, leaving it in a restartable state, so that if the signal is ignored or there is a signal handler which returns, the system call restarts and can complete normally. If you strace -f my testcase then you should see that's exactly what happens:

[pid 13994] fcntl(3, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=4096} <unfinished ...>
[pid 13993] <... nanosleep resumed> NULL) = 0
[pid 13993] kill(13994, SIGPWR)         = 0
[pid 13994] <... fcntl resumed> )       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 13994] --- SIGPWR {si_signo=SIGPWR, si_code=SI_USER, si_pid=13993, si_uid=1000} ---
[pid 13993] nanosleep({0, 500000000},  <unfinished ...>
[pid 13994] fcntl(3, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=4096} <unfinished ...>

I understand that might be difficult to implement. Though, for flock it seems like all you'd have to do is abandon taking the lock and prepare the syscall for restart, so the restarted syscall just tries again from scratch to take the lock.

Keno mentioned this issue Nov 26, 2016

Failing tests on KNL #1904

Closed

flock test fails on lustrefs #1908

flock test fails on lustrefs #1908

Comments

Keno commented Nov 26, 2016 • edited Loading

Keno commented Nov 27, 2016

rocallahan commented Nov 27, 2016

rocallahan commented Nov 27, 2016

brianjmurrell commented Nov 27, 2016

Keno commented Nov 27, 2016

Keno commented Nov 27, 2016

verygreen commented Nov 28, 2016

Keno commented Nov 28, 2016

verygreen commented Nov 28, 2016

Keno commented Nov 28, 2016

verygreen commented Nov 28, 2016

verygreen commented Nov 28, 2016

verygreen commented Nov 28, 2016

paf-49 commented Nov 28, 2016

Keno commented Nov 28, 2016

paf-49 commented Nov 28, 2016

Keno commented Nov 28, 2016 • edited Loading

paf-49 commented Nov 28, 2016

Keno commented Nov 28, 2016

paf-49 commented Nov 28, 2016

paf-49 commented Nov 28, 2016

Keno commented Nov 28, 2016

paf-49 commented Nov 28, 2016

paf-49 commented Nov 28, 2016

paf-49 commented Nov 28, 2016

rocallahan commented Nov 28, 2016

verygreen commented Nov 28, 2016

Keno commented Nov 28, 2016 • edited Loading

paf-49 commented Nov 28, 2016

rocallahan commented Nov 28, 2016

rocallahan commented Nov 28, 2016

paf-49 commented Nov 28, 2016

paf-49 commented Nov 28, 2016 • edited Loading

verygreen commented Nov 28, 2016

rocallahan commented Nov 28, 2016

verygreen commented Nov 28, 2016

paf-49 commented Nov 28, 2016

rocallahan commented Nov 28, 2016

rocallahan commented Nov 28, 2016

paf-49 commented Nov 28, 2016

rocallahan commented Nov 28, 2016

paf-49 commented Nov 28, 2016 • edited Loading

Keno commented Nov 28, 2016

paf-49 commented Nov 29, 2016

paf-49 commented Nov 29, 2016

paf-49 commented Nov 29, 2016 • edited Loading

paf-49 commented Nov 29, 2016 • edited Loading

rocallahan commented Nov 29, 2016

paf-49 commented Nov 29, 2016 • edited Loading

Keno commented Nov 29, 2016

rocallahan commented Nov 29, 2016

paf-49 commented Nov 29, 2016

rocallahan commented Nov 29, 2016

Keno commented Nov 26, 2016 •

edited

Loading

Keno commented Nov 28, 2016 •

edited

Loading

Keno commented Nov 28, 2016 •

edited

Loading

paf-49 commented Nov 28, 2016 •

edited

Loading

paf-49 commented Nov 28, 2016 •

edited

Loading

paf-49 commented Nov 29, 2016 •

edited

Loading

paf-49 commented Nov 29, 2016 •

edited

Loading

paf-49 commented Nov 29, 2016 •

edited

Loading