Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flock test fails on lustrefs #1908

Open
Keno opened this issue Nov 26, 2016 · 53 comments
Open

flock test fails on lustrefs #1908

Keno opened this issue Nov 26, 2016 · 53 comments

Comments

@Keno
Copy link
Member

Keno commented Nov 26, 2016

> $HOME/rr-build/bin/rr record $HOME/rr-build/bin/flock
rr: Saving execution to trace directory `/global/homes/k/kfischer/.local/share/rr/flock-7'.
parent pid is 61556
sizeof(flock) = 32
before lock: type: 2, pid: 0
  after lock: type: 0, pid: 61556
sizeof(flock64) = 32
before lock: type: 2, pid: 0
  after GETLK: type: 1, pid: 61556
P: forcing child to block on LK, sleeping ...
FAILED: errno=4 (Interrupted system call)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:98: main: Assertion `"FAILED: !" && check_cond(0 == err)' failed.
P: ... awake, releasing lock
FAILED: errno=0 (Success)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:120: main: Assertion `"FAILED: !" && check_cond(((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0x7f) == 0) && 0 == ((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0xff00) >> 8))' failed.
Aborted
> stat --file-system --format=%T .
lustre

Works fine without rr as well as with -n.
Mount options are rw,flock,lazystatfs in case it makes a difference.

@Keno
Copy link
Member Author

Keno commented Nov 27, 2016

lfs 2.7.1.11 in case it matters.

@rocallahan
Copy link
Collaborator

I’m looking into the flock test failure on lustrefs. My best is that lustre doesn’t like us interrupting it with the desched sig. How do we usually handle that? Do we explicitly restart the system call somewhere?

We don't handle anything like that currently. The closest I've seen was when trying to add syscall buffering for epoll_wait, which I think hit the kernel bug referenced here: https://www.varnish-cache.org/lists/pipermail/varnish-commit/2016-January/014928.html. I ended up just abandoning that.

@rocallahan
Copy link
Collaborator

The immediate problem that makes this hard to fix is knowing when the flock call would really complete if it exits with EINTR whenever we get descheduled. You could modify the syscallbuf to automatically retry with a traced flock after EINTR but of course that might be incorrect if the EINTR occurred legitimately. A hack that would work OK is to disable flock syscall buffering if lustrefs is mounted.

Probably should report a lustrefs bug in any case.

@brianjmurrell
Copy link

Does it make any difference if you change the flock mount option to localflock?

@Keno
Copy link
Member Author

Keno commented Nov 27, 2016

Unfortunately, I cannot change any mount options on this system, so it's not easy for me to try that.

@Keno
Copy link
Member Author

Keno commented Nov 27, 2016

I'm gonna try the KVM setup guide at http://wiki.lustre.org/KVM_Quick_Start_Guide, and see if I can get a dev setup running on one of my machines.

@verygreen
Copy link

what's Lustre version in use? Is there a way to get the test binary without having to get a lot of dependencies for self building?

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

As far as I know the version mentioned above is the relevant version here (if not please let me know what command to run). rr has no dependencies other than cmake, so just cloning this repository and doing a standard cmake build should be sufficient. I'd also be happy to do a build and zip up the result, but build from source should be very straightforward. The relevant command to try is in the original post (adjust build location accordingly of course). In my tests it deterministically failed when run in a lustre directory, but passed on all other file systems I've tried (ext4, btrfs, gpfs, temps). Also I was unable to get a lustre dev setup running, so I did not try the suggestion above.

@verygreen
Copy link

Ah, 2.7.1.11, hm, that's a strange version number, I guss that just means 2.7.1 or something close to it. (cat /proc/fs/lustre/version should be good enough).

The filesystems you tried are all local. Did you try any other network filesystems, like say nfs4 (should be the easiest to setup).

I am going to try this on my local lustre setup and see what happens.
What's the distro (I only care due to the kernel), rhel6? rhel7? ubuntu of some sort?

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

As far as I'm aware gpfs is distributed. Distribution is Cray Linux, with the kernel based on 3.12 as far as I can tell:

kfischer@cori11:~> cat /proc/fs/lustre/version
lustre: 2.7.1.11
kernel: patchless_client
build:  2.7.1.11-trunk-1.0600.f2563c6.3.2-abuild-lustre-filesystem.g
it@f2563c6-2016-10-26-20:44
kfischer@cori11:~> uname -r
3.12.60-52.57.1.11767.0.PTF.996988-default

@verygreen
Copy link

Ah, sles12 for Cray, I think.
Yes, gpfs is distributed, I just missed it in the list.
Thanks. I'll try it shortly to see what's going on here.

@verygreen
Copy link

Also Cray explains why the version is so strange. They roll a bunch of their own patches in and their tree sometimes has significant departures from mainline at times.

@verygreen
Copy link

hm, it does seem o be pulling a bunch of stuff that I don't have on my test nodes.
Can I have just the 64 bit binary that would run on rhel7, please?

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Just sticking my oar in so I can get the binary as well... I'm on the Cray Lustre side. (Green is right about the version info ;) )

Also, if you wanted to report this to the Cray site staff so they can open a bug, that would be helpful too... (Even if Green or I figure out the problem and create a patch without you going through Cray, you'll need that bug so we can get the fix installed on Cori.)

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

Ok, here's the build directory (built on Cori, so it's the same binary I've been using for tests): http://anubis.juliacomputing.io:8844/rr-build.tar.gz. Please let me know if it doesn't work (e.g. due to libc version, etc), in which case I'll spin up a CentOS machine and build one there.

Also I may have been wrong about gpfs. It appears the compute notes on Cori use Cray DataWarp for the home directory, so while the backing file system is gpfs, there's some Cray magic in there as well. I also tried running the test on the login node's home directory, which does appear to be pure gpfs (which is why I thought it would be so on the compute nodes as well) and I was able to reproduce the same behavior I saw with lustre, so this may be a more general problem. My apologies for the incorrect information.

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

Trying to execute it:
[root@cent7c01 bin]# ./rr record flock
rr: Saving execution to trace directory `/root/.local/share/rr/flock-2'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

No, that was fine, plain gpfs did appear to have the problem however.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

This was built from unmodified master, so sources are the same as those in this repository. What's /proc/cpuinfo? Any virtualization techniques in use (some don't preserve the performance counters we need)? Also can you check which counter failed by attaching gdb and getting me a backtrace?

@paf-49
Copy link

paf-49 commented Nov 28, 2016

CentOS 7, VMWare ESXi.

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping : 7
microcode : 0x70d
cpu MHz : 2194.711
cache size : 16384 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb pln pts dtherm xsaveopt
bogomips : 4389.42
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

I'll get a backtrace momentarily.

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

Hmm, VMWare is usually fine I thought. Try perf list | grep "Hardware event"?

@paf-49
Copy link

paf-49 commented Nov 28, 2016

[root@cent7c01 ~]# perf list | grep "Hardware event"
ref-cycles [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

@paf-49
Copy link

paf-49 commented Nov 28, 2016

GDB:

Starting program: /shared/paf/rr-build/bin/./rr record ./flock
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff66b9700 (LWP 2591)]
[New Thread 0x7ffff5bb7700 (LWP 2592)]
[New Thread 0x7ffff4db5700 (LWP 2593)]
[New Thread 0x7fffeffff700 (LWP 2594)]
[New Thread 0x7fffef7fe700 (LWP 2595)]
[New Thread 0x7fffeeffd700 (LWP 2596)]
[New Thread 0x7fffee7fc700 (LWP 2597)]
[New Thread 0x7fffedffb700 (LWP 2598)]
[New Thread 0x7fffed7fa700 (LWP 2599)]
rr: Saving execution to trace directory `/root/.local/share/rr/flock-8'.
Detaching after fork from child process 2600.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Program received signal SIGABRT, Aborted.
0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff69f1ce8 in __GI_abort () at abort.c:90
#2 0x00000000006a50a4 in rr::FatalOstream::~FatalOstream (this=0x7fffffffde59,
__in_chrg=) at /global/homes/k/kfischer/rr/src/log.cc:264
#3 0x00000000006b41da in rr::start_counter (tid=2600, group_fd=-1,
attr=0xa4d460 rr::cycles_attr)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:262
#4 0x00000000006b4844 in rr::PerfCounters::reset (this=0xa537d8, ticks_period=60253)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:346
#5 0x000000000076533b in rr::Task::resume_execution (this=0xa537b0,
how=rr::RESUME_SYSCALL, wait_how=rr::RESUME_NONBLOCKING, tick_period=60253, sig=0)
at /global/homes/k/kfischer/rr/src/Task.cc:938
#6 0x00000000006bef28 in rr::RecordSession::task_continue (this=0xa51560,
step_state=...) at /global/homes/k/kfischer/rr/src/RecordSession.cc:581
#7 0x00000000006c4da9 in rr::RecordSession::record_step (this=0xa51560)
at /global/homes/k/kfischer/rr/src/RecordSession.cc:1951
#8 0x00000000006bbd26 in rr::record (
args=std::vector of length 1, capacity 2 = {...}, flags=...)
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:314
#9 0x00000000006bc179 in rr::RecordCommand::run (
this=0xa4d610 rr::RecordCommand::singleton,
args=std::vector of length 1, capacity 2 = {...})
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:381
#10 0x000000000078545e in main (argc=3, argv=0x7fffffffe458)
at /global/homes/k/kfischer/rr/src/main.cc:270

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

Thanks. The perf list output explains the problem. We need the retired branch counter, which does not seem to be available. Odd. I thought usually perf counters were pretty much an all-or-nothing deal when it comes to virtualization. Perhaps an old VMWare version? Any chance you could try on bare metal?

@paf-49
Copy link

paf-49 commented Nov 28, 2016

grumble, grumble
Yes, it's a version issue. I'll spare you the boring details (just spent a few minutes trying to enable the feature).

I can try it on real hardware, but might have glibc issues again...

@paf-49
Copy link

paf-49 commented Nov 28, 2016

On hardware (Cray SLES 12, probably not too different from Cori):
./rr record ./flock
rr: Saving execution to trace directory /root/.local/share/rr/flock-0'. [FATAL /global/homes/k/kfischer/rr/src/AddressSpace.cc:286:map_rr_page() errno: SUCCESS] (task 23650 (rec:23650) at time 14) -> Assertion child_fd == -EACCES' failed to hold. Unexpected error mapping rr_page
Launch gdb with
gdb '-l' '10000' '-ex' 'target extended-remote :23650' /cray/css/u18/paf/shared/rr/flock

Sits there...

Tried gdb as suggested:
Reading symbols from /cray/css/u18/paf/shared/rr/flock...done.
Remote debugging using :32227
warning: limiting remote suggested packet size (17073526 bytes) to 16384
Remote connection closed
(gdb) quit

Output of rr:
rr: /global/homes/k/kfischer/rr/src/GdbConnection.cc:540: std::string rr::read_target_desc(const char*): Assertion `f' failed.
Aborted

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Note that I am able to execute the 'flock' binary correctly by itself.

@rocallahan
Copy link
Collaborator

@Keno might it be easier for you to create a standalone testcase? Shouldn't be that hard, probably doesn't even need ptrace, just have one process call flock while another process sends it an ignored signal?

@verygreen
Copy link

I am getting:

rr: Saving execution to trace directory `/root/.local/share/rr/flock-0'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:156:get_cpu_microarch() errno: ENOTTY] CPU 0x620 unknown.
Aborted

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

@rocallahan Yes, I had started on that, but I didn't quite manage to reproduce it. I was hoping at this point we'd be robust enough to be able to have people run it ;) - wishful thinking I guess. I'll try again to do a standalone test case.

@paf-49
Copy link

paf-49 commented Nov 28, 2016

When you say "ignored signal", do you mean "blocked in the signal mask", or ...?

@rocallahan
Copy link
Collaborator

Just a signal with no handler and SIG_IGN disposition.

@rocallahan
Copy link
Collaborator

But apparently the problem is more complicated than that. Maybe you need ptrace as well.

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Huh. I wasn't familiar with SIG_IGN... Lustre (and perhaps GPFS too) does interruptible waits in some places, and (usually) returns -ERESTARTSYS, making the interrupted call restartable. Do you have SA_RESTART set? If not, then you'll get -EINTR on interrupts. (Can you even set that when you're not writing your own handler? I would think so)

@paf-49
Copy link

paf-49 commented Nov 28, 2016

To clarify: When SA_RESTART is set, returning -ERESTARTSYS from Lustre causes the kernel to restart the syscall in question. If SA_RESTART is not set, then the kernel translates that to -EINTR and returns it back to userspace. (In certain cases Lustre returns -EINTR directly, causing SA_RESTART to be ignored)

@verygreen
Copy link

Yed, I would think if you have any sort of a signal during a blocking system call, ou might get EINTR and you need to handle it one way or another.
though flock is kind of a bad case since it restarts the whole thing and with RPCS potentially taking long time it might never complete as the result spinning in the retry loop forever (should not be even lustre specific as long as the wait is interruptible).

@rocallahan
Copy link
Collaborator

SA_RESTART only applies when you set a handler. Ignored signals should not cause a syscall to return EINTR.

@verygreen
Copy link

yes, I guess ignored signals really should not unless they could not be ignored or something.
I wonder what signal is this that's actually getting through

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Hm. @rocallahan I'm digging through kernel source to look at the restart handling wrt SIG_IGN. I mean, if you interrupt a syscall, which needs to be possible, then it has to be restarted. We can't just wait uninterruptibly for network things which may not complete. (That's probably a difference between distributed file systems and local ones. I'm betting most local ones don't wait interruptibly.)

So, ignored is not the same thing as blocked. (Why not just block...?)

@rocallahan
Copy link
Collaborator

rr arranges for an ignored SIGSTKFLT to be delivered pretty much every time a syscall blocks in the kernel. This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time) and it doesn't cause EINTR for flock in common filesystems like ext4 or btrfs (or tests would be failing on those filesystems, which are tested often). That's why we suspect filesystem-specific kernel issues here.

We can't block the signal because rr needs to get a blocking notification (via ptrace) that the signal is being delivered. rr responds to that notification by resuming execution of the tracee, letting it complete the syscall (and meanwhile scheduling another tracee).

@rocallahan
Copy link
Collaborator

This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time)

Correction, it doesn't cause EINTR in the set of syscalls that we have fast-paths for. And we do know that epoll_wait causes unwanted EINTRs.

@paf-49
Copy link

paf-49 commented Nov 28, 2016

I think the "filesystem-specific" issue here is that we're interruptible, so we're getting interrupted. The syscall is getting interrupted. I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

Being network file systems, Lustre and GPFS can't do this realistically. (There's a bit more to it, but that's the basic thing.)

Still poking around in the kernel to try to understand stuff here better...

@rocallahan
Copy link
Collaborator

I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

That may be true, but other interruptible syscalls (e.g. read on a pipe or socket) wait in such a way that they don't return EINTR for ignored signals.

@paf-49
Copy link

paf-49 commented Nov 28, 2016

Hm, all right.

My digging around strongly suggests that in the case of SIG_IGN, the signal shouldn't truly be delivered at all. ptrace_signal is called to let it look at the attempted delivery, but after that, we don't call the actual handling code, which is where ERESTARTSYS is handled. (I think... It gets very hairy in here.)

(do_signal-->get_signal-->get_signal_to_deliver and then handle_signal)

Since we can't reproduce the problem yet, would @Keno be able to try (just for our information) setting sa_flags to SA_RESTART? (Since there is a struct sigaction when you're setting SIG_IGN.)

@Keno
Copy link
Member Author

Keno commented Nov 28, 2016

Ok try the following maybe: https://gist.github.com/Keno/f257142b20d212b94182058ecee363af
Results for me are as follows:
tmpfs/ext4/btrfs: ok
GPFS-via-DataWarp: ok
GPFS: Reproduces failure reported here
Lustre: Hangs

So not quite the same behavior as in the original test case on lustre, but close enough to reproduce the problem?

@paf-49
Copy link

paf-49 commented Nov 29, 2016

Well, on Lustre, I'd call that hang "expected". When waiting, Lustre sometimes blocks everything and otherwise it blocks all signals it considers non-fatal. (The hang is the parent signalling the child (while the child is waiting for an fcntl call that will not complete, because the parent holds a conflicting lock) and then waiting for the child to change state. This state change doesn't happen because the signal is blocked.)

The signal state on the child process can be seen in /proc//status (relevant lines pointed out below):
paf@crystal:/proc/12839> cat status
Name: a.out
State: S (sleeping)
Tgid: 12839
Ngid: 0
Pid: 12839
PPid: 12838
TracerPid: 12838
Uid: 10213 10213 10213 10213
Gid: 12790 12790 12790 12790
FDSize: 64
Groups: 1007 1013 1113 8001 12534 12584 12790
VmPeak: 4088 kB
VmSize: 4088 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 112 kB
VmRSS: 112 kB
VmData: 44 kB
VmStk: 164 kB
VmExe: 8 kB
VmLib: 1788 kB
VmPTE: 28 kB
VmSwap: 0 kB
Threads: 1
SigQ: 1/127523
SigPnd: 0000000000000000
ShdPnd: 0000000020000000 <--- PENDING
SigBlk: ffffffffffff9ef9 <--- BLOCKED
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: ffff
Cpus_allowed_list: 0-15
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 4
nonvoluntary_ctxt_switches: 2

If you switch to a signal that Lustre considers fatal and isn't blocking, the child will wake up (and die, since there's no handling set up).

So, a different bit of weirdness in signals and Lustre. Not the same one we see with rr.

@paf-49
Copy link

paf-49 commented Nov 29, 2016

Ah, after parsing through the various asserts, etc, it looks like sending SIGINT to the child will get the -1 and errno=EINTR behavior you mentioned.

Setting up a sigaction with SIG_IGN doesn't appear to do anything. So I think I've reproduced the problem(?)...

@paf-49
Copy link

paf-49 commented Nov 29, 2016

OK, yes. When we're not ptraced, the child process exits when sent SIGINT, as I'd expect, and hangs when sent '30', also more or less as expected (since it's blocked).

When I set SIG_IGN as the action for SIGINT (on the child), it hangs. No -1 and EINTR returned, syscall doesn't exit. This is when not ptraced.

When ptraced, I get -1 and EINTR (on SIGINT), whether or not I've set up the sigaction. IE, it seems like the sigaction is getting ignored when we're ptraced.

@paf-49
Copy link

paf-49 commented Nov 29, 2016

So the question then, I suppose, is what's different about EXT4 and friends? I suspect it's something in the handling of waiting in TASK_UNINTERRUPTIBLE...

Because they must trigger the state change the parent is waiting for after sending the signal, without actually exiting the syscall. I'm guessing it's "successful delivery" when a signal is given to a process waiting in TASK_UNINTERRUPTIBLE (which doesn't care that you signalled it).

So... Lustre is interruptible. That's by design. It's getting interrupted...

And when ptraced, the sigaction seems to be ignored. That doesn't feel like a Lustre bug.

Thoughts?

@rocallahan
Copy link
Collaborator

Getting EINTR in this testcase is a bug, because the ptracer never delivers the signal to the tracee. I don't know which part of the kernel to blame.

hangs when sent '30', also more or less as expected (since it's blocked).

Blocked by what? User-space hasn't blocked it.

I don't think we should invest too much in trying to determine whether this is a kernel bug or not. We generally work around kernel bugs anyway, and I think it would be effective to simply detect if certain filesystems are mounted and disable buffering of flocks in those cases, which should have no real downsides given those syscalls are already expensive on these filesystems.

@paf-49
Copy link

paf-49 commented Nov 29, 2016

Blocked by what? User-space hasn't blocked it.

Blocked by Lustre.

And I'm certainly happy to let it go with a workaround, if that works for rr. (My area is Lustre and the kernel, so I'm inclined to dig deep.) I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.

@Keno
Copy link
Member Author

Keno commented Nov 29, 2016

If you're willing to keep looking into the lustre/kernel side of things, I'd certainly be happy to keep helping out any way I can. We should of course still investigate the possibility of a workaround, to fix this for the immediate future (and even if we find a kernel fix to fix it for older versions).

@rocallahan
Copy link
Collaborator

Blocked by Lustre.

OK. Blocking signal delivery indefinitely when the signal wasn't blocked by user-space seems like a bug to me. If it's infeasible to fix, or just not worth fixing, then OK. Obviously for rr it's not important since we only care about the ptrace case.

I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.

Feel free to do it here :-). Generally I'm keen to get upstream bugs reported and fixed. I just meant this is not an issue where we need to struggle mightily to reach a consensus, if we're having difficulty reaching one.

@paf-49
Copy link

paf-49 commented Nov 29, 2016

OK. Blocking signal delivery indefinitely when the signal wasn't blocked by user-space seems like a bug to me. If it's infeasible to fix, or just not worth fixing, then OK.

Well, Oleg (who's one of the core Lustre developers, I'm on the periphery) might have an opinion here, but basically, Lustre must wait, it shouldn't do it uninterruptibly (bad form, since some of our network timeouts are in minutes, and people should be able to kill programs that are in those waits), but it doesn't want to stop waiting for signals that aren't intended to be fatal. So they came up with a mask of signals that they think people will consider critical/fatal. The sort of signals people expect to interrupt and stop a program. And they block everything else. But... SIGKILL aside, that's inherently arbitrary.

It's ugly, but I don't see a better solution.

That's kind of neither here nor there, though. FWIW, when you have SIG_IGN set and are not ptracing, this code hangs on non-Lustre filesystems. The SIG_IGN seems to mean we don't get the expected state transition that the parent is waiting for, so it never unlocks. When ptracing, SIG_IGN is, well, ignored.

@rocallahan
Copy link
Collaborator

FWIW, when you have SIG_IGN set and are not ptracing, this code hangs on non-Lustre filesystems.

I don't see this. Testcase:
https://gist.github.com/rocallahan/6b557e6f0a63bf00b586f58187af8927
Completes normally for me with btrfs.

Lustre must wait, it shouldn't do it uninterruptibly (bad form, since some of our network timeouts are in minutes, and people should be able to kill programs that are in those waits), but it doesn't want to stop waiting for signals that aren't intended to be fatal.

I believe the correct behavior here would be for signals to interrupt the system call, leaving it in a restartable state, so that if the signal is ignored or there is a signal handler which returns, the system call restarts and can complete normally. If you strace -f my testcase then you should see that's exactly what happens:

[pid 13994] fcntl(3, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=4096} <unfinished ...>
[pid 13993] <... nanosleep resumed> NULL) = 0
[pid 13993] kill(13994, SIGPWR)         = 0
[pid 13994] <... fcntl resumed> )       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 13994] --- SIGPWR {si_signo=SIGPWR, si_code=SI_USER, si_pid=13993, si_uid=1000} ---
[pid 13993] nanosleep({0, 500000000},  <unfinished ...>
[pid 13994] fcntl(3, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=4096} <unfinished ...>

I understand that might be difficult to implement. Though, for flock it seems like all you'd have to do is abandon taking the lock and prepare the syscall for restart, so the restarted syscall just tries again from scratch to take the lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants