-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flock test fails on lustrefs #1908
Comments
|
We don't handle anything like that currently. The closest I've seen was when trying to add syscall buffering for |
The immediate problem that makes this hard to fix is knowing when the Probably should report a lustrefs bug in any case. |
Does it make any difference if you change the flock mount option to localflock? |
Unfortunately, I cannot change any mount options on this system, so it's not easy for me to try that. |
I'm gonna try the KVM setup guide at http://wiki.lustre.org/KVM_Quick_Start_Guide, and see if I can get a dev setup running on one of my machines. |
what's Lustre version in use? Is there a way to get the test binary without having to get a lot of dependencies for self building? |
As far as I know the version mentioned above is the relevant version here (if not please let me know what command to run). rr has no dependencies other than cmake, so just cloning this repository and doing a standard cmake build should be sufficient. I'd also be happy to do a build and zip up the result, but build from source should be very straightforward. The relevant command to try is in the original post (adjust build location accordingly of course). In my tests it deterministically failed when run in a lustre directory, but passed on all other file systems I've tried (ext4, btrfs, gpfs, temps). Also I was unable to get a lustre dev setup running, so I did not try the suggestion above. |
Ah, 2.7.1.11, hm, that's a strange version number, I guss that just means 2.7.1 or something close to it. (cat /proc/fs/lustre/version should be good enough). The filesystems you tried are all local. Did you try any other network filesystems, like say nfs4 (should be the easiest to setup). I am going to try this on my local lustre setup and see what happens. |
As far as I'm aware gpfs is distributed. Distribution is Cray Linux, with the kernel based on 3.12 as far as I can tell:
|
Ah, sles12 for Cray, I think. |
Also Cray explains why the version is so strange. They roll a bunch of their own patches in and their tree sometimes has significant departures from mainline at times. |
hm, it does seem o be pulling a bunch of stuff that I don't have on my test nodes. |
Just sticking my oar in so I can get the binary as well... I'm on the Cray Lustre side. (Green is right about the version info ;) ) Also, if you wanted to report this to the Cray site staff so they can open a bug, that would be helpful too... (Even if Green or I figure out the problem and create a patch without you going through Cray, you'll need that bug so we can get the fix installed on Cori.) |
Ok, here's the build directory (built on Cori, so it's the same binary I've been using for tests): http://anubis.juliacomputing.io:8844/rr-build.tar.gz. Please let me know if it doesn't work (e.g. due to libc version, etc), in which case I'll spin up a CentOS machine and build one there. Also I may have been wrong about gpfs. It appears the compute notes on Cori use Cray DataWarp for the home directory, so while the backing file system is gpfs, there's some Cray magic in there as well. I also tried running the test on the login node's home directory, which does appear to be pure gpfs (which is why I thought it would be so on the compute nodes as well) and I was able to reproduce the same behavior I saw with lustre, so this may be a more general problem. My apologies for the incorrect information. |
Just out of curiosity, did the problem happen with GPFS-via-DataWarp? Trying to execute it: Perf record works on this system. Don't have referenced source, so can't easily dig further. |
No, that was fine, plain gpfs did appear to have the problem however.
This was built from unmodified master, so sources are the same as those in this repository. What's |
CentOS 7, VMWare ESXi. processor : 0
|
Hmm, VMWare is usually fine I thought. Try |
[root@cent7c01 ~]# perf list | grep "Hardware event" |
GDB: Starting program: /shared/paf/rr-build/bin/./rr record ./flock Program received signal SIGABRT, Aborted. |
Thanks. The |
grumble, grumble I can try it on real hardware, but might have glibc issues again... |
On hardware (Cray SLES 12, probably not too different from Cori): Sits there... Tried gdb as suggested: Output of rr: |
Note that I am able to execute the 'flock' binary correctly by itself. |
@Keno might it be easier for you to create a standalone testcase? Shouldn't be that hard, probably doesn't even need ptrace, just have one process call flock while another process sends it an ignored signal? |
I am getting:
|
@rocallahan Yes, I had started on that, but I didn't quite manage to reproduce it. I was hoping at this point we'd be robust enough to be able to have people run it ;) - wishful thinking I guess. I'll try again to do a standalone test case. |
When you say "ignored signal", do you mean "blocked in the signal mask", or ...? |
Just a signal with no handler and |
But apparently the problem is more complicated than that. Maybe you need ptrace as well. |
Huh. I wasn't familiar with SIG_IGN... Lustre (and perhaps GPFS too) does interruptible waits in some places, and (usually) returns -ERESTARTSYS, making the interrupted call restartable. Do you have SA_RESTART set? If not, then you'll get -EINTR on interrupts. (Can you even set that when you're not writing your own handler? I would think so) |
To clarify: When SA_RESTART is set, returning -ERESTARTSYS from Lustre causes the kernel to restart the syscall in question. If SA_RESTART is not set, then the kernel translates that to -EINTR and returns it back to userspace. (In certain cases Lustre returns -EINTR directly, causing SA_RESTART to be ignored) |
Yed, I would think if you have any sort of a signal during a blocking system call, ou might get EINTR and you need to handle it one way or another. |
SA_RESTART only applies when you set a handler. Ignored signals should not cause a syscall to return EINTR. |
yes, I guess ignored signals really should not unless they could not be ignored or something. |
Hm. @rocallahan I'm digging through kernel source to look at the restart handling wrt SIG_IGN. I mean, if you interrupt a syscall, which needs to be possible, then it has to be restarted. We can't just wait uninterruptibly for network things which may not complete. (That's probably a difference between distributed file systems and local ones. I'm betting most local ones don't wait interruptibly.) So, ignored is not the same thing as blocked. (Why not just block...?) |
rr arranges for an ignored SIGSTKFLT to be delivered pretty much every time a syscall blocks in the kernel. This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time) and it doesn't cause EINTR for flock in common filesystems like ext4 or btrfs (or tests would be failing on those filesystems, which are tested often). That's why we suspect filesystem-specific kernel issues here. We can't block the signal because rr needs to get a blocking notification (via ptrace) that the signal is being delivered. rr responds to that notification by resuming execution of the tracee, letting it complete the syscall (and meanwhile scheduling another tracee). |
Correction, it doesn't cause EINTR in the set of syscalls that we have fast-paths for. And we do know that |
I think the "filesystem-specific" issue here is that we're interruptible, so we're getting interrupted. The syscall is getting interrupted. I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame. Being network file systems, Lustre and GPFS can't do this realistically. (There's a bit more to it, but that's the basic thing.) Still poking around in the kernel to try to understand stuff here better... |
That may be true, but other interruptible syscalls (e.g. |
Hm, all right. My digging around strongly suggests that in the case of SIG_IGN, the signal shouldn't truly be delivered at all. ptrace_signal is called to let it look at the attempted delivery, but after that, we don't call the actual handling code, which is where ERESTARTSYS is handled. (I think... It gets very hairy in here.) (do_signal-->get_signal-->get_signal_to_deliver and then handle_signal) Since we can't reproduce the problem yet, would @Keno be able to try (just for our information) setting sa_flags to SA_RESTART? (Since there is a struct sigaction when you're setting SIG_IGN.) |
Ok try the following maybe: https://gist.github.com/Keno/f257142b20d212b94182058ecee363af So not quite the same behavior as in the original test case on lustre, but close enough to reproduce the problem? |
Well, on Lustre, I'd call that hang "expected". When waiting, Lustre sometimes blocks everything and otherwise it blocks all signals it considers non-fatal. (The hang is the parent signalling the child (while the child is waiting for an fcntl call that will not complete, because the parent holds a conflicting lock) and then waiting for the child to change state. This state change doesn't happen because the signal is blocked.) The signal state on the child process can be seen in /proc//status (relevant lines pointed out below):
|
Ah, after parsing through the various asserts, etc, it looks like sending SIGINT to the child will get the -1 and errno=EINTR behavior you mentioned. Setting up a sigaction with SIG_IGN doesn't appear to do anything. So I think I've reproduced the problem(?)... |
OK, yes. When we're not ptraced, the child process exits when sent SIGINT, as I'd expect, and hangs when sent '30', also more or less as expected (since it's blocked). When I set SIG_IGN as the action for SIGINT (on the child), it hangs. No -1 and EINTR returned, syscall doesn't exit. This is when not ptraced. When ptraced, I get -1 and EINTR (on SIGINT), whether or not I've set up the sigaction. IE, it seems like the sigaction is getting ignored when we're ptraced. |
So the question then, I suppose, is what's different about EXT4 and friends? I suspect it's something in the handling of waiting in TASK_UNINTERRUPTIBLE... Because they must trigger the state change the parent is waiting for after sending the signal, without actually exiting the syscall. I'm guessing it's "successful delivery" when a signal is given to a process waiting in TASK_UNINTERRUPTIBLE (which doesn't care that you signalled it). So... Lustre is interruptible. That's by design. It's getting interrupted... And when ptraced, the sigaction seems to be ignored. That doesn't feel like a Lustre bug. Thoughts? |
Getting EINTR in this testcase is a bug, because the ptracer never delivers the signal to the tracee. I don't know which part of the kernel to blame.
Blocked by what? User-space hasn't blocked it. I don't think we should invest too much in trying to determine whether this is a kernel bug or not. We generally work around kernel bugs anyway, and I think it would be effective to simply detect if certain filesystems are mounted and disable buffering of flocks in those cases, which should have no real downsides given those syscalls are already expensive on these filesystems. |
Blocked by Lustre. And I'm certainly happy to let it go with a workaround, if that works for rr. (My area is Lustre and the kernel, so I'm inclined to dig deep.) I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here. |
If you're willing to keep looking into the lustre/kernel side of things, I'd certainly be happy to keep helping out any way I can. We should of course still investigate the possibility of a workaround, to fix this for the immediate future (and even if we find a kernel fix to fix it for older versions). |
OK. Blocking signal delivery indefinitely when the signal wasn't blocked by user-space seems like a bug to me. If it's infeasible to fix, or just not worth fixing, then OK. Obviously for rr it's not important since we only care about the ptrace case.
Feel free to do it here :-). Generally I'm keen to get upstream bugs reported and fixed. I just meant this is not an issue where we need to struggle mightily to reach a consensus, if we're having difficulty reaching one. |
Well, Oleg (who's one of the core Lustre developers, I'm on the periphery) might have an opinion here, but basically, Lustre must wait, it shouldn't do it uninterruptibly (bad form, since some of our network timeouts are in minutes, and people should be able to kill programs that are in those waits), but it doesn't want to stop waiting for signals that aren't intended to be fatal. So they came up with a mask of signals that they think people will consider critical/fatal. The sort of signals people expect to interrupt and stop a program. And they block everything else. But... SIGKILL aside, that's inherently arbitrary. It's ugly, but I don't see a better solution. That's kind of neither here nor there, though. FWIW, when you have SIG_IGN set and are not ptracing, this code hangs on non-Lustre filesystems. The SIG_IGN seems to mean we don't get the expected state transition that the parent is waiting for, so it never unlocks. When ptracing, SIG_IGN is, well, ignored. |
I don't see this. Testcase:
I believe the correct behavior here would be for signals to interrupt the system call, leaving it in a restartable state, so that if the signal is ignored or there is a signal handler which returns, the system call restarts and can complete normally. If you
I understand that might be difficult to implement. Though, for flock it seems like all you'd have to do is abandon taking the lock and prepare the syscall for restart, so the restarted syscall just tries again from scratch to take the lock. |
Works fine without rr as well as with
-n
.Mount options are
rw,flock,lazystatfs
in case it makes a difference.The text was updated successfully, but these errors were encountered: