Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux v5.13 broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) #12301

Closed
jgottula opened this issue Jun 30, 2021 · 25 comments
Assignees
Labels
Component: ZVOL ZFS Volumes Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@jgottula
Copy link
Contributor

jgottula commented Jun 30, 2021

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version rolling release / 2021 Jun 28
Linux Kernel 5.13
Architecture x86_64
ZFS Version 2.0.5-1 and 2.1.99-335_g4694131a0
SPL Version 2.0.5-1 and 2.1.99-335_g4694131a0

The problem

At system startup right after pool import, or at any other time when a large number of zvol dev nodes pop into existence (e.g. when setting snapdev=visible on a volume dataset with many snapshots), a small random smattering of the /dev/zvol/* symlinks that are expected to be created, are missing. This is just the immediately user-noticeable symptom, though.

Adding the following udev rule to heighten the log level for zvol devices helps diagnose what's going on. You can throw this in as /etc/udev/rules.d/00-debug.rules and load it up at runtime via udevadm control -R if desired.

SUBSYSTEM=="block", KERNEL=="zd*", OPTIONS="log_level=debug"

The rule in /usr/lib/udev/rules.d/60-zvol.rules for creating the /dev/zvol/* symlinks includes a PROGRAM token which causes udev (systemd-udevd) to spawn an instance of /usr/lib/udev/zvol_id per /dev/zd* dev node. Many of these are run in parallel at once.

The debug logging rule shown above reveals that what's happening is that, occasionally and somewhat randomly, one of these zvol_id processes will not be able to successfully perform the open("/dev/zdN", O_RDONLY) syscall to open the dev node it was asked to identify. The syscall fails, and the value in errno after the failure is 512.

So... 512 isn't a valid userspace error value at all. I was fortunate to find that 512 corresponds to ERESTARTSYS in Linux kernel land. As far as I know, this is a Thing That Should Never Leak To Userspace. Yet, it has, for whatever reason. (Maybe something about the signal configuration in udev processes is unusual and allows for this? No idea.)

Anyway. My naive understanding is that if a syscall handler (such as zvol_open) returns -ERESTARTSYS, then the Linux kernel is supposed to automatically make things "just work" such that the thread doesn't exit back to userspace but instead makes a U-turn and auto-retries the syscall again. (Which, in this case, would serve the purpose of effecting a retry on some heavily contended locks.) And so in principle, glibc shouldn't have to do any magic; and the program itself shouldn't have to do some sort of EINTR loop. And, at least if things work as intended, no ERESTARTSYS 512 values should be seen on the userspace side. (Yet, here we are.)

It should also be noted that other udev-spawned programs (e.g. udisks-part-id) also occasionally encountered the exact same syscall error with the same ERESTARTSYS error; often well-correlated with the failure in zvol_id, presumably due to both programs running at nearly the same time and both facing similar lock contention.

I did hack around with the zvol_id program, and experimented with putting the open syscall into a loop that would retry on failure if (errno == 512). I put number-of-tries printouts to stderr in that loop, and I found that the open call would eventually succeed; but generally after something on the order of 20-400 retries. (Probably would fare better with some kind of sleep call in between, but ehhh.)

I do half-wonder if the reorganization that happened in 5df7e9d has anything to do with this issue: prior to that commit, zvol_first_open would directly return (-SET_ERROR(ERESTARTSYS));, whereas after the commit, zvol_first_open (in one file) does return (SET_ERROR(EINTR)); while zvol_open (in a different file) does some -EINTR to -ERESTARTSYS stuff. Doesn't seem like the whole story by any means, but possibly a complicating factor.

I also wonder if there was some internal change to how the Linux kernel does its ERESTARTSYS magic, perhaps in 5.12. I haven't been able to actually track down where that logic is located, at the moment; I think it used to be in one particular file, but has since been moved to some other file; and my on-the-fly kretprobes knowledge apparently isn't good enough to figure out how to just print out a damn stack trace whenever zvol_first_open returns -EINTR.

Other issues and PRs of similarity or relevance

Information that might kinda help reproduce the problem maybe perhaps

I don't know if I have any specific reliable way to reproduce the problem.

The best I can say is that my pool has a lot of zvols (dozens to hundreds, particularly when snapdev=visible), and that I see this pretty consistently whenever a bunch of /dev/zd* nodes come into existence at the same time and make systemd-udevd spin up a crapton of workers that all attempt to access some large number of those dev nodes at the same time (undoubtedly running into some degree of lock contention).

I don't do any exotic things like attempting to build nested zfs contraptions involving using zvols as vdevs. (I generally regard myself as a sane person.)

I use a custom kernel that has, among other things, CONFIG_PREEMPT=y and CONFIG_HZ=1000. The involuntary preemption thing could be relevant; not entirely sure, to be honest.

This has been happening on an SMP system with multiple cores (AMD FX-8350).

If it's of any interest, I'm currently running systemd/udev version 249rc2. I believe this was happening on 248.3 as well, and possibly prior to that. I'm 95% sure I was also seeing this behavior back on kernel 5.12.2 (at which time I was using zfs 2babd20).

Oh yeah and I guess that little udev rule file I posted earlier that enables debug log_level for /dev/zd* dev nodes is probably a handy thing to have around if you're trying to reproduce this.

Logs of interest

Nothing of particular note shows up in /proc/spl/kstat/zfs/dbgmsg (with zfs_flags=1).

What I've posted below, however, is some interesting udev log_level=debug output from the journal, isolated out for one particular zvol device that had difficulty when I switched snapdev=hidden to snapdev=visible on a particular volume.

In this case I was using an unmodified zvol_id binary that didn't have any printouts added to indicate e.g. what errno was; but you can see it from the udisks-part-id invocation that also failed on the open syscall.

systemd-udevd[3765278]: zd736: The log level is changed to 'debug' while processing device (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/50-udev-default.rules:69 GROUP 6
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 Running PROGRAM '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765289.
systemd-udevd[3765278]: zd736: '/usr/lib/udev/zvol_id /dev/zd736'(out) 'Unable to open device file: /dev/zd736'
systemd-udevd[3765278]: zd736: Process '/usr/lib/udev/zvol_id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'zvol/Unable'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'to'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'open'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'device'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'file:'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK '/dev/zd736'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/80-udisks.rules:84 Importing properties from results of 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765294.
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'using device_file=/dev/zd736 syspath=/sys/devices/virtual/block/zd736, offset=0 ao=0 and number=0 for /dev/zd736'
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'Error opening /dev/zd736: Unknown error 512'
systemd-udevd[3765278]: zd736: Process 'udisks-part-id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Setting permissions /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: sd-device: Created db file '/run/udev/data/b230:736' for '/devices/virtual/block/zd736'
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Preserve permissions of /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: Adding watch on '/dev/zd736'
systemd-udevd[3765278]: zd736: Device processed (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: sd-device-monitor: Passed 757 byte to netlink monitor

In case you read the log above and are now thinking "wtf are those symlinks?!"

In the log sequence posted above, it's hard not to notice that another idiotic thing is going on. Because there's some goofs in the udev rules file and in zvol_id, the nonzero exit status of 512 from zvol_id is modulo'd into 0, which of course is then treated as a "success", meaning that the contents of stdout are not ignored, but are consumed normally (as if they were the name of the /dev/zvol/* symlink to create). And then zvol_id, for reasons I don't particularly understand, only ever prints stuff (including its error messages) to stdout; combine that with the wonders of space-separated list syntax, and you wind up with udev deciding that the rules are requesting that the following symlinks be created:

/dev/zvol/Unable -> /dev/zd736
/dev/to          -> /dev/zd736
/dev/open        -> /dev/zd736
/dev/device      -> /dev/zd736
/dev/file:       -> /dev/zd736
/dev//dev/zd736  -> /dev/zd736

(I plan to address this in a separate PR that I have pretty much all lined up and ready to go already. The 60-zvol.rules file is extremely ancient; like, it's literally a decade old. And it uses a fairly impressive number of deprecated syntax quirks for a mere one-liner file. Most of the code for zvol_id is about a decade old as well, and does dumb stuff like printing error messages to stdout. So yeah, I'll be taking care of all of this particular dumbness myself shortly. That PR will fix the bizarre symlinks symptom that's been going on; but it doesn't do anything to address the core mystery of "why the hell are these zvol_open calls failing, and in such a way that errno=ERESTARTSYS is getting to userspace?")

(This is now up as PR #12302.)

Stuff I can do to help

If this issue turns out to be a giant pain to reproduce, I'm happy to do stuff like flipping on zfs_flags bits, dumping various logs, adding extra debug printouts in particular places, creating/building/running custom one-off builds of the zfs module that I use on my personal system with my actual data with reckless abandon; basically whatever helps track this down. 😄

@jgottula jgottula added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jun 30, 2021
@jgottula
Copy link
Contributor Author

jgottula commented Jul 1, 2021

I've now also tried the latest zfs git master version (2.1.99-335_g4694131a0), still on kernel 5.13, and the errno=ERESTARTSYS situation still happens.

@jgottula
Copy link
Contributor Author

jgottula commented Jul 1, 2021

I've created a branch that patches zvol_id so that if the open() syscall fails with errno=ERESTARTSYS, it will continually retry until it does succeed. Might be useful for debugging/testing.

It's based upon some other cleanup patches I submitted in PR #12302.

https://github.com/jgottula/zfs/tree/zvol_id-retry-on-ERESTARTSYS

@jgottula
Copy link
Contributor Author

jgottula commented Jul 1, 2021

I was just looking back over the internal Linux kernel errno.h header, and figured it would be a good idea to highlight some of what's written in the comments there.

https://github.com/torvalds/linux/blob/v5.13/include/linux/errno.h#L8-L14

/*
 * These should never be seen by user programs.  To return one of ERESTART*
 * codes, signal_pending() MUST be set.  Note that ptrace can observe these
 * at syscall exit tracing, but they will never be left for the debugged user
 * process to see.
 */
#define ERESTARTSYS	512

After re-reading this, I'm thinking that I don't remember anything like signal_pending() being in the zvol_open/zvol_first_open code.

Perhaps in Linux 5.12 (or thereabouts) something about how the ERESTART* return codes are handled was modified, and now it's stricter about the signal aspect (whereas it may have been overly lenient before)...?

@jgottula
Copy link
Contributor Author

jgottula commented Jul 1, 2021

@behlendorf I think I've found the smoking gun here.

Back in one of the issues I referenced in the top post (#612, from way back in 2012), discussion surrounded how to solve a lock inversion deadlock that manifested in cases like trying to use zvols as vdevs and other interesting situations. (One also imagines that that very same lock contention situation could arise in an occasional pseudorandom manner, when multiple processes try to do approximately a million zvol-related system calls all at the same time... such as when systemd-udevd spawns an army of workers at the exact moment that a zvol dev node pops into existence, for example.)

The solution to the problem at the time, which was implemented in 65d5608, was to take advantage of a mechanism provided by __blkdev_get that allowed callees (such as zvol_open) to return -ERESTARTSYS. This would let such a function bail out of its fops->open() call; release bdev->bd_mutex for hopefully long enough to allow for a task switch or parallel execution to resolve the lock ordering situation; and then reacquire the lock and call back into fops->open() for another opportunity to attempt the correct locking order.

Here's an excerpt from that commit message:

Luckily, the __blkdev_get() function allows us to return -ERESTARTSYS which will result in bdev->bd_mutex being dropped, reacquired, and fops->open() being called again. This process can be repeated safely until both locks are acquired.


Okay so here's the problem: the -ERESTARTSYS functionality in __blkdev_get that zvol_open started relying on with that commit, doesn't actually exist anymore in recent kernels!

From what I can gather from code comments and commit messages, it seems that the only reason __blkdev_get ever even had a handy -ERESTARTSYS do-over mechanism in the first place was because of some problem with the md driver that couldn't be solved easily any other way. (I don't really know, or necessarily care much, about the specifics; but it seems to have been a relatively specific workaround just to make md work properly.)

So, with v5.9-rc1, we start seeing some refactoring of the restart logic in __blkdev_get with this commit.

Next, with v5.11-rc1, we see much more substantial refactoring of blkdev_get/__blkdev_get in this commit. The restart logic is still intact, albeit moved around somewhat. But I haven't actually scrutinized it closely enough to know if it necessarily still worked the way that the zvol_open code originally assumed; and so I'm suspicious that kernel v5.11 might have been the actual point where things stopped working as intended. (2021-07-05: Nope, kernel 5.12 is fine.)

Finally, with v5.13-rc1, we arrive at this commit, literally with title "block: remove the -ERESTARTSYS handling in blkdev_get_by_dev" and description "Now that md has been cleaned up we can get rid of this hack", whose code changes consist solely of just ripping the -ERESTARTSYS retry logic out of the block device layer entirely once and for all.


So this clearly indicates that for Linux v5.13 and above, the -ERESTARTSYS logic originally utilized by 65d5608 to fix #612 is no longer actually there.

And that actually fits nicely with the symptoms of zvol_open just bailing out immediately at the first sign of lock contention, not actually doing any of the internal-retry stuff it previously did, and instead simply passing the -ERESTARTSYS return value all the way back out to userspace (as it's no longer intercepted and handled by the block layer).

I took a brief look at the Linux stable tree, and it doesn't look like that third patch has been backported to anything pre-v5.13 at all. So that's nice at least.

However, I'm >90% sure that I was seeing symptoms of this problem when I was still running v5.12.2. So I'm suspicious that the second patch may have effectively made the -ERESTARTSYS mechanism not work for zvol_open as early as perhaps v5.11. (2021-07-05: Nope.)


I'm going to go back and look through some of my system logs and see if I can confirm whether I was in fact seeing udev errors symptomatic of this problem back when I was on kernel v5.12.2.

I may even rollback my kernel and zfs packages to v5.12.2 and do a quick boot with that, to see if I can reproduce the behavior with those existing builds that I still have laying around.

(2021-07-05: I had a convenient chance to try this today, as I managed to OOM and hardlock my kernel this afternoon by being an idiot, giving me an "opportunity" to "reboot". Reverted kernel back to 5.12.2; cannot reproduce. So it's pretty much 100% guaranteed to be a v5.13+ only thing. Which frankly fits what the code was showing pretty well anyway. I did check back on pre-5.13 boots and confirmed that I was having udev symlink issues at that time; but they must have been due to something unrelated.)


In any case, the finding here appears to be that #612 has effectively been un-fixed by recent kernels (definitely v5.13+, and maybe even v5.11+, depending on what I find); and that some new way of resolving that lock ordering situation needs to be found.

@behlendorf
Copy link
Contributor

@jgottula good find, yes that would entirely explain it. We'll have to do some investigation and see if there's some other way we can restructure the locking to avoid the original deadlock.

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Jul 2, 2021
@jgottula jgottula changed the title open() on a zvol dev node may fail with errno=ERESTARTSYS in userspace (observed with udev) Linux v5.13+ (maybe v5.11+?) broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) Jul 2, 2021
@jgottula
Copy link
Contributor Author

jgottula commented Jul 2, 2021

Given some of the original scenarios involved with the lock order involved zpool-on-zvol setups, I wonder if this means that those have been rendered non-working once again. (On either an "always" or perhaps "sporadic" basis... I haven't bothered to look in-depth at the logic there to know where it would land on the never~sometimes~always spectrum.)

I should also check if there are any existing tests in the test suite that do the using-zvol-as-a-vdev thing, and (if such do exist) whether they still pass. And if there aren't any tests for that, then there probably ought to be.

@jgottula
Copy link
Contributor Author

jgottula commented Jul 2, 2021

I've created a branch that patches zvol_id so that if the open() syscall fails with errno=ERESTARTSYS, it will continually retry until it does succeed. Might be useful for debugging/testing.

https://github.com/jgottula/zfs/tree/zvol_id-retry-on-ERESTARTSYS

This branch is still around in case anyone wants to mess around with it. Rebased onto master just now.

@jgottula
Copy link
Contributor Author

jgottula commented Jul 2, 2021

Here are some more commit references, for convenience's sake.

This is the commit from way back in Jan 2009 that fixed the original md race condition / softlock situation, by making some modifications in the md code as well as adding the -ERESTARTSYS restart "hack" to __blkdev_get.

And then these commits (one, two) make up the v5.13-era fix to md that allowed the -ERESTARTSYS restart logic to be removed from __blkdev_get (well I guess it made its way into blkdev_get_by_dev instead eventually, but whatever).

@behlendorf
Copy link
Contributor

Given some of the original scenarios involved with the lock order involved zpool-on-zvol setups, I wonder if this means that those have been rendered non-working once again.

This configuration has actually always been a little bit dodgy on all the platforms and really is discouraged. The test suite does use this kind of configuration in a few places (since it was convenient), but the newest kernel we've been testing with is 5.12.13 so we may just not have seen this yet. I'd expect this to manifest itself as a test case failure now, and not a deadlock, since ZFS will be returning -ERESTARTSYS so I'll keep any eye out for that.

Thanks for the handy references, if you're so inclined to dig in this more let me know. Unfortunately, it might be a while before anyone has the time to really dig in to this.

@jgottula
Copy link
Contributor Author

jgottula commented Jul 2, 2021

I found another thing that's likely to be rather relevant here.

The commit that ripped the -ERESTARTSYS behavior out in v5.13 was just one part of a larger patch set: a patch set that (unfortunately) seems to have ended up being split between v5.13 and v5.14; and whose v5.14 portion makes some changes that include removing bd_mutex entirely. (So that may complicate the situation a bit.)


Here's the original 15-part patch set: move bd_mutex to the gendisk.

Patch 01 seems to have been dropped, as its changes never showed up in Linus's master branch at any point. (And that patch didn't make it into v2 of the patch set either; see below.)

Patches 02 thru 07 were merged to Linus's tree during the v5.13 merge window and made it into the v5.13 release. Patches 08 thru 15 were held back for later, however.

The patches that were merged here were all in md code... except for the one block patch that removed the -ERESTARTSYS logic. I guess on the grounds that, ostensibly, the only thing that relied on that behavior was md. 😕


So anyway, later on, we have a v2 patch set basically just containing not-merged-in-v5.13 patches that were previously numbered 08 thru 15, rebased for v5.14: move bd_mutex to the gendisk v2.

All eight of the patches in this set were merged into Linus's tree during the (still ongoing) v5.14 merge window. And among other tidbits, it does implement the titular replacement of block_device->bd_mutex with gendisk->open_mutex.

(There was also sort of a "pre-v2" patch set, move bd_mutex to the gendisk (resend), that was posted to the linux-block list between v5.13-rc1 and v5.13-rc2. But that one didn't go anywhere itself.)


I don't know the precise reasoning that led to the md patches, plus the one block patch that removed the -ERESTARTSYS logic, being merged for v5.13 but the remainder of the block patches being held back for v5.14.

I imagine, though, that things would be less complicated for us if only they'd been nice and left the -ERESTARTSYS mechanism in for v5.13, and waited for the rest of the block patches to arrive in v5.14 to remove it. (Or just held everything for v5.14.) If that had been the case, then we'd presumably be in a position of not needing to do anything for v5.13; and then needing a single fix to address v5.14.

Instead, we now (potentially) need one fix for v5.13 block_device->bd_mutex; and then a different fix for v5.14 gendisk->open_mutex.

(But: I haven't yet looked into the minutiae to figure out the exact implications of the change from bd_mutex to open_mutex. Not sure if it's basically the same semantics as before; or if it's potentially the sort of thing that kinda fundamentally gets rid of the problem we were dealing with before. Still need to look into that...)

@jgottula
Copy link
Contributor Author

jgottula commented Jul 5, 2021

Easy reproduction method

Wow. Turns out it's really easy to make this bug manifest even in a non-udev scenario. Basically just trying to open any number of not-already-open zvols at "essentially the same time" (without any particularly amazing precision) will trigger a lot of open() failures. (I encountered this while doing a parallel md5sum job on a bunch of zvol snapshots.)

This command is a super easy way to get a system with a bunch of zvols to vomit a load of errors every single time. At least on a multi-core machine with full preempt and 1000hz tick, in any case. (This uses GNU parallel, with the job count set to 10x as many jobs as there are CPU threads, to spin up lots of instances of dd that all attempt to open() the system's /dev/zd* nodes at roughly the same time.)

# parallel -j1000% nice -n19 dd if=\{\} of=/dev/null status=none bs=4K count=1 ::: /dev/zd*
dd: failed to open '/dev/zd1024': Unknown error 512
dd: failed to open '/dev/zd1040': Unknown error 512
[[ snipped: 76 additional error lines ]]
dd: failed to open '/dev/zd96': Unknown error 512
dd: failed to open '/dev/zd976': Unknown error 512

Somewhat useless statistics

Here are some failure statistics for the dd process open() calls in the command scenario depicted above, for varying parallel job number values, on my FX-8350 ("4C/8T"-ish) machine with 217 total /dev/zd* nodes (86 of which are /dev/zd*p* partition sub-devices):

parallel jobs test runs min median max min% median% max%
-j1 n=5 0 0 0 0% 0% 0%
-j2 n=20 20 31 37 9% 14% 17%
-j3 n=20 34 46 56 16% 21% 26%
-j4 n=20 55 66 75 23% 30% 35%
-j8 n=20 66 76 87 30% 35% 40%
-j217 n=20 73 85 94 34% 39% 43%

@jgottula jgottula changed the title Linux v5.13+ (maybe v5.11+?) broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) Linux v5.13 broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) Jul 6, 2021
@jgottula
Copy link
Contributor Author

jgottula commented Jul 6, 2021

Got an easy opportunity to double-check kernel v5.12 today; I'd had vague suspicions that this problem had possibly reached back into v5.12 or v5.11 one way or another.

I've updated the comments and title here to reflect that this is definitely a v5.13+ problem only.

(2021-07-05: I had a convenient chance to try this today, as I managed to OOM and hardlock my kernel this afternoon by being an idiot, giving me an "opportunity" to "reboot". Reverted kernel back to 5.12.2; cannot reproduce. So it's pretty much 100% guaranteed to be a v5.13+ only thing. Which frankly fits what the code was showing pretty well anyway. I did check back on pre-5.13 boots and confirmed that I was having udev symlink issues at that time; but they must have been due to something unrelated.)

@hiroshimatsuo
Copy link

hiroshimatsuo commented Aug 31, 2021

In my case ``multiple zpools'' cause the wrong /dev/ simlinks during bootup udev action.

My Fedorora34 box has two zpools:
1st contains only one zvol (with no snapshots)
2nd contains no zvols.

After booting buggy symlinks
/dev/file:
/dev/open
/dev/Unable
...
are created as same as explained here.

On the other hand everything goes well if the 2nd zpool is removed.

@behlendorf behlendorf added the Component: ZVOL ZFS Volumes label Aug 31, 2021
@behlendorf behlendorf self-assigned this Aug 31, 2021
@peter-held
Copy link

Any updates on this ?

It seems that the problem exists also in 5.14.8.
The latest kernel version that worked for me was 5.12.15.
I'm using archlinux.

Thanks.

@behlendorf
Copy link
Contributor

We've got a good handle on what the core issue here is as explained above in this #12301 (comment). I'm still investigating how best to deal with it.

@jgottula
Copy link
Contributor Author

jgottula commented Oct 4, 2021

Since I compile my own kernels, I've personally been working around the problem by applying a small kernel patch that reverts a8ed1a0607 (the commit that removed the -ERESTARTSYS try-again logic from the Linux block layer). It does """fix""" things in a fairly airtight way.

Obviously it's not an actual solution per se; and it's not helpful to anyone who isn't compiling their own kernels; but in any case it's been a workable stopgap measure for me while this issue has been pending.

(I'd noticed that some of my "blast SSD contents to a zvol for backup" scripts were occasionally hitting the errno == 512 case when open()ing the zvol dev node, and at some point I got annoyed at writing in extra special-case errno code on the userspace side to fix the issue; so I just went the kernel patch route. 😝)

@jblachly
Copy link

jblachly commented Oct 26, 2021

Transitioning from freebsd, I found myself with /dev/zvol/Unable, which is an ext4 containing zvol.

It sounds like @behlendorf is on it; let me know if any additional information is needed; simply wanted to provide datapoint with versions below.

$ uname -a && sudo zfs version
Linux ubuntu 5.13.0-20-generic #20-Ubuntu SMP Fri Oct 15 14:21:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
zfs-2.0.6-1ubuntu2
zfs-kmod-2.0.6-1ubuntu2

and interestingly, after a bit of poking around, the Unable symlink disappeared (whereas the underlying block device /dev/zd192 in this case, stuck around and remains usable)

behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 9, 2021
This test case may fail on 5.13 and newer Linux kernels if the
/dev/zvol/ device is not created by udev.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 13, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 13, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 13, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Nov 13, 2021
This test case may fail on 5.13 and newer Linux kernels if the
/dev/zvol/ device is not created by udev.

Reviewed-by: Rich Ercolani <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes  openzfs#12738
behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 13, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
@behlendorf
Copy link
Contributor

behlendorf commented Nov 13, 2021

@tonyhutter I've been mulling this over for a while now, and unfortunately I don't really see a significantly cleaner way to handle this. I've gone ahead and opened PR #12759 with a variant of the change you suggested above.

@Bronek
Copy link

Bronek commented Nov 23, 2021

I think I am affected by this as well; the /dev/zvol symlinks are very unreliable (i.e. some of them randomly do not appear after boot) on my system, which is running kernel 5.14. The problem does not appear under kernel 5.10.

When trying to reset /dev/zvol symlinks, I can see errors like below:

root@gdansk ~ # partprobe -s
/dev/sda: gpt partitions 1 2
/dev/sdb: gpt partitions 1 2
/dev/sdc: gpt partitions 1 9
/dev/sdd: gpt partitions 1 9
/dev/sde: gpt partitions 1 9
/dev/sdf: gpt partitions 1 9
/dev/nvme0n1: gpt partitions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
/dev/zd192: gpt partitions 1 2 3 4 5
/dev/zd0: msdos partitions 1
/dev/zd96: gpt partitions 1 2 3
/dev/sdi: gpt partitions 1 9
/dev/zd48: gpt partitions 1 2
/dev/sdg: gpt partitions 1 9
/dev/zd288: gpt partitions 1
/dev/zd256: gpt partitions 1
/dev/zd32: gpt partitions 1 2
Error: Error opening /dev/zd208: Unknown error 512
Error: Error opening /dev/zd272: Unknown error 512
/dev/sdj: gpt partitions 1 9
/dev/zd224: gpt partitions 1
/dev/zd176: gpt partitions 1
/dev/sdh: gpt partitions 1 9
/dev/zd128: gpt partitions 1 2

@sempervictus
Copy link
Contributor

sempervictus commented Nov 28, 2021

@behlendorf & @tonyhutter: Won't this get worse when we have namespace delegation to a ton of containers brute forcing their way through the acquisition of that spa_namespace_lock?
Also, couldn't we be a bit more specific in the retry logic such as to match on ERESTARTSYS for the retry and actually fail hard on other errors?

@behlendorf
Copy link
Contributor

@sempervictus it's definitely not the ideal solution, that is going to require more extensive changes to how the spa_namespace_lock() is used (and probably converting it to a r/w lock). But this should at least largely address the immediate problem. Unfortunately, we can't make the retry logic any more targeted since those changes would need to be made in the Linux VFS code which we can't change.

@Bronek
Copy link

Bronek commented Nov 29, 2021

My problem was fixed with PR #12759 , applied on top of 2.1.1

behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 30, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
tonynguien pushed a commit that referenced this issue Dec 2, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #12301
Closes #12759
@jgottula
Copy link
Contributor Author

jgottula commented Dec 2, 2021

This should be fixed, now that #12759 is merged. Thanks to @tonyhutter and @behlendorf for coming up with the fix and to those who helped test it.

I still have some additional testing of that PR pending; specifically to make sure that anyone who applied the "undo the commit in 5.13 that removed the -ERESTARTSYS retry path from the block layer" patch, and then neglects to unapply it when updating zfs to a version incorporating #12759, won't run into any particular weirdness related to the combination of old block-layer retry logic and new zfs retry logic. (Brief discussion on that over here.)


@behlendorf & @tonyhutter: Won't this get worse when we have namespace delegation to a ton of containers brute forcing their way through the acquisition of that spa_namespace_lock? Also, couldn't we be a bit more specific in the retry logic such as to match on ERESTARTSYS for the retry and actually fail hard on other errors?

@sempervictus it's definitely not the ideal solution, that is going to require more extensive changes to how the spa_namespace_lock() is used (and probably converting it to a r/w lock). But this should at least largely address the immediate problem. Unfortunately, we can't make the retry logic any more targeted since those changes would need to be made in the Linux VFS code which we can't change.

Shall I close this issue and leave it to you folks to create new issues for these additional/follow-on concerns/ideas/future-improvements? Or should I just keep this one open?

@behlendorf
Copy link
Contributor

Let's go ahead and close this out now that the fix has been merged. We can open new issues for any remaining followup work.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 6, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 6, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 7, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 9, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 11, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 15, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 22, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
ionutnechita pushed a commit to sunlightlinux/linux-sunlight that referenced this issue Oct 29, 2022
Description:
 - This reverts commit a8ed1a0.

 - It breaks ZFS sometimes:
   openzfs/zfs#12301 (comment)

Bug: N/A
Change-Id: I1a17247218d65a1cd1d6822c37075de922301bfd
Signed-off-by: Thomas Lamprecht <[email protected]>
Signed-off-by: Ionut Nechita <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

8 participants