Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs-mount-generator: canmount=noauto for backups causing mount problems at boot #10530

Closed
aerusso opened this issue Jul 3, 2020 · 6 comments
Closed

Comments

@aerusso
Copy link
Contributor

aerusso commented Jul 3, 2020

This is a request for a discussion, particularly with @rlaager and @InsanePrawn who, with me, worked on #9649, which is responsible for the behavior in Debian bug report 962424.

System information

Type Version/Name
Distribution Name Debian
Distribution Version stable (backported zfs)
Linux Kernel 5.5.0-0.bpo.2-amd64 (irrelevant: issue is in zfs-mount-generator)
Architecture Linux
ZFS Version 0.8.4-1~bpo10+1

Describe the problem you're observing

The user complains that their system is broken when the mount generator is enabled, and resolved when it is disabled. They have determined that

the generator was trying to mount multiple datasets to the same mountpoints (/, /usr/, ...) which obviously breaks... everything.

Describe how to reproduce the problem

The user has provided a copy of the responsible zfs-list.cache rpool

Indeed, running zfs-mount-generator on this (modify FSLIST to do this as an unprivileged user), you will get a systemd unit usr.mount. Per @InsanePrawn's comment (and warning):

Now if anything else creates a mount unit B for /home/a/b, the activation of B will also pull in our A and relevant keyload units.

Which means that the user's rpool/backup/support/rpool/usr is (attempted to be) mounted at /usr . In their case, this is I think an empty dataset that is just used as a parent for their /usr/local/ dataset.

Include any warning/errors/backtraces from the system logs

No access to system, but presumably the whole system falls apart trying to mount the backup's /usr, probably failing because the mountpoint isn't empty.

Discussion

I think this is a user configuration error: the user (apparently) went through and ran

zfs set canmount=noauto $dataset

on all their backup datasets. This promoted a canmount=off unit to canmount=noauto, and is why this is happening. Moreover, the user definitely does not want their backup mounted at /usr, ever. They should presumably have set mountpoint=/backups or something on their backup parent rpool/backups dataset.

I don't think we have an obligation to try to force sane behavior when explicitly stated configuration is not.

But--still--this is new behavior: canmount=noauto has meant mounts are never activated unless there is an explicit call to zfs mount in the past. There will be no solution to this problem unless we roll back creating units for canmount=noauto, and I don't think we want to do that. I maintain that a global configuration flag to disable the canmount behavior is inappropriate, introducing too many configuration branches to think about, both as a user, a developer, and a tester.

We should decide on an official position so that I can address the Debian bug report, and possibly modify documentation. This is still relatively new behavior, hybridizing zfs and systemd mount point logic, and it is apparently confusing users.

@InsanePrawn
Copy link
Contributor

This is a request for a discussion, particularly with @rlaager and @InsanePrawn who, with me, worked on #9406, which is responsible for the behavior in Debian bug report 962424.

wrong pr :)

I think this is a user configuration error: the user (apparently) went through and ran

zfs set canmount=noauto $dataset

on all their backup datasets. This promoted a canmount=off unit to canmount=noauto, and is why this is happening. Moreover, the user definitely does not want their backup mounted at /usr, ever. They should presumably have set mountpoint=/backups or something on their backup parent rpool/backups dataset.

I don't think we have an obligation to try to force sane behavior when explicitly stated configuration is not.

I 100% agree, everything is working as expected given the caveat of how systemd 'recursively' activates the mounts. org.openzfs.systemd:ignore exists for this reason, user just needs to apply it to all [previously] canmount=off datasets (or just the whole backup tree, as user props are inherited.)

Actually, while looking at rpool.txt and the generated units in detail, we can see that the real culprit is canmount=on:
rpool/backups/support/rpool/usr /usr on on on off on off on off - none - - - - - - - -

But--still--this is new behavior: canmount=noauto has meant mounts are never activated unless there is an explicit call to zfs mount in the past. There will be no solution to this problem unless we roll back creating units for canmount=noauto, and I don't think we want to do that. I maintain that a global configuration flag to disable the canmount behavior is inappropriate, introducing too many configuration branches to think about, both as a user, a developer, and a tester.

We should decide on an official position so that I can address the Debian bug report, and possibly modify documentation. This is still relatively new behavior, hybridizing zfs and systemd mount point logic, and it is apparently confusing users.

Since during design everyone seemed to think the noauto in combination with the systemd mounting behaviour was not a problem (that can also hit you with canmount=on datasets as seen above, as we have no way to prioritize between two canmount=on datasets for the same mountpoint without user intervention), we just need to communicate the changes.
I put a paragraph about it in the 'UNIT ORDERING AND DEPENDENCIES' section in the zfs-mount-generator(8) man page.
I think the biggest mistake here was slipping the "new" generator into 0.8.4 without giving it the "proper publicity", so people didn't expect the behaviour change; after development I thought this would reach the stable releases with OpenZFS 2.0 and would be mentioned/explained in a huge list of new features.

We can work on improving the existing documentation, but I think the real problem is people didn't know to revisit the documentation.

Someone should submit a talk for the dev summit this year :)

@aerusso
Copy link
Contributor Author

aerusso commented Jul 3, 2020

@InsanePrawn Somehow a modified version of the rpool wound up being uploaded---canmount=noauto---and I've updated the upload.

@aerusso
Copy link
Contributor Author

aerusso commented Jul 3, 2020

Sorry, I don't know what I was thinking---I've replaced the github upload with a direct link to the user's upload in the Debian BTS.

@rlaager
Copy link
Member

rlaager commented Jul 5, 2020

I've only been skimming this, but I still don't have a good understanding of what is going on. I only see these two as having canmount=on:

rpool/backups/kirby/backups
rpool/backups/local

I get how those two (or even just the first one, since they share a parent) would pull in /backups/kirby and /backups. But what is pulling in a mount for /usr?

@aerusso
Copy link
Contributor Author

aerusso commented Jul 5, 2020

Yeah, it's tricky: the canmount=noauto unit created for dataset rpool/backups/support/rpool/usr is (attempted to be) mounted when the canmount=on unit for the dataset rpool/usr/local is mounted.

This is the way systemd understand dependencies between mountpoints: it assumes that you would not want to mount a filesystem deeper into the hierarchy unless you mount the containing mountpoint.

To be fair to systemd, that's not an unreasonable assumption. The user does not want their rpool/backups/support/rpool/usr dataset mounted, and certainly NEVER at /usr. I can, however, imagine other situations where the assumption might be aggressive. I maintain you just shouldn't have systemd manage those mountpoints, though.

@rlaager
Copy link
Member

rlaager commented Jul 5, 2020

I see now. Thanks! Right, I agree. The user in this case should choose one of: A) not managing backup mountpoints with systemd, or B) mounting them at another location.

This is something that hasn't yet, but easily could, affect me personally. So I'm not unsympathetic to the problem. But I don't see any better solution. systemd's behavior isn't unreasonable here (and we can't change it anyway). So that leaves us with the choice of either creating units for canmount=noauto, which allows them to be manually mounted but creates this issue, or not creating them which has the opposite trade-offs. Given that trade-off, I think we should stick with the one that allows for more use cases.

@rlaager rlaager closed this as completed Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants