Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support idmapped mount #13671

Merged
merged 3 commits into from
Oct 19, 2022
Merged

Support idmapped mount #13671

merged 3 commits into from
Oct 19, 2022

Conversation

youzhongyang
Copy link
Contributor

Signed-off-by: Youzhong Yang [email protected]

This pull request adds support for idmapped mount. Although I have tried other approach, changing function signatures to pass 'struct user_namespace' pointer around seems to be the way to go.

Motivation and Context

Idmapped mount is a very useful functionality for some use cases. For example, a zfs dataset is owned by some role account, if a developer clones the dataset to create a sandbox and works on it, the ownership of the files/directories has to be changed to the developer, these changes can be very time-consuming, use memory and disk space too in case of a huge directory tree.

Description

How Has This Been Tested?

Manual testing is done. Automation requires the use of source code such as the one on https://github.com/brauner/mount-idmapped, which may lead to some sort of license issue.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ghost
Copy link

ghost commented Jul 19, 2022

It's great to see this missing functionality implemented!

Rather than using void * to pass the namespace around, it would be safer to make some zuserns_t or such typedef that can be struct user_namespace on Linux and void on FreeBSD.

@ghost
Copy link

ghost commented Jul 19, 2022

Fixes #12923

@ghost ghost added the Status: Code Review Needed Ready for review and testing label Jul 22, 2022
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to see implemented! I think the approach you went with is pretty reasonable, even though it does mean modifying a lot of interfaces. One possible alternative would have been to attach the mount namespace to the credential, but I suspect in the end that would have been just as disruptive and probably more confusing.

On the testing front it looks like xfstests has incorporated idmapped mount test coverage for filesystems which support it. If we can manually verify those new tests pass as expected then this looks good to me.

config/kernel-idmap_mnt_api.m4 Outdated Show resolved Hide resolved
@lukts30
Copy link

lukts30 commented Aug 11, 2022

The interaction of the feature with #12263 should be checked carefully.

I think ZFS would end and up being the first filesystem that supports both FS_USERNS_MOUNT and FS_ALLOW_IDMAP. Upsteam Linux file system currently only support either one of these features.

There were no restrictions when the IDMAP feature was initially added to the kernel 5.12 but they were added post merge in 5.13 and are now present in
[5.12;5.16] [1].

	/* Don't yet support filesystem mountable in user namespaces. */
		if (m->mnt_sb->s_user_ns != &init_user_ns)
	 return -EINVAL;
	 
		/* We're not controlling the superblock. */
		if (!capable(CAP_SYS_ADMIN))
	 return -EPERM;

Theses added restrictions where then again removed in 5.17 [2].

	// struct user_namespace *fs_userns = m->mnt_sb->s_user_ns;
	/* We're not controlling the superblock. */
		if (!ns_capable(fs_userns, CAP_SYS_ADMIN))
	 return -EPERM;

@youzhongyang
Copy link
Contributor Author

All the following tests passed:

generic/633
generic/644
generic/645
generic/656
generic/689

root@ubuntu-dev:~/xfstests-dev# src/vfs/vfstest --test-core --device /dev/loop10 --mount /vmgr/idmapped --fstype zfs ; echo $?
0
root@ubuntu-dev:~/xfstests-dev# src/vfs/vfstest --test-fscaps-regression --device /dev/loop10 --mount /vmgr/idmapped --fstype zfs ; echo $?
0
root@ubuntu-dev:~/xfstests-dev# src/vfs/vfstest --test-nested-userns --device /dev/loop10 --mount /vmgr/idmapped --fstype zfs ; echo $?
0
root@ubuntu-dev:~/xfstests-dev# src/vfs/vfstest --test-setattr-fix-968219708108 --device /dev/loop10 --mount /vmgr/idmapped --fstype zfs ; echo $?
0
root@ubuntu-dev:~/xfstests-dev# src/vfs/vfstest --test-setxattr-fix-705191b03d50 --device /dev/loop10 --mount /vmgr/idmapped --fstype zfs ; echo $?
0

@youzhongyang youzhongyang requested review from behlendorf and a user and removed request for behlendorf and a user August 26, 2022 16:43
@youzhongyang youzhongyang requested review from behlendorf and removed request for a user September 9, 2022 13:20
@youzhongyang youzhongyang force-pushed the idmap branch 2 times, most recently from e42bcde to 2f99a88 Compare September 15, 2022 01:55
@behlendorf behlendorf added the Type: Feature Feature request or new feature label Sep 16, 2022
@youzhongyang
Copy link
Contributor Author

The interaction of the feature with #12263 should be checked carefully.

I think ZFS would end and up being the first filesystem that supports both FS_USERNS_MOUNT and FS_ALLOW_IDMAP. Upsteam Linux file system currently only support either one of these features.

There were no restrictions when the IDMAP feature was initially added to the kernel 5.12 but they were added post merge in 5.13 and are now present in [5.12;5.16] [1].

	/* Don't yet support filesystem mountable in user namespaces. */
		if (m->mnt_sb->s_user_ns != &init_user_ns)
	 return -EINVAL;
	 
		/* We're not controlling the superblock. */
		if (!capable(CAP_SYS_ADMIN))
	 return -EPERM;

Theses added restrictions where then again removed in 5.17 [2].

	// struct user_namespace *fs_userns = m->mnt_sb->s_user_ns;
	/* We're not controlling the superblock. */
		if (!ns_capable(fs_userns, CAP_SYS_ADMIN))
	 return -EPERM;

@lukts30 - Thanks for the feedback. You are right, I verified that idmapped mount is not allowed in user namespace on Ubuntu 22.04 (which runs kernel 5.15.0), but there is nothing we can do about it, isn't it? In last few days I also got a chance to review the PR #12263, FS_USERNS_MOUNT and FS_ALLOW_IDMAP are totally two different features, FS_ALLOW_IDMAP is for idmapping on the dataset/mount point, while FS_USERNS_MOUNT is for the visibility and accessibility of the dataset/mount point, yes they are both related to the user namespace, but in a completely different way.

Does this address your concern?

@youzhongyang
Copy link
Contributor Author

I've rebased this PR and CI results look good. It's ready for review. Please let me know if anything else needs to be done.

I understand the approach of modifying a lot of interfaces looks ugly, but so far it is the only reliable one. My attempt of messing with cred_t has been unsuccessful (it looks elegant, passes all manual testing but fails the zfs-tests suites.)

@lukts30
Copy link

lukts30 commented Sep 18, 2022

Thanks for the feedback. You are right, I verified that idmapped mount is not allowed in user namespace on Ubuntu 22.04 (which runs kernel 5.15.0), but there is nothing we can do about it, isn't it?

My concerns are more applicable when the precautious kernel checks are missing (kernel 5.17 or newer).

In last few days I also got a chance to review the PR #12263, FS_USERNS_MOUNT and FS_ALLOW_IDMAP are totally two different features, FS_ALLOW_IDMAP is for idmapping on the dataset/mount point, while FS_USERNS_MOUNT is for the visibility and accessibility of the dataset/mount point, yes they are both related to the user namespace, but in a completely different way.

I am aware that FS_USERNS_MOUNT and FS_ALLOW_IDMAP serve different purposes but they can be used at the same time with kernel 5.17 or newer.

Therefore I want to be sure that a idmapped mount inside of userns with a delegated dataset works as expected.
It should only be possible to idmap already accessible uid/gid ranges with a idmapped mount. No unexpected uid/gid range elevation should be possible.

That might already be guaranteed by the kernel idmapping functions so my concerns might be unwarranted.

@youzhongyang
Copy link
Contributor Author

@lukts30 - I've run the idmapped tests available in xfs-tests suite, on kernel 5.17 and 5.19 in both global zone and zone, here are the results:

Kernel vfstest --test-core vfstest --test-fscaps-regression vfstest --test-nested-userns vfstest --test-setattr-fix-968219708108 vfstest --test-setxattr-fix-705191b03d50 Notes
5.15.0-46-lowlatency PASS PASS PASS PASS PASS Not Applicable to zone (non-default user namespace)
5.17.0-1003-oem PASS in global zone; FAIL in zone PASS PASS PASS FAIL
5.19.0-17-generic PASS in global zone; FAIL in zone PASS PASS PASS PASS

Here 'global zone' refers to default user namespace; 'zone' refers to using non-default user namespace.

There seem to be changes in kernel > 5.15, I will figure out.

@youzhongyang
Copy link
Contributor Author

I've investigated the vfstest --test-core failure when running in a 'zone' (non-default user namespace) on kernel 5.19.0.

Bottom Line on Top: it is expected and nothing we can do about it.

The script I used to run xfstests cases is as follows:

# cat run-idmap-tests
#!/bin/bash

ZPOOL=rpool
# make sure xfstests-dev has already been built
VFSTEST=/root/xfstests-dev/src/vfs/vfstest
# make sure /dev/loop0 is not in-use, run 'losetup' to find out;
# specifying --device is just to make vfstest program happy
DEVICE=/dev/loop0

echo "Run tests in the global 'zone'"
zfs create -o zoned=off -o mountpoint=/userns -o acltype=posix ${ZPOOL}/userns

${VFSTEST} --test-core --device ${DEVICE} --mount /userns --fstype zfs
${VFSTEST} --test-fscaps-regression --device ${DEVICE} --mount /userns --fstype zfs
${VFSTEST} --test-nested-userns --device ${DEVICE} --mount /userns --fstype zfs
${VFSTEST} --test-setattr-fix-968219708108 --device ${DEVICE} --mount /userns --fstype zfs
${VFSTEST} --test-setxattr-fix-705191b03d50 --device ${DEVICE} --mount /userns --fstype zfs

zfs destroy ${ZPOOL}/userns

echo "Run tests in a 'zone'"
zfs create -o zoned=on -o mountpoint=/userns -o acltype=posix ${ZPOOL}/userns
unshare -Um --keep-caps sleep 2h &
SLEEPPID=$!
sleep 1
echo '0 0 4294967295' > /proc/${SLEEPPID}/uid_map
echo '0 0 4294967295' > /proc/${SLEEPPID}/gid_map
zfs zone /proc/${SLEEPPID}/ns/user ${ZPOOL}/userns
nsenter -t ${SLEEPPID} --all zfs mount ${ZPOOL}/userns

nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-core --device ${DEVICE} --mount /userns --fstype zfs
nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-fscaps-regression --device ${DEVICE} --mount /userns --fstype zfs
nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-nested-userns --device ${DEVICE} --mount /userns --fstype zfs
nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-setattr-fix-968219708108 --device ${DEVICE} --mount /userns --fstype zfs
nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-setxattr-fix-705191b03d50 --device ${DEVICE} --mount /userns --fstype zfs

nsenter -t ${SLEEPPID} --all zfs unmount ${ZPOOL}/userns
zfs unzone /proc/${SLEEPPID}/ns/user ${ZPOOL}/userns
kill -TERM ${SLEEPPID}
zfs destroy ${ZPOOL}/userns

The output of running it:

root@ubuntu-dev:~# ./run-idmap-tests
Run tests in the global 'zone'
Run tests in a 'zone'
vfstest.c: 1569: setgid_create - Operation not permitted - failure: mknodat
vfstest.c: 1882: run_test - Success - failure: create operations in directories with setgid bit set

The vfstest --test-core fails here in xfstests-dev/src/vfs/vfstest.c:

/* create a character device via mknodat() vfs_mknod */
if (mknodat(info->t_dir1_fd, CHRDEV1, S_IFCHR | S_ISGID | S_IXGRP, makedev(5, 1)))
        die("failure: mknodat");

The call sequence of mknoddat() is as follows:

mknodat() -> do_mknodat() -> vfs_mknod() -> capable(CAP_MKNOD)

capable(CAP_MKNOD) returns false!

Why?

Again its call sequence is as follows:

capable(int cap) -> ns_capable(&init_user_ns, cap) ->
ns_capable_common(ns, cap, CAP_OPT_NONE) -> 
security_capable(current_cred(), ns, cap, opts) -> 
cap_capable(cred, ns, cap, opts)

I ran the following bpftrace script to find out why:

# cat capable.bt
#include <linux/user_namespace.h>
#include <linux/cred.h>

kretprobe:zpl_mknod,kretprobe:vfs_mknod,kretprobe:security_inode_mknod
/comm == "vfstest"/
{
   printf("%s %d %s ret %d\n", comm, pid, probe, retval);
}

kprobe:capable /comm == "vfstest"/
{
   printf("%s %d %s cap %d\n", comm, pid, probe, arg0);
   @[tid] = 1;
}

kretprobe:capable /comm == "vfstest"/
{
   printf("%s %d %s ret %d\n", comm, pid, probe, retval);
   @[tid] = 0;
   delete(@[tid]);
}

kprobe:cap_capable /comm == "vfstest" && @[tid]/
{
   $cred = (struct cred *) arg0;
   $ns = (struct user_namespace *) arg1;
   printf("%s %d %s cap %d\n", comm, pid, probe, arg2);
   printf("targ_ns %p parent %p owner %d level %d\n", $ns, $ns->parent,
      $ns->owner.val, $ns->level);
   printf("cred euid %d user_ns %p parent %p level %d cap_effective %x %x\n",
      $cred->euid.val, $cred->user_ns, $cred->user_ns->parent,
      $cred->user_ns->level,
      $cred->cap_effective.cap[0], $cred->cap_effective.cap[1]);
}

kretprobe:cap_capable /comm == "vfstest" && @[tid]/
{
   printf("%s %d %s ret %d\n", comm, pid, probe, retval);
}
root@ubuntu-dev:~# nsenter -t ${SLEEPPID} --all ${VFSTEST} --test-core --device ${DEVICE} --mount /userns --fstype zfs
vfstest.c: 1569: setgid_create - Operation not permitted - failure: mknodat
vfstest.c: 1882: run_test - Success - failure: create operations in directories with setgid bit set
root@ubuntu-dev:~# bpftrace capable.bt
Attaching 7 probes...
vfstest 70281 kprobe:capable cap 14
vfstest 70281 kprobe:cap_capable cap 14
targ_ns 0xffffffff9d0812a0 parent (nil) owner 0 level 0
cred euid 0 user_ns 0xffff89e7d21b04e0 parent 0xffffffff9d0812a0 level 1 cap_effective ffffffff 1ff
vfstest 70281 kretprobe:cap_capable ret -1
vfstest 70281 kretprobe:capable ret 0
vfstest 70288 kprobe:capable cap 14
vfstest 70288 kprobe:cap_capable cap 14
targ_ns 0xffffffff9d0812a0 parent (nil) owner 0 level 0
cred euid 0 user_ns 0xffff89e7d21b04e0 parent 0xffffffff9d0812a0 level 1 cap_effective ffffffff 1ff
vfstest 70288 kretprobe:cap_capable ret -1
vfstest 70288 kretprobe:capable ret 0
vfstest 70300 kprobe:capable cap 4
vfstest 70300 kprobe:cap_capable cap 4
targ_ns 0xffffffff9d0812a0 parent (nil) owner 0 level 0
cred euid 0 user_ns 0xffff89e7d21b04e0 parent 0xffffffff9d0812a0 level 1 cap_effective ffffffff 1ff
vfstest 70300 kretprobe:cap_capable ret -1
vfstest 70300 kretprobe:capable ret 0
vfstest 70302 kprobe:capable cap 27
vfstest 70302 kprobe:cap_capable cap 27
targ_ns 0xffffffff9d0812a0 parent (nil) owner 0 level 0
cred euid 0 user_ns 0xffff89e7d21b04e0 parent 0xffffffff9d0812a0 level 1 cap_effective ffffffff 1ff
vfstest 70302 kretprobe:cap_capable ret -1
vfstest 70302 kretprobe:capable ret 0
vfstest 70302 kretprobe:vfs_mknod ret -1

By the way, CAP_MKNOD = 27.

Observing the code of cap_capable() and the above bpftrace output, it's not difficult to understand why cap_capable() returns -EPERM:

  • targ_ns (which is &init_user_ns) is the parent of cred->user_ns!

@lukts30
Copy link

lukts30 commented Sep 20, 2022

Creating device nodes is still a privileged operation and a known limitation of containers not running in the initial user namespace.

Looking at xfstest source code it looks like it only uses mknodat to create a /dev/console (char 5:1) like file. Additionally, normal file creation is done with mknodat which should be an unprivileged operation.

Therefore it might be possible to run unmodified xfstest inside a lxd container with security.syscalls.intercept.mknod true set.
https://linuxcontainers.org/lxd/docs/master/syscall-interception/

@youzhongyang
Copy link
Contributor Author

Creating device nodes is still a privileged operation and a known limitation of containers not running in the initial user namespace.

Looking at xfstest source code it looks like it only uses mknodat to create a /dev/console (char 5:1) like file. Additionally, normal file creation is done with mknodat which should be an unprivileged operation.

Therefore it might be possible to run unmodified xfstest inside a lxd container with security.syscalls.intercept.mknod true set. https://linuxcontainers.org/lxd/docs/master/syscall-interception/

@lukts30 - It does not work. Did I do anything wrong?

zfs create -o zoned=on -o mountpoint=/userns -o acltype=posix rpool/userns
lxc init images:ubuntu/jammy mylxc -c security.syscalls.intercept.mknod=true
lxc config device add mylxc zfs unix-char path=/dev/zfs source=/dev/zfs
lxc config set mylxc raw.idmap 'both 0-999998 0-999998'
lxc start mylxc
CONTAINERS_PID=$(ps -efww | grep 'containers mylxc' | grep -v grep | awk '{print $2}')
LXCINIT_PID=$(ps -efww | awk -v ppid=$CONTAINERS_PID '$3 == ppid' | awk '{print $2}')
zfs zone /proc/$LXCINIT_PID/ns/user rpool/userns
lxc file push -r -p /root/zfs-idmap mylxc/root
lxc exec mylxc -- bash
cd /root/zfs-idmap/
dpkg -i libnvpair3_*.deb libuutil3_*.deb libzfs5_*.deb libzpool5_*.deb zfs_*.deb
zfs mount rpool/userns

#
# build xfstests-dev in container
# ...
#

ZPOOL=rpool
VFSTEST=/root/xfstests-dev/src/vfs/vfstest
DEVICE=/dev/loop0

${VFSTEST} --test-core --device ${DEVICE} --mount /userns --fstype zfs

	vfstest.c: 1569: setgid_create - Operation not permitted - failure: mknodat
	vfstest.c: 1882: run_test - Success - failure: create operations in directories with setgid bit set

@youzhongyang
Copy link
Contributor Author

Commenting out mknodat() for character device in setgid_create(), now it fails at linkat():

                /* create tmpfile via filesystem tmpfile api */
                if (supported) {
                        tmpfile_fd = openat(info->t_dir1_fd, ".", O_TMPFILE | O_RDWR, S_IXGRP | S_ISGID);
                        if (tmpfile_fd < 0)
                                die("failure: create");
                        /* link the temporary file into the filesystem, making it permanent */
                        if (linkat(tmpfile_fd, "", info->t_dir1_fd, FILE3, AT_EMPTY_PATH))
                                die("failure: linkat");

which is due to capable(CAP_DAC_READ_SEARCH) returning false, very much similar to the above capable(CAP_MKNOD) case.

	 * To use null names we require CAP_DAC_READ_SEARCH
	 * This ensures that not everyone will be able to create
	 * handlink using the passed filedescriptor.
	 */
	if (flags & AT_EMPTY_PATH && !capable(CAP_DAC_READ_SEARCH)) {
		error = -ENOENT;
		goto out_putnames;
	}

Conclusion: I don't think xfstests idmapped test cases are designed to run in a non-default user namespace.

Youzhong Yang added 2 commits October 13, 2022 14:17
Signed-off-by: Youzhong Yang <[email protected]>
@youzhongyang
Copy link
Contributor Author

Updates:

  • Created idmap_util.c which can check if idmapped mount is supported or not, remount a folder with the given idmap information
  • A few test cases created
  • Rebased against master 5 days ago
  • Ran xfstests cases, all good
--test-core
--test-fscaps-regression
--test-nested-userns
--test-setattr-fix-968219708108
--test-setxattr-fix-705191b03d50
--test-setgid-create-umask
--test-setgid-create-acl

It's ready for review. Please take a look. Without any feedback, I have no idea what to do next.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be is good shape and it's great to see this implemented. Thanks for sticking with it, and adding the new test cases, and doing the manual testing. This is ready to merge.

@behlendorf behlendorf requested a review from a user October 18, 2022 18:04
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 19, 2022
@behlendorf behlendorf merged commit 2a068a1 into openzfs:master Oct 19, 2022
@behlendorf
Copy link
Contributor

Merged. While we aren't aware of any overlooked corner cases, and everything passes xfstests, any additional manual testing would be welcome. This should be a bit easier now that it is merged. Thanks!

andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 20, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 21, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to openzfsonwindows/openzfs that referenced this pull request Oct 21, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 21, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 21, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Nov 9, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Nov 9, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Nov 9, 2022
Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
* etc: mask zfs-load-key.service

Otherwise, systemd-sysv-generator will generate a service equivalent
that breaks the boot: under systemd this is covered by
zfs-mount-generator

We already do this for zfs-import.service, and other init scripts are
suppressed automatically by the "actual" .service files

Fixes: commit f04b976 ("Add init script
 to load keys")
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes openzfs#14010
Closes openzfs#14019

* Linux: Remove ZFS_AC_KERNEL_SRC_MODULE_PARAM_CALL_CONST autotools check

On older kernels, the definition for `module_param_call()` typecasts
function pointers to `(void *)`, which triggers -Werror, causing the
check to return false when it should return true.

Fixing this breaks the build process on some older kernels because they
define a `__check_old_set_param()` function in their headers that checks
for a non-constified `->set()`. We workaround that through the c
preprocessor by defining `__check_old_set_param(set)` to `(set)`, which
prevents the build failures.

However, it is now apparent that all kernels that we support have
adopted the GRSecurity change, so there is no need to have an explicit
autotools check for it anymore. We therefore remove the autotools check,
while adding the workaround to our headers for the build time
non-constified `->set()` check done by older kernel headers.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#13984
Closes openzfs#14004

* Cleanup: 64-bit kernel module parameters should use fixed width types

Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.

Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.

After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:

Parameters that were originally 64-bit on Illumos:

 * dbuf_cache_max_bytes
 * dbuf_metadata_cache_max_bytes
 * l2arc_feed_min_ms
 * l2arc_feed_secs
 * l2arc_headroom
 * l2arc_headroom_boost
 * l2arc_write_boost
 * l2arc_write_max
 * metaslab_aliquot
 * metaslab_force_ganging
 * zfetch_array_rd_sz
 * zfs_arc_max
 * zfs_arc_meta_limit
 * zfs_arc_meta_min
 * zfs_arc_min
 * zfs_async_block_max_blocks
 * zfs_condense_max_obsolete_bytes
 * zfs_condense_min_mapping_bytes
 * zfs_deadman_checktime_ms
 * zfs_deadman_synctime_ms
 * zfs_initialize_chunk_size
 * zfs_initialize_value
 * zfs_lua_max_instrlimit
 * zfs_lua_max_memlimit
 * zil_slog_bulk

Parameters that were originally 32-bit on Illumos:

 * zfs_per_txg_dirty_frees_percent

Parameters that were originally `ssize_t` on Illumos:

 * zfs_immediate_write_sz

Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.

Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:

 * l2arc_rebuild_blocks_min_l2size
 * zfs_key_max_salt_uses
 * zfs_max_log_walking
 * zfs_max_logsm_summary_length
 * zfs_metaslab_max_size_cache_sec
 * zfs_min_metaslabs_to_flush
 * zfs_multihost_interval
 * zfs_unflushed_log_block_max
 * zfs_unflushed_log_block_min
 * zfs_unflushed_log_block_pct
 * zfs_unflushed_max_mem_amt
 * zfs_unflushed_max_mem_ppm

New parameters that do not exist in Illumos:

 * l2arc_trim_ahead
 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_arc_sys_free
 * zfs_deadman_ziotime_ms
 * zfs_delete_blocks
 * zfs_history_output_max
 * zfs_livelist_max_entries
 * zfs_max_async_dedup_frees
 * zfs_max_nvlist_src_size
 * zfs_rebuild_max_segment
 * zfs_rebuild_vdev_limit
 * zfs_unflushed_log_txg_max
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift
 * zfs_vnops_read_chunk_size
 * zvol_max_discard_blocks

Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.

A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:

 * zfs_delete_blocks
 * zfs_key_max_salt_uses
 * zvol_max_discard_blocks

The documentation for a few parameters was found to be incorrect:

 * zfs_deadman_checktime_ms - incorrectly documented as int
 * zfs_delete_blocks - not documented as Linux only
 * zfs_history_output_max - incorrectly documented as int
 * zfs_vnops_read_chunk_size - incorrectly documented as long
 * zvol_max_discard_blocks - incorrectly documented as ulong

The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.

In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:

 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_per_txg_dirty_frees_percent
 * zfs_unflushed_log_block_pct
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift

Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.

Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Original-patch-by: Andrew Innes <[email protected]>
Original-patch-by: Jorgen Lundman <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#13984
Closes openzfs#14004

* cstyle: Allow URLs in C++ comments

If a C++ comment contained a URL, the `://` part of the URL would
trigger an error because there was no trailing blank, but trailing
blanks make for an invalid URL.  Modify the check to ignore text
within the C++ comment.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Lindee <[email protected]>
Closes openzfs#13987

* zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t

Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.

However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.

I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.

To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem

Here's the stack trace you would see then:

  VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
  PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
  Showing stack for process 28332
  CPU: 2 PID: 28332 Comm: zpool Tainted: G           O      5.10.103-1.nutanix.el7.x86_64 #1
  Call Trace:
   dump_stack+0x74/0x92
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   dsl_dataset_disown+0xe9/0x150 [zfs]
   dmu_objset_disown+0xd6/0x150 [zfs]
   zfs_domount+0x17b/0x4b0 [zfs]
   zpl_mount+0x174/0x220 [zfs]
   legacy_get_tree+0x2b/0x50
   vfs_get_tree+0x2a/0xc0
   path_mount+0x2fa/0xa70
   do_mount+0x7c/0xa0
   __x64_sys_mount+0x8b/0xe0
   do_syscall_64+0x38/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes openzfs#14025

* Fix potential NULL pointer dereference in lzc_ioctl()

Users are allowed to pass NULL to resultp, but we unconditionally assume
that they never do. When an external user does pass NULL to resultp, we
dereference a NULL pointer.

Clang's static analyzer complained about this.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14008

* Cleanup: Address Clang's static analyzer's unused code complaints

These were categorized as the following:

 * Dead assignment		23
 * Dead increment		4
 * Dead initialization		6
 * Dead nested assignment	18

Most of these are harmless, but since actual issues can hide among them,
we correct them.

That said, there were a few return values that were being ignored that
appeared to merit some correction:

 * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
   `destroy_batched()`. We handle it by returning -1 if there is an
   error.

 * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
   `zfs_for_each()`. We handle it by doing a binary OR of the error
   value from the subsequent `zfs_for_each()` call to the existing
   value. This is how errors are mostly handled inside `zfs_for_each()`.
   The error value here is passed to exit from the zfs command, so doing
   a binary or on it is better than what we did previously.

 * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
   `dsl_prop_get_ds()` when the property is not of type string. We
   return an error when it does. There is a small concern that the
   `zfs_get_temporary_prop()` call would handle things, but in the case
   that it does not, we would be pushing an uninitialized numval onto
   the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
   anytime that `zfs_get_temporary_prop()` does, so that not giving it a
   chance to fix things is not a problem.

 * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
   `nvlist_add_nvlist()` twice in ways in which errors are expected to
   be impossible, so we switch to `fnvlist_add_nvlist()`.

A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:

 * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
   value from `describe_free()`. A look through the commit history
   revealed that this was intentional.

 * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
   returned handle from `arc_hdr_realloc()` because it is already
   referenced in lists.

 * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
   saying not to use the error from `vdev_label_init()` because whatever
   causes the error could be the reason why a detach is being done.

Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.

After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.

My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Cedric Berger <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#13986

* zstream: allow decompress to fix metadata for uncompressed records

If a record is uncompressed on-disk but the block pointer insists
otherwise, reading it will return EIO.  This commit adds an "off" type
to the "zstream decompress" command.  Using it will set the compression
field in a zfs stream to "off" without changing the record's data.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by:	Alan Somers <[email protected]>
Sponsored by:	Axcient
Closes openzfs#13997

* Fix theoretical array overflow in lua_typename()

Out of the 12 defects in lua that coverity reports, 5 of them involve
`lua_typename()` and out of the dozens of defects in ZFS that lua
reports, 3 of them involve `lua_typename()` due to the ZCP code. Given
all of the uses of `lua_typename()` in the ZCP code, I was surprised
that there were not more. It appears that only 2 were reported because
only 3 called `lua_type()`, which does a defective sanity check that
allows invalid types to be passed.

lua/lua@d4fb848 addressed this in
upstream lua 5.3. Unfortunately, we did not get that fix since we use
lua 5.2 and we do not have assertions enabled in lua, so the upstream
solution would not do anything.

While we could adopt the upstream solution and enable assertions, a
simpler solution is to fix the issue by making `lua_typename()` return
`internal_type_error` whenever it is called with an invalid type. This
avoids the array overflow and if we ever see it appear somewhere, we
will know there is a problem with the lua interpreter.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#13947

* Linux compat: fix DECLARE_EVENT_CLASS() test when ZFS is built-in

ZFS_LINUX_TRY_COMPILE_HEADER macro doesn't take CONFIG_ZFS=y into
account. As a result, on several latest Linux versions, configure
script marks DECLARE_EVENT_CLASS() available for non-GPL when ZFS
is being built as a module, but marks it unavailable when ZFS is
built-in.
Follow the logic of the neighbor macros and adjust
ZFS_LINUX_TRY_COMPILE_HEADER accordingly, so that it doesn't try
to look for a .ko when ZFS is built-in.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
Closes openzfs#14006

* Fix declarations of non-global variables

This patch inserts the `static` keyword to non-global variables,
which where found by the analysis tool smatch.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes openzfs#13970

* Coverity model file update

Upon review, it was found that the model for malloc() was incorrect.

In addition, several general purpose memory allocation functions were
missing models:

 * kmem_vasprintf()
 * kmem_asprintf()
 * kmem_strdup()
 * kmem_strfree()
 * spl_vmem_alloc()
 * spl_vmem_zalloc()
 * spl_vmem_free()
 * calloc()

As an experiment to try to find more bugs, some less than general
purpose memory allocation functions were also given models:

 * zfsvfs_create()
 * zfsvfs_free()
 * nvlist_alloc()
 * nvlist_dup()
 * nvlist_free()
 * nvlist_pack()
 * nvlist_unpack()

Finally, the models were improved using additional coverity primitives:

 * __coverity_negative_sink__()
 * __coverity_writeall0__()
 * __coverity_mark_as_uninitialized_buffer__()
 * __coverity_mark_as_afm_allocated__()

In addition, an attempt to inform coverity that certain modelled
functions read entire buffers was used by adding the following to
certain models:

int first = buf[0];
int last = buf[buflen-1];

It was inspired by the QEMU model file.

No additional false positives were found by this, but it is believed
that the more accurate model file will help to catch false positives in
the future.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14048

* Linux 6.1 compat: change order of sys/mutex.h includes

After Linux 6.1-rc1 came out, the build started failing to build a
couple of the files in the linux spl code due to the mutex_init
redefinition. Moving the sys/mutex.h include to a lower position within
these two files appears to fix the problem.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Coleman Kane <[email protected]>
Closes openzfs#14040

* ZED: Fix uninitialized value reads

Coverity complained about a couple of uninitialized value reads in ZED.

 * zfs_deliver_dle() can pass an uninitialized string to zed_log_msg()
 * An uninitialized sev.sigev_signo is passed to timer_create()

The former would log garbage while the latter is not a real issue, but
we might as well suppress it by initializing the field to 0 for
consistency's sake.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14047

* Fix NULL pointer dereference in zdb

Clang's static analyzer complained that we dereference a NULL pointer in
dump_path() if we return 0 when there is an error.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* fm_fmri_hc_create() must call va_end() before returning

clang-tidy caught this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* Fix NULL pointer passed to strlcpy from zap_lookup_impl()

Clang's static analyzer pointed out that whenever zap_lookup_by_dnode()
is called, we have the following stack where strlcpy() is passed a NULL
pointer for realname from zap_lookup_by_dnode():

strlcpy()
zap_lookup_impl()
zap_lookup_norm_by_dnode()
zap_lookup_by_dnode()

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* Fix NULL pointer dereference in spa_open_common()

Calling spa_open() will pass a NULL pointer to spa_open_common()'s
config parameter. Under the right circumstances, we will dereference the
config parameter without doing a NULL check.

Clang's static analyzer found this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* set_global_var() should not pass NULL pointers to dlclose()

Both Coverity and Clang's static analyzer caught this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* Fix possible NULL pointer dereference in sha2_mac_init()

If mechanism->cm_param is NULL, passing mechanism to
PROV_SHA2_GET_DIGEST_LEN() will dereference a NULL pointer.

Coverity reported this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14044

* Cleanup: Simplify userspace abd_free_chunks()

Clang's static analyzer complained that we could use after free here if
the inner loop ever iterated. That is a false positive, but upon
inspection, the userland abd_alloc_chunks() function never will put
multiple consecutive pages into a `struct scatterlist`, so there is no
need to loop. We delete the inner loop.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Cleanup: Delete unnecessary pointer check from vdev_to_nvlist_iter()

This confused Clang's static analyzer, making it think there was a
possible NULL pointer dereference. There is no NULL pointer dereference.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Cleanup: metaslab_alloc_dva() should not NULL check mg->mg_next

This is a circularly linked list. mg->mg_next can never be NULL.

This caused 3 defect reports in Coverity.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Cleanup: zvol_add_clones() should not NULL check dp

It is never NULL because we return early if dsl_pool_hold() fails.

This caused Coverity to complain.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Cleanup: Delete dead code from send_merge_thread()

range is always deferenced before it reaches this check, such that the
kmem_zalloc() call is never executed.

There is also no need to set `range->eos_marker = B_TRUE` because it is
already set.

Coverity incorrectly complained about a potential NULL pointer
dereference because of this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Cleanup: Remove NULL pointer check from dmu_send_impl()

The pointer is to a structure member, so it is never NULL.

Coverity complained about this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14042

* Fix memory leaks in dmu_send()/dmu_send_obj()

If we encounter an EXDEV error when using the redacted snapshots
feature, the memory used by dspp.fromredactsnaps is leaked.

Clang's static analyzer caught this during an experiment in which I had
annotated various headers in an attempt to improve the results of static
analysis.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#13973

* Support idmapped mount

Adds support for idmapped mounts.  Supported as of Linux 5.12 this 
functionality allows user and group IDs to be remapped without changing 
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes openzfs#12923
Closes openzfs#13671

* Fix sequential resilver drive failure race condition

This patch handles the race condition on simultaneous failure of
2 drives, which misses the vdev_rebuild_reset_wanted signal in
vdev_rebuild_thread. We retry to catch this inside the
vdev_rebuild_complete_sync function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Dipak Ghosh <[email protected]>
Reviewed-by: Akash B <[email protected]>
Signed-off-by: Samuel Wycliffe J <[email protected]>
Closes openzfs#14041
Closes openzfs#14050

* Add options to zfs redundant_metadata property

Currently, additional/extra copies are created for metadata in
addition to the redundancy provided by the pool(mirror/raidz/draid),
due to this 2 times more space is utilized per inode and this decreases
the total number of inodes that can be created in the filesystem. By
setting redundant_metadata to none, no additional copies of metadata
are created, hence can reduce the space consumed by the additional
metadata copies and increase the total number of inodes that can be
created in the filesystem.  Additionally, this can improve file create
performance due to the reduced amount of metadata which needs
to be written.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Dipak Ghosh <[email protected]>
Signed-off-by: Akash B <[email protected]>
Closes openzfs#13680

* Fix userland memory leak in zfs_do_send()

Clang 15's static analyzer caught this.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14045

* Fix theoretical use of uninitialized values

Clang's static analyzer complains about this.

In get_configs(), if we have an invalid configuration that has no top
level vdevs, we can read a couple of uninitialized variables. Aborting
upon seeing this would break the userland tools for healthy pools, so we
instead initialize the two variables to 0 to allow the userland tools to
continue functioning for the pools with valid configurations.

In zfs_do_wait(), if no wait activities are enabled, we read an
uninitialized error variable.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14043

* Silence static analyzer warnings about spa_sync_props()

Both Coverity and Clang's static analyzer complain about reading an
uninitialized intval if the property is not passed as DATA_TYPE_UINT64
in the nvlist. This is impossible becuase spa_prop_validate() already
checked this, but they are unlikely to be the last static analyzers to
complain about this, so lets just refactor the code to suppress the
warnings.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14043

* crypto_get_ptrs() should always write to *out_data_2

Callers will check if it has been set to NULL before trying to access
it, but never initialize it themselves. Whenever "one block spans two
iovecs", `crypto_get_ptrs()` will return, without ever setting
`*out_data_2 = NULL`. The caller will then do a NULL check against the
uninitailized pointer and if it is not zero, pass it to `memcpy()`.

The only reason this has not caused horrible runtime issues is because
`memcpy()` should be told to copy zero bytes when this happens. That
said, this is technically undefined behavior, so we should correct it so
that future changes to the code cannot trigger it.

Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14043

* abd_return_buf() should call zfs_refcount_remove_many() early

Calling zfs_refcount_remove_many() after freeing memory means we pass a
reference to freed memory as the holder. This is not believed to be able
to cause a problem, but there is a bit of a tradition of fixing these
issues when they appear so that they do not obscure more serious issues
in static analyzer output, so we fix this one too.

Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14043

* Add defensive assertion to vdev_queue_aggregate()

a6ccb36 had been intended to include
this to silence Coverity reports, but this one was missed by mistake.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14043

* Fix build failures

Co-authored-by: наб <[email protected]>
Co-authored-by: Richard Yao <[email protected]>
Co-authored-by: ColMelvin <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Co-authored-by: Alan Somers <[email protected]>
Co-authored-by: Alexander <[email protected]>
Co-authored-by: Tino Reichardt <[email protected]>
Co-authored-by: Coleman Kane <[email protected]>
Co-authored-by: youzhongyang <[email protected]>
Co-authored-by: samwyc <[email protected]>
Co-authored-by: Akash B <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants