Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Linux pipe buffer on zfs recv size to the maximum system size #3171

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Mar 11, 2015

I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original (had we
read it) after we were done with the file descriptor, but I do not see a
strong case for doing either, so I went with a simple implementation.

Closes #1161

Signed-off-by: Richard Yao [email protected]

@ryao ryao force-pushed the recv-buffer branch 5 times, most recently from 2eeff38 to 3ea7a69 Compare March 11, 2015 21:03
@behlendorf
Copy link
Contributor

Do we have any performance data on how much (if any) this helps? Also what is the normal default pipe size and what is the maximum (1048576 on my system). It looks like this functionality was added in 2.6.35 so we'll probably also need an autoconf check for this to avoid breaking the build.

@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

@behlendorf We have performance data for mbuffer from @pyavdr in #1161, but we don't have any performance data for this specific patch. We might be able to get a volunteer from #1161 to get data for us.

If it is alright with you, I would prefer to avoid an autotools check by exploiting Linux's stable userland/kernel API boundary by checking for ENOSYS and doing some preprocessor definitons such as the following instead of an autotools check:

#ifndef F_SETPIPE_SZ
#define F_SETPIPE_SZ (F_SETLEASE + 7)
#endif

#ifndef F_GETPIPE_SZ
#define F_GETPIPE_SZ (F_GETLEASE + 7)
#endif

This would allow us to build the binaries on a system whose userland predates the introduction of F_GETPIPE_SZ and F_SETPIPE_SZ such as CentOS 6 and utilize the new functionality should they be installed on a system where the kernel supports this. We also avoid making the build process 1-2 seconds longer.

Would this approach be alright?

@behlendorf
Copy link
Contributor

I'd be very cautious about defining F_SETPIPE_SZ/F_GETPIPE_SZ ourselves. We really can't safely assume those ids aren't already being used on a specific platform. They may already be defined for another purpose, in which case we can't assume we'll get ENOSYS.

Since this is just an optimization I think we should just wrap the entire block in an #ifdef F_SETPIPE_SZ chunk. That avoids the need for an autoconf check, but it does mean you'll need to build the binary on a platform which supports this functionality.

Also just for reference these values are definitely slightly differently on my platform. The end result in the same but the header is a little different.

/*
 * Set and get of pipe page size array
 */
#define F_SETPIPE_SZ    (F_LINUX_SPECIFIC_BASE + 7)
#define F_GETPIPE_SZ    (F_LINUX_SPECIFIC_BASE + 8)

@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

@behlendorf The code sample that you copied is from the kernel headers, which define F_LINUX_SPECIFIC_BASE as an offset whose value is equal to F_SETLEASE. The values should be identical on your platform. The method of computation that I picked just happened to be different because I liked the idea of (ab)usng F_GETLEASE for symmetry. We should always be able to obtain the correct values in this way. If you look at glibc's headers, you will find that glibc hard codes these values on all Linux platforms:

# define F_SETPIPE_SZ   1031    /* Set pipe page size array.  */
# define F_GETPIPE_SZ   1032    /* Set pipe page size array.  */

It is possible that systems with these options in the headers won't have them in the kernel and vice versa. Consequently, there should be plenty of binaries in the wild that assume these values and if written properly, will assume an EINVAL to indicate that the fcntl operation is not supported. My initial guess was that it would be ENOSYS (and probably should have been ENOTSUP), but the kernel's do_fcntl function is written otherwise. It seems safe to just try the syscall and quit if it receives an error. This is done in both glibc and udev for various things. I used this technique a couple years ago for a patch in eudev that added a fallback for a missing dup3() implementation.

Is your concern that there exists a Linux architecture on which these values are different (glibc would contradict this) or that there is a distribution in which these values have been changed? If it is the former, that does not appear to be an issue. If it is the latter, it should be safe to (ab)use F_SETLEASE + 7 and F_GETLEASE + 7 to get the values unless a distribution decided to be truly creative in which case any guarantees about binary compatibility are broken while compiling against its headers would produce working binaries.

Are you certain that we should not do runtime detection here?

@ryao ryao force-pushed the recv-buffer branch 2 times, most recently from de7ef90 to 3530cf2 Compare March 11, 2015 23:39
@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

I just pushed one last try at runtime detection to demonstrate what I have in mind. This one should pass the builders. I know that some people run newer kernels with older userlands, so it will be unfortunate if we cannot do things this way.

@ryao
Copy link
Contributor Author

ryao commented Mar 12, 2015

@behlendorf I overlooked your earlier question. The normal pipe size is 64KB on Linux.

@behlendorf
Copy link
Contributor

Is your concern that there exists a Linux architecture on which these values are different

Re-reading my comment I wasn't very clear. No, I'm actually not concerned about different Linux architectures defining this differently. I was worried about non-Linux, non-GNU platforms which might define this differently.

I'm happy to see you and @lundman came to basically the solution I was going to suggest. Although I'd suggest wrapping this in _GNU_SOURCE rather than __linux__ since these are more properly GNU extensions. Aside from that small tweak I think the approach the patch takes is reasonable.

From fcntl(2):

       F_GETOWN_EX, F_SETOWN_EX, F_SETPIPE_SZ, F_GETPIPE_SZ, F_GETSIG,  F_SET-
       SIG,  F_NOTIFY, F_GETLEASE, and F_SETLEASE are Linux-specific.  (Define
       the _GNU_SOURCE macro to obtain these definitions.)

@behlendorf
Copy link
Contributor

Or maybe more correctly _GNU_SOURCE && __linux__ since it's grubbing around in /proc/ which is Linux specific.

@ryao
Copy link
Contributor Author

ryao commented Mar 12, 2015

@behlendorf Changing this to _GNU_SOURCE && __linux__ should not buy us anything because we always define _GNU_SOURCE in config/Rules.am via AM_CPPFLAGS and non-linux platforms should not define __linux__. Building this code on a Linux system without _GNU_SOURCE should be valid because it uses the POSIX API + a couple of Linux kernel-specific extensions.

While glibc hides non-standard extensions behind _GNU_SOURCE in library headers, there should be no need to do that here because this is not a library header. Using _GNU_SOURCE && __linux__ in this instance does not seem like a correct usage of _GNU_SOURCE, but I can repush this with that change if you think we should use it here anyway.

@behlendorf
Copy link
Contributor

@ryao I'm not set on adding _GNU_SOURCE, for all practical purposes __linux__ is almost certainly sufficient. I've just added some minor patch review comments.

I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Mar 13, 2015

The failures appear to be pre-existing issues. One is issue #2302. While the other two are the txg_quiesce and txg_sync threads each blocking for more than 2 minutes. I suspect #2302 might be fixed by #3172, which was just merged to master. I have rebased and repushed.

@thegreatgazoo
Copy link

From fcntl(2):

An unprivileged process can adjust the pipe capacity to any value
between  the  system page size and the limit defined in /proc/sys/fs/pipe-max-size.
A privileged process (CAP_SYS_RESOURCE) can override the limit.

Maybe it'd be a good idea to allow privileged users to set it to a higher value if they wish so.

@behlendorf
Copy link
Contributor

@thegreatgazoo that's not a bad thought but let's avoid complicating this further until we have a real use case for that. I've merged the refreshed patch to master, we'll see in practice if this optimization helps.

5c3f61e Increase Linux pipe buffer size on 'zfs receive'

@behlendorf behlendorf closed this Mar 20, 2015
@behlendorf behlendorf added this to the 0.6.4 milestone Mar 20, 2015
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Mar 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement zfs recv buffer
3 participants