-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement zfs recv buffer #1161
Comments
I'm working on this |
@edillmann i'm not convinced this is strictly needed (in fact as it stands i want less buffering in a sense) that said, i know there are use cases where people do 'zfs send | somethingslow' and feel pain (i argue the solution is don't do that) please consider making the change optional perhaps default to off... and which point you could just have wrapped around zfs send (called zfssend?) that wrapped zfs send and did the buffering for you |
A couple comments.
|
I use zfs send / recv regulary and have some performance values for about 8 GB data with compression on both sides: time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m42s, user 0m42s, sys 0m22s. time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/zvol" time zfs send -R tank/zvol@now | ssh zfsbackupfastip "zfs recv tank/zvol" Using mbuffer on both sides show a clear performance gain: time zfs send -R tank/zvol@now| mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m03s, user 0m41s, sys 0m24s. The IP connection is 10 Gbit/s, ZOL rc14, kernel 3.4.28 both sides with mirrored zpools. The biggest brake is the ssh encryption, which can be accelerated with an AES-NI supported cipher like aes128-cbc, which clearly shows better performance. Doing the same zfs send/recv on a 1 Gbit/s link: time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip "zfs recv tank/zvol" gives real 2m04s, user 0m46s, sys 0m18s. time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 2m02s, user 0m46s, sys 0m18s. Doing the same on a larger dataset of uncompressed 50 GB of data, with compression on both sides, on the 10 Gbit/s link: time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 5m50s, user 3m12s, sys 1m42s. time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/largezvol" Using mbuffer on both sides shows a clear performance gain: time zfs send -R tank/largezvol@now | mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 4m40s, user 3m17s, sys 1m42s The send curve shows some normal ripples with and without mbuffer. Really bursts |
@edillmann I suggest implementing an argument to zfs recv that permits the buffer size to be specified at runtime. Ideally, it would accept a number optionally followed by either K or M to signify that the number be multipled by 2^10 or 2^20 respectively. The value of 0 would disable this behavior. An assertion should be included to ensure that the resulting buffer size is non-negative. An adequate default value would need to be determined empirically. However, mbuffer's default value is 2M, which seems reasonable. @cwedgwood If you want less buffering, you should use GNU stdbuf. The zfs send/recv command would be something like @behlendorf ZFS send/recv exists because Matthew Ahrens observed that intercontinental latencies had a significant effect on rsync performance. In particular, rsync functions by doing checksum comparisons on 64KB blocks (if I recall correctly). ZFS send/recv was intended to eliminate this crosstalk with a fully unidirectional stream. Unfortunately, the use of zfs send/recv in place of rsync appears to have replaced the crosstalk of "send me the next chunk" with the crosstalk of "send me the next transaction group" when the transaction group size exceeds the size of the UNIX pipe's buffer. This is arguably a much better situation because not only are far fewer crosstalks are required, but the use of incremental send/recv enables us to eliminate many of them altogether. Adding an adequately sized buffer to @ahrens What do you think of this? @pyavdr mbuffer should only benefit the recv end. Using it on the send end should only be unnecessary overhead. |
send/recv traffoc is still unidirectional, even when incremental transfers are in use. The issue is that when doing incremental transfers, and even large full snapshot transfers, the receiving end may need to do disk reading along side its transaction commits and those are always synchronous. These block its pulling data from the incoming source. For network transmissions the TCP buffer usually becomes the only substantial buffering. The cross-talk then becomes just TCP ACK packets but it's still necessary cross-talk. Personally I agree with the need for buffering for some kinds of high speed transfers but am only about 60% sold that it should be implemented in ZFS itself. |
@ryao There is no zfs send layer "crosstalk"; it's a unidirectional protocal as @DeHackEd points out -- there's no "send me the next transaction group". Buffering (on both ends) helps because zfs send produces data burstily, compared with the size of existing buffers (just a few KB in the TCP stack); and because zfs receive is not always ready to read data from the socket (writing the data may take a nontrivial amount of time). p.s. My motivation for implementing send/receive was an ancient source code management system (TeamWare) over NFS. But I would imagine that rsync has similar issues, and then some -- e.g. files with just a few blocks modified, which are handled very efficiently by zfs send. |
@ahrens P.S. I will cite teamware when talk about this in the future. Thanks for the correction. |
@ryao I still don't know what you mean by "transaction group" in this context. Do you mean record (dmu_replay_record_t)? Can you point me to the code you have in mind? |
@ahrens My current understanding of With that said, I probably should let people actually working on this talk. I do not have dtrace at my disposal, so the best that I can do is think of what could be wrong, write patches to fix them and iterate until the patches have the desired effect. Looking into send/recv performance is a low priority for me, so I have not done anything beyond form an initial hypothesis about what is happening. |
Using I ran several tests on gigabit networks with below
As I found @ryao, @edillmann: The custom buffer size/mbuffer option would be nice but I believe it is not worth your development time provided |
Related to #1112. |
There are other cases where buffering (even on the send side) has benefits. I have an active system that sends an incremental replication stream (with a large number of file systems and automated snapshots) to a backup, but I need to be nice to the network connecting them when this is done during the work day. Sticking a buffer between the zfs send and ssh lets the zfs operation finish and potentially exit quickly (when the actual user data changes are small enough) on the send side, which minimizes the duration of user-facing IO impact, and then slowly dole out (throttled either with mbuffer or pv) the stream from the buffer over the network. There are lots of knobs to mbuffer (buffer size, % empty to start filling, % full to start draining, use a temp file for the buffer, etc.) that I don't think zfs needs/wants to replicate, but they can be useful for tuning to a specific use case. With that in mind, this is handled better by documentation than by modifying the ZOL send/recv code. It is part of the UNIX philosophy to let each tool do its own thing well, and then chain them together as appropriate. If someone is concerned enough about zfs send/recv performance to dig into buffering issues, they are certainly able to add an additional item to their (likely scripted) command line. Sticking lz4c into the chain (wrapping any buffering) would also be a good fit for this type of documentation. |
I agree with the previous statement. There are a myriad of perfectly good tools out there already, so why reinvent the wheel? After reading discussions on "how to accelerate zfs send/recv" on a number of websites, we tinkered with various command lines (netcat, ssh with different options, lz4 compression, mbuffer) before arriving at an "optimum" for our particular setup. There is no one-size-fits-all answer. If anything perhaps this could / should be addressed as a documentation issue? |
@olw2005 could you share what your "optimum" was for your setup? |
We tinkered with lz4demo: http://code.google.com/p/lz4/ locally compiled along with a modified version of the “zfs-replicate” shell script from here: Author: kattungaDate: August 11, 2012Version: 2.0http://linuxzfs.blogspot.com/2012/08/zfs-replication-script.htmlhttps://github.com/kattunga/zfs-scripts.gitCredits:Mike La Spina for the original concept and script http://blog.laspina.ca/Function:Provides snapshot and send process which replicates a ZFS dataset from a source to target server.Maintains a runing snapshot archive for X timeThe modified zfs send [and ssh -> zfs receive on the other end] looked like this: zfs send $VERBOSE $DEDUP -R $last_snap_source | lz4demo stdin stdout 2> /dev/null | mbuffer -q -m 512M 2> /dev/null | ssh -c aes128-cbc $TGT_HOST $TGT_PORT "lz4demo -d stdin stdout 2> /dev/null | zfs recv $VERBOSE -F $TGT_PATH" 2> $0.err But in the end, the above did not significantly outperform straight ssh (with aes128-cbc encryption). YMMV. From: lintonv [mailto:[email protected]] @olw2005https://github.com/olw2005 could you share what your "optimum" was for your setup? — This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG. |
FWIW, lz4 (or lz4c) is available on many distros in some form, so you likely don't need to roll your own anymore. We've been happy with something like this: If you aren't rate limiting, mbuffer may still allow the send to finish faster if you are sending small incrementals, especially with multiple small (hourly, for example) snapshots and a higher performance source than destination. You can also set your ssh cipher to arcfour to lower ssh's cpu load if you don't need military grade encryption... |
@eborisch @lintonv If you have AES-NI instruction sets (i.e. a newer cpu) the speed for aes-128-cbc is pretty decent. Fast enough for my use case, anyway. RH has a web page with test cmds for ssh here: Good luck! This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG. |
And as I've mentioned elsewhere before (#3010) I would suggest avoiding '-F' on the recv if at all possible. |
I tried mbuffer and using the lz4c compression, in the ways you both suggested above. But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec. Let me explain:
This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow. |
@lintonv You might try directing the send into /dev/null to eliminate other variables. It sounds like your disk may be the bottleneck, in which case buffering / compressing won’t help. From: lintonv [mailto:[email protected]] @eborischhttps://github.com/eborisch @olw2005https://github.com/olw2005 I tried mbuffer and using the lz4c compression, in the ways you both suggested above. But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec. Let me explain:
This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow. — This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG. |
@olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck. I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible. Any other insights on how this can be done in a parallel fashion? |
@lintonv I’ll leave the code questions for others to answer. I’m a sysadmin not a programmer, Jim. =) However I will note that in our usage, an unconstrained (redirected to /dev/null for example) “full” zfs send of an un-cached but relatively un-fragmented filesystem easily pegs the 6gbps sas controller. We typically net around 500 - 600 MB/s @ around 4k – 5k iops which is about right given (raw speed) * compression / (raidz2 overhead). In practice we get around 250-300 MB/s dumping a zfs send out across the 10gbit lan to LTO5 tape on a backup server. (I believe in that case it’s largely constrained by the tape speed.) Bottom line, I don’t think there is anything “wrong” with the zfs send code. From: lintonv [mailto:[email protected]] @olw2005https://github.com/olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck. I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible. Any other insights on how this can be done in a parallel fashion? — This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG. |
I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here. Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant. |
@lintonv On that note, I was going to mention in my last post but forgot. You should redirect the performance questions to the zfs discussion list (see the zfsonlinux.org website). You’ll get more advice there. (In fact, if you search it you’ll probably find the question of zfs send/recv performance has come up before. Repeatedly.) As for block size, this may be out-of-date but at the time we implemented (circa v0.6.1) the 128k block size worked a lot better for our use case (zvols shared with iscsi to vmware). I tested a range of block sizes and there was noticeable performance degradation at the smallest block sizes (4k and 8k in particular). Again, take it with a grain of salt as that was about 3 years ago and the zfs code has changed a lot since then. From: lintonv [mailto:[email protected]] I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here. Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant. — This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG. |
I noticed when reviewing documentation that it is possible for userspace to use |
Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. We have people using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This could have been configurable or we could have changed the value back to the original (had we read it) after we were done with the file descriptor,but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This could have been configurable or we could have changed the value back to the original (had we read it) after we were done with the file descriptor,but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This could have been configurable or we could have changed the value back to the original (had we read it) after we were done with the file descriptor,but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
Those additional pushes were just for changes to the commit message. Anyway, the triviality of this piqued my interest, so I implemented it, compiled it and verified with strace that the right syscalls were being done on a simple test case. Someone else will need to verify that it actually provides a benefit. |
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original (had we read it) after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
Would someone who benefits from mbuffer mind doing a benchmark of ryao/zfs@3530cf2 against the unpatched userland binaries with and without mbuffer? It should outperform mbuffer in situations where a single core cannot keep up with the two additional copies using mbuffer requires while eliminating the need for it. |
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for userspace to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Closes openzfs#1161 Signed-off-by: Richard Yao <[email protected]>
I noticed when reviewing documentation that it is possible for user space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1161
I noticed when reviewing documentation that it is possible for user space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1161
I noticed when reviewing documentation that it is possible for user space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. There are users using mbuffer to improve zfs recv performance when piping over the network, so it seems advantageous to integrate such functionality directly into the zfs recv tool. This avoids the addition of two buffers and two copies (one for the buffer mbuffer adds and another for the additional pipe), so it should be more efficient. This could have been made configurable and/or this could have changed the value back to the original after we were done with the file descriptor, but I do not see a strong case for doing either, so I went with a simple implementation. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1161
5c3f61e closed this. |
The vulnerability is a denial-of-service. We need to wait for our dependencies to move off of the vulnerable crate. See: https://rustsec.org/advisories/RUSTSEC-2023-0052
UNIX pipes usually have a 64KB buffer size, which is too small to buffer a ZFS transaction. The consequence is that a zfs send | ... | zfs recv operation will typically alternate between sending and receiving, which is suboptimal. A program called
mbuffer
had been suggested by various ZFS users as a workaround for this.mbuffer
provides user adjustable buffer that is 2MB in size by default, which is generally sufficient to avoid suboptimal behavior in practice.I encountered an issue where a zfs send stopped prematurely when I was using
mbuffer
, which has caused me to question its reliability. It would be ideal to integrate this functionality into thezfs recv
command to ensure that buffering is done in a consistent manner. This would have the additional benefit of ensuring that users do not accidentally place mbuffer on thezfs send
side of a SSH tunnel, which would reduce the benefit of a buffer.The text was updated successfully, but these errors were encountered: