Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement zfs recv buffer #1161

Closed
ryao opened this issue Dec 23, 2012 · 30 comments
Closed

Implement zfs recv buffer #1161

ryao opened this issue Dec 23, 2012 · 30 comments
Labels
Type: Feature Feature request or new feature

Comments

@ryao
Copy link
Contributor

ryao commented Dec 23, 2012

UNIX pipes usually have a 64KB buffer size, which is too small to buffer a ZFS transaction. The consequence is that a zfs send | ... | zfs recv operation will typically alternate between sending and receiving, which is suboptimal. A program called mbuffer had been suggested by various ZFS users as a workaround for this. mbuffer provides user adjustable buffer that is 2MB in size by default, which is generally sufficient to avoid suboptimal behavior in practice.

I encountered an issue where a zfs send stopped prematurely when I was using mbuffer, which has caused me to question its reliability. It would be ideal to integrate this functionality into the zfs recv command to ensure that buffering is done in a consistent manner. This would have the additional benefit of ensuring that users do not accidentally place mbuffer on the zfs send side of a SSH tunnel, which would reduce the benefit of a buffer.

@edillmann
Copy link
Contributor

I'm working on this

@cwedgwood
Copy link
Contributor

@edillmann i'm not convinced this is strictly needed

(in fact as it stands i want less buffering in a sense)

that said, i know there are use cases where people do 'zfs send | somethingslow' and feel pain (i argue the solution is don't do that)

please consider making the change optional perhaps default to off...

and which point you could just have wrapped around zfs send (called zfssend?) that wrapped zfs send and did the buffering for you

@behlendorf
Copy link
Contributor

A couple comments.

  • Let's make sure we have a good test case so any performance improvements can be characterized. There's no point in adding more code here unless we can clearly show an improvement for at least one common use case.
  • I haven't profiled any of this myself. But it seems to me the most straight forward thing to do would be to add a zfs recv -b <size> ... option. This would just cause the command to internally allocate a buffer of a given size to smooth out the transfer. By default making size=0 would address @cwedgwood's concerns until we have enough real testing to suggest a better default.

@pyavdr
Copy link
Contributor

pyavdr commented Mar 8, 2013

I use zfs send / recv regulary and have some performance values for about 8 GB data with compression on both sides:

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m42s, user 0m42s, sys 0m22s.

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/zvol"
gives real 1m24s, user 0m42s, sys 0m22s.

time zfs send -R tank/zvol@now | ssh zfsbackupfastip "zfs recv tank/zvol"
gives real 3m22s, user 2m24s, sys 0m24s.

Using mbuffer on both sides show a clear performance gain:

time zfs send -R tank/zvol@now| mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m03s, user 0m41s, sys 0m24s.

The IP connection is 10 Gbit/s, ZOL rc14, kernel 3.4.28 both sides with mirrored zpools. The biggest brake is the ssh encryption, which can be accelerated with an AES-NI supported cipher like aes128-cbc, which clearly shows better performance.

Doing the same zfs send/recv on a 1 Gbit/s link:

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip "zfs recv tank/zvol" gives real 2m04s, user 0m46s, sys 0m18s.

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 2m02s, user 0m46s, sys 0m18s.

Doing the same on a larger dataset of uncompressed 50 GB of data, with compression on both sides, on the 10 Gbit/s link:

time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 5m50s, user 3m12s, sys 1m42s.

time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/largezvol"
gives real 5m44s, user 3m12s, sys 1m42s.

Using mbuffer on both sides shows a clear performance gain:

time zfs send -R tank/largezvol@now | mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 4m40s, user 3m17s, sys 1m42s

The send curve shows some normal ripples with and without mbuffer. Really bursts
on transmission would show really big ripples, which is not the case. So from this real world values,
i can´t see any performance gains on using mbuffer on the receive side only. Implementing an integrated buffer for zfs send/recv would be the same situation, so it depends on the implementation of a zfs send/recv buffer.

@ryao
Copy link
Contributor Author

ryao commented Mar 21, 2013

@edillmann I suggest implementing an argument to zfs recv that permits the buffer size to be specified at runtime. Ideally, it would accept a number optionally followed by either K or M to signify that the number be multipled by 2^10 or 2^20 respectively. The value of 0 would disable this behavior. An assertion should be included to ensure that the resulting buffer size is non-negative. An adequate default value would need to be determined empirically. However, mbuffer's default value is 2M, which seems reasonable.

@cwedgwood If you want less buffering, you should use GNU stdbuf. The zfs send/recv command would be something like zfs send ... | stdbuf -i0 -o0 zfs recv .... With that said, I would be surprised if any buffering (including excessive buffering) had a measurable, negative effect on performance.

@behlendorf ZFS send/recv exists because Matthew Ahrens observed that intercontinental latencies had a significant effect on rsync performance. In particular, rsync functions by doing checksum comparisons on 64KB blocks (if I recall correctly). ZFS send/recv was intended to eliminate this crosstalk with a fully unidirectional stream. Unfortunately, the use of zfs send/recv in place of rsync appears to have replaced the crosstalk of "send me the next chunk" with the crosstalk of "send me the next transaction group" when the transaction group size exceeds the size of the UNIX pipe's buffer. This is arguably a much better situation because not only are far fewer crosstalks are required, but the use of incremental send/recv enables us to eliminate many of them altogether. Adding an adequately sized buffer to zfs recv is a potential improvement.

@ahrens What do you think of this?

@pyavdr mbuffer should only benefit the recv end. Using it on the send end should only be unnecessary overhead.

@DeHackEd
Copy link
Contributor

send/recv traffoc is still unidirectional, even when incremental transfers are in use. The issue is that when doing incremental transfers, and even large full snapshot transfers, the receiving end may need to do disk reading along side its transaction commits and those are always synchronous. These block its pulling data from the incoming source. For network transmissions the TCP buffer usually becomes the only substantial buffering. The cross-talk then becomes just TCP ACK packets but it's still necessary cross-talk.

Personally I agree with the need for buffering for some kinds of high speed transfers but am only about 60% sold that it should be implemented in ZFS itself.

@ahrens
Copy link
Member

ahrens commented Mar 22, 2013

@ryao There is no zfs send layer "crosstalk"; it's a unidirectional protocal as @DeHackEd points out -- there's no "send me the next transaction group". Buffering (on both ends) helps because zfs send produces data burstily, compared with the size of existing buffers (just a few KB in the TCP stack); and because zfs receive is not always ready to read data from the socket (writing the data may take a nontrivial amount of time).

p.s. My motivation for implementing send/receive was an ancient source code management system (TeamWare) over NFS. But I would imagine that rsync has similar issues, and then some -- e.g. files with just a few blocks modified, which are handled very efficiently by zfs send.

@ryao
Copy link
Contributor Author

ryao commented Mar 22, 2013

@ahrens zfs recv will block until it has read the next transaction group. If the receiver end does not have an entire transaction group in the buffer, it will block on network traffic. This says "send me more data" in the TCP protocol, which is effectively "send me the next transaction group".

P.S. I will cite teamware when talk about this in the future. Thanks for the correction.

@ahrens
Copy link
Member

ahrens commented Mar 22, 2013

@ryao I still don't know what you mean by "transaction group" in this context. Do you mean record (dmu_replay_record_t)? Can you point me to the code you have in mind?

@ryao
Copy link
Contributor Author

ryao commented Mar 23, 2013

@ahrens My current understanding of zfs send is that some kind of record is sent (probably dmu_replay_record_t) that needs to be received entirely by zfs recv before it can do anything.

With that said, I probably should let people actually working on this talk. I do not have dtrace at my disposal, so the best that I can do is think of what could be wrong, write patches to fix them and iterate until the patches have the desired effect. Looking into send/recv performance is a low priority for me, so I have not done anything beyond form an initial hypothesis about what is happening.

@bassu
Copy link
Contributor

bassu commented Nov 4, 2013

Using mbuffer over multiple pipes is useless however running it in listening mode might be helpful but of course, at expense of security!

I ran several tests on gigabit networks with below ssh alias and I clearly did not see any significant improvements of mbuffer over plain ssh.

# which ssh
 alias ssh='ssh -T -c arcfour -o Compression=no -x'

As I found mbuffer slower in many cases, I am not sure why people keep recommending it. The only performance gain was 1-2% with mbuffer in listening mode. I got around average 60 MB/s for large transfers with ssh.

@ryao, @edillmann: The custom buffer size/mbuffer option would be nice but I believe it is not worth your development time provided ssh is tested with aforementioned tweaks. Also, there are other important issues than this.
@behlendorf : Spot on. Preliminary tests show no significant performance gains over plain ssh with arcfour encryption and no compression combined with UNIX pipes. Probably, more people should test it so we can add it to FAQs.

@FransUrbo
Copy link
Contributor

Related to #1112.

@eborisch
Copy link

There are other cases where buffering (even on the send side) has benefits. I have an active system that sends an incremental replication stream (with a large number of file systems and automated snapshots) to a backup, but I need to be nice to the network connecting them when this is done during the work day.

Sticking a buffer between the zfs send and ssh lets the zfs operation finish and potentially exit quickly (when the actual user data changes are small enough) on the send side, which minimizes the duration of user-facing IO impact, and then slowly dole out (throttled either with mbuffer or pv) the stream from the buffer over the network. There are lots of knobs to mbuffer (buffer size, % empty to start filling, % full to start draining, use a temp file for the buffer, etc.) that I don't think zfs needs/wants to replicate, but they can be useful for tuning to a specific use case.

With that in mind, this is handled better by documentation than by modifying the ZOL send/recv code. It is part of the UNIX philosophy to let each tool do its own thing well, and then chain them together as appropriate. If someone is concerned enough about zfs send/recv performance to dig into buffering issues, they are certainly able to add an additional item to their (likely scripted) command line. Sticking lz4c into the chain (wrapping any buffering) would also be a good fit for this type of documentation.

@olw2005
Copy link

olw2005 commented Aug 27, 2014

I agree with the previous statement. There are a myriad of perfectly good tools out there already, so why reinvent the wheel? After reading discussions on "how to accelerate zfs send/recv" on a number of websites, we tinkered with various command lines (netcat, ssh with different options, lz4 compression, mbuffer) before arriving at an "optimum" for our particular setup.

There is no one-size-fits-all answer. If anything perhaps this could / should be addressed as a documentation issue?

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 7, 2014
@lintonv
Copy link

lintonv commented Feb 11, 2015

@olw2005 could you share what your "optimum" was for your setup?

@olw2005
Copy link

olw2005 commented Feb 11, 2015

@lintonv

We tinkered with lz4demo: http://code.google.com/p/lz4/
and mbuffer: http://www.maier-komor.de/mbuffer.html

locally compiled along with a modified version of the “zfs-replicate” shell script from here:

Author: kattunga

Date: August 11, 2012

Version: 2.0

http://linuxzfs.blogspot.com/2012/08/zfs-replication-script.html

https://github.com/kattunga/zfs-scripts.git

Credits:

Mike La Spina for the original concept and script http://blog.laspina.ca/

Function:

Provides snapshot and send process which replicates a ZFS dataset from a source to target server.

Maintains a runing snapshot archive for X time

The modified zfs send [and ssh -> zfs receive on the other end] looked like this:

zfs send $VERBOSE $DEDUP -R $last_snap_source | lz4demo stdin stdout 2> /dev/null | mbuffer -q -m 512M 2> /dev/null | ssh -c aes128-cbc $TGT_HOST $TGT_PORT "lz4demo -d stdin stdout 2> /dev/null | zfs recv $VERBOSE -F $TGT_PATH" 2> $0.err

But in the end, the above did not significantly outperform straight ssh (with aes128-cbc encryption).

YMMV.

From: lintonv [mailto:[email protected]]
Sent: Wednesday, February 11, 2015 3:11 PM
To: zfsonlinux/zfs
Cc: Wieck, Owen
Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@olw2005https://github.com/olw2005 could you share what your "optimum" was for your setup?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1161#issuecomment-73955169.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those
of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors
are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and
other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and
any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail.
"Ricardo" means Ricardo plc and its subsidiary companies.
Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

@eborisch
Copy link

FWIW, lz4 (or lz4c) is available on many distros in some form, so you likely don't need to roll your own anymore.

We've been happy with something like this:
zfs send [args ...] | lz4c | ssh remote_host "mbuffer [args to rate limit / buffer] | lz4c -d | zfs recv [args]"

If you aren't rate limiting, mbuffer may still allow the send to finish faster if you are sending small incrementals, especially with multiple small (hourly, for example) snapshots and a higher performance source than destination. You can also set your ssh cipher to arcfour to lower ssh's cpu load if you don't need military grade encryption...

@lintonv
Copy link

lintonv commented Feb 16, 2015

@olw2005 @eborisch Thank you both. I'll do some testing and post what I find.

@olw2005
Copy link

olw2005 commented Feb 16, 2015

@eborisch @lintonv If you have AES-NI instruction sets (i.e. a newer cpu) the speed for aes-128-cbc is pretty decent. Fast enough for my use case, anyway. RH has a web page with test cmds for ssh here:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/sect-Security_Guide-Encryption-OpenSSL_Intel_AES-NI_Engine.html

Good luck!


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those
of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors
are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and
other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and
any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail.
"Ricardo" means Ricardo plc and its subsidiary companies.
Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

@eborisch
Copy link

And as I've mentioned elsewhere before (#3010) I would suggest avoiding '-F' on the recv if at all possible.

@lintonv
Copy link

lintonv commented Feb 17, 2015

@eborisch @olw2005

I tried mbuffer and using the lz4c compression, in the ways you both suggested above.

But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec.

Let me explain:

  1. Initial send of a 1 Gig FS goes at 12 MB/sec
  2. After that transfer, delete the FS on the RECEIVING end and re-transmit. At this point, I get the full bandwidth of 120 MB/sec,

This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow.

@olw2005
Copy link

olw2005 commented Feb 17, 2015

@lintonv You might try directing the send into /dev/null to eliminate other variables. It sounds like your disk may be the bottleneck, in which case buffering / compressing won’t help.

From: lintonv [mailto:[email protected]]
Sent: Tuesday, February 17, 2015 11:50 AM
To: zfsonlinux/zfs
Cc: Wieck, Owen
Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@eborischhttps://github.com/eborisch @olw2005https://github.com/olw2005

I tried mbuffer and using the lz4c compression, in the ways you both suggested above.

But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec.

Let me explain:

  1. Initial send of a 1 Gig FS goes at 12 MB/sec
  2. After that transfer, delete the FS on the RECEIVING end and re-transmit. At this point, I get the full bandwidth of 120 MB/sec,

This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1161#issuecomment-74702123.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those
of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors
are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and
other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and
any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail.
"Ricardo" means Ricardo plc and its subsidiary companies.
Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

@lintonv
Copy link

lintonv commented Feb 18, 2015

@olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck.

I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible.

Any other insights on how this can be done in a parallel fashion?

@olw2005
Copy link

olw2005 commented Feb 18, 2015

@lintonv I’ll leave the code questions for others to answer. I’m a sysadmin not a programmer, Jim. =)

However I will note that in our usage, an unconstrained (redirected to /dev/null for example) “full” zfs send of an un-cached but relatively un-fragmented filesystem easily pegs the 6gbps sas controller. We typically net around 500 - 600 MB/s @ around 4k – 5k iops which is about right given (raw speed) * compression / (raidz2 overhead). In practice we get around 250-300 MB/s dumping a zfs send out across the 10gbit lan to LTO5 tape on a backup server. (I believe in that case it’s largely constrained by the tape speed.)

Bottom line, I don’t think there is anything “wrong” with the zfs send code.

From: lintonv [mailto:[email protected]]
Sent: Wednesday, February 18, 2015 8:55 AM
To: zfsonlinux/zfs
Cc: Wieck, Owen
Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@olw2005https://github.com/olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck.

I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible.

Any other insights on how this can be done in a parallel fashion?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1161#issuecomment-74866834.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those
of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors
are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and
other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and
any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail.
"Ricardo" means Ricardo plc and its subsidiary companies.
Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

@lintonv
Copy link

lintonv commented Feb 19, 2015

I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here.

Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant.

@olw2005
Copy link

olw2005 commented Feb 19, 2015

@lintonv On that note, I was going to mention in my last post but forgot. You should redirect the performance questions to the zfs discussion list (see the zfsonlinux.org website). You’ll get more advice there. (In fact, if you search it you’ll probably find the question of zfs send/recv performance has come up before. Repeatedly.)

As for block size, this may be out-of-date but at the time we implemented (circa v0.6.1) the 128k block size worked a lot better for our use case (zvols shared with iscsi to vmware). I tested a range of block sizes and there was noticeable performance degradation at the smallest block sizes (4k and 8k in particular). Again, take it with a grain of salt as that was about 3 years ago and the zfs code has changed a lot since then.

From: lintonv [mailto:[email protected]]
Sent: Thursday, February 19, 2015 11:38 AM
To: zfsonlinux/zfs
Cc: Wieck, Owen
Subject: Re: [zfs] Implement zfs recv buffer (#1161)

I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here.

Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1161#issuecomment-75085223.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those
of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors
are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and
other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and
any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail.
"Ricardo" means Ricardo plc and its subsidiary companies.
Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

I noticed when reviewing documentation that it is possible for userspace to use fctntl(fd, F_SETPIPE_SZ, size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. We can use fstat to check if the fd is of type S_IFIFO so that we only do this on actual pipes. I thought of it when working on something else, so I am making a note here should someone else want to do it before I find time. This should be trivial to achieve.

ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. We have people using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This could have been configurable or we could have
changed the value back to the original (had we read it) after we were
done with the file descriptor,but I do not see a strong case for doing
either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This could have been configurable or we could have
changed the value back to the original (had we read it) after we were
done with the file descriptor,but I do not see a strong case for doing
either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This could have been configurable or we could have
changed the value back to the original (had we read it) after we were
done with the file descriptor,but I do not see a strong case for doing
either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

Those additional pushes were just for changes to the commit message. Anyway, the triviality of this piqued my interest, so I implemented it, compiled it and verified with strace that the right syscalls were being done on a simple test case. Someone else will need to verify that it actually provides a benefit.

ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original (had we
read it) after we were done with the file descriptor, but I do not see a
strong case for doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 11, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Mar 11, 2015

Would someone who benefits from mbuffer mind doing a benchmark of ryao/zfs@3530cf2 against the unpatched userland binaries with and without mbuffer? It should outperform mbuffer in situations where a single core cannot keep up with the two additional copies using mbuffer requires while eliminating the need for it.

ryao added a commit to ryao/zfs that referenced this issue Mar 12, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 13, 2015
I noticed when reviewing documentation that it is possible for userspace
to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change the
kernel pipe buffer size on Linux to increase the pipe size up to the
value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient. This could have been made configurable
and/or this could have changed the value back to the original after we
were done with the file descriptor, but I do not see a strong case for
doing either, so I went with a simple implementation.

Closes openzfs#1161

Signed-off-by: Richard Yao <[email protected]>
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 20, 2015
I noticed when reviewing documentation that it is possible for user
space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change
the kernel pipe buffer size on Linux to increase the pipe size up to
the value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient.

This could have been made configurable and/or this could have changed
the value back to the original after we were done with the file
descriptor, but I do not see a strong case for doing either, so I
went with a simple implementation.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1161
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 21, 2015
I noticed when reviewing documentation that it is possible for user
space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change
the kernel pipe buffer size on Linux to increase the pipe size up to
the value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient.

This could have been made configurable and/or this could have changed
the value back to the original after we were done with the file
descriptor, but I do not see a strong case for doing either, so I
went with a simple implementation.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1161
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 22, 2015
I noticed when reviewing documentation that it is possible for user
space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change
the kernel pipe buffer size on Linux to increase the pipe size up to
the value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient.

This could have been made configurable and/or this could have changed
the value back to the original after we were done with the file
descriptor, but I do not see a strong case for doing either, so I
went with a simple implementation.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1161
@ryao
Copy link
Contributor Author

ryao commented May 7, 2015

5c3f61e closed this.

@ryao ryao closed this as completed May 7, 2015
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
The vulnerability is a denial-of-service.  We need to wait for our
dependencies to move off of the vulnerable crate.  See:
https://rustsec.org/advisories/RUSTSEC-2023-0052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.