-
Notifications
You must be signed in to change notification settings - Fork 23.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce system calls of write for client->reply by introducing writev #9934
Conversation
Refine code Fix a test failure Improve code logic Refine code Update code Add a check
@panjf2000 thank you. regarding the tests, the failures seem very consistent, i suppose it's some side effect of your change. |
Hi @oranagra , could you share some details about these two failed tests with me? I have no idea what might cause these failures and how to fix it, thanks! |
@panjf2000 When use |
@oranagra I have a question, whether using |
I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64). Furthermore, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initiative :)
|
Good point, maybe we should put |
@oranagra @sundb @moticless |
Actually, there's no explicit copying of data within pointers of iov into the kernel's memory. see https://elixir.bootlin.com/linux/v5.0/source/net/ipv4/tcp.c#L1174 and https://lwn.net/Articles/604287/, therefore, even if we don't limit the number of bytes to NET_MAX_WRITES_PER_EVENT, kernel will write bytes up to its maximum of socket send buffer (defined by /proc/sys/net/ipv4/tcp_wmem) instead of all bytes from user space(like 100MB) |
I'm sorry guys, i'm really busy elsewhere, and didn't review the code or correspondence here. |
Another aspect around this matter - we might deteriorate the performance in case of big replies, especially with TLS. |
I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
I still failed to reproduce the regression results on my own machine, it seemed that the performance data of unstableuse-writevIs there any chance I can make a program profiling on your AWS VMs? since the regression results seem to be reproduced on those VMs easily. @filipecosta90 |
Maybe the difference in the networking of TCP/IP stack. |
@oranagra / @panjf2000, WRT to:
The benchmarks are using 2 VMs ( 1DB and 1 CLIENT Virtual machines (KVM) on AWS ). WRT to benchmark automation, I noticed some variance on multiple runs of the same benchmark for unstable and the comparison branch.
@oranagra and @panjf2000 still about the impact of commands with deferred len I've tested a RedisTimeSeries module use-case, that uses deferred replies (cc @gkorland @OfirMos @dann). I was expecting higher impact on the numbers. Even thought we've got a slight improvement over unstable, the change is not as meaningful on the unstable
panjf2000/redis
To concludeTLDR I was expecting a larger impact, but on the "common case" this is not happening. Nonetheless, there are indeed some "not-so-usuall" use-cases that are improved. I see reason to merge it :) |
@filipecosta90 i'm not sure how RedisRimeSeries uses deferred reply, if it is just one per command, or inside a loop (like COMMAND command and CLUSTER SLOTS used to do, see #10056, #7123).
did you reproduce this on the latest? or the old copy of this branch and it's merge-base?
so we now conclude that this PR doesn't do any damage in the common use cases? (even over real network). just to point out again, this PR does get some 300% performance improvement for the old code of COMMAND command, see: @yossigo please ack. |
Any updates on this? |
we discussed and approved this today in a core-team meeting. So anyone has any other concerns or something i forgot before i merge it? |
the root cause is that in 6.2.x we started using deferred replies #7844. |
Is there any chance that those deferred replies are way larger than IOV_MAX=1024, which might explain why |
Filipe reproduced two cases (i assume the sizes of the elements in the zset are small): I suppose we can merge this PR, and keep discussing this in the other issue. |
@panjf2000 i edited the top comment (to be used for the squash-merge commit comment). let me know, or just fix if you see any issues or missing info. |
It looks good to me. |
There are scenarios where it results in many small objects in the reply list,
such as commands heavily using deferred array replies (
addReplyDeferredLen
).E.g. what COMMAND command and CLUSTER SLOTS used to do (see #10056, #7123),
but also in case of a transaction or a pipeline of commands that use just one deferred array reply.
We used to have to run multiple loops along with multiple calls to
write()
to send data back to peer based on the current code, but by means ofwritev()
, we can gather those scattered objects in reply list and include the static reply buffer as well, then send it by one system call, that ought to achieve higher performance.In the case of TLS, we simply check and concatenate buffers into one big buffer and send it away by one call to
connTLSWrite()
, if the amount of all buffers exceedsNET_MAX_WRITES_PER_EVENT
, then invokeconnTLSWrite()
multiple times to avoid a huge massive of memory copies.Note that aside from reducing system calls, this change will also reduce the number of small TCP packets sent.