Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce system calls of write for client->reply by introducing writev #9934

Merged
merged 21 commits into from
Feb 22, 2022

Conversation

panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Dec 12, 2021

There are scenarios where it results in many small objects in the reply list,
such as commands heavily using deferred array replies (addReplyDeferredLen).
E.g. what COMMAND command and CLUSTER SLOTS used to do (see #10056, #7123),
but also in case of a transaction or a pipeline of commands that use just one deferred array reply.

We used to have to run multiple loops along with multiple calls to write() to send data back to peer based on the current code, but by means of writev(), we can gather those scattered objects in reply list and include the static reply buffer as well, then send it by one system call, that ought to achieve higher performance.

In the case of TLS, we simply check and concatenate buffers into one big buffer and send it away by one call to connTLSWrite(), if the amount of all buffers exceeds NET_MAX_WRITES_PER_EVENT, then invoke connTLSWrite() multiple times to avoid a huge massive of memory copies.

Note that aside from reducing system calls, this change will also reduce the number of small TCP packets sent.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Refine code


Fix a test failure


Improve code logic


Refine code


Update code


Add a check
@panjf2000
Copy link
Contributor Author

This code change passed all tests on my macOS but now fail on linux:

image

I fail to see the connection between this code change and tests/unit/scripting.tcl, any clue for this?

@oranagra
Copy link
Member

@panjf2000 thank you.
Note that this isn't only saving system calls, there's also a good chance it saves TCP packets, specifically when the reply list is composed of many small objects, which is the case of many calls for addReplyDeferredLen, see #7123

regarding the tests, the failures seem very consistent, i suppose it's some side effect of your change.

@panjf2000
Copy link
Contributor Author

panjf2000 commented Dec 13, 2021

Hi @oranagra , could you share some details about these two failed tests with me? I have no idea what might cause these failures and how to fix it, thanks!

@sundb
Copy link
Collaborator

sundb commented Dec 13, 2021

@panjf2000 When use loglevel verboseconfig to start server and run eval "local a = {}; local b = {a}; a[1] = b; return a” 0, you can see Error writing to client: Invalid argument log.
connWritev always returns -1 in Script return recursive object test.

@sundb
Copy link
Collaborator

sundb commented Dec 13, 2021

@oranagra I have a question, whether using writev may cause blocking of other connections.
For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

@panjf2000
Copy link
Contributor Author

panjf2000 commented Dec 13, 2021

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).

Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.

Copy link
Collaborator

@moticless moticless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative :)

@sundb
Copy link
Collaborator

sundb commented Dec 14, 2021

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).

The replBufBlock size is 16k, which means we can easily break the NET_MAX_WRITES_PER_EVENT 16k limit.

Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.

Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

@panjf2000
Copy link
Contributor Author

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).

The replBufBlock size is 16k, which means we can easily break the NET_MAX_WRITES_PER_EVENT 16k limit.

Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.

Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

Good point, maybe we should put NET_MAX_WRITES_PER_EVENT into _writeToClient() for writev().

@panjf2000 panjf2000 requested a review from sundb December 14, 2021 06:50
@panjf2000
Copy link
Contributor Author

@oranagra @sundb @moticless
PTAL

@panjf2000
Copy link
Contributor Author

panjf2000 commented Dec 14, 2021

Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

Actually, there's no explicit copying of data within pointers of iov into the kernel's memory. see https://elixir.bootlin.com/linux/v5.0/source/net/ipv4/tcp.c#L1174 and https://lwn.net/Articles/604287/, therefore, even if we don't limit the number of bytes to NET_MAX_WRITES_PER_EVENT, kernel will write bytes up to its maximum of socket send buffer (defined by /proc/sys/net/ipv4/tcp_wmem) instead of all bytes from user space(like 100MB)
@sundb

@oranagra
Copy link
Member

I'm sorry guys, i'm really busy elsewhere, and didn't review the code or correspondence here.
keep up the good work, i'll get to it some day.

@moticless
Copy link
Collaborator

whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old cod

Another aspect around this matter - we might deteriorate the performance in case of big replies, especially with TLS.
I think we better always verify that the total aggregated bytes is not more than NET_MAX_WRITES_PER_EVENT, as like before by the caller. This includes refining the first entry condition (listLength(c->reply) > 1) to a condition that verifies the sum of first two replies is no more than NET_MAX_WRITES_PER_EVENT. We also let the higher logic at function writeToClient() to check its own conditions of whether to continue loop, or break.

@panjf2000
Copy link
Contributor Author

whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old cod

Another aspect around this matter - we might deteriorate the performance in case of big replies, especially with TLS. I think we better always verify that the total aggregated bytes is not more than NET_MAX_WRITES_PER_EVENT, as like before by the caller. This includes refining the first entry condition (listLength(c->reply) > 1) to a condition that verifies the sum of first two replies is no more than NET_MAX_WRITES_PER_EVENT. We also let the higher logic at function writeToClient() to check its own conditions of whether to continue loop, or break.

I think MAX_IOV_SIZE_PER_EVENT (latest commit in this PR) can achieve the same goal as using NET_MAX_WRITES_PER_EVENT, or you have some other comments on the latest commit?
@moticless

Copy link
Collaborator

@moticless moticless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@filipecosta90 filipecosta90 added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Feb 16, 2022
@panjf2000
Copy link
Contributor Author

panjf2000 commented Feb 18, 2022

I still failed to reproduce the regression results on my own machine, it seemed that the performance data of unstable and use-writev is almost the same across generic scenarios, and the profiling results of two branches are basically the same, besides, writev() system call didn't even show up in the flame graph of use-writev, which could mean that most of the time it used write() like unstable did, so I think there shouldn't be a performance gap between these two branches in theory.

unstable

image

use-writev

image

Is there any chance I can make a program profiling on your AWS VMs? since the regression results seem to be reproduced on those VMs easily. @filipecosta90

@oranagra

@oranagra
Copy link
Member

Maybe the difference in the networking of TCP/IP stack.
I.e. the difference is not in CPU time so will not be visible in flame graph, but rather some difference or side effect of using writev on TCP / network behavior.
I'm not sure how Filipe's benchmarks are run, but i assume Andy's are local.
It could also be some issue with AWS hypervisor, but maybe the first step is to try comparing them on a real network (commonly the actual use case for redis) .

@filipecosta90 filipecosta90 added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Feb 19, 2022
@filipecosta90
Copy link
Contributor

@oranagra / @panjf2000, WRT to:

I'm not sure how Filipe's benchmarks are run, but i assume Andy's are local.

The benchmarks are using 2 VMs ( 1DB and 1 CLIENT Virtual machines (KVM) on AWS ).

WRT to benchmark automation, I noticed some variance on multiple runs of the same benchmark for unstable and the comparison branch.
Apart from 4 unstable benchmarks ( let's discard them if you agree there is no change that can impact it ), we have 9 benchmarks with no change.

redisbench-admin compare --triggering_env ci   --baseline-branch unstable --comparison-branch panjf2000:use-writev  --github_repo redis --github_org redis --use_metric_context_path true --metric_name Ops/sec --testname_regex .+Totals
Effective log level set to INFO
2022-02-19 21:52:52,199 INFO Using: redisbench-admin 0.6.23
2022-02-19 21:52:52,793 INFO Using deployment_type=oss-standalone and deployment_name=oss-standalone for the analysis
2022-02-19 21:52:52,793 INFO Using a time-delta from 7 days ago to now
2022-02-19 21:52:52,984 WARNING Based on test-cases set (key=ci.benchmarks.redislabs/ci/redis/redis:testcases_AND_metric_context_path) we have 14 comparison points. 
2022-02-19 21:53:02,096 INFO Printing differential analysis between branches
# Comparison between unstable and panjf2000:use-writev for metric: Ops/sec. Time Period from 7 days ago to now
|                                   Test Case                                    |   Baseline (median obs. +- std.dev)    |  Comparison (median obs. +- std.dev)   |% change (higher-better)|            Note             |
|--------------------------------------------------------------------------------|----------------------------------------|----------------------------------------|------------------------|-----------------------------|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10:Totals| 312220 +- 20.6% UNSTABLE (5 datapoints)| 315862 +- 1.6%  (2 datapoints)         |1.2%                    |UNSTABLE (very high variance)|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values:Totals            | 134161 +- 0.5%  (5 datapoints)         | 117642 +- 18.6% UNSTABLE (2 datapoints)|-12.3%                  |UNSTABLE (very high variance)|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10:Totals | 364426 +- 4.0%  (5 datapoints)         | 371266 +- 0.2%  (2 datapoints)         |1.9%                    |-- no change --              |
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values:Totals             | 143372 +- 4.3%  (5 datapoints)         | 141731 +- 2.3%  (2 datapoints)         |-1.1%                   |-- no change --              |
|memtier_benchmark-1Mkeys-100B-expire-use-case:Totals                            | 219014 +- 2.3%  (5 datapoints)         | 219491 +- 0.3%  (2 datapoints)         |0.2%                    |-- no change --              |
|memtier_benchmark-1Mkeys-10B-expire-use-case:Totals                             | 217413 +- 0.4%  (5 datapoints)         | 219192 +- 0.1%  (2 datapoints)         |0.8%                    |-- no change --              |
|memtier_benchmark-1Mkeys-1KiB-expire-use-case:Totals                            | 217806 +- 0.4%  (5 datapoints)         | 219764 +- 0.8%  (2 datapoints)         |0.9%                    |-- no change --              |
|memtier_benchmark-1Mkeys-4KiB-expire-use-case:Totals                            | 217860 +- 2.0%  (5 datapoints)         | 218195 +- 0.7%  (2 datapoints)         |0.2%                    |-- no change --              |
|memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values:Totals  | 169042 +- 1.9%  (5 datapoints)         | 167368 +- 1.3%  (2 datapoints)         |-1.0%                   |-- no change --              |
|memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values:Totals                 | 165841 +- 4.1%  (5 datapoints)         | 183398 +- 10.4% UNSTABLE (2 datapoints)|10.6%                   |UNSTABLE (very high variance)|
|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10:Totals| 140570 +- 0.3%  (5 datapoints)         | 139839 +- 0.7%  (2 datapoints)         |-0.5%                   |-- no change --              |
|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values:Totals            | 99571 +- 14.4% UNSTABLE (5 datapoints) | 100428 +- 0.9%  (2 datapoints)         |0.9%                    |UNSTABLE (very high variance)|
|memtier_benchmark-1Mkeys-load-list-with-100B-values:Totals                      | 150928 +- 1.8%  (5 datapoints)         | 152043 +- 0.4%  (2 datapoints)         |0.7%                    |-- no change --              |
2022-02-19 21:53:02,162 INFO Detected a total of 9 stable tests between versions.
2022-02-19 21:53:02,162 WARNING Detected a total of 4 highly unstable benchmarks.

@oranagra and @panjf2000 still about the impact of commands with deferred len I've tested a RedisTimeSeries module use-case, that uses deferred replies (cc @gkorland @OfirMos @dann). I was expecting higher impact on the numbers. Even thought we've got a slight improvement over unstable, the change is not as meaningful on the COMMAND numbers shared (that I can indeed reproduce ). TLDR, the deferred positive impact is only "measurable/meaningfull" on a really large number of deferred writes per reply (otherwise the impact will be faded away like the one bellow ).

unstable

memtier_benchmark --command="TS.RANGE ts - + COUNT 1000"  --hide-histogram --test-time 600 -s 10.3.0.175 -p 6380 
(...)
ALL STATS
====================================================================================================
Type           Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------
Ts.ranges      5218.29        38.29828        38.39900        42.23900        43.77500    139793.09 
Totals         5218.29        38.29828        38.39900        42.23900        43.77500    139793.09 

panjf2000/redis

ALL STATS
====================================================================================================
Type           Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------
Ts.ranges      5315.26        37.60366        37.63100        43.00700        45.82300    142390.83 
Totals         5315.26        37.60366        37.63100        43.00700        45.82300    142390.83 

To conclude

TLDR I was expecting a larger impact, but on the "common case" this is not happening. Nonetheless, there are indeed some "not-so-usuall" use-cases that are improved. I see reason to merge it :)

@oranagra
Copy link
Member

oranagra commented Feb 20, 2022

I've tested a RedisTimeSeries module use-case, that uses deferred replies. I was expecting higher impact on the numbers

@filipecosta90 i'm not sure how RedisRimeSeries uses deferred reply, if it is just one per command, or inside a loop (like COMMAND command and CLUSTER SLOTS used to do, see #10056, #7123).
if it's just one per command, same as COMMAND command still does, it should not have a high impact.

the change is not as meaningful on the COMMAND numbers shared (that I can indeed reproduce )

did you reproduce this on the latest? or the old copy of this branch and it's merge-base?

Apart from 4 unstable benchmarks ( let's discard them if you agree there is no change that can impact it ), we have 9 benchmarks with no change.

so we now conclude that this PR doesn't do any damage in the common use cases? (even over real network).

just to point out again, this PR does get some 300% performance improvement for the old code of COMMAND command, see:
#9934 (comment)
and about 10% performance improvement with TLS, see:
#9934 (comment)

@yossigo please ack.

@panjf2000
Copy link
Contributor Author

Any updates on this?

@oranagra
Copy link
Member

we discussed and approved this today in a core-team meeting.
maybe the the last possible pending concern is to realize why it didn't have the expected impact on #10310.
but maybe none of that is gonna affect this PR.

So anyone has any other concerns or something i forgot before i merge it?

@panjf2000
Copy link
Contributor Author

panjf2000 commented Feb 22, 2022

I just took a quick look at #10310, and I wonder if the root cause for the regression of v6.2.6 compared to v5.0.2 has been found?
@oranagra

@oranagra
Copy link
Member

the root cause is that in 6.2.x we started using deferred replies #7844.
Filipe is suppose to try to "revert" this on unstable and see if this explains the whole regression, or just part of it (which would then explain why writev didn't completely undo the regression).

@panjf2000
Copy link
Contributor Author

Is there any chance that those deferred replies are way larger than IOV_MAX=1024, which might explain why writev() only alleviates rather than completely cures the regression?

@oranagra
Copy link
Member

Filipe reproduced two cases (i assume the sizes of the elements in the zset are small):
One without pipeline, in which case writev didn't help. not sure if this regression is from the fact there are two writes instead of 1, or maybe another change.
Second with pipeline of 16, so there should be 32 nodes in the reply list (i assume they're all small). and writev improved the case, but not nearly to what it was in the past.

I suppose we can merge this PR, and keep discussing this in the other issue.
either we'll realize there's an additional regression, or we'll figure out why writev can't solve it entirely.
either way, i don't see how the conclusion could affect this PR.

@oranagra
Copy link
Member

@panjf2000 i edited the top comment (to be used for the squash-merge commit comment). let me know, or just fix if you see any issues or missing info.

@panjf2000
Copy link
Contributor Author

panjf2000 commented Feb 22, 2022

@panjf2000 i edited the top comment (to be used for the squash-merge commit comment). let me know, or just fix if you see any issues or missing info.

It looks good to me.
@oranagra

@oranagra oranagra merged commit 496375f into redis:unstable Feb 22, 2022
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Feb 22, 2022
@panjf2000 panjf2000 deleted the use-writev branch September 15, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action:run-benchmark Triggers the benchmark suite for this Pull Request release-notes indication that this issue needs to be mentioned in the release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

7 participants