Reduce system calls of write for client->reply by introducing writev #9934

panjf2000 · 2021-12-12T10:46:07Z

There are scenarios where it results in many small objects in the reply list,
such as commands heavily using deferred array replies (addReplyDeferredLen).
E.g. what COMMAND command and CLUSTER SLOTS used to do (see #10056, #7123),
but also in case of a transaction or a pipeline of commands that use just one deferred array reply.

We used to have to run multiple loops along with multiple calls to write() to send data back to peer based on the current code, but by means of writev(), we can gather those scattered objects in reply list and include the static reply buffer as well, then send it by one system call, that ought to achieve higher performance.

In the case of TLS, we simply check and concatenate buffers into one big buffer and send it away by one call to connTLSWrite(), if the amount of all buffers exceeds NET_MAX_WRITES_PER_EVENT, then invoke connTLSWrite() multiple times to avoid a huge massive of memory copies.

Note that aside from reducing system calls, this change will also reduce the number of small TCP packets sent.

Refine code Fix a test failure Improve code logic Refine code Update code Add a check

panjf2000 · 2021-12-12T10:58:01Z

This code change passed all tests on my macOS but now fail on linux:

I fail to see the connection between this code change and tests/unit/scripting.tcl, any clue for this?

oranagra · 2021-12-12T12:11:34Z

@panjf2000 thank you.
Note that this isn't only saving system calls, there's also a good chance it saves TCP packets, specifically when the reply list is composed of many small objects, which is the case of many calls for addReplyDeferredLen, see #7123

regarding the tests, the failures seem very consistent, i suppose it's some side effect of your change.

panjf2000 · 2021-12-13T11:54:34Z

Hi @oranagra , could you share some details about these two failed tests with me? I have no idea what might cause these failures and how to fix it, thanks!

sundb · 2021-12-13T14:44:26Z

@panjf2000 When use loglevel verboseconfig to start server and run eval "local a = {}; local b = {a}; a[1] = b; return a” 0, you can see Error writing to client: Invalid argument log.
connWritev always returns -1 in Script return recursive object test.

sundb · 2021-12-13T15:11:52Z

@oranagra I have a question, whether using writev may cause blocking of other connections.
For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

panjf2000 · 2021-12-13T15:52:37Z

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).

Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.

moticless

Great initiative :)

src/connection.c

src/networking.c

sundb · 2021-12-14T03:26:35Z

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).

The replBufBlock size is 16k, which means we can easily break the NET_MAX_WRITES_PER_EVENT 16k limit.

Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.

Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

panjf2000 · 2021-12-14T03:42:45Z

@oranagra I have a question, whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old code, it would be issued in 10 event cycles, but now it is issued in 10 clientReplyBlocks in one cycle.

I don't think that would be a problem cuz it will stop sending data when it reaches the limitation of NET_MAX_WRITES_PER_EVENT (1024*64).
The replBufBlock size is 16k, which means we can easily break the NET_MAX_WRITES_PER_EVENT 16k limit.
Furthermore, the writev() call ought to return immediately since the socket is non-blocking, therefore, calling writev() will not result in blocking even if there is a large number of bytes in the reply list.
Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

Good point, maybe we should put NET_MAX_WRITES_PER_EVENT into _writeToClient() for writev().

src/networking.c

panjf2000 · 2021-12-14T06:51:50Z

@oranagra @sundb @moticless
PTAL

panjf2000 · 2021-12-14T09:01:40Z

Yes, but writev needs to copy memory from user state to kernel anyway,
which is probably the reason for the NET_MAX_WRITES_PER_EVENT limit.

Actually, there's no explicit copying of data within pointers of iov into the kernel's memory. see https://elixir.bootlin.com/linux/v5.0/source/net/ipv4/tcp.c#L1174 and https://lwn.net/Articles/604287/, therefore, even if we don't limit the number of bytes to NET_MAX_WRITES_PER_EVENT, kernel will write bytes up to its maximum of socket send buffer (defined by /proc/sys/net/ipv4/tcp_wmem) instead of all bytes from user space(like 100MB)
@sundb

oranagra · 2021-12-14T09:02:27Z

I'm sorry guys, i'm really busy elsewhere, and didn't review the code or correspondence here.
keep up the good work, i'll get to it some day.

src/networking.c

src/tls.c

moticless · 2021-12-14T15:20:58Z

whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old cod

Another aspect around this matter - we might deteriorate the performance in case of big replies, especially with TLS.
I think we better always verify that the total aggregated bytes is not more than NET_MAX_WRITES_PER_EVENT, as like before by the caller. This includes refining the first entry condition (listLength(c->reply) > 1) to a condition that verifies the sum of first two replies is no more than NET_MAX_WRITES_PER_EVENT. We also let the higher logic at function writeToClient() to check its own conditions of whether to continue loop, or break.

src/networking.c

panjf2000 · 2021-12-14T16:48:41Z

whether using writev may cause blocking of other connections. For example, if connection A has 100M replied data (10 clientReplyBlocks), in the old cod

Another aspect around this matter - we might deteriorate the performance in case of big replies, especially with TLS. I think we better always verify that the total aggregated bytes is not more than NET_MAX_WRITES_PER_EVENT, as like before by the caller. This includes refining the first entry condition (listLength(c->reply) > 1) to a condition that verifies the sum of first two replies is no more than NET_MAX_WRITES_PER_EVENT. We also let the higher logic at function writeToClient() to check its own conditions of whether to continue loop, or break.

I think MAX_IOV_SIZE_PER_EVENT (latest commit in this PR) can achieve the same goal as using NET_MAX_WRITES_PER_EVENT, or you have some other comments on the latest commit?
@moticless

src/tls.c

src/networking.c

moticless

Thank you

panjf2000 · 2022-02-18T15:14:05Z

I still failed to reproduce the regression results on my own machine, it seemed that the performance data of unstable and use-writev is almost the same across generic scenarios, and the profiling results of two branches are basically the same, besides, writev() system call didn't even show up in the flame graph of use-writev, which could mean that most of the time it used write() like unstable did, so I think there shouldn't be a performance gap between these two branches in theory.

unstable

use-writev

Is there any chance I can make a program profiling on your AWS VMs? since the regression results seem to be reproduced on those VMs easily. @filipecosta90

@oranagra

oranagra · 2022-02-18T18:51:15Z

Maybe the difference in the networking of TCP/IP stack.
I.e. the difference is not in CPU time so will not be visible in flame graph, but rather some difference or side effect of using writev on TCP / network behavior.
I'm not sure how Filipe's benchmarks are run, but i assume Andy's are local.
It could also be some issue with AWS hypervisor, but maybe the first step is to try comparing them on a real network (commonly the actual use case for redis) .

filipecosta90 · 2022-02-19T22:43:52Z

@oranagra / @panjf2000, WRT to:

I'm not sure how Filipe's benchmarks are run, but i assume Andy's are local.

The benchmarks are using 2 VMs ( 1DB and 1 CLIENT Virtual machines (KVM) on AWS ).

WRT to benchmark automation, I noticed some variance on multiple runs of the same benchmark for unstable and the comparison branch.
Apart from 4 unstable benchmarks ( let's discard them if you agree there is no change that can impact it ), we have 9 benchmarks with no change.

redisbench-admin compare --triggering_env ci   --baseline-branch unstable --comparison-branch panjf2000:use-writev  --github_repo redis --github_org redis --use_metric_context_path true --metric_name Ops/sec --testname_regex .+Totals
Effective log level set to INFO
2022-02-19 21:52:52,199 INFO Using: redisbench-admin 0.6.23
2022-02-19 21:52:52,793 INFO Using deployment_type=oss-standalone and deployment_name=oss-standalone for the analysis
2022-02-19 21:52:52,793 INFO Using a time-delta from 7 days ago to now
2022-02-19 21:52:52,984 WARNING Based on test-cases set (key=ci.benchmarks.redislabs/ci/redis/redis:testcases_AND_metric_context_path) we have 14 comparison points. 
2022-02-19 21:53:02,096 INFO Printing differential analysis between branches
# Comparison between unstable and panjf2000:use-writev for metric: Ops/sec. Time Period from 7 days ago to now
|                                   Test Case                                    |   Baseline (median obs. +- std.dev)    |  Comparison (median obs. +- std.dev)   |% change (higher-better)|            Note             |
|--------------------------------------------------------------------------------|----------------------------------------|----------------------------------------|------------------------|-----------------------------|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10:Totals| 312220 +- 20.6% UNSTABLE (5 datapoints)| 315862 +- 1.6%  (2 datapoints)         |1.2%                    |UNSTABLE (very high variance)|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values:Totals            | 134161 +- 0.5%  (5 datapoints)         | 117642 +- 18.6% UNSTABLE (2 datapoints)|-12.3%                  |UNSTABLE (very high variance)|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10:Totals | 364426 +- 4.0%  (5 datapoints)         | 371266 +- 0.2%  (2 datapoints)         |1.9%                    |-- no change --              |
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values:Totals             | 143372 +- 4.3%  (5 datapoints)         | 141731 +- 2.3%  (2 datapoints)         |-1.1%                   |-- no change --              |
|memtier_benchmark-1Mkeys-100B-expire-use-case:Totals                            | 219014 +- 2.3%  (5 datapoints)         | 219491 +- 0.3%  (2 datapoints)         |0.2%                    |-- no change --              |
|memtier_benchmark-1Mkeys-10B-expire-use-case:Totals                             | 217413 +- 0.4%  (5 datapoints)         | 219192 +- 0.1%  (2 datapoints)         |0.8%                    |-- no change --              |
|memtier_benchmark-1Mkeys-1KiB-expire-use-case:Totals                            | 217806 +- 0.4%  (5 datapoints)         | 219764 +- 0.8%  (2 datapoints)         |0.9%                    |-- no change --              |
|memtier_benchmark-1Mkeys-4KiB-expire-use-case:Totals                            | 217860 +- 2.0%  (5 datapoints)         | 218195 +- 0.7%  (2 datapoints)         |0.2%                    |-- no change --              |
|memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values:Totals  | 169042 +- 1.9%  (5 datapoints)         | 167368 +- 1.3%  (2 datapoints)         |-1.0%                   |-- no change --              |
|memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values:Totals                 | 165841 +- 4.1%  (5 datapoints)         | 183398 +- 10.4% UNSTABLE (2 datapoints)|10.6%                   |UNSTABLE (very high variance)|
|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10:Totals| 140570 +- 0.3%  (5 datapoints)         | 139839 +- 0.7%  (2 datapoints)         |-0.5%                   |-- no change --              |
|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values:Totals            | 99571 +- 14.4% UNSTABLE (5 datapoints) | 100428 +- 0.9%  (2 datapoints)         |0.9%                    |UNSTABLE (very high variance)|
|memtier_benchmark-1Mkeys-load-list-with-100B-values:Totals                      | 150928 +- 1.8%  (5 datapoints)         | 152043 +- 0.4%  (2 datapoints)         |0.7%                    |-- no change --              |
2022-02-19 21:53:02,162 INFO Detected a total of 9 stable tests between versions.
2022-02-19 21:53:02,162 WARNING Detected a total of 4 highly unstable benchmarks.

@oranagra and @panjf2000 still about the impact of commands with deferred len I've tested a RedisTimeSeries module use-case, that uses deferred replies (cc @gkorland @OfirMos @dann). I was expecting higher impact on the numbers. Even thought we've got a slight improvement over unstable, the change is not as meaningful on the COMMAND numbers shared (that I can indeed reproduce ). TLDR, the deferred positive impact is only "measurable/meaningfull" on a really large number of deferred writes per reply (otherwise the impact will be faded away like the one bellow ).

unstable

memtier_benchmark --command="TS.RANGE ts - + COUNT 1000"  --hide-histogram --test-time 600 -s 10.3.0.175 -p 6380 
(...)
ALL STATS
====================================================================================================
Type           Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------
Ts.ranges      5218.29        38.29828        38.39900        42.23900        43.77500    139793.09 
Totals         5218.29        38.29828        38.39900        42.23900        43.77500    139793.09

panjf2000/redis

ALL STATS
====================================================================================================
Type           Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------
Ts.ranges      5315.26        37.60366        37.63100        43.00700        45.82300    142390.83 
Totals         5315.26        37.60366        37.63100        43.00700        45.82300    142390.83

To conclude

TLDR I was expecting a larger impact, but on the "common case" this is not happening. Nonetheless, there are indeed some "not-so-usuall" use-cases that are improved. I see reason to merge it :)

oranagra · 2022-02-20T06:36:05Z

I've tested a RedisTimeSeries module use-case, that uses deferred replies. I was expecting higher impact on the numbers

@filipecosta90 i'm not sure how RedisRimeSeries uses deferred reply, if it is just one per command, or inside a loop (like COMMAND command and CLUSTER SLOTS used to do, see #10056, #7123).
if it's just one per command, same as COMMAND command still does, it should not have a high impact.

the change is not as meaningful on the COMMAND numbers shared (that I can indeed reproduce )

did you reproduce this on the latest? or the old copy of this branch and it's merge-base?

Apart from 4 unstable benchmarks ( let's discard them if you agree there is no change that can impact it ), we have 9 benchmarks with no change.

so we now conclude that this PR doesn't do any damage in the common use cases? (even over real network).

just to point out again, this PR does get some 300% performance improvement for the old code of COMMAND command, see:
#9934 (comment)
and about 10% performance improvement with TLS, see:
#9934 (comment)

@yossigo please ack.

panjf2000 · 2022-02-22T09:29:11Z

Any updates on this?

oranagra · 2022-02-22T09:36:18Z

we discussed and approved this today in a core-team meeting.
maybe the the last possible pending concern is to realize why it didn't have the expected impact on #10310.
but maybe none of that is gonna affect this PR.

So anyone has any other concerns or something i forgot before i merge it?

panjf2000 · 2022-02-22T09:43:17Z

I just took a quick look at #10310, and I wonder if the root cause for the regression of v6.2.6 compared to v5.0.2 has been found?
@oranagra

oranagra · 2022-02-22T09:49:30Z

the root cause is that in 6.2.x we started using deferred replies #7844.
Filipe is suppose to try to "revert" this on unstable and see if this explains the whole regression, or just part of it (which would then explain why writev didn't completely undo the regression).

panjf2000 · 2022-02-22T09:57:09Z

Is there any chance that those deferred replies are way larger than IOV_MAX=1024, which might explain why writev() only alleviates rather than completely cures the regression?

oranagra · 2022-02-22T10:22:59Z

Filipe reproduced two cases (i assume the sizes of the elements in the zset are small):
One without pipeline, in which case writev didn't help. not sure if this regression is from the fact there are two writes instead of 1, or maybe another change.
Second with pipeline of 16, so there should be 32 nodes in the reply list (i assume they're all small). and writev improved the case, but not nearly to what it was in the past.

I suppose we can merge this PR, and keep discussing this in the other issue.
either we'll realize there's an additional regression, or we'll figure out why writev can't solve it entirely.
either way, i don't see how the conclusion could affect this PR.

oranagra · 2022-02-22T11:35:08Z

@panjf2000 i edited the top comment (to be used for the squash-merge commit comment). let me know, or just fix if you see any issues or missing info.

panjf2000 · 2022-02-22T11:42:49Z

@panjf2000 i edited the top comment (to be used for the squash-merge commit comment). let me know, or just fix if you see any issues or missing info.

It looks good to me.
@oranagra

Refine the code

0e2505e

Fix the test failures of in scripting.tcl

b8b005e

moticless reviewed Dec 13, 2021

View reviewed changes

src/connection.c Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

sundb reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

sundb reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

sundb reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

sundb reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

Fix some code review comments

745824c

panjf2000 requested a review from sundb December 14, 2021 06:50

panjf2000 added 2 commits December 14, 2021 17:03

Update the maximum size of iov

9574d5c

Only limit the size of iovec

f9f3a50

sundb reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

sundb reviewed Dec 14, 2021

View reviewed changes

src/tls.c Outdated Show resolved Hide resolved

moticless reviewed Dec 14, 2021

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

moticless reviewed Feb 16, 2022

View reviewed changes

src/tls.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

Resolve some code review comments

1b37ec2

moticless approved these changes Feb 16, 2022

View reviewed changes

filipecosta90 added action:run-benchmark and removed action:run-benchmark labels Feb 16, 2022

filipecosta90 added action:run-benchmark and removed action:run-benchmark labels Feb 19, 2022

oranagra mentioned this pull request Feb 20, 2022

[BUG] ZREVRANGE 50% slower after upgrading from 5.0.7 to 6.2.6 #10310

Closed

oranagra merged commit 496375f into redis:unstable Feb 22, 2022

oranagra added the release-notes label Feb 22, 2022

This was referenced Feb 24, 2022

Optimization: Avoid deferred array reply on ZRANGE commands BYRANK #10337

Merged

Release 7.0 RC2 #10355

Merged

panjf2000 deleted the use-writev branch September 15, 2022 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce system calls of write for client->reply by introducing writev #9934

Reduce system calls of write for client->reply by introducing writev #9934

panjf2000 commented Dec 12, 2021 •

edited

Loading

panjf2000 commented Dec 12, 2021

oranagra commented Dec 12, 2021

panjf2000 commented Dec 13, 2021 •

edited

Loading

sundb commented Dec 13, 2021

sundb commented Dec 13, 2021

panjf2000 commented Dec 13, 2021 •

edited

Loading

moticless left a comment

sundb commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

panjf2000 commented Dec 14, 2021 •

edited

Loading

oranagra commented Dec 14, 2021

moticless commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

moticless left a comment

panjf2000 commented Feb 18, 2022 •

edited

Loading

oranagra commented Feb 18, 2022

filipecosta90 commented Feb 19, 2022

oranagra commented Feb 20, 2022 •

edited

Loading

panjf2000 commented Feb 22, 2022

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022 •

edited

Loading

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022

oranagra commented Feb 22, 2022

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022 •

edited

Loading

Reduce system calls of write for client->reply by introducing writev #9934

Reduce system calls of write for client->reply by introducing writev #9934

Conversation

panjf2000 commented Dec 12, 2021 • edited Loading

panjf2000 commented Dec 12, 2021

oranagra commented Dec 12, 2021

panjf2000 commented Dec 13, 2021 • edited Loading

sundb commented Dec 13, 2021

sundb commented Dec 13, 2021

panjf2000 commented Dec 13, 2021 • edited Loading

moticless left a comment

Choose a reason for hiding this comment

sundb commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

panjf2000 commented Dec 14, 2021 • edited Loading

oranagra commented Dec 14, 2021

moticless commented Dec 14, 2021

panjf2000 commented Dec 14, 2021

moticless left a comment

Choose a reason for hiding this comment

panjf2000 commented Feb 18, 2022 • edited Loading

unstable

use-writev

oranagra commented Feb 18, 2022

filipecosta90 commented Feb 19, 2022

To conclude

oranagra commented Feb 20, 2022 • edited Loading

panjf2000 commented Feb 22, 2022

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022 • edited Loading

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022

oranagra commented Feb 22, 2022

oranagra commented Feb 22, 2022

panjf2000 commented Feb 22, 2022 • edited Loading

panjf2000 commented Dec 12, 2021 •

edited

Loading

panjf2000 commented Dec 13, 2021 •

edited

Loading

panjf2000 commented Dec 13, 2021 •

edited

Loading

panjf2000 commented Dec 14, 2021 •

edited

Loading

panjf2000 commented Feb 18, 2022 •

edited

Loading

oranagra commented Feb 20, 2022 •

edited

Loading

panjf2000 commented Feb 22, 2022 •

edited

Loading

panjf2000 commented Feb 22, 2022 •

edited

Loading