Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high number of unneeded retransmissions in bulk transfer #253

Closed
pabeni opened this issue Jan 13, 2022 · 7 comments
Closed

high number of unneeded retransmissions in bulk transfer #253

pabeni opened this issue Jan 13, 2022 · 7 comments
Assignees
Labels

Comments

@pabeni
Copy link

pabeni commented Jan 13, 2022

comparing MPTCP single subflow bulk transfer performances to plain TCP, a large number of TCP-level retransmissions happen in the MPTCP case:

mptcpize run taskset 0x1 iperf3 -t 60 -i 60 -c 192.168.255.1
Connecting to host 192.168.255.1, port 5201
[  5] local 192.168.255.2 port 48820 connected to 192.168.255.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-60.00  sec   133 GBytes  19.1 Gbits/sec  36519   1.02 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec   133 GBytes  19.1 Gbits/sec  36519             sender
[  5]   0.00-60.00  sec   133 GBytes  19.1 Gbits/sec                  receiver

That is sever orders of magnitude more than plain TCP in the same scenario:

taskset 0x1 iperf3 -t 60 -i 60 -c 192.168.255.1
Connecting to host 192.168.255.1, port 5201
[  5] local 192.168.255.2 port 48824 connected to 192.168.255.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-4.05   sec  10.8 GBytes  22.9 Gbits/sec   68    971 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-4.05   sec  10.8 GBytes  22.9 Gbits/sec   68             sender
[  5]   0.00-4.05   sec  0.00 Bytes  0.00 bits/sec                  receiver

The retransmissions are fast retransmit ones:

taskset 0x1 iperf3 -t 60 -i 60 -c 192.168.255.1 & sleep 1; nstat >/dev/null ; sleep 1 ; nstat Tcp* MPTcp*
[1] 8321
Connecting to host 192.168.255.1, port 5201
[  5] local 192.168.255.2 port 48828 connected to 192.168.255.1 port 5201
#kernel
TcpInSegs                       21415              0.0
TcpOutSegs                      1990105            0.0
TcpExtTCPPureAcks               13718              0.0
TcpExtTCPHPAcks                 7694               0.0
TcpExtTCPBacklogCoalesce        6                  0.0
TcpExtTCPAutoCorking            4589               0.0
TcpExtTCPOrigDataSent           1990487            0.0
TcpExtTCPDelivered              1990487            0.0

perf probes show that the code triggering the fast retransmit is:

        if (tcp_ack_is_dubious(sk, flag)) {
                if (!(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP))) {
                        num_dupack = 1;
                        /* Consider if pure acks were aggregated in tcp_add_backlog() */
                        if (!(flag & FLAG_DATA))
                                num_dupack = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
                }
                tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
                                      &rexmit);

in tcp_ack(), due to 'flag' being 0x4100. That is possibly caused by MPTCP-level ack generated on the other side by mptcp_cleanup_rbuf(), with unmodified TCP-level ack seq, unmodified window size, and increased MPTCP-level ack seq.

From plain TCP PoV, such acks are dups and trigger the fast retransmit logic.

@pabeni pabeni self-assigned this Jan 13, 2022
@pabeni pabeni added the bug label Jan 13, 2022
@vandit86
Copy link

Don't know if my comment is relevant, but I observed some similar behaviour with iperf3 when the bandwidth is not limited ( param -b is not set). Many retransmissions in the end of the connection.
However, with the iperf, the MPTCP works as expected.
kernel 5.15-rc7

@pabeni
Copy link
Author

pabeni commented Jan 14, 2022

That is possibly caused by MPTCP-level ack generated on the other side by mptcp_cleanup_rbuf(), with unmodified TCP-level ack seq, unmodified window size, and increased MPTCP-level ack seq.

From plain TCP PoV, such acks are dups and trigger the fast retransmit logic.

Double checking with pcap traces, the dup acks are have unmodified MPTCP-level ack seq, too. In all the inspected samples they contains a new sack block

@pabeni
Copy link
Author

pabeni commented Jan 17, 2022

sack (and thus retransmission) are caused by actual drop at the NIX RX queue level:

# ip -s link show dev enp37s0f0np0
6: enp37s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:5d:9f:ac brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped missed  mcast   
    102430689615983 67667224255 0       18753856 0       25      

the following workaround is effective in my setup:

ethtool -G enp37s0f0np0 rx 4096

The default RX ring size is 1024 on such NIC. Still the burstiness introduced by MPTCP looks excessive ?!?

@matttbe
Copy link
Member

matttbe commented Jan 17, 2022

Very interesting! Thank you for looking at that!

Why MPTCP subflow would be more bursty than plain TCP ones?

@pabeni
Copy link
Author

pabeni commented Jan 18, 2022

Why MPTCP subflow would be more bursty than plain TCP ones?

I don't have a clear picture. The main differences is that with MPTCP, subflow processing - that is, the plain TCP stack - does not use the socket backlog and always happens in BH processing. Since the host is under heavy load, BH processing happens in ksoftirqd context, and there is some latency between the ksoftirqd scheduling and the moment ksoftirqd actually runs that depends on the process scheduler decisions (and settings).

There is also some dependency on the kernel setting, as switching the 'tuned' profile - with the default rx ring setting - causes drops with plain TCP, too.

@pabeni
Copy link
Author

pabeni commented Jan 21, 2022

as per above comments, this is not really an MPTCP issue - possibly something to be aware of for high speed setup

@pabeni pabeni closed this as completed Jan 21, 2022
@matttbe
Copy link
Member

matttbe commented Jan 21, 2022

Thank you for closing the issue!
I just added a note about that at the end of https://github.com/multipath-tcp/mptcp_net-next/wiki

jenkins-tessares pushed a commit that referenced this issue Sep 9, 2022
This reverts commit ef5b570.

Zhouyi reported that commit is causing crashes when running rcutorture
with KASAN enabled:

  BUG: using smp_processor_id() in preemptible [00000000] code: rcu_torture_rea/100
  caller is rcu_preempt_deferred_qs_irqrestore+0x74/0xed0
  CPU: 4 PID: 100 Comm: rcu_torture_rea Tainted: G        W          5.19.0-rc5-next-20220708-dirty #253
  Call Trace:
    dump_stack_lvl+0xbc/0x108 (unreliable)
    check_preemption_disabled+0x154/0x160
    rcu_preempt_deferred_qs_irqrestore+0x74/0xed0
    __rcu_read_unlock+0x290/0x3b0
    rcu_torture_read_unlock+0x30/0xb0
    rcutorture_one_extend+0x198/0x810
    rcu_torture_one_read+0x58c/0xc90
    rcu_torture_reader+0x12c/0x360
    kthread+0x1e8/0x220
    ret_from_kernel_thread+0x5c/0x64

KASAN will generate instrumentation instructions around the
WRITE_ONCE(local_paca->irq_soft_mask, mask):

   0xc000000000295cb0 <+0>:	addis   r2,r12,774
   0xc000000000295cb4 <+4>:	addi    r2,r2,16464
   0xc000000000295cb8 <+8>:	mflr    r0
   0xc000000000295cbc <+12>:	bl      0xc00000000008bb4c <mcount>
   0xc000000000295cc0 <+16>:	mflr    r0
   0xc000000000295cc4 <+20>:	std     r31,-8(r1)
   0xc000000000295cc8 <+24>:	addi    r3,r13,2354
   0xc000000000295ccc <+28>:	mr      r31,r13
   0xc000000000295cd0 <+32>:	std     r0,16(r1)
   0xc000000000295cd4 <+36>:	stdu    r1,-48(r1)
   0xc000000000295cd8 <+40>:	bl      0xc000000000609b98 <__asan_store1+8>
   0xc000000000295cdc <+44>:	nop
   0xc000000000295ce0 <+48>:	li      r9,1
   0xc000000000295ce4 <+52>:	stb     r9,2354(r31)
   0xc000000000295ce8 <+56>:	addi    r1,r1,48
   0xc000000000295cec <+60>:	ld      r0,16(r1)
   0xc000000000295cf0 <+64>:	ld      r31,-8(r1)
   0xc000000000295cf4 <+68>:	mtlr    r0

If there is a context switch before "stb     r9,2354(r31)", r31 may
not equal to r13, in such case, irq soft mask will not work.

The usual solution of marking the code ineligible for instrumentation
forces the code out-of-line, which we would prefer to avoid. Christophe
proposed a partial revert, but Nick raised some concerns with that. So
for now do a full revert.

Reported-by: Zhouyi Zhou <[email protected]>
[mpe: Construct change log based on Zhouyi's original report]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
jenkins-tessares pushed a commit that referenced this issue May 4, 2023
Toggle deleted anonymous sets as inactive in the next generation, so
users cannot perform any update on it. Clear the generation bitmask
in case the transaction is aborted.

The following KASAN splat shows a set element deletion for a bound
anonymous set that has been already removed in the same transaction.

[   64.921510] ==================================================================
[   64.923123] BUG: KASAN: wild-memory-access in nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.924745] Write of size 8 at addr dead000000000122 by task test/890
[   64.927903] CPU: 3 PID: 890 Comm: test Not tainted 6.3.0+ #253
[   64.931120] Call Trace:
[   64.932699]  <TASK>
[   64.934292]  dump_stack_lvl+0x33/0x50
[   64.935908]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.937551]  kasan_report+0xda/0x120
[   64.939186]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.940814]  nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.942452]  ? __kasan_slab_alloc+0x2d/0x60
[   64.944070]  ? nf_tables_setelem_notify+0x190/0x190 [nf_tables]
[   64.945710]  ? kasan_set_track+0x21/0x30
[   64.947323]  nfnetlink_rcv_batch+0x709/0xd90 [nfnetlink]
[   64.948898]  ? nfnetlink_rcv_msg+0x480/0x480 [nfnetlink]

Signed-off-by: Pablo Neira Ayuso <[email protected]>
jenkins-tessares pushed a commit that referenced this issue Aug 11, 2023
Add a detachment test case with miniq present to assert that with and
without the miniq we get the same error.

  # ./test_progs -t tc_opts
  #244     tc_opts_after:OK
  #245     tc_opts_append:OK
  #246     tc_opts_basic:OK
  #247     tc_opts_before:OK
  #248     tc_opts_chain_classic:OK
  #249     tc_opts_delete_empty:OK
  #250     tc_opts_demixed:OK
  #251     tc_opts_detach:OK
  #252     tc_opts_detach_after:OK
  #253     tc_opts_detach_before:OK
  #254     tc_opts_dev_cleanup:OK
  #255     tc_opts_invalid:OK
  #256     tc_opts_mixed:OK
  #257     tc_opts_prepend:OK
  #258     tc_opts_replace:OK
  #259     tc_opts_revision:OK
  Summary: 16/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
matttbe pushed a commit that referenced this issue Aug 17, 2023
Add several new tcx test cases to improve test coverage. This also includes
a few new tests with ingress instead of clsact qdisc, to cover the fix from
commit dc644b5 ("tcx: Fix splat in ingress_destroy upon tcx_entry_free").

  # ./test_progs -t tc
  [...]
  #234     tc_links_after:OK
  #235     tc_links_append:OK
  #236     tc_links_basic:OK
  #237     tc_links_before:OK
  #238     tc_links_chain_classic:OK
  #239     tc_links_chain_mixed:OK
  #240     tc_links_dev_cleanup:OK
  #241     tc_links_dev_mixed:OK
  #242     tc_links_ingress:OK
  #243     tc_links_invalid:OK
  #244     tc_links_prepend:OK
  #245     tc_links_replace:OK
  #246     tc_links_revision:OK
  #247     tc_opts_after:OK
  #248     tc_opts_append:OK
  #249     tc_opts_basic:OK
  #250     tc_opts_before:OK
  #251     tc_opts_chain_classic:OK
  #252     tc_opts_chain_mixed:OK
  #253     tc_opts_delete_empty:OK
  #254     tc_opts_demixed:OK
  #255     tc_opts_detach:OK
  #256     tc_opts_detach_after:OK
  #257     tc_opts_detach_before:OK
  #258     tc_opts_dev_cleanup:OK
  #259     tc_opts_invalid:OK
  #260     tc_opts_mixed:OK
  #261     tc_opts_prepend:OK
  #262     tc_opts_replace:OK
  #263     tc_opts_revision:OK
  [...]
  Summary: 44/38 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/8699efc284b75ccdc51ddf7062fa2370330dc6c0.1692029283.git.daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <[email protected]>
jenkins-tessares pushed a commit that referenced this issue Oct 6, 2023
Add various tests to check maximum number of supported programs
being attached:

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  ./test_progs -t tc_opts
  [    1.185325] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.186826] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  [    1.270123] tsc: Refined TSC clocksource calibration: 3407.988 MHz
  [    1.272428] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc932722, max_idle_ns: 440795381586 ns
  [    1.276408] clocksource: Switched to clocksource tsc
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK              <--- (new test)
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_replace:OK
  #269     tc_opts_revision:OK
  Summary: 18/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
jenkins-tessares pushed a commit that referenced this issue Oct 13, 2023
Add a new test case which performs double query of the bpf_mprog through
libbpf API, but also via raw bpf(2) syscall. This is testing to gather
first the count and then in a subsequent probe the full information with
the program array without clearing passed structs in between.

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  ./test_progs -t tc_opts
  [    1.398818] tsc: Refined TSC clocksource calibration: 3407.999 MHz
  [    1.400263] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd336761, max_idle_ns: 440795243819 ns
  [    1.402734] clocksource: Switched to clocksource tsc
  [    1.426639] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.428112] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_query:OK            <--- (new test)
  #269     tc_opts_replace:OK
  #270     tc_opts_revision:OK
  Summary: 19/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
jenkins-tessares pushed a commit that referenced this issue Oct 13, 2023
Add a new test case to query on an empty bpf_mprog and pass the revision
directly into expected_revision for attachment to assert that this does
succeed.

  ./test_progs -t tc_opts
  [    1.406778] tsc: Refined TSC clocksource calibration: 3407.990 MHz
  [    1.408863] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcaf6eb0, max_idle_ns: 440795321766 ns
  [    1.412419] clocksource: Switched to clocksource tsc
  [    1.428671] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.430260] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_query:OK
  #269     tc_opts_query_attach:OK     <--- (new test)
  #270     tc_opts_replace:OK
  #271     tc_opts_revision:OK
  Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
matttbe pushed a commit that referenced this issue Oct 27, 2023
Add several new test cases which assert corner cases on the mprog query
mechanism, for example, around passing in a too small or a larger array
than the current count.

  ./test_progs -t tc_opts
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_query:OK
  #269     tc_opts_query_attach:OK
  #270     tc_opts_replace:OK
  #271     tc_opts_revision:OK
  Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Reviewed-by: Alan Maguire <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants