Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very low throughput when using L7NetworkPolicies with an external host #6806

Open
antoninbas opened this issue Nov 13, 2024 · 10 comments · May be fixed by #6843
Open

Very low throughput when using L7NetworkPolicies with an external host #6806

antoninbas opened this issue Nov 13, 2024 · 10 comments · May be fixed by #6843
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
A user reported low throughout when using the following L7 NP:

apiVersion: crd.antrea.io/v1beta1
kind: NetworkPolicy
metadata:
  name: testl7
spec:
  egress:
  - action: Allow
    enableLogging: true
    appliedTo:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 443
    l7Protocols:
    - tls:
        sni: 'ash-speed.hetzner.com'
  priority: 1
  tier: baseline

While running curl https://ash-speed.hetzner.com will work fine, running curl https://ash-speed.hetzner.com/100MB.bin -o /dev/null (essentially a speed test) will show very low throughput, less than 10Kbps.
When removing the policy, the speed is much better (a few 10s Mbps in my case).
The same issue can be observed with no-TLS HTTP traffic (using http://ash-speed.hetzner.com) as the host.

When capturing the traffic on antrea-gw0, I observed some large packets (larger than the 1500 MTU) with an incorrect checksum:

root@kind-worker:/# tcpdump -vvvv -n -i antrea-gw0 port 443
tcpdump: listening on antrea-gw0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
00:05:56.295410 IP (tos 0x0, ttl 64, id 32016, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.1.2.54872 > 5.161.7.195.443: Flags [S], cksum 0x1f53 (correct), seq 767025568, win 64860, options [mss 1410,sackOK,TS val 3718729215 ecr 0,nop,wscale 7], length 0
00:05:56.380776 IP (tos 0x0, ttl 62, id 63701, offset 0, flags [none], proto TCP (6), length 60)
    5.161.7.195.443 > 10.10.1.2.54872: Flags [S.], cksum 0x41f6 (correct), seq 3817098490, ack 767025569, win 29184, options [mss 1460,nop,nop,TS val 4288494426 ecr 3718729215,nop,wscale 7], length 0
00:05:56.384369 IP (tos 0x0, ttl 64, id 32017, offset 0, flags [DF], proto TCP (6), length 52)
    10.10.1.2.54872 > 5.161.7.195.443: Flags [.], cksum 0xdd6a (correct), seq 1, ack 1, win 507, options [nop,nop,TS val 3718729307 ecr 4288494426], length 0
00:05:56.389085 IP (tos 0x0, ttl 64, id 32018, offset 0, flags [DF], proto TCP (6), length 569)
    10.10.1.2.54872 > 5.161.7.195.443: Flags [P.], cksum 0xacb3 (correct), seq 1:518, ack 1, win 507, options [nop,nop,TS val 3718729312 ecr 4288494426], length 517
00:05:56.390122 IP (tos 0x0, ttl 62, id 63702, offset 0, flags [none], proto TCP (6), length 52)
    5.161.7.195.443 > 10.10.1.2.54872: Flags [.], cksum 0xdc73 (correct), seq 1, ack 518, win 223, options [nop,nop,TS val 4288494435 ecr 3718729312], length 0
00:05:56.470800 IP (tos 0x0, ttl 62, id 63703, offset 0, flags [none], proto TCP (6), length 2979)
    5.161.7.195.443 > 10.10.1.2.54872: Flags [P.], cksum 0x2405 (incorrect -> 0xa8f2), seq 1:2928, ack 518, win 4096, options [nop,nop,TS val 4288494515 ecr 3718729312], length 2927
00:05:56.672110 IP (tos 0x0, ttl 62, id 63706, offset 0, flags [none], proto TCP (6), length 1450)
    5.161.7.195.443 > 10.10.1.2.54872: Flags [.], cksum 0xbec2 (correct), seq 1:1399, ack 518, win 4096, options [nop,nop,TS val 4288494715 ecr 3718729312], length 1398
00:05:56.672803 IP (tos 0x0, ttl 64, id 32019, offset 0, flags [DF], proto TCP (6), length 52)
    10.10.1.2.54872 > 5.161.7.195.443: Flags [.], cksum 0xd3b2 (correct), seq 518, ack 1399, win 502, options [nop,nop,TS val 3718729596 ecr 4288494715], length 0
00:05:56.673378 IP (tos 0x0, ttl 62, id 63707, offset 0, flags [none], proto TCP (6), length 1581)
    5.161.7.195.443 > 10.10.1.2.54872: Flags [P.], cksum 0x1e8f (incorrect -> 0xaf0d), seq 1399:2928, ack 518, win 4096, options [nop,nop,TS val 4288494718 ecr 3718729596], length 1529
00:05:56.873772 IP (tos 0x0, ttl 62, id 63709, offset 0, flags [none], proto TCP (6), length 1450)

I was able to resolve the issue by disabling transmit checksum offload on antrea-gw0 (ethtool -K antrea-gw0 tx-checksumming off).

Versions:
Antrea v2.2.0

Additional information

Surprisingly, the packets captured on eth0 on the receive path are also larger than the MTU, which is a bit surprising to me, as GRO is disabled on eth0. But maybe I am misunderstanding something and GRO doesn't apply here, for this traffic which is forwarded by the Linux kernel from eth0 to antrea-gw0.

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users. labels Nov 13, 2024
@antoninbas
Copy link
Contributor Author

@tnqn @hongliangl it seems that this PR was aimed at solving this issue: #3957. Any idea why it was abandonned.

On a side note, we should also find a way to catch this condition with our L7NP e2e tests.

@hongliangl
Copy link
Contributor

hongliangl commented Nov 13, 2024

@antoninbas IIRC, we didn't get a good idea how to deal with the default behavior of checksum for antrea-gw0 and the option disableTXChecksumOffload.

  • The default behavior is on, disableTXChecksumOffload is false (by default), do nothing.
  • The default behavior is on, disableTXChecksumOffload is true, set the checksum to off.
  • The default behavior is off, disableTXChecksumOffload is false (by default), set the checksum to on (I am not sure if this is a good idea).
  • The default behavior is off, disableTXChecksumOffload is true, do nothing.

cc @tnqn

@jsalatiel
Copy link

jsalatiel commented Nov 13, 2024

@antoninbas Interesting that for you the download is just too slow. For me it fails

root@client-dd5c7d849-k9d4c:/# curl https://ash-speed.hetzner.com/100MB.bin -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:31 --:--:--     0
curl: (52) Empty reply from server

Even the workaround ethtool -K antrea-gw0 tx-checksumming off has no effect. ( I tried to recreate the pod just in case and still can not download)

I can also see incorrect checksum on tcpdump

08:53:02.473850 IP (tos 0x0, ttl 64, id 48262, offset 0, flags [DF], proto TCP (6), length 60)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [S], cksum 0xb7d6 (correct), seq 283223420, win 64860, options [mss 1410,sackOK,TS val 1326860703 ecr 0,nop,wscale 7], length 0
08:53:02.558762 IP (tos 0x0, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    5.161.7.195.443 > 10.238.66.40.59318: Flags [S.], cksum 0xba1a (correct), seq 1005820690, ack 283223421, win 65160, options [mss 1452,sackOK,TS val 2055711425 ecr 1326860703,nop,wscale 13], length 0
08:53:02.559098 IP (tos 0x0, ttl 64, id 48263, offset 0, flags [DF], proto TCP (6), length 52)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [.], cksum 0x5aa0 (incorrect -> 0xe51d), seq 1, ack 1, win 507, options [nop,nop,TS val 1326860788 ecr 2055711425], length 0
08:53:02.564205 IP (tos 0x0, ttl 64, id 48264, offset 0, flags [DF], proto TCP (6), length 569)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [P.], cksum 0x5ca5 (incorrect -> 0xb5e3), seq 1:518, ack 1, win 507, options [nop,nop,TS val 1326860793 ecr 2055711425], length 517
08:53:02.851406 IP (tos 0x0, ttl 64, id 48265, offset 0, flags [DF], proto TCP (6), length 569)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [P.], cksum 0x5ca5 (incorrect -> 0xb4c3), seq 1:518, ack 1, win 507, options [nop,nop,TS val 1326861081 ecr 2055711425], length 517
08:53:03.139462 IP (tos 0x0, ttl 64, id 48266, offset 0, flags [DF], proto TCP (6), length 569)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [P.], cksum 0x5ca5 (incorrect -> 0xb3a3), seq 1:518, ack 1, win 507, options [nop,nop,TS val 1326861369 ecr 2055711425], length 517
08:53:03.570682 IP (tos 0x0, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    5.161.7.195.443 > 10.238.66.40.59318: Flags [S.], cksum 0xb626 (correct), seq 1005820690, ack 283223421, win 65160, options [mss 1452,sackOK,TS val 2055712437 ecr 1326860703,nop,wscale 13], length 0
08:53:03.570855 IP (tos 0x0, ttl 64, id 48267, offset 0, flags [DF], proto TCP (6), length 52)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [.], cksum 0x5aa0 (incorrect -> 0xdf24), seq 518, ack 1, win 507, options [nop,nop,TS val 1326861800 ecr 2055711425], length 0
08:53:03.763402 IP (tos 0x0, ttl 64, id 48268, offset 0, flags [DF], proto TCP (6), length 569)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [P.], cksum 0x5ca5 (incorrect -> 0xb133), seq 1:518, ack 1, win 507, options [nop,nop,TS val 1326861993 ecr 2055711425], length 517
08:53:04.915375 IP (tos 0x0, ttl 64, id 48269, offset 0, flags [DF], proto TCP (6), length 569)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [P.], cksum 0x5ca5 (incorrect -> 0xacb3), seq 1:518, ack 1, win 507, options [nop,nop,TS val 1326863145 ecr 2055711425], length 517
08:53:05.590637 IP (tos 0x0, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    5.161.7.195.443 > 10.238.66.40.59318: Flags [S.], cksum 0xae42 (correct), seq 1005820690, ack 283223421, win 65160, options [mss 1452,sackOK,TS val 2055714457 ecr 1326860703,nop,wscale 13], length 0
08:53:05.590842 IP (tos 0x0, ttl 64, id 48270, offset 0, flags [DF], proto TCP (6), length 52)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [.], cksum 0x5aa0 (incorrect -> 0xd740), seq 518, ack 1, win 507, options [nop,nop,TS val 1326863820 ecr 2055711425], length 0

@antoninbas
Copy link
Contributor Author

@jsalatiel
This is not expected:

08:53:02.559098 IP (tos 0x0, ttl 64, id 48263, offset 0, flags [DF], proto TCP (6), length 52)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [.], cksum 0x5aa0 (incorrect -> 0xe51d), seq 1, ack 1, win 507, options [nop,nop,TS val 1326860788 ecr 2055711425], length 0

This is traffic from the Pod to the external server. If disableTXChecksumOffload is set to true in the Antrea configuration, this should not happen. The kernel will compute the checksum when transmitting the packet on the eth0 interface of the Pod (no offload), and when you capture it on antrea-gw0, the checksum should still be correct.
This is why the request is failing for you (instead of just very low throughput). The first data packet (HTTP request) is never received by the server, and you can see that the server keeps trying to re-transmit the SYN-ACK packet (which has a correct checksum):

08:53:05.590637 IP (tos 0x0, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    5.161.7.195.443 > 10.238.66.40.59318: Flags [S.], cksum 0xae42 (correct), seq 1005820690, ack 283223421, win 65160, options [mss 1452,sackOK,TS val 2055714457 ecr 1326860703,nop,wscale 13], length 0

The only explanation I can think of (assuming disableTXChecksumOffload is set to true) is that the client Pod (with IP 10.238.66.40) was created before setting disableTXChecksumOffload to true in the Antrea config, and not recreated thereafter. Setting disableTXChecksumOffload to true and then restarting Antrea Agents will not affect existing workload Pods. This is consistent with how we handle MTU settings as well. Such datapath changes require a restart of existing workload (non-hostNetwork) Pods.

@antoninbas
Copy link
Contributor Author

@hongliangl I don't have a great idea. Maybe we could change disableTXChecksumOffload to a pointer (*bool), which should not be a breaking change as we are just making the field optional. By default the fields would be nil / omitted, which gives us 3 possible values:

  • nil: do not apply any setting
  • false: try to force checksum offloading on
  • true: try to force checksum offloading off

We do need to address this however, as the L7NP feature is currently broken because of this issue.

@jsalatiel
Copy link

@jsalatiel This is not expected:

08:53:02.559098 IP (tos 0x0, ttl 64, id 48263, offset 0, flags [DF], proto TCP (6), length 52)
    10.238.66.40.59318 > 5.161.7.195.443: Flags [.], cksum 0x5aa0 (incorrect -> 0xe51d), seq 1, ack 1, win 507, options [nop,nop,TS val 1326860788 ecr 2055711425], length 0

This is traffic from the Pod to the external server. If disableTXChecksumOffload is set to true in the Antrea configuration, this should not happen. The kernel will compute the checksum when transmitting the packet on the eth0 interface of the Pod (no offload), and when you capture it on antrea-gw0, the checksum should still be correct. This is why the request is failing for you (instead of just very low throughput). The first data packet (HTTP request) is never received by the server, and you can see that the server keeps trying to re-transmit the SYN-ACK packet (which has a correct checksum):

08:53:05.590637 IP (tos 0x0, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    5.161.7.195.443 > 10.238.66.40.59318: Flags [S.], cksum 0xae42 (correct), seq 1005820690, ack 283223421, win 65160, options [mss 1452,sackOK,TS val 2055714457 ecr 1326860703,nop,wscale 13], length 0

The only explanation I can think of (assuming disableTXChecksumOffload is set to true) is that the client Pod (with IP 10.238.66.40) was created before setting disableTXChecksumOffload to true in the Antrea config, and not recreated thereafter. Setting disableTXChecksumOffload to true and then restarting Antrea Agents will not affect existing workload Pods. This is consistent with how we handle MTU settings as well. Such datapath changes require a restart of existing workload (non-hostNetwork) Pods.

You are correct. Restarting the client Pod (with IP 10.238.66.40) fixed the problem. ( After setting ethtool -K antrea-gw0 tx-checksumming off )

@jsalatiel
Copy link

While this is not fixed, this systemd unit may help.

[Unit]
BindsTo=sys-subsystem-net-devices-antrea\x2dgw0.device
After=sys-subsystem-net-devices-antrea\x2dgw0.device

[Service]
Type=oneshot
ExecStart=/usr/sbin/ethtool -K antrea-gw0 tx-checksumming off
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

@hongliangl
Copy link
Contributor

I investigated the root cause of the low throughput and identified that Suricata is unable to send oversized packets back to OVS through antrea-l7-tap1.

For a connection between a Pod and an external network governed by an L7 NetworkPolicy, reply packets traverse the following network adapters:

Name Type TX Checksum Offload TSO GSO
ens224 Physical network adapter on Node Enabled Enabled Enabled
antrea-gw0 OVS internal on Node Enabled Enabled Enabled
antrea-l7-tap0 OVS internal connected to Suricata Enabled Disabled (set by Suricata) Disabled (set by Suricata)
antrea-l7-tap1 OVS internal connected to Suricata Enabled Disabled (set by Suricata) Disabled (set by Suricata)
eth0 Veth pair in Pod Disabled (set by Antrea) Disabled (set by Antrea) Enabled (set by Antrea)

Packet Flow Analysis:

  1. ens224:
    Oversized reply packets (larger than the MTU) are successfully received and forwarded to antrea-gw0 without fragmentation because antrea-gw0 supports TSO.
    Reference: 1-ens224.pcap in l7np-antrea-gw0-tx-checksum-on.tar.gz

  2. antrea-gw0:
    Similar to ens224, oversized packets are forwarded without fragmentation.
    Reference: 2-antrea-gw0.pcap in l7np-antrea-gw0-tx-checksum-on.tar.gz

  3. antrea-l7-tap0:
    Oversized packets are successfully received by Suricata for processing.
    Reference: 3-antrea-l7-tap0.pcap in l7np-antrea-gw0-tx-checksum-on.tar.gz

  4. antrea-l7-tap1:
    When Suricata attempts to send permitted packets back to OVS through antrea-l7-tap1, the transmission fails with the following error:

    [350 - W#01-antr..tap0] 2024-11-29 10:06:05 Warning: af-packet: antrea-l7-tap1: sending packet failed on socket 17: Message too long  
    

    This happens because both TSO and GSO are disabled on antrea-l7-tap1.
    Reference: 4-antrea-l7-tap1.pcap in l7np-antrea-gw0-tx-checksum-on.tar.gz

  5. eth0:
    The oversized packets never reach this adapter. Instead, TCP retransmissions are triggered until the remote server resends normal-sized packets. This is the primary cause of the low throughput.
    Reference: 5-eth0.pcap in l7np-antrea-gw0-tx-checksum-on.tar.gz

Solution:

To resolve the issue, the following changes must be applied to antrea-gw0:

  1. Disable TSO:
    This ensures that oversized packets are split into fragments that fit within the MTU before being forwarded to antrea-l7-tap1.
  2. Disable TX Checksum Offload:
    After processing by Suricata, checksum offload metadata is lost. If TX checksum offloading remains enabled, the Pod will reject packets due to checksum errors. See capture 6-eth0.pcap in l7np-antrea-gw0-tx-checksum-on-tso-off.tar.gz.

After disabling TX checksum of antrea-gw0, throughput is restoredd. See files in l7np-antrea-gw0-tx-checksum-off.tar.gz.

l7np-antrea-gw0-tx-checksum-on.tar.gz

l7np-antrea-gw0-tx-checksum-on-tso-off.tar.gz

l7np-antrea-gw0-tx-checksum-off.tar.gz

@hongliangl
Copy link
Contributor

@hongliangl I don't have a great idea. Maybe we could change disableTXChecksumOffload to a pointer (*bool), which should not be a breaking change as we are just making the field optional. By default the fields would be nil / omitted, which gives us 3 possible values:

  • nil: do not apply any setting
  • false: try to force checksum offloading on
  • true: try to force checksum offloading off

We do need to address this however, as the L7NP feature is currently broken because of this issue.

@antoninbas, I discussed this with @tnqn , and we agreed it’s better to keep the current bool type. This ensures compatibility with released versions while allowing continued use of the existing option. Proper documentation is essential to clearly inform users about how the option works (if we disable TX checksum with this option), what they need to be aware of, and how to restore the default behavior if necessary.

@tnqn
Copy link
Member

tnqn commented Dec 5, 2024

@hongliangl I don't have a great idea. Maybe we could change disableTXChecksumOffload to a pointer (*bool), which should not be a breaking change as we are just making the field optional. By default the fields would be nil / omitted, which gives us 3 possible values:

  • nil: do not apply any setting
  • false: try to force checksum offloading on
  • true: try to force checksum offloading off

@antoninbas the problem is that the option is already set to false in the config file of existing releases, which means doing nothing. Changing false to force checksum offloading on would change the system setting for an upgraded cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants