-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on debian #25
Comments
Thats correct, most probably you don't need to disable these explicitly. you can do that through ethtool. Overall, your issue is very strange. From driver perspective I can't think of nothing what could lead to 10G link be slower than 1G link. First, you may try using the out of box driver from this repo. may be some fixes are missing in your kernel tree. I suspect this is something about packet corruption or retransmits. You may try using iperf tcp to measure the linerate. It'll show you if retransmits are happening. You may also inspect "netstat -s" diffs to check for some suspects. |
Many thanks for the helpful reply, very interesting. I did some checks and find no evidence of retransmits or retrys. I also swapped the cables and between the good 1G card and the slow 10G card and the performance on both remained the same. I tried de-rating the 10G switch port to be 1G and the Aquantia card renegotiated to 1G but still would only get about 30MByte/sec throughput. So now it gets stranger. I was trying various things, and at some point the problem disappeared. The thing I did that seemed to fix it was to enable GRO on the Aquantia card. I read elsewhere that GRO is safe for routing, but LRO is not. It's odd though, because disabling GRO (ie reverting to how it was before) does not make the problem come back. I'm now getting about 300MByte/s throughput on ssh, which is more like what I had in mind! And I see no significant ksoftirqd load while it's doing that. So I'm wondering now what caused it to go into the slow / high-ksortirqd state. It was in that state for several weeks at least, including one system reboot, so it doesn't seem to be a completely freak occurrence. But also I currently can't reproduce it. I'll report back if I find anything further on this. I'm happy either way if you want to close this out or leave it open for a short while in case I can figure out what causes this. Thanks again for the help, |
Lets keep it, I really interested in finding the reason of that deadly perf drop. It could be related to LRO/LSO settings in the driver. Thats a tricky offload which in theory may corrupt TCP stream somewere. |
Aha, I also noticed a performance drop on my AQC107 (ASUS XG-C100C) adapter, using the 2.4.10 driver on my Ubuntu 18.04 PC. Rolling back to driver version 2.4.3 (from the Marvell website) restored the performance. I had downloaded the current master from this repository, made the "usual" change in aq_cfg.h (which I had done in 2.4.3 as well): /* LRO */ built and installed... and the performance was noticeably worse. I didn't bother to investigate further, thinking that's just a "glitch" in the developer preview and the next update would fix it. But as there has been no update since, it's probably worth investigating after all... |
Hi robert, could you provide some numbers? what do you mean with worse? |
Hi cail, trying to come up with numbers I remembered why I did not report this before: The symptoms are somewhat "fuzzy". I reverted from 2.4.10 to 2.4.3 when I experienced glitches during video conferencing calls (using Slack or Microsoft Teams). As these are rather vital to my home office work, I did not have time to investigate or write down any notes. Unfortunately, now my Internet provider has issues, so it is even harder to tell whether any symptoms I am seeing are caused by the driver or my provider :-/ And my provider requires me to use their router while the issue is ongoing, so I cannot use my router with the 2.5G port... For now, I'll try building the 2.4.7 commit and using that. If I don't see any issues with that, I'll try 2.4.10 again and see if I can isolate an issue. Or if I see issues with 2.4.7, I'll go back to 2.4.3 again. Maybe that'll allow me to tell whether there really is an issue, and if it was introduced between 2.4.3 and 2.4.7 or between 2.4.7 and 2.4.10... |
Thanks, please let us know if you'll find out something. |
Mine's not a debian box, but seems related. Yesterday, I had a similar issue. I was transferring a large file between two machines, and I also got perf with scp at around 30 Mbytes/s. I did not see any RX/TX errors with ifconfig nor anything other exceptional in the kernel log. However, I do get these notes:
To fix the slow mode, all I had to do was:
The slow mode only affected receive perf on this machine. Sending was just fine at around 260 Mbytes/s, which is the CPU-bound limit for scp between my boxes. I think rx-vlan-offload itself had nothing to do with this. I don't do vlan at the interface level. I think that all that was needed was whatever reinitialization at hw/driver level gets done when toggling this. Two more notes:
Some more info below.
lspci reports:
|
Hi Samy, thanks for the observation. Couple of questions/suggestions:
|
The driver is the in-tree kernel driver that comes with the upstream kernel. Loaded as a module. The only out-of-tree driver I have is the nvidia.ko GPU driver.
This was the first time I noticed the slow mode. But I've had the 10G setup only for a month or so.
Sure. |
Hello. Firstly: I was running a full copper solution and I had to ship back all equipment because of this problem. Secondly: Make a speedtest with iPerf3 to measure the real speed of your network. Hope this helps. |
Ok, it happened again. This time I was able to make couple of measurements. Main observations:
Unplugging and plugging the cable made the problem go away. Previously, I have also noticed that the use of ethtool to change settings, or ifdown/ifup makes the problem go away. Measurement data below: IPERF3 server inbound / UDP: (server on this machine)
IPERF3 server inbound / TCP: (server on this machine)
IPERF3 server outbound / UDP (server on another machine):
IPERF3 server outbound / TCP (server on another machine):
Inbound pings: (i.e., pinged from another machine)
$ ping boombox -s 2000
$ ping boombox -s 10000
I also made a wireshark capture during ping -s 10000 on boombox (the affected machine). Nothing unusual there, I just see that some IP packets are missing. |
Additional note. I also dual-boot this machine occasionally on Windows. I don't recall ever observing the slow mode there, although that wouldn't mean it doesn't happen. I also use the other machine almost daily with the same NIC model and no slow mode ever observed. |
I am experiencing a similar issue. The transfer speed from my ubuntu machine which have aquantia 10gbe to my NAS is good at around EDIT: |
Such a low perfomance may be a sign of packet corruption or extensive packet drop. |
Same issue seen frequently in a overheated server room: many servers with aquantia network interface had a slowdown on only the RX line, from around normally 400 to 800 MiB/s, that drops to around 100 MiB/s and sometimes as low as 32 MiB/s! Now, all affected network adapters was aquantia/atlantic
until the same defect hit a server with only Intel network interfaces/cards! Same issue, same RX slowdown, but now a server without the Aquantia card and atlantic driver!?! I've made a shell script, that detects the problem, reading a 4GiB file, using dd and with the cache-off switch (iflag=direct). The test case creates file with random values (zero files can be problematic in file system with sparse-file handling), that, in turn, is read by dd to get a measure of the network speed. Since this script runs nightly on my servers (with option -q and -w), a random wait/sleep period is introduced, so that multiple servers do not run at exactly the same time, and the script is repeated a number of time before marked as failed. Running without options just test the network RX defect... Now, the defect began this spring, when the server room started to get very hot, but I have not proven in any way, that the RX defect is actually related to this temperature issue, manually running 'sensors' report aquantia temperatures up to 83 deg C, but these have not been monitored continuously
But it could also be a general kernel defect or a firmware issue in Ubuntu (running Linux 5.15.0-94-generic, Ubuntu 22.04.3 LTS). This defect is the 'reverse' to the comment:
since it resets to 'normal' speed (fast) at boot, and degradation only happens after boot. Looked into packet errors, and drop but found no issues (ping reports `15 packets transmitted, 15 received, 0% packet loss, time 14334ms rtt min/avg/max/mdev = 0.300/0.342/0.388/0.021 ms'). Tried disabling all offload settings, did not help either....
|
Hello,
I've got a debian machine with both an AQC107 10GigE card, and also some Realtek 1GigE boards.
scp of a large file from this machine to a client has throughput of about 100Mbyte / sec on the 1GigE Realtek interfaces (ie saturating the gigabit) but only about 30 Mbyte / sec on the ACQ107. The link is up at 10GigE.
The machine is acting as a router. LRO and GRO are off (by ethtool) on all interfaces
I do notice during the scp using the atlantic card, ksoftirqd is using 100% of a CPU core. During the scp using a realtek card, ksoftirqd uses about 1% of a CPU core.
OK, I'm not expecting scp to necessarily saturate the 10GigE connection, but having it run a factor of 3 slower on the 10GigE card than a 1GigE card is disappointing to say the least, and the ksoftirqd behavior suggested to me something about the way interrupts and the driver are configured.
I've google around a lot and not found anything helpful. The driver readme notes suggest that the driver should be compiled with LRO disabled when using in a router, but also provides instructions for disabling it with ethtool. Am I correct in understanding that disabling LRO and GRO using ethtool is equivalent, or do I really need to rebuild the driver with is disabled at compile time?
Any other hints or suggestions on getting at least GigE performance out of this much appreciated. I'm running stock debian kernel and the bundles driver:
Many thanks for any suggestions or pointers,
Paul
The text was updated successfully, but these errors were encountered: