-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange Performance on Raspbeery Pi 4 #5
Comments
Update: After directly flashing my rpi4 with OpenWRT 23.05.2 with Linux v5.15.137 compiled by OpenWRT, I got 1.01 Gbit/sec! | Raspberry Pi 4 / BCM2711* | OpenWRT 23.05.2 / 5.15.137 | 1.01 Gbits/sec | |
One interesting finding: Use Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 47296 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 78.2 MBytes 656 Mbits/sec 0 402 KBytes
[ 5] 1.00-2.00 sec 80.2 MBytes 672 Mbits/sec 0 441 KBytes
[ 5] 2.00-3.00 sec 79.6 MBytes 668 Mbits/sec 0 441 KBytes
[ 5] 3.00-4.00 sec 80.3 MBytes 674 Mbits/sec 0 441 KBytes
[ 5] 4.00-5.00 sec 80.8 MBytes 678 Mbits/sec 0 441 KBytes
[ 5] 5.00-6.00 sec 81.0 MBytes 679 Mbits/sec 0 441 KBytes
[ 5] 6.00-7.00 sec 79.5 MBytes 667 Mbits/sec 0 441 KBytes
[ 5] 7.00-8.00 sec 80.1 MBytes 672 Mbits/sec 0 441 KBytes
[ 5] 8.00-9.00 sec 80.1 MBytes 672 Mbits/sec 0 441 KBytes
[ 5] 9.00-10.00 sec 79.7 MBytes 668 Mbits/sec 0 441 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 799 MBytes 671 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 798 MBytes 669 Mbits/sec receiver
iperf Done. |
Another interesting finding: Turn off Connecting to host 169.254.200.2, port 5201
[ 5] local 169.254.200.1 port 37182 connected to 169.254.200.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 135 MBytes 1.13 Gbits/sec 0 818 KBytes
[ 5] 1.00-2.00 sec 130 MBytes 1.09 Gbits/sec 0 860 KBytes
[ 5] 2.00-3.00 sec 126 MBytes 1.05 Gbits/sec 0 975 KBytes
[ 5] 3.00-4.00 sec 130 MBytes 1.09 Gbits/sec 0 1022 KBytes
[ 5] 4.00-5.00 sec 130 MBytes 1.09 Gbits/sec 0 1.07 MBytes
[ 5] 5.00-6.00 sec 132 MBytes 1.11 Gbits/sec 0 1.07 MBytes
[ 5] 6.00-7.00 sec 132 MBytes 1.11 Gbits/sec 0 1.14 MBytes
[ 5] 7.00-8.00 sec 132 MBytes 1.11 Gbits/sec 0 1.26 MBytes
[ 5] 8.00-9.00 sec 129 MBytes 1.08 Gbits/sec 0 1.26 MBytes
[ 5] 9.00-10.01 sec 130 MBytes 1.08 Gbits/sec 0 1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 1.28 GBytes 1.09 Gbits/sec 0 sender
[ 5] 0.00-10.01 sec 1.27 GBytes 1.09 Gbits/sec receiver
iperf Done. However, if we turn off 104d103
< # CONFIG_BPF_LSM is not set
139d137
< CONFIG_TASKS_RUDE_RCU=y
260d257
< CONFIG_TRACEPOINTS=y
1603d1599
< # CONFIG_BATMAN_ADV_TRACING is not set
1637d1632
< # CONFIG_NET_DROP_MONITOR is not set
2965d2959
< # CONFIG_ATH6KL_TRACING is not set
8041d8034
< # CONFIG_PSTORE_FTRACE is not set
8492d8484
< # CONFIG_TRACE_MMIO_ACCESS is not set
8726d8717
< # CONFIG_DEBUG_PAGE_REF is not set
8803,8804d8793
< CONFIG_TRACE_IRQFLAGS=y
< CONFIG_TRACE_IRQFLAGS_NMI=y
8837d8825
< CONFIG_NOP_TRACER=y
8845d8832
< CONFIG_TRACER_MAX_TRACE=y
8847,8853d8833
< CONFIG_RING_BUFFER=y
< CONFIG_EVENT_TRACING=y
< CONFIG_CONTEXT_SWITCH_TRACER=y
< CONFIG_RING_BUFFER_ALLOW_SWAP=y
< CONFIG_PREEMPTIRQ_TRACEPOINTS=y
< CONFIG_TRACING=y
< CONFIG_GENERIC_TRACER=y
8855,8895c8835
< CONFIG_FTRACE=y
< # CONFIG_BOOTTIME_TRACING is not set
< CONFIG_FUNCTION_TRACER=y
< CONFIG_FUNCTION_GRAPH_TRACER=y
< CONFIG_DYNAMIC_FTRACE=y
< CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
< CONFIG_FUNCTION_PROFILER=y
< CONFIG_STACK_TRACER=y
< CONFIG_IRQSOFF_TRACER=y
< CONFIG_SCHED_TRACER=y
< # CONFIG_HWLAT_TRACER is not set
< # CONFIG_OSNOISE_TRACER is not set
< # CONFIG_TIMERLAT_TRACER is not set
< # CONFIG_FTRACE_SYSCALLS is not set
< CONFIG_TRACER_SNAPSHOT=y
< CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP=y
< CONFIG_BRANCH_PROFILE_NONE=y
< # CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
< # CONFIG_PROFILE_ALL_BRANCHES is not set
< CONFIG_BLK_DEV_IO_TRACE=y
< CONFIG_KPROBE_EVENTS=y
< # CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
< # CONFIG_UPROBE_EVENTS is not set
< CONFIG_BPF_EVENTS=y
< CONFIG_DYNAMIC_EVENTS=y
< CONFIG_PROBE_EVENTS=y
< CONFIG_FTRACE_MCOUNT_RECORD=y
< CONFIG_FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY=y
< # CONFIG_SYNTH_EVENTS is not set
< # CONFIG_HIST_TRIGGERS is not set
< # CONFIG_TRACE_EVENT_INJECT is not set
< # CONFIG_TRACEPOINT_BENCHMARK is not set
< # CONFIG_RING_BUFFER_BENCHMARK is not set
< # CONFIG_TRACE_EVAL_MAP_FILE is not set
< # CONFIG_FTRACE_RECORD_RECURSION is not set
< # CONFIG_FTRACE_STARTUP_TEST is not set
< # CONFIG_RING_BUFFER_STARTUP_TEST is not set
< # CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
< # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
< # CONFIG_KPROBE_EVENT_GEN_TEST is not set
< # CONFIG_RV is not set
---
> # CONFIG_FTRACE is not set |
Yet another interesting finding: turn off Turn on 8803a8804,8805
> CONFIG_TRACE_IRQFLAGS=y
> CONFIG_TRACE_IRQFLAGS_NMI=y
8849a8852
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
8861c8864
< # CONFIG_IRQSOFF_TRACER is not set
---
> CONFIG_IRQSOFF_TRACER=y |
In my RPi 4B, using OpenWrt 23.05.2 (64bit), the tested result was 881Mbps. |
BTW I believe 32bit VS 64bit should show some difference, probably we should indicate this? |
For an out-of-order CPU, 32bit vs 64bit shows same performance is normal, sometimes 64bit may slower for fatter pointer size which consumes more cache capacity. Intuitively we think 64bit will be fast is based on the register width doubled so it will be faster to processing something like 64-bit arithmetic operations only take one instruction to finish. But please remind that 64-bit operations also has longer latency on the CPU physical circuit which may needs to lower the frequency or more cycles to produce. It’s the same on SIMD. The crypto algorithm in WireGuard is chacha20 and poly1305 also uses SIMD i.e. arm neon to calculate, if uarch implementation does not provide wide enough simd processing in a single cycle, we will get the same performance on whatever 32/64 bit. |
Based on the results, even the 2GHz Quad-Core A53 on TP-Link XDR 6088 can achieve 818 Mbits/sec. I doubt the Raspberry Pi 4's result of only 394 Mbits/sec is accurate as it has Quad-Core A72 @ 1.5GHz. Then, I switched back to the archlinuxarm-based PiKVM distro which my Raspberry PI 4 usually works on with armv7l kernel rather than aarch64 on Raspberry Pi OS, and ran the benchmark. Then, the result made me astonished.
Using armv7l Kernel we will get about 69% faster, WHY?
I searched on the web and found a thread that has the same confusion as me but on AES rather than chacha20 used by wg[1]. It might be the chacha20 implementation in the kernel is not optimized in aarch64. I want to leave the issue here to record any further investigation of this performance issue.
[1] https://forums.raspberrypi.com/viewtopic.php?t=317075
The text was updated successfully, but these errors were encountered: