Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AmpereOne A192-32X (Supermicro) #52

Open
geerlingguy opened this issue Oct 16, 2024 · 37 comments
Open

AmpereOne A192-32X (Supermicro) #52

geerlingguy opened this issue Oct 16, 2024 · 37 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Oct 16, 2024

DSC01611

Basic information

Linux/system information

# output of `screenfetch`
ubuntu@ubuntu:~$ screenfetch 
                          ./+o+-       ubuntu@ubuntu
                  yyyyy- -yyyyyy+      OS: Ubuntu 24.04 noble
               ://+//////-yyyyyyo      Kernel: aarch64 Linux 6.8.0-39-generic-64k
           .++ .:/++++++/-.+sss/`      Uptime: 23m
         .:++o:  /++++++++/:--:/-      Packages: 810
        o:+o+:++.`..```.-/oo+++++/     Shell: bash 5.2.21
       .:+o:+o/.          `+sssoo+/    Disk: 19G / 101G (20%)
  .++/+:+oo+o:`             /sssooo.   CPU: Ampere Ampere-1a @ 192x 3.2GHz
 /+++//+:`oo+o               /::--:.   GPU: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
 \+/+o+++`o++o               ++////.   RAM: 31390MiB / 522867MiB
  .++.o+++oo+:`             /dddhhh.  
       .+.o+oo:.          `oddhhhh+   
        \+.++o+o``-````.:ohdhhhhh+    
         `:o+++ `ohhhhhhhhyo++os:     
           .o:`.syhhhhhhh/.oo++o`     
               /osyyyyyyo++ooo+++/    
                   ````` +oo+++o\:    
                          `oo++.     

# output of `uname -a`
Linux ubuntu 6.8.0-39-generic-64k #39-Ubuntu SMP PREEMPT_DYNAMIC Sat Jul  6 11:08:16 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Power

  • Idle power draw (at wall): 199 W (30 W CPU / 78 W IO - 108W SoC package power from sensors)
  • Maximum simulated power draw (stress-ng --matrix 0): 500 W
  • During Geekbench multicore benchmark: 300-600 W (depending on Geekbench version)
  • During top500 HPL benchmark: 692 W

Disk

Samsung NVMe SSD - 983 DCT M.2 960GB

Benchmark Result
iozone 4K random read 50.35 MB/s
iozone 4K random write 216.04 MB/s
iozone 1M random read 2067.82 MB/s
iozone 1M random write 1295.13 MB/s
iozone 1M sequential read 2098.31 MB/s
iozone 1M sequential write 1291.07 MB/s
wget https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh
chmod +x disk-benchmark.sh
sudo MOUNT_PATH=/ TEST_SIZE=1g ./disk-benchmark.sh

Samsung NVMe SSD - MZQL21T9HCJR-00A07

Specs: https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/mzql21t9hcjr-00a07/

Single disk
Benchmark Result
iozone 4K random read 60.19 MB/s
iozone 4K random write 284.72 MB/s
iozone 1M random read 3777.29 MB/s
iozone 1M random write 2686.80 MB/s
iozone 1M sequential read 3773.44 MB/s
iozone 1M sequential write 2680.90 MB/s
RAID 0 (mdadm)
Benchmark Result
iozone 4K random read 58.05 MB/s
iozone 4K random write 250.06 MB/s
iozone 1M random read 5444.03 MB/s
iozone 1M random write 4411.07 MB/s
iozone 1M sequential read 7120.75 MB/s
iozone 1M sequential write 4458.30 MB/s

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: 21.4 Gbps
  • iperf3 -c $SERVER_IP --reverse: 18.8 Gbps
  • iperf3 -c $SERVER_IP --bidir: 8.08 Gbps up, 22.2 Gbps down

Tested on one of the two built-in Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller interfaces, to my HL15 Arm NAS (see: geerlingguy/arm-nas#16), routed through a Mikrotik 25G Cloud Router.

GPU

Did not test - this server doesn't have a GPU, just the ASPEED integrated BMC VGA graphics, which are not suitable for much GPU-accelerated gaming or LLMs, lol. Just render it on CPU!

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  14199.7 MB/s (0.3%)
 C copy backwards (32 byte blocks)                    :  13871.7 MB/s
 C copy backwards (64 byte blocks)                    :  13879.6 MB/s (0.2%)
 C copy                                               :  13890.6 MB/s (0.2%)
 C copy prefetched (32 bytes step)                    :  14581.4 MB/s
 C copy prefetched (64 bytes step)                    :  14613.8 MB/s
 C 2-pass copy                                        :  10819.4 MB/s
 C 2-pass copy prefetched (32 bytes step)             :  11313.6 MB/s
 C 2-pass copy prefetched (64 bytes step)             :  11417.4 MB/s
 C fill                                               :  31260.2 MB/s
 C fill (shuffle within 16 byte blocks)               :  31257.1 MB/s
 C fill (shuffle within 32 byte blocks)               :  31263.1 MB/s
 C fill (shuffle within 64 byte blocks)               :  31260.9 MB/s
 NEON 64x2 COPY                                       :  14464.3 MB/s (0.9%)
 NEON 64x2x4 COPY                                     :  13694.9 MB/s
 NEON 64x1x4_x2 COPY                                  :  12444.6 MB/s
 NEON 64x2 COPY prefetch x2                           :  14886.9 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  14954.4 MB/s
 NEON 64x2 COPY prefetch x1                           :  14892.3 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  14955.5 MB/s
 ---
 standard memcpy                                      :  14141.9 MB/s
 standard memset                                      :  31268.0 MB/s
 ---
 NEON LDP/STP copy                                    :  13775.1 MB/s (0.7%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  14267.3 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  14340.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  14670.0 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  14644.7 MB/s
 NEON LD1/ST1 copy                                    :  13756.1 MB/s
 NEON STP fill                                        :  31262.2 MB/s
 NEON STNP fill                                       :  31265.7 MB/s
 ARM LDP/STP copy                                     :  14454.0 MB/s (0.6%)
 ARM STP fill                                         :  31265.6 MB/s
 ARM STNP fill                                        :  31266.0 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.6 ns 
    262144 :    1.7 ns          /     2.0 ns 
    524288 :    1.9 ns          /     2.2 ns 
   1048576 :    2.1 ns          /     2.2 ns 
   2097152 :    3.0 ns          /     3.3 ns 
   4194304 :   22.6 ns          /    33.9 ns 
   8388608 :   33.7 ns          /    44.3 ns 
  16777216 :   39.3 ns          /    48.0 ns 
  33554432 :   42.1 ns          /    49.4 ns 
  67108864 :   49.0 ns          /    60.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.6 ns 
    262144 :    1.7 ns          /     2.0 ns 
    524288 :    1.9 ns          /     2.2 ns 
   1048576 :    2.1 ns          /     2.2 ns 
   2097152 :    3.0 ns          /     3.3 ns 
   4194304 :   22.6 ns          /    33.9 ns 
   8388608 :   33.7 ns          /    44.3 ns 
  16777216 :   39.3 ns          /    47.9 ns 
  33554432 :   42.1 ns          /    49.4 ns 
  67108864 :   49.9 ns          /    61.9 ns 

sbc-bench results

Run sbc-bench and paste a link to the results here: https://0x0.st/X0gc.bin

See: ThomasKaiser/sbc-bench#105

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: 11.248 sec
  • pts/x264 4K: 69.49 fps
  • pts/x264 1080p: 160.75 fps
  • pts/phpbench: 567108
  • pts/build-linux-kernel (defconfig): 50.101 sec

Additional benchmarks

QEMU Coremark

The Ampere team have suggested running this, as it will emulate running tons of virtual instances with coremark inside, a good proxy of the type of performance you can get with VMs/containers on this system: https://github.com/AmpereComputing/qemu-coremark

ubuntu@ubuntu:~/qemu-coremark$ ./run_pts.sh 2
47 instances of pts/coremark running in parallel in arm64 VMs!
Round 1 - Total CoreMark Score is: 4697344
Round 2 - Total CoreMark Score is: 4684524

llama.cpp (Ampere-optimized)

See: https://github.com/AmpereComputingAI/llama.cpp (I also have an email from Ampere with some testing notes).

Ollama (generic LLMs)

See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings

System CPU/GPU Model Eval Rate
AmpereOne A192-32X (192 core - 512GB) CPU llama3.2:3b 23.52 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:8b 17.47 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:70b 3.86 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:405b 0.90 Tokens/s

yolo-v5

See: https://github.com/AmpereComputingAI/yolov5-demo (maybe test it on a 4K60 video, see how it fares).

@geerlingguy
Copy link
Owner Author

Getting full 25 Gbps Ethernet on the 2nd interface:

ubuntu@ubuntu:~$ ethtool eno2np1
Settings for eno2np1:
	Supported ports: [ FIBRE ]
	Supported link modes:   25000baseCR/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: RS	 BASER
	Advertised link modes:  25000baseCR/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 25000Mb/s
	Lanes: 1
	Duplex: Full
	Auto-negotiation: on
	Port: Direct Attach Copper
	PHYAD: 1
	Transceiver: internal
netlink error: Operation not permitted
        Current message level: 0x00002081 (8321)
                               drv tx_err hw
	Link detected: yes

If I try running Geekbench 6 I get a core dump, lol:

ubuntu@ubuntu:~/Geekbench-6.3.0-LinuxARMPreview$ ./geekbench6
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I opened up a support issue for that: Can't run Geekbench 6 Arm Preview on AmpereOne 192-core system

@geerlingguy
Copy link
Owner Author

And yes, I know this system is not really an SBC. I still want to test it against Arm SBCs, though ;)

@geerlingguy
Copy link
Owner Author

To get btop to show the CPU SoC temps instead of apm_xgene/IO Power, I went into options o, tabbed to the CPU tab, and under 'Cpu sensor' changed it to apm_xgene/SoC Temperature.

Screenshot 2024-10-25 at 3 55 25 PM

@ThomasKaiser
Copy link

Jeff, if time permits could you please check this:

grep CONFIG_ARM64_MTE /boot/config-6.8.0*

Background: the CPU cores should be capable of MTE but your machine doesn't expose the feature via /proc/cpuinfo.

@hrw
Copy link

hrw commented Oct 26, 2024

No GPU in it but can you check it with some AMD/NVIDIA graphic cards?

@geerlingguy
Copy link
Owner Author

@hrw - I'd love to find a way to get a test sample of one of AMD or Nvidia's enterprise server cards—right now the best fit I have is an older Quadro RTX card, but it won't fit in this chassis.

@ThomasKaiser I'll try to run that next time I have the server booted (remind me if I forget next week); I shut it down over the weekend and a boot cycle takes 5-10 minutes, so I'm too lazy to sit and wait today for one command!

@hrw
Copy link

hrw commented Oct 27, 2024

@geerlingguy "add pcie x16 riser cable to your shopping list" was my first idea but then I realized that server case would lack power cables for gpu as well.

@geerlingguy
Copy link
Owner Author

@hrw - The server actually includes 2x8 pin PCIe power connections, it's designed for up to 1 fanless GPU (needs high CFM to keep cool).

@geerlingguy
Copy link
Owner Author

It looks like one stick of RAM was spewing errors, see geerlingguy/top500-benchmark#43 (comment)

I've re-seated that RAM module (DIMMF1), and am going to re-run all benchmarks so far. It is not erroring out now.

@geerlingguy
Copy link
Owner Author

@ThomasKaiser:

ubuntu@ubuntu:$ grep CONFIG_ARM64_MTE /boot/config-6.8.0*
/boot/config-6.8.0-39-generic-64k:CONFIG_ARM64_MTE=y
/boot/config-6.8.0-47-generic:CONFIG_ARM64_MTE=y

@geerlingguy
Copy link
Owner Author

Attempting qemu-coremark, during setup I'm getting an error: meson setup fails with 'Dependency "glib-2.0" not found'

@geerlingguy
Copy link
Owner Author

Had to install libglib2.0-dev manually, then add myself to the kvm group, but now the benchmark runs.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 30, 2024

I noticed when I run sudo shutdown now, I get logged out of ubuntu and SSH goes away, but then the server won't actually power off (and go into BMC-only mode) for many minutes.

Watching the SOL Console today, I saw tons of errors like:

[ 5261.993963] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x3ee0} len:0
[ 5270.120534] {1788}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 5270.129045] {1788}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 5270.137729] {1788}[Hardware Error]: event severity: corrected
[ 5270.143461] {1788}[Hardware Error]:  Error 0, type: corrected
[ 5270.149193] {1788}[Hardware Error]:   section_type: memory error
[ 5270.155186] {1788}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[ 5270.164478] {1788}[Hardware Error]:   node:0 card:5 module:16 device:7 
[ 5270.171078] {1788}[Hardware Error]:   error_type: 13, scrub corrected error
[ 5270.178026] EDAC MC0: 1 CE scrub corrected error on unknown memory (node:0 card:5 module:16 device:7 page:0x0 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 device:7 status(0x0000000000000400): Storage error in DRAM memory)
[ 5271.187341] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x3ee4} len:0
[ 5280.388425] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x3ef0} len:0

So it looks like that DIMM is throwing a bunch of errors, maybe causing the Ethernet driver to throw other errors?

[ 5372.462135] bnxt_en 0003:02:00.1 eno2np1: Resp cmpl intr err msg: 0x51
[ 5372.468653] bnxt_en 0003:02:00.1 eno2np1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[ 5381.671651] bnxt_en 0003:02:00.1 eno2np1: Resp cmpl intr err msg: 0x51
[ 5381.678169] bnxt_en 0003:02:00.1 eno2np1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
...
[ 5417.936638] INFO: task kworker/72:1:1300 blocked for more than 122 seconds.
[ 5417.943594]       Tainted: G        W          6.8.0-39-generic-64k #39-Ubuntu
[ 5417.950804] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
[ 5603.138033] EDAC MC0: 1 CE single-symbol chipkill ECC on P0_Node0_Channel5_Dimm0 DIMMF1 (node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:1479 column:1216 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 page:0x2e3b7 offset:0x3800 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:1479 column:1216 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 status(0x0000000000000400): Storage error in DRAM memory)
... [finally a long time later] ...
[ 5900.617885] reboot: Power down

It's still always DIMMF1 :)

@bexcran
Copy link

bexcran commented Oct 30, 2024

I saw the shutdown of an AmpereOne machine I was testing take a really long time too due to the Broadcom Ethernet driver. But I didn’t see any of the DRAM or APEI issues, so I’m not sure they’re related.

@geerlingguy
Copy link
Owner Author

I saw the shutdown of an AmpereOne machine I was testing take a really long time too due to the Broadcom Ethernet driver.

Hmm, maybe that's it then — those messages kept popping in amidst all the DIMM messages. Might be nice to figure out how to fix the bnxt_en driver!

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 30, 2024

Testing a RAID 0 array of all the NVMe drives following my guide:

ubuntu@ubuntu:~$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=6 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1 /dev/nvme3n1p1 /dev/nvme5n1p1 /dev/nvme6n1p1

ubuntu@ubuntu:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Wed Oct 30 16:37:22 2024
        Raid Level : raid0
        Array Size : 11251445760 (10.48 TiB 11.52 TB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

       Update Time : Wed Oct 30 16:37:22 2024
             State : clean 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : ubuntu:0  (local to host ubuntu)
              UUID : 6dd22af6:0fd54fa0:9463f73f:636afb4e
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259       11        0      active sync   /dev/nvme0n1p1
       1     259       13        1      active sync   /dev/nvme1n1p1
       2     259       12        2      active sync   /dev/nvme2n1p1
       3     259       14        3      active sync   /dev/nvme3n1p1
       4     259       15        4      active sync   /dev/nvme5n1p1
       5     259       16        5      active sync   /dev/nvme6n1p1

ubuntu@ubuntu:~$ sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 /dev/md0
ubuntu@ubuntu:~$ sudo mkdir /mnt/raid0
ubuntu@ubuntu:~$ sudo mount /dev/md0 /mnt/raid0

Running my disk benchmark on the array...

Benchmark Result
iozone 4K random read 58.05 MB/s
iozone 4K random write 250.06 MB/s
iozone 1M random read 5444.03 MB/s
iozone 1M random write 4411.07 MB/s
iozone 1M sequential read 7120.75 MB/s
iozone 1M sequential write 4458.30 MB/s

@geerlingguy geerlingguy changed the title AmpereOne A192-32X (SuperMicro) AmpereOne A192-32X (Supermicro) Oct 30, 2024
@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 31, 2024

Ampere sent over a replacement DIMM, and it seems to have corrected all the memory issues.

However, shutdown is still excruciating — timing this shutdown cycle, it took 15+ minutes, and I just see tons of Ethernet NIC errors (see below for a snippet), maybe a bug in the bnxt_en driver on arm64?

[  224.516490] infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
[  224.523180] infiniband bnxt_re0: Couldn't start port
[  224.528173] bnxt_en 0003:02:00.0 bnxt_re0: Failed to destroy HW QP
[  224.534384] ------------[ cut here ]------------
[  224.538988] WARNING: CPU: 97 PID: 2721 at drivers/infiniband/core/cq.c:322 ib_free_cq+0x13c/0x1d8 [ib_core]
[  224.548759] Modules linked in: tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables qrtr overlay nls_iso8859_1 bnxt_re(+) ampere_cspmu cfg80211 dax_hmem acpi_ipmi ib_uverbs cxl_acpi ast cxl_core ipmi_ssif arm_cspmu_module arm_spe_pmu i2c_algo_bit ib_core onboard_usb_hub acpi_tad arm_cmn ipmi_msghandler xgene_hwmon cppc_cpufreq sch_fq_codel binfmt_misc dm_multipath nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 rndis_host cdc_ether usbnet btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha2_ce nvme sha256_arm64 sha1_ce nvme_core bnxt_en xhci_pci xhci_pci_renesas nvme_auth aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: ipmi_devintf]
[  224.637726] CPU: 97 PID: 2721 Comm: (udev-worker) Not tainted 6.8.0-39-generic-64k #39-Ubuntu
[  224.646237] Hardware name: Supermicro Super Server/R13SPD, BIOS T20241001152934 10/01/2024
[  224.654487] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[  224.661437] pc : ib_free_cq+0x13c/0x1d8 [ib_core]
[  224.666152] lr : ib_mad_port_open+0x220/0x450 [ib_core]
[  224.671388] sp : ffff80010920f520
[  224.674690] x29: ffff80010920f520 x28: 0000000000000000 x27: ffffb4c746059120
[  224.681813] x26: 0000000000000000 x25: ffff0002527e8870 x24: ffff0002527e88f8
[  224.688936] x23: ffffb4c7465f3e90 x22: 00000000ffffff92 x21: ffffb4c7465fc550
[  224.696060] x20: ffff000246000000 x19: ffff00015794bc00 x18: ffff8000e8d400f0
[  224.703182] x17: 0000000000000000 x16: 0000000000000000 x15: 6c6c6174735f7766
[  224.710305] x14: 0000000000000000 x13: 505120574820796f x12: 7274736564206f74
[  224.717429] x11: 2064656c69614620 x10: 0000000000000000 x9 : ffffb4c7465c58b0
[  224.724552] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[  224.731675] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  224.738798] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000002
[  224.745921] Call trace:
[  224.748355]  ib_free_cq+0x13c/0x1d8 [ib_core]
[  224.752723]  ib_mad_port_open+0x220/0x450 [ib_core]
[  224.757609]  ib_mad_init_device+0x78/0x228 [ib_core]
[  224.762582]  add_client_context+0xfc/0x208 [ib_core]
[  224.767556]  enable_device_and_get+0xe0/0x1e0 [ib_core]
[  224.772790]  ib_register_device.part.0+0x130/0x218 [ib_core]
[  224.778459]  ib_register_device+0x38/0x68 [ib_core]
[  224.783345]  bnxt_re_ib_init+0x120/0x238 [bnxt_re]
[  224.788135]  bnxt_re_probe+0x14c/0x268 [bnxt_re]
[  224.792746]  auxiliary_bus_probe+0x50/0x108
[  224.796920]  really_probe+0x1c0/0x420
[  224.800575]  __driver_probe_device+0x94/0x1d8
[  224.804920]  driver_probe_device+0x48/0x188
[  224.809091]  __driver_attach+0x14c/0x2c8
[  224.813002]  bus_for_each_dev+0x88/0x110
[  224.816913]  driver_attach+0x30/0x60
[  224.820476]  bus_add_driver+0x17c/0x2d0
[  224.824300]  driver_register+0x68/0x178
[  224.828125]  __auxiliary_driver_register+0x78/0x148
[  224.832990]  bnxt_re_mod_init+0x54/0xfff8 [bnxt_re]
[  224.837861]  do_one_initcall+0x64/0x3b8
[  224.841687]  do_init_module+0xa0/0x280
[  224.845425]  load_module+0x7b8/0x8f0
[  224.848988]  init_module_from_file+0x98/0x118
[  224.853332]  idempotent_init_module+0x1a4/0x2c8
[  224.857850]  __arm64_sys_finit_module+0x70/0xf8
[  224.862368]  invoke_syscall.constprop.0+0x84/0x100
[  224.867147]  do_el0_svc+0xe4/0x100
[  224.870536]  el0_svc+0x48/0x1c8
[  224.873673]  el0t_64_sync_handler+0x148/0x158
[  224.878019]  el0t_64_sync+0x1b0/0x1b8
[  224.881670] ---[ end trace 0000000000000000 ]---
[  224.886282] bnxt_en 0003:02:00.0 bnxt_re0: Free MW failed: 0xffffff92
[  224.892720] infiniband bnxt_re0: Couldn't open port 1
[  257.266860] INFO: task (udev-worker):2732 blocked for more than 122 seconds.
[  257.273911]       Tainted: G        W          6.8.0-39-generic-64k #39-Ubuntu
[  257.281123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  284.760123] systemd-shutdown[1]: Waiting for process: 2721 ((udev-worker)), 2732 ((udev-worker))
[  326.899586] bnxt_en 0003:02:00.1: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (101989 > 100000) msec active 1 
[  326.911840] bnxt_en 0003:02:00.1 bnxt_re1: Failed to modify HW QP
[  326.917924] infiniband bnxt_re1: Couldn't change QP1 state to INIT: -110
[  326.924614] infiniband bnxt_re1: Couldn't start port
[  326.929635] bnxt_en 0003:02:00.1 bnxt_re1: Failed to destroy HW QP
[  326.935847] bnxt_en 0003:02:00.1 bnxt_re1: Free MW failed: 0xffffff92
[  326.942289] infiniband bnxt_re1: Couldn't open port 1
[  327.166856] bnxt_en 0003:02:00.1 bnxt_re1: Failed to deinitialize RCFW: 0xffffff92
[  327.184299] bnxt_en 0003:02:00.0 bnxt_re0: Failed to remove GID: 0xffffff92
[  327.192669] bnxt_en 0003:02:00.0 bnxt_re0: Failed to deinitialize RCFW: 0xffffff92
[  338.125977] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x44 0x69b} len:0
[  338.134153] bnxt_en 0003:02:00.1 eno2np1: hwrm vnic set tpa failure rc for vnic 2: fffffff0
[  340.376637] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0xb4 0x5eb} len:0
[  347.935864] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6a4} len:0
[  350.701222] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0xb4 0x5f0} len:0
[  357.708225] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6ae} len:0
[  360.760027] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0x23 0x5f1} len:0
[  367.482017] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6b4} len:0
[  370.821975] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0xb4 0x5f5} len:0
[  377.268121] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6b7} len:0
[  381.507812] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0xb4 0x5fa} len:0
[  387.048016] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6b9} len:0
[  391.749581] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0xb4 0x5ff} len:0
[  396.814118] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6c6} len:0
[  403.018680] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0x23 0x606} len:0
[  406.578705] bnxt_en 0003:02:00.1 eno2np1: Error (timeout: 5000015) msg {0x41 0x6cf} len:0
...
[  621.657292] bnxt_en 0003:02:00.1 eno2np1: Resp cmpl intr err msg: 0x51
[  621.663810] bnxt_en 0003:02:00.1 eno2np1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[  625.054417] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0x37 0x690} len:0
[  625.911529] INFO: task kworker/34:0:223 blocked for more than 245 seconds.
[  625.918395]       Tainted: G        W          6.8.0-39-generic-64k #39-Ubuntu
[  625.925604] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  631.417059] bnxt_en 0003:02:00.1 eno2np1: Resp cmpl intr err msg: 0x51
[  631.423577] bnxt_en 0003:02:00.1 eno2np1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[  635.129886] bnxt_en 0003:02:00.0 eno1np0: Error (timeout: 5000015) msg {0x37 0x693} len:0
...

@bexcran
Copy link

bexcran commented Oct 31, 2024

Could you try blacklisting the bnxt_re module? To do so, edit /etc/modprobe.d/blacklist.conf and add:

blacklist bnxt_re

From https://utcc.utoronto.ca/~cks/space/blog/linux/BroadcomNetworkDriverAndRDMA?showcomments :

The driver stalls during boot and spits out kernel messages like:

    bnxt_en 0000:ab:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xf]=0x3 waited (102721 > 100000) msec active 1
    bnxt_en 0000:ab:00.0 bnxt_re0: Failed to modify HW QP
    infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
    infiniband bnxt_re0: Couldn't start port
    bnxt_en 0000:ab:00.0 bnxt_re0: Failed to destroy HW QP
    [... more fun ensues ...]

This causes systemd-udev-settle.service to fail:

udevadm[1212]: Timed out for waiting the udev queue being empty.
systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE

This then causes Ubuntu 24.04's ZFS services to fail to completely start, which is a bad thing on hardware that we want to >use for our ZFS fileservers.

We aren't the only people with this problem, so I was able to find various threads on the Internet, for example. These gave me the solution, which >is to blacklist the bnxt_re kernel module, but at the time left me with the mystery of how and why the bnxt_re module >was even being loaded in the first place.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 31, 2024

@bexcran - Will try that, after waiting 15 minutes I just pushed an immediate shutdown so I could finally power cycle.

(On the plus side, power on is much faster now, with a DIMM not spewing out errors all the time.)

@bexcran
Copy link

bexcran commented Oct 31, 2024

@geerlingguy That's what I'd do too! If you want to be a bit nicer to your system/filesystem and you have "Magic SysRq Keys" enabled you can do:

ALT+PrintScreen+s,u,b

That is, press and hold ALT and SysRq (will probably be labeled PrintScr on your keyboard instead of SysRq) while pressing 's', then 'u' then 'b' without letting go of ALT and SysRq.

That'll sync data to disk, attempt to unmount filesystems and the reboot.

https://docs.kernel.org/admin-guide/sysrq.html

@geerlingguy
Copy link
Owner Author

I added /etc/modprobe.d/blacklist-bnxt.conf with blacklist bnxt_re inside, and rebooted.

Now, it reaches poweroff.target within 3 seconds, and Power down state after about 12. SOOOOO much nicer lol.

I guess if I ever need Infiniband over Ethernet, I can figure out that bnxt_re module, otherwise, not sure why it would load by default!

@geerlingguy
Copy link
Owner Author

@bexcran - Is there any simple way of switching the kernel I'm booting on here? I would like to try the 4K kernel just to see if Geekbench will complete a run, but the default kernel that it's running right now (for performance reasons) is 64K.

@bexcran
Copy link

bexcran commented Oct 31, 2024

@geerlingguy Sorry, I don't know.

@geerlingguy
Copy link
Owner Author

I may do a reinstall of the OS on a separate drive just to do that test then.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

Also, now that I have my Ampere Altra 32-core NAS server upgraded to 25 Gbps Ethernet: geerlingguy/arm-nas#16

I can finally run the iperf3 test between these two machines!

ubuntu@ubuntu:~$ iperf3 -c 10.0.2.51
Connecting to host 10.0.2.51, port 5201
[  5] local 10.0.2.21 port 41304 connected to 10.0.2.51 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.73 GBytes  23.4 Gbits/sec    9   1.37 MBytes       
[  5]   1.00-2.00   sec  2.69 GBytes  23.1 Gbits/sec  423    998 KBytes       
[  5]   2.00-3.00   sec  2.22 GBytes  19.1 Gbits/sec   80    737 KBytes       
[  5]   3.00-4.00   sec  1.82 GBytes  15.6 Gbits/sec   93    928 KBytes       
[  5]   4.00-5.00   sec  2.56 GBytes  22.0 Gbits/sec  211    997 KBytes       
[  5]   5.00-6.00   sec  2.57 GBytes  22.1 Gbits/sec  250    765 KBytes       
[  5]   6.00-7.00   sec  2.56 GBytes  22.0 Gbits/sec  128    952 KBytes       
[  5]   7.00-8.00   sec  2.57 GBytes  22.1 Gbits/sec  198    846 KBytes       
[  5]   8.00-9.00   sec  2.58 GBytes  22.2 Gbits/sec  113   1.07 MBytes       
[  5]   9.00-10.00  sec  2.59 GBytes  22.2 Gbits/sec  141    718 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  24.9 GBytes  21.4 Gbits/sec  1646             sender
[  5]   0.00-10.04  sec  24.9 GBytes  21.3 Gbits/sec                  receiver

I've noticed some variances—hard to tell if it's on the NAS side, the AmpereOne side, or my cloud router. None of them are showing 100% CPU utilization, and watching on atop, I don't see any interrupt issues or any other bottleneck.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

Testing a copy over SMB from one of the NVMe on this system to the NVMe on the HL15:

$ sudo apt install cifs-utils
$ sudo mkdir /mnt/mercury
$ sudo mount -t cifs -o user=jgeerling,uid=$(id -u),gid=$(id -g) //nas01.mmoffice.net/mercury /mnt/mercury

# Inside /mnt/nvme/test, create a large file
$ fallocate -l 100G largefile

# Benchmark file copy over SMB *to* NAS01
ubuntu@ubuntu:/mnt/nvme/test$ rsync --info=progress2 -a largefile /mnt/mercury/test/largefile
107,374,182,400 100%    1.14GB/s    0:01:27 (xfr#1, to-chk=0/1)

# Benchmark file copy over SMB *from* NAS01
ubuntu@ubuntu:/mnt/nvme/test$ rsync --info=progress2 -a /mnt/mercury/test/largefile largefile
107,374,182,400 100%    1.03GB/s    0:01:37 (xfr#1, to-chk=0/1)

Not quite as fast as I was hoping, but this is dealing with SMB + Ethernet + rsync overhead, and I saw it going between 8-15 Gbps on the NAS. Interesting that the copy back was noticeably slower (about 1 Gbps slower).

Testing with fio:

$ fio --name=job-w --rw=write --size=2G --ioengine=libaio --iodepth=4 --bs=128k --direct=1 --filename=bench.file
WRITE: bw=864MiB/s (906MB/s), 864MiB/s-864MiB/s (906MB/s-906MB/s), io=2048MiB (2147MB), run=2370-2370msec

$ fio --name=job-r --rw=read --size=2G --ioengine=libaio --iodepth=4 --bs=128K --direct=1 --filename=bench.file
READ: bw=1267MiB/s (1328MB/s), 1267MiB/s-1267MiB/s (1328MB/s-1328MB/s), io=2048MiB (2147MB), run=1617-1617msec

$ fio --name=job-randw --rw=randwrite --size=2G --ioengine=libaio --iodepth=32 --bs=4k --direct=1 --filename=bench.file
read: IOPS=15.0k, BW=59.4MiB/s (62.3MB/s)(2048MiB/34486msec)
WRITE: bw=59.4MiB/s (62.3MB/s), 59.4MiB/s-59.4MiB/s (62.3MB/s-62.3MB/s), io=2048MiB (2147MB), run=34486-34486msec

$ fio --name=job-randr --rw=randread --size=2G --ioengine=libaio --iodepth=32 --bs=4K --direct=1 --filename=bench.file
read: IOPS=36.4k, BW=142MiB/s (149MB/s)(2048MiB/14398msec)
READ: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=2048MiB (2147MB), run=14398-14398msec

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

To switch kernels on Ubuntu, I did the following:

Get a listing of all the installed kernels:

ubuntu@ubuntu:~$ sudo grub-mkconfig | grep -iE "menuentry 'Ubuntu, with Linux" | awk '{print i++ " : "$1, $2, $3, $4, $5, $6, $7}'
Sourcing file `/etc/default/grub'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-47-generic
Found initrd image: /boot/initrd.img-6.8.0-47-generic
Found linux image: /boot/vmlinuz-6.8.0-39-generic-64k
Found initrd image: /boot/initrd.img-6.8.0-39-generic-64k
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done
0 : menuentry 'Ubuntu, with Linux 6.8.0-47-generic' --class ubuntu
1 : menuentry 'Ubuntu, with Linux 6.8.0-47-generic (recovery mode)'
2 : menuentry 'Ubuntu, with Linux 6.8.0-39-generic-64k' --class ubuntu
3 : menuentry 'Ubuntu, with Linux 6.8.0-39-generic-64k (recovery mode)'

Edit the Grub configuration.

$ sudoedit /etc/default/grub

# Set `GRUB_DEFAULT` to `0` to pick the first option / default.
#GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-39-generic-64k"
GRUB_DEFAULT=0

# Comment out the `GRUB_TIMEOUT_STYLE=hidden` line so it looks like:
#GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0

# After saving the file, run
$ sudo update-grub
$ sudo reboot

Technically I could hit Esc (I think? Maybe Shift?) during boot, but the timing for that is pretty narrow, so it's nicer to just have the menu appear during boot.

After reboot:

ubuntu@ubuntu:~$ uname -a
Linux ubuntu 6.8.0-47-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 22:03:50 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

Now that I have the kernel back to 4k page size, I am running Geekbench 6 test. I noticed someone else ran one on the same motherboard/CPU in May: https://browser.geekbench.com/v6/cpu/6131970

14435 multicore vs my 15160. Single core spot on at 1309.

Geekbench 6 is horrible for this many cores—it didn't seem to even get halfway up to full multicore performance... Geekbench 5 at least pegs all the cores and hits 600W for some of the tests.

Geekbench Version Single core Multi Core Peak Power Consumption
6.0.3 Arm preview 1309 15160 279W
5.4.0 Arm preview 958 80639 586W

Geekbench 6 is on the left:

Screenshot 2024-11-01 at 12 16 27 PM

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 21, 2024

I wanted to get more numbers for the qemu-coremark test Ampere suggested (which runs Arm64 VMs in parallel running coremark).

From Wendell / L1Techs, running on Sapphire Rapids (60 core / 120 thread / 1.90 GHz):

59 instances of pts/coremark running in parallel in arm64 VMs!
Round 1 - Total CoreMark Score is: 491,008
w@prod-baremetal:~/qemu-coremark$ cat /proc/cpuinfo |grep Xeon |head -n 1
model name      : Intel(R) Xeon(R) Platinum 8490H

And on Granite Rapids (2x128 core Intel Xeon 6 (256 core / 512 thread / 2-3.9 GHz):

127 instances of pts/coremark running in parallel in arm64 VMs!
Round 1 - Total CoreMark Score is: 1,255,430
root@tecpress7:~/qemu-coremark# cat /proc/cpuinfo |grep Xeon |head -n 1
model name      : Intel(R) Xeon(R) 6980P

(Compared to the AmpereOne M192-32X at 4,697,344, and Epyc Genoa 9654 (96 core/192 thread) at 512,244)

@ThomasKaiser
Copy link

I wanted to get more numbers for the qemu-coremark test Ampere suggested

Why exactly? What is the purpose of running the arm64 version of coremark emulated on x86?

Why is your 192-core system executing 47 parallel instances, the 120-thread Platinum 8490H executing 59 and the 512-thread Xeon 6980P 127? The Ampere and the 6980P seem to fire up ($count-of-threads / 4 ) - 1 instances while the 8490H fires up ($count-of-threads / 2 ) - 1... just why? Why not $count-of-threads instead if it's all about running fully parallel?

@geerlingguy
Copy link
Owner Author

@ThomasKaiser those are questions best directed at Ampere. I know coremark is one of the few benchmarks where their CPU really shines in general (compared to AMD/Intel), and running emulated code further puts their machine ahead of the others in this specific instance (Hyperthreading always muddies things too).

I think marketing-wise, Ampere would like to take some W's somewhere, and this is how they want to do it. It's a very niche case—where you have arm64 code you want to run, and you're running it on x86 servers, but I do get it. Some people run Ampere machines for CI for their arm stuff (targeting embedded), so this is the one case where it could give an indication that doing that is better native on Arm than emulated on your existing x86 servers.

Is it contrived? Yes. Is it useful? Well, maybe a little. I'm not going to give it much weight in my ultimate review of this hardware. (In fact, for full benchmarking, I am pointing people towards Phoronix and ServeTheHome, who have more extensive testing already).

@joespeed
Copy link

joespeed commented Nov 21, 2024

@ThomasKaiser the purpose of qemu-coremark is simply to educate automakers and other developers of arm software of something that is pretty obvious to you and Jeff. If you are developing and testing arm software then it is better done on arm. Automakers do a ton of arm CI testing in arm64 emulators on x86. But emulation is slow and imperfect. You get better quality software faster when doing arm64 software development and testing on .... arm64. Here is a talk that @bexcran and I gave to the SOAFEE TSC about this.

So what qemu-coremark does is run Phoronix Test Suite CoreMark in as many 4-core arm64 SoCs as the host will support, emulated on x86 and virtualized on arm64. Automakers, Tier 1s and automotive ISVs do CI testing in large (no larger) numbers of emulated or virtualized arm64 SoCs, actually automotive ECUs, so that they can run their 100,000s to millions of CI tests in parallel. Vehicles have 100s of millions of lines of code these days so must be able to massively parallelize the CI testing or it'll take a week. Apex.ai takes CI testing that consumes days on a physical automotive ECU and cranks through it in hours using arm64 runners with virtual automotive ECUs with QNX RTOS on racked Ampere arm64 servers. And because it is virtualized, not emulated, the quality or correctness of the test environment is much better.

Automakers already use coremark to understand the performance of their automotive ECUs (microprocessors) and MCUs (microcontrollers). So it is a reasonable way to help them get their head around the relative performance or throughput they could expect of CI testing in such environments. The absolute number is not important, what matters is the relative performance, e.g. that a System76 Thelio Astra arm64 developer desktop has around 5x the performance of a much more expensive EPYC server for the rather specific job of arm64 CI testing.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 21, 2024

@joespeed - Thanks for the clarification!

It's similar to the whole 'Snapdragon has trouble running x86 games in Windows 11's emulation mode', but in reverse. Like I said in my previous comment, I think right now it's a bit of a niche use case, but not an invalid use case.

These high core count servers probably deserve some new benchmarking setups though — I've been considering setting up my Drupal site into a set of containers (already done, built for x86/arm64) and building a script that runs as many site instances as possible, then uses ab/wrk to hammer them for a long period of time. It's a little complicated to get it working reliably (and more importantly, equally across architectures), but hopefully I can get it working to the point I have a real-world test point that is useful for comparison (and something that I dealt with a lot estimating AWS/Azure/Google pricing for web projects).

@ThomasKaiser
Copy link

ThomasKaiser commented Nov 22, 2024

arm64 CI testing

Thanks, that explains this niche use case and is important for Ampere's target audience (not so much Jeff's ;)

These high core count servers probably deserve some new benchmarking setups though

I personally use 7-zip's internal benchmark as a rough representation of these 'server workloads' (Integer / memory latency) and so far within the last decade when comparing with 'real workloads' between different machines the scores were a pretty good representation of tasks that were able to run on different hardware.

The Ampere machine here with its 192 cores shows a 1:156 ratio between single-threaded und multi score: 4783 vs. 745720 7-ZIP MIPS... which is simply excellent. Would be interesting how that compares to the Drupal setup :)

@geerlingguy
Copy link
Owner Author

@ThomasKaiser - Indeed! At some point I'll get time to work on it again. Right now in crunch mode for some... other projects too.

@hrw
Copy link

hrw commented Nov 24, 2024

@geerlingguy can you run my ArmCpuInfo in EFI shell and share results?

I have some idea and need such data from some Arm servers.

@geerlingguy
Copy link
Owner Author

Blog post: https://www.jeffgeerling.com/blog/2024/ampereone-cores-are-new-mhz

Video: https://www.youtube.com/watch?v=t05OZAruyYY

@hrw I will try at some point, a bit busy right now but will try to remember later this month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants