-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AmpereOne A192-32X (Supermicro) #52
Comments
Getting full 25 Gbps Ethernet on the 2nd interface:
If I try running Geekbench 6 I get a core dump, lol:
I opened up a support issue for that: Can't run Geekbench 6 Arm Preview on AmpereOne 192-core system |
And yes, I know this system is not really an SBC. I still want to test it against Arm SBCs, though ;) |
Jeff, if time permits could you please check this:
Background: the CPU cores should be capable of MTE but your machine doesn't expose the feature via |
No GPU in it but can you check it with some AMD/NVIDIA graphic cards? |
@hrw - I'd love to find a way to get a test sample of one of AMD or Nvidia's enterprise server cards—right now the best fit I have is an older Quadro RTX card, but it won't fit in this chassis. @ThomasKaiser I'll try to run that next time I have the server booted (remind me if I forget next week); I shut it down over the weekend and a boot cycle takes 5-10 minutes, so I'm too lazy to sit and wait today for one command! |
@geerlingguy "add pcie x16 riser cable to your shopping list" was my first idea but then I realized that server case would lack power cables for gpu as well. |
@hrw - The server actually includes 2x8 pin PCIe power connections, it's designed for up to 1 fanless GPU (needs high CFM to keep cool). |
It looks like one stick of RAM was spewing errors, see geerlingguy/top500-benchmark#43 (comment) I've re-seated that RAM module ( |
|
Attempting qemu-coremark, during setup I'm getting an error: meson setup fails with 'Dependency "glib-2.0" not found' |
Had to install |
I noticed when I run Watching the SOL Console today, I saw tons of errors like:
So it looks like that DIMM is throwing a bunch of errors, maybe causing the Ethernet driver to throw other errors?
It's still always |
I saw the shutdown of an AmpereOne machine I was testing take a really long time too due to the Broadcom Ethernet driver. But I didn’t see any of the DRAM or APEI issues, so I’m not sure they’re related. |
Hmm, maybe that's it then — those messages kept popping in amidst all the DIMM messages. Might be nice to figure out how to fix the |
Testing a RAID 0 array of all the NVMe drives following my guide:
Running my disk benchmark on the array...
|
Ampere sent over a replacement DIMM, and it seems to have corrected all the memory issues. However, shutdown is still excruciating — timing this shutdown cycle, it took 15+ minutes, and I just see tons of Ethernet NIC errors (see below for a snippet), maybe a bug in the
|
Could you try blacklisting the
From https://utcc.utoronto.ca/~cks/space/blog/linux/BroadcomNetworkDriverAndRDMA?showcomments :
|
@bexcran - Will try that, after waiting 15 minutes I just pushed an immediate shutdown so I could finally power cycle. (On the plus side, power on is much faster now, with a DIMM not spewing out errors all the time.) |
@geerlingguy That's what I'd do too! If you want to be a bit nicer to your system/filesystem and you have "Magic SysRq Keys" enabled you can do: ALT+PrintScreen+s,u,b That is, press and hold ALT and SysRq (will probably be labeled PrintScr on your keyboard instead of SysRq) while pressing 's', then 'u' then 'b' without letting go of ALT and SysRq. That'll sync data to disk, attempt to unmount filesystems and the reboot. |
I added Now, it reaches I guess if I ever need Infiniband over Ethernet, I can figure out that |
@bexcran - Is there any simple way of switching the kernel I'm booting on here? I would like to try the 4K kernel just to see if Geekbench will complete a run, but the default kernel that it's running right now (for performance reasons) is 64K. |
@geerlingguy Sorry, I don't know. |
I may do a reinstall of the OS on a separate drive just to do that test then. |
Also, now that I have my Ampere Altra 32-core NAS server upgraded to 25 Gbps Ethernet: geerlingguy/arm-nas#16 I can finally run the
I've noticed some variances—hard to tell if it's on the NAS side, the AmpereOne side, or my cloud router. None of them are showing 100% CPU utilization, and watching on |
Testing a copy over SMB from one of the NVMe on this system to the NVMe on the HL15:
Not quite as fast as I was hoping, but this is dealing with SMB + Ethernet + rsync overhead, and I saw it going between 8-15 Gbps on the NAS. Interesting that the copy back was noticeably slower (about 1 Gbps slower). Testing with
|
To switch kernels on Ubuntu, I did the following: Get a listing of all the installed kernels:
Edit the Grub configuration.
Technically I could hit After reboot:
|
Now that I have the kernel back to 4k page size, I am running Geekbench 6 test. I noticed someone else ran one on the same motherboard/CPU in May: https://browser.geekbench.com/v6/cpu/6131970 14435 multicore vs my 15160. Single core spot on at 1309. Geekbench 6 is horrible for this many cores—it didn't seem to even get halfway up to full multicore performance... Geekbench 5 at least pegs all the cores and hits 600W for some of the tests.
Geekbench 6 is on the left: |
I wanted to get more numbers for the From Wendell / L1Techs, running on Sapphire Rapids (60 core / 120 thread / 1.90 GHz):
And on Granite Rapids (2x128 core Intel Xeon 6 (256 core / 512 thread / 2-3.9 GHz):
(Compared to the AmpereOne M192-32X at |
Why exactly? What is the purpose of running the arm64 version of coremark emulated on x86? Why is your 192-core system executing 47 parallel instances, the 120-thread Platinum 8490H executing 59 and the 512-thread Xeon 6980P 127? The Ampere and the 6980P seem to fire up |
@ThomasKaiser those are questions best directed at Ampere. I know coremark is one of the few benchmarks where their CPU really shines in general (compared to AMD/Intel), and running emulated code further puts their machine ahead of the others in this specific instance (Hyperthreading always muddies things too). I think marketing-wise, Ampere would like to take some W's somewhere, and this is how they want to do it. It's a very niche case—where you have arm64 code you want to run, and you're running it on x86 servers, but I do get it. Some people run Ampere machines for CI for their arm stuff (targeting embedded), so this is the one case where it could give an indication that doing that is better native on Arm than emulated on your existing x86 servers. Is it contrived? Yes. Is it useful? Well, maybe a little. I'm not going to give it much weight in my ultimate review of this hardware. (In fact, for full benchmarking, I am pointing people towards Phoronix and ServeTheHome, who have more extensive testing already). |
@ThomasKaiser the purpose of qemu-coremark is simply to educate automakers and other developers of arm software of something that is pretty obvious to you and Jeff. If you are developing and testing arm software then it is better done on arm. Automakers do a ton of arm CI testing in arm64 emulators on x86. But emulation is slow and imperfect. You get better quality software faster when doing arm64 software development and testing on .... arm64. Here is a talk that @bexcran and I gave to the SOAFEE TSC about this. So what qemu-coremark does is run Phoronix Test Suite CoreMark in as many 4-core arm64 SoCs as the host will support, emulated on x86 and virtualized on arm64. Automakers, Tier 1s and automotive ISVs do CI testing in large (no larger) numbers of emulated or virtualized arm64 SoCs, actually automotive ECUs, so that they can run their 100,000s to millions of CI tests in parallel. Vehicles have 100s of millions of lines of code these days so must be able to massively parallelize the CI testing or it'll take a week. Apex.ai takes CI testing that consumes days on a physical automotive ECU and cranks through it in hours using arm64 runners with virtual automotive ECUs with QNX RTOS on racked Ampere arm64 servers. And because it is virtualized, not emulated, the quality or correctness of the test environment is much better. Automakers already use coremark to understand the performance of their automotive ECUs (microprocessors) and MCUs (microcontrollers). So it is a reasonable way to help them get their head around the relative performance or throughput they could expect of CI testing in such environments. The absolute number is not important, what matters is the relative performance, e.g. that a System76 Thelio Astra arm64 developer desktop has around 5x the performance of a much more expensive EPYC server for the rather specific job of arm64 CI testing. |
@joespeed - Thanks for the clarification! It's similar to the whole 'Snapdragon has trouble running x86 games in Windows 11's emulation mode', but in reverse. Like I said in my previous comment, I think right now it's a bit of a niche use case, but not an invalid use case. These high core count servers probably deserve some new benchmarking setups though — I've been considering setting up my Drupal site into a set of containers (already done, built for x86/arm64) and building a script that runs as many site instances as possible, then uses ab/wrk to hammer them for a long period of time. It's a little complicated to get it working reliably (and more importantly, equally across architectures), but hopefully I can get it working to the point I have a real-world test point that is useful for comparison (and something that I dealt with a lot estimating AWS/Azure/Google pricing for web projects). |
Thanks, that explains this niche use case and is important for Ampere's target audience (not so much Jeff's ;)
I personally use 7-zip's internal benchmark as a rough representation of these 'server workloads' (Integer / memory latency) and so far within the last decade when comparing with 'real workloads' between different machines the scores were a pretty good representation of tasks that were able to run on different hardware. The Ampere machine here with its 192 cores shows a 1:156 ratio between single-threaded und multi score: 4783 vs. 745720 7-ZIP MIPS... which is simply excellent. Would be interesting how that compares to the Drupal setup :) |
@ThomasKaiser - Indeed! At some point I'll get time to work on it again. Right now in crunch mode for some... other projects too. |
@geerlingguy can you run my ArmCpuInfo in EFI shell and share results? I have some idea and need such data from some Arm servers. |
Blog post: https://www.jeffgeerling.com/blog/2024/ampereone-cores-are-new-mhz Video: https://www.youtube.com/watch?v=t05OZAruyYY @hrw I will try at some point, a bit busy right now but will try to remember later this month. |
Basic information
Linux/system information
Benchmark results
CPU
Power
sensors
)stress-ng --matrix 0
): 500 Wtop500
HPL benchmark: 692 WDisk
Samsung NVMe SSD - 983 DCT M.2 960GB
Samsung NVMe SSD - MZQL21T9HCJR-00A07
Specs: https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/mzql21t9hcjr-00a07/
Single disk
RAID 0 (mdadm)
Network
iperf3
results:iperf3 -c $SERVER_IP
: 21.4 Gbpsiperf3 -c $SERVER_IP --reverse
: 18.8 Gbpsiperf3 -c $SERVER_IP --bidir
: 8.08 Gbps up, 22.2 Gbps downTested on one of the two built-in Broadcom
BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
interfaces, to my HL15 Arm NAS (see: geerlingguy/arm-nas#16), routed through a Mikrotik 25G Cloud Router.GPU
Did not test - this server doesn't have a GPU, just the ASPEED integrated BMC VGA graphics, which are not suitable for much GPU-accelerated gaming or LLMs, lol. Just render it on CPU!
Memory
tinymembench
results:Click to expand memory benchmark result
sbc-bench
resultsRun sbc-bench and paste a link to the results here: https://0x0.st/X0gc.bin
See: ThomasKaiser/sbc-bench#105
Phoronix Test Suite
Results from pi-general-benchmark.sh:
Additional benchmarks
QEMU Coremark
The Ampere team have suggested running this, as it will emulate running tons of virtual instances with coremark inside, a good proxy of the type of performance you can get with VMs/containers on this system: https://github.com/AmpereComputing/qemu-coremark
llama.cpp (Ampere-optimized)
See: https://github.com/AmpereComputingAI/llama.cpp (I also have an email from Ampere with some testing notes).
Ollama (generic LLMs)
See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings
yolo-v5
See: https://github.com/AmpereComputingAI/yolov5-demo (maybe test it on a 4K60 video, see how it fares).
The text was updated successfully, but these errors were encountered: