Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to 25 Gbps Ethernet #16

Closed
geerlingguy opened this issue Oct 29, 2024 · 9 comments
Closed

Upgrade to 25 Gbps Ethernet #16

geerlingguy opened this issue Oct 29, 2024 · 9 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Oct 29, 2024

I purchased a PCIe Gen 4 SFP28 NIC with Intel E810-XXVAM2 on Amazon, and would like to install it in the server to get dual 25 Gbps Ethernet on the NAS.

Some of my other gear is starting to come online at 25G, and it would be nice to have a storage target capable of saturating the network!

Intel has a driver download page here: Intel® Network Adapter Driver for E810 Series Devices under Linux*

@geerlingguy
Copy link
Owner Author

Interestingly, the SOL console in the BMC is spitting errors like:

[   11.232990] Unable to handle kernel paging request at virtual address 0021a817ce8721ad
[   11.240897] Mem abort info:
[   11.243680]   ESR = 0x96000004
[   11.246733]   EC = 0x25: DABT (current EL), IL = 32 bits
[   11.252039]   SET = 0, FnV = 0
[   11.255082]   EA = 0, S1PTW = 0
[   11.258212] Data abort info:
[   11.261080]   ISV = 0, ISS = 0x00000004
[   11.264905]   CM = 0, WnR = 0
[   11.267862] [0021a817ce8721ad] address between user and kernel address ranges
[   11.274986] Internal error: Oops: 96000004 [#1] SMP
[   11.279852] Modules linked in: ast(+) drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea ghash_ce sysfillrect sysimgblt sha2_ce fb_sys_fops sha256_arm64 sha1_ce mpt3sas(+) drm nvme(+) ixgbe(+) raid_class igb(+) ice(+) nvme_core xfrm_algo scsi_transport_sas mdio i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[   11.310860] CPU: 0 PID: 205 Comm: kworker/0:2 Not tainted 5.4.0-198-generic #218-Ubuntu
[   11.318850] Hardware name: To Be Filled By O.E.M. ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 1.21 11/15/2023
...
[   11.427088] Call trace:
[   11.429522]  __kmalloc+0xac/0x2d0
[   11.432825]  rh_call_control+0x210/0x938
[   11.436735]  usb_hcd_submit_urb+0x14c/0x3e8
[   11.440906]  usb_submit_urb+0x198/0x590
[   11.444730]  usb_start_wait_urb+0x70/0x160
[   11.448814]  usb_control_msg+0xc4/0x140

This seems to happen after the ASPEED USB port tries initializing?

Another trace:

[   11.979019] ice 0004:01:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.4.0
[   11.989174] Unable to handle kernel paging request at virtual address 0021a817ce8721ad
[   11.997080] Mem abort info:
[   11.999862]   ESR = 0x96000004
[   12.002906]   EC = 0x25: DABT (current EL), IL = 32 bits
[   12.008206]   SET = 0, FnV = 0
[   12.011250]   EA = 0, S1PTW = 0
[   12.014379] Data abort info:
[   12.017247]   ISV = 0, ISS = 0x00000004
[   12.021072]   CM = 0, WnR = 0
[   12.024028] [0021a817ce8721ad] address between user and kernel address ranges
[   12.031152] Internal error: Oops: 96000004 [#2] SMP
[   12.036016] Modules linked in: hid_generic usbhid hid ast(+) drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea ghash_ce sysfillrect sysimgblt sha2_ce fb_sys_fops sha256_arm64 sha1_ce mpt3sas(+) drm nvme(+) ixgbe(+) raid_class igb(+) ice(+) nvme_core xfrm_algo scsi_transport_sas mdio i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[   12.069019] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G      D           5.4.0-198-generic #218-Ubuntu
[   12.078311] Hardware name: To Be Filled By O.E.M. ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 1.21 11/15/2023
[   12.087518] Workqueue: events work_for_cpu_fn
[   12.091862] pstate: a0c00009 (NzCv daif +PAN +UAO)
[   12.096641] pc : kmem_cache_alloc_trace+0x94/0x278
[   12.101418] lr : kmem_cache_alloc_trace+0x6c/0x278
[   12.106195] sp : ffff800010253b20
[   12.109497] x29: ffff800010253b20 x28: 0000000000000000 
[   12.114796] x27: ffffaebc2bed51cc x26: 0000000000000068 
[   12.120094] x25: ffff680e28007c00 x24: ffffaebc2bed51cc 
[   12.125393] x23: 0000000000028a97 x22: 0000000000000dc0 
[   12.130691] x21: 0000000000000000 x20: ae21a817ce8721ad 
[   12.135990] x19: ffff680e28007c00 x18: ffffaebc2d108538 
[   12.141288] x17: 0000000088e09f7b x16: ffffaebc2c49baf0 
[   12.146586] x15: ffff680e28688530 x14: ffff800010c9f000 
[   12.151885] x13: ffff680e28a0fe00 x12: ffff800010bb5000 
[   12.157183] x11: ffffaebc2d8f43a0 x10: ffff800010bb0000 
[   12.162482] x9 : 0000000000000041 x8 : 0000000000004000 
[   12.167780] x7 : ffffaebc2ddf2818 x6 : ffff680e2815b428 
[   12.173079] x5 : ffffaebc2c460670 x4 : ffff680e2f9f91e0 
[   12.178377] x3 : 0000000000100070 x2 : ae21a817ce8721ad 
[   12.183676] x1 : 0000000000000000 x0 : 5197a916cf8535d2 
[   12.188974] Call trace:
[   12.191408]  kmem_cache_alloc_trace+0x94/0x278
[   12.195840]  alloc_msi_entry+0x3c/0x98
[   12.199578]  __pci_enable_msix_range.part.0+0x3a4/0x5b0
[   12.204790]  __pci_enable_msix_range+0x64/0x90
[   12.209221]  pci_enable_msix_range+0x48/0x58
[   12.213487]  ice_probe+0x6a4/0xc68 [ice]
[   12.217398]  local_pci_probe+0x48/0xa0
[   12.221135]  work_for_cpu_fn+0x24/0x38
[   12.224871]  process_one_work+0x1d0/0x498
[   12.228868]  worker_thread+0x238/0x528
[   12.232604]  kthread+0xf0/0x118
[   12.235733]  ret_from_fork+0x10/0x18
[   12.239296] Code: 54000e20 b9402261 f940ba60 8b010282 (f8616a81) 
[   12.245377] ---[ end trace 4029d97195803760 ]---

And then the system won't continue booting.

@geerlingguy
Copy link
Owner Author

I think I'm running Ubuntu 20.04 on the HL15... it might be worth attempting upgrading to 24.04 :O

Otherwise maybe I can manually install later Intel drivers?

@geerlingguy
Copy link
Owner Author

On Ampere's recommendation, I'm going to try a ConnectX-5 Mellanox card, the MCX512A-ACAT, instead.

Now I have a spare E810, ready to go into one of my Windows PCs :)

@geerlingguy
Copy link
Owner Author

I have the X-5 installed, and it seems to enumerate correctly:

jgeerling@nas01:~$ dmesg | grep mlx5
[   10.642917] mlx5_core 0004:01:00.0: Adding to iommu group 29
[   10.643196] mlx5_core 0004:01:00.0: enabling device (0100 -> 0102)
[   10.643316] mlx5_core 0004:01:00.0: firmware version: 16.27.2048
[   10.643346] mlx5_core 0004:01:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   11.071754] mlx5_core 0004:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[   11.085231] mlx5_core 0004:01:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[   11.097374] mlx5_core 0004:01:00.0: Port module event: module 0, Cable plugged
[   11.097626] mlx5_core 0004:01:00.0: mlx5_pcie_event:294:(pid 542): PCIe slot advertised sufficient power (75W).
[   11.108877] mlx5_core 0004:01:00.1: Adding to iommu group 31
[   11.121246] mlx5_core 0004:01:00.1: enabling device (0100 -> 0102)
[   11.138453] mlx5_core 0004:01:00.1: firmware version: 16.27.2048
[   11.144495] mlx5_core 0004:01:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   11.451304] mlx5_core 0004:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[   11.460281] mlx5_core 0004:01:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[   11.484789] mlx5_core 0004:01:00.1: Port module event: module 1, Cable unplugged
[   11.492484] mlx5_core 0004:01:00.1: mlx5_pcie_event:294:(pid 545): PCIe slot advertised sufficient power (75W).
[   11.516633] mlx5_core 0004:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   11.787574] mlx5_core 0004:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   54.125946] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
[   54.145352] mlx5_core 0004:01:00.0 enP4p1s0f0: renamed from eth0
[   54.234030] mlx5_core 0004:01:00.1 enP4p1s0f1: renamed from eth1

It's not getting an IP address automatically, though. Not sure why.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

Not detecting a link...

jgeerling@nas01:~$ ethtool enP4p1s0f1
Settings for enP4p1s0f1:
	Supported ports: [ Backplane ]
	Supported link modes:   1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: None BaseR RS
	Advertised link modes:  1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: None
	Speed: Unknown!
	Duplex: Unknown! (255)
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
	Current message level: 0x00000004 (4)
			       link
	Link detected: no

I'm using a 10Gtek 25G SFP28 DAC - 3m, 30AWG, Passive... I wonder if this DAC isn't able to work with the card? Weird.

When I plug in the DAC, I see the changes:

Supported FEC modes: None BaseR RS  # was 'Not Reported'
Advertised FEC modes: None  # was 'Not Reported'
Port: Direct Attach Copper  # was 'Other'

But it still says Link detected: no.

On the switch (Mikrotik 25G), I'm seeing the link as negotiated at 25G:

Screenshot 2024-11-01 at 9 52 54 AM

@geerlingguy
Copy link
Owner Author

Strangely, at some point this morning, it looks like the Intel interfaces were giving a bunch of errors:

[49658.032378] pcieport 0003:00:03.0: AER: Corrected error message received from 0003:03:00.0
[49658.032388] ixgbe 0003:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[49658.042484] ixgbe 0003:03:00.0: AER:   device [8086:1563] error status/mask=00001000/00002000
[49658.051003] ixgbe 0003:03:00.0: AER:    [12] Timeout 

And the Mellanox driver is detecting cable hotplugs:

[58773.924177] mlx5_core 0004:01:00.0: Port module event: module 0, Cable unplugged
[58783.083281] mlx5_core 0004:01:00.1: Port module event: module 1, Cable plugged

Since this is 20.04, and I don't have NetworkManager present (so no nmcli), I ran:

sudo ip link set enP4p1s0f1 down
sudo ip link set enP4p1s0f1 up

And dmesg shows:

[59410.830394] mlx5_core 0004:01:00.1 enP4p1s0f1: Link up
[59410.833804] IPv6: ADDRCONF(NETDEV_CHANGE): enP4p1s0f1: link becomes ready

While ip a shows:

4: enP4p1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 6c:b3:11:29:4d:43 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::6eb3:11ff:fe29:4d43/64 scope link 
       valid_lft forever preferred_lft forever

So now it's getting IPv6, but not IPv4...

@geerlingguy
Copy link
Owner Author

Also grabbing hardware details with sudo lshw -C network:

  *-network:0 DISABLED
       description: Ethernet interface
       product: MT27800 Family [ConnectX-5]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0004:01:00.0
       logical name: enP4p1s0f0
       version: 00
       serial: 6c:b3:11:29:4d:42
       capacity: 25Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core firmware=16.27.2048 (MT_0000000080) latency=0 link=no multicast=yes
       resources: iomemory:28000-27fff irq:89 memory:280000000000-280001ffffff memory:280004000000-2800047fffff
  *-network:1
       description: Ethernet interface
       product: MT27800 Family [ConnectX-5]
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0004:01:00.1
       logical name: enP4p1s0f1
       version: 00
       serial: 6c:b3:11:29:4d:43
       capacity: 25Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core duplex=full firmware=16.27.2048 (MT_0000000080) latency=0 link=yes multicast=yes
       resources: iomemory:28000-27fff irq:260 memory:280002000000-280003ffffff memory:280004800000-280004ffffff

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 1, 2024

Huh. Forcing a release/renew grabbed an IP for the interface:

sudo dhclient -r enP4p1s0f1
sudo dhclient enP4p1s0f1

jgeerling@nas01:~$ ip a
...
4: enP4p1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 6c:b3:11:29:4d:43 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.236/24 brd 10.0.2.255 scope global dynamic enP4p1s0f1
       valid_lft 7199sec preferred_lft 7199sec
    inet6 fe80::6eb3:11ff:fe29:4d43/64 scope link 
       valid_lft forever preferred_lft forever

Now the question is, will the configuration persist across a reboot?

@geerlingguy
Copy link
Owner Author

Nope. But following this Stack Exchange answer, I did the following to make the new card's Ethernet interfaces persist with IPv4 DHCP across reboots:

$ sudo nano /etc/netplan/00-installer-config.yaml

# Add in the interfaces among the others and save:
    enP4p1s0f0:
      dhcp4: true
    enP4p1s0f1:
      dhcp4: true

$ sudo netplan apply
$ sudo dhclient -r enP4p1s0f1
$ sudo dhclient enP4p1s0f1

And now even after a reboot, I'm getting full 25 Gbps bandwidth, yay!

jgeerling@nas01:~$ sudo ethtool enP4p1s0f1
Settings for enP4p1s0f1:
	Supported ports: [ Backplane ]
	Supported link modes:   1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: None BaseR RS
	Advertised link modes:  1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: None
	Speed: 25000Mb/s
	Duplex: Full
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000004 (4)
			       link
	Link detected: yes

Full docs on Ubuntu's docs site: Configuring networks

I guess the 00-installer-config.yaml is created at system install time, and since this card wasn't present, it doesn't show up there. Ah well. I could create 99-mellanox.yaml and tack it on that way, but as this hardware change is likely permanent(ish), I'm happy just throwing the config in the installer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant