Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"cannot allocate memory" issue when using HBM on Sapphire Rapids #9397

Open
tonycurtis opened this issue Oct 2, 2023 · 4 comments
Open

"cannot allocate memory" issue when using HBM on Sapphire Rapids #9397

tonycurtis opened this issue Oct 2, 2023 · 4 comments
Labels

Comments

@tonycurtis
Copy link
Contributor

tonycurtis commented Oct 2, 2023

Describe the bug

See error message below. NUMA node 8(-15) is HBM-only (no CPUs)

Steps to Reproduce

  • Command line
$ time mpirun -n 1 --map-by node numactl --membind=8 ./a.out 
[1696273494.180540] [xm035:578959:0]         ib_mlx5.h:900  UCX  ERROR mlx5dv_devx_umem_reg() failed: Cannot allocate memory
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)

openmpi 4.1.5 built with ucx 1.16.0 out of github (about 2 weeks ago, commit a07e5e1)

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
$ cat /etc/redhat-release
Rocky Linux release 9.2 (Blue Onyx)
$ uname -a
Linux xm036 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
$ ofed_info -s
MLNX_OFED_LINUX-23.07-0.5.1.2:

Command

$ time mpirun -n 1 --map-by node numactl --membind=8 ./a.out 
[1696273494.180540] [xm035:578959:0]         ib_mlx5.h:900  UCX  ERROR mlx5dv_devx_umem_reg() failed: Cannot allocate memory
@tonycurtis tonycurtis added the Bug label Oct 2, 2023
@yosefe
Copy link
Contributor

yosefe commented Oct 3, 2023

@tonycurtis I assume it works without membind?

@tonycurtis
Copy link
Contributor Author

Yes, fine without membind

@yosefe
Copy link
Contributor

yosefe commented Oct 3, 2023

@tonycurtis we didn't consider HBM-only case in UCX, I guess umem cannot be allocated on it.
BTW, does ib_send_bw (for example) work with similar membind?

@tonycurtis
Copy link
Contributor Author

ib_send_bw works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants