Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP_AM_SEND_FLAG_COPY_HEADER not working as expected with protov2 #10424

Open
pentschev opened this issue Jan 16, 2025 · 1 comment · May be fixed by #10452
Open

UCP_AM_SEND_FLAG_COPY_HEADER not working as expected with protov2 #10424

pentschev opened this issue Jan 16, 2025 · 1 comment · May be fixed by #10452
Labels

Comments

@pentschev
Copy link
Contributor

Describe the bug

UCP_AM_SEND_FLAG_COPY_HEADER seems not to work as expected with protov2, leading to corrupted header data in the receiver callback while the length of the header is correct. It is hard to reproduce this issue and I've only managed to reproduce it where I have 8 or more processes communicating where they exchange peer information via AM upon connection, potentially generating up to 16 simultaneous send requests per process and 128 simultaneous requests for the whole cluster where all messages are no larger than 44 bytes, however, it requires a somewhat heavy stack that isn't easy to write a standalone reproducer.

In a call with @ofirfarjun7 we were able to determine that persisting the header and removing UCP_AM_SEND_FLAG_COPY_HEADER indeed resolves the issue, which supports the likelihood of this being a bug with that feature in protov2. With protov1 this is not reproducible, both with or without the use of UCP_AM_SEND_FLAG_COPY_HEADER.

Steps to Reproduce

  • Reproducing is difficult without a heavy stack, I can work on a Docker container if needed
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
  • Environment variables used:
    • UCX_PROTO_ENABLE=n: always works
    • UCX_PROTO_ENABLE=y + UCX_TLS=rc_verbs,self,sm,cuda: runs into corrupted header data for 8+ processes
    • UCX_PROTO_ENABLE=y + UCX_TLS=rc_verbs,self,sm,cuda,tcp: runs into corrupted header data for 8+ processes

Setup and versions

  • DGX-1 with 8 x NVIDIA V100
  • Linux dgx13 5.4.0-182-generic UCP test is broken #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • MLNX_OFED_LINUX-23.10-2.1.3.1
  • NVIDIA driver: 535.161.08
  • CUDA 12.2
  • Built without gdrcopy support
  • nv_peer_mem module loaded

Additional information (depending on the issue)

  • OpenMPI version 5.0.6
@pentschev pentschev added the Bug label Jan 16, 2025
pentschev added a commit to pentschev/ucxx that referenced this issue Jan 16, 2025
Retain a copy of headers for AM send requests as workaround for possible
UCX bug openucx/ucx#10424 .
rapids-bot bot pushed a commit to rapidsai/ucxx that referenced this issue Jan 16, 2025
Retain a copy of headers for AM send requests as workaround for possible UCX bug openucx/ucx#10424 .

Unfortunately, reproducing this is not straightforward and it wasn't observed in a stack that can be made into UCXX tests currently, so testing this is not possible at the moment.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #349
@tvegas1
Copy link
Contributor

tvegas1 commented Jan 21, 2025

Could you please provide output with UCX_PROTO_INFO=y? Could you also provide UCX version used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants