UCP_AM_SEND_FLAG_COPY_HEADER
not working as expected with protov2
#10424
Labels
UCP_AM_SEND_FLAG_COPY_HEADER
not working as expected with protov2
#10424
Describe the bug
UCP_AM_SEND_FLAG_COPY_HEADER
seems not to work as expected with protov2, leading to corrupted header data in the receiver callback while the length of the header is correct. It is hard to reproduce this issue and I've only managed to reproduce it where I have 8 or more processes communicating where they exchange peer information via AM upon connection, potentially generating up to 16 simultaneous send requests per process and 128 simultaneous requests for the whole cluster where all messages are no larger than 44 bytes, however, it requires a somewhat heavy stack that isn't easy to write a standalone reproducer.In a call with @ofirfarjun7 we were able to determine that persisting the header and removing
UCP_AM_SEND_FLAG_COPY_HEADER
indeed resolves the issue, which supports the likelihood of this being a bug with that feature in protov2. With protov1 this is not reproducible, both with or without the use ofUCP_AM_SEND_FLAG_COPY_HEADER
.Steps to Reproduce
ucx_info -v
)UCX_PROTO_ENABLE=n
: always worksUCX_PROTO_ENABLE=y
+UCX_TLS=rc_verbs,self,sm,cuda
: runs into corrupted header data for 8+ processesUCX_PROTO_ENABLE=y
+UCX_TLS=rc_verbs,self,sm,cuda,tcp
: runs into corrupted header data for 8+ processesSetup and versions
Additional information (depending on the issue)
The text was updated successfully, but these errors were encountered: