Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UCX error Message truncated observed with UCX 1.11 RC in Q77 NDS #2892

Closed
abellina opened this issue Jul 9, 2021 · 5 comments
Closed
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Jul 9, 2021

We are seeing an error in Q77 NDS at 3TB:

 jucx_common_def.cc:257  UCX  ERROR JUCX: request error: Message truncated

I used these UCX settings:

spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
spark.executorEnv.UCX_ERROR_SIGNALS=
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA=yes
spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024

This error only happens with the latest 21.08 nightly (rapids-4-spark_2.12-21.08.0-20210708.152651-40.jar) build and UCX 1.11 RC (https://github.com/openucx/ucx/releases/tag/v1.11.0-rc3).

Reverting UCX to 1.10.1 (https://github.com/openucx/ucx/releases/tag/v1.10.1) works. So this appears to be a regression with UCX 1.11, or at least a bad interplay between JUCX 1.11.0 (as opposed to JUCX 1.11.0-RC3) and the native bits.

UCX 1.11 allows logging, and the logs are just showing:

ucp_request.c:524  UCX  DEBUG message truncated: recv_length 8230 offset 0 buffer_size 4480
 jucx_common_def.cc:257  UCX  ERROR JUCX: request error: Message truncated

I am checking the other side of the connection, and I don't see 8230 bytes sent, so I think this is a fragment size (i.e. we are potentially falling back to fragment based copies for such tiny buffers).

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin P0 Must have for release labels Jul 9, 2021
@abellina abellina added this to the July 5 - July 16 milestone Jul 9, 2021
@abellina
Copy link
Collaborator Author

abellina commented Jul 9, 2021

@petro-rudenko @yosefe. Any ideas of any behavior change related to this?

@abellina
Copy link
Collaborator Author

abellina commented Jul 12, 2021

With JUCX 1.11.0-rc3 this issue goes away. There's a separate issue @petro-rudenko and I are trying to debug that prevents me to simply update to the JUCX 1.11.0-rc3 jar, but the truncation issue goes away for Q77.

For this particular ticket to be closed, I'd like to update to a new jar hopefully and understand why 1.10.1 works, and 1.11.0-rc3 resolves the truncation issue.

@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 12, 2021
@abellina
Copy link
Collaborator Author

The issue can also be seen with JUCX 1.11.0-rc3. For now, we have verbose logs, and are looking into the root cause.

@abellina
Copy link
Collaborator Author

With GPU active messages I am not able to reproduce this. The error above is specific to UCX tag-based messages, which are in use in the currently merged code.

I am going to move forward towards active messages after re-checking all of NDS.

FYI @petro-rudenko

@abellina
Copy link
Collaborator Author

We have run with NDS a few times with active messages and we are not seeing the truncation issue. I am closing this issue as it doesn't affect us anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin
Projects
None yet
Development

No branches or pull requests

3 participants