Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_IPC: fix peer-access-map init #6360

Conversation

Akshay-Venkatesh
Copy link
Contributor

What

Peer-accessibility map is a 2-d matrix and it was incorrectly initialized following restructuring. This PR fixes that.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe This fixes the issue brought up by @pentschev with #5815

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@yosefe
Copy link
Contributor

yosefe commented Feb 17, 2021

ok to test

@pentschev
Copy link
Contributor

The performance issue is fixed with this PR, only errors such as those below should probably be demoted to debug/trace.

[1613602600.236562] [dgx13:23725:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602600.372408] [dgx13:23729:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602600.506529] [dgx13:23983:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602600.625597] [dgx13:23740:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602600.885944] [dgx13:23729:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602600.903756] [dgx13:23729:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602601.622274] [dgx13:23733:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602601.960989] [dgx13:23740:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602602.011343] [dgx13:23737:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602602.021268] [dgx13:23725:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602602.123655] [dgx13:23901:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602602.421565] [dgx13:23733:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices
[1613602602.514896] [dgx13:23901:0]  cuda_ipc_cache.c:114  UCX  ERROR cuIpcOpenMemHandle() failed: peer access is not supported between these two devices

@pentschev
Copy link
Contributor

On latest changes performance is still good, but I think log below was accidentally promoted to error in https://github.com/openucx/ucx/pull/6360/files#diff-561a7e4fa208245423bb161bc7a1cc05fbce0d65a10ef13c77e366f38c541826L308-R308 . Could you confirm this is accidental @Akshay-Venkatesh ? I now see those errors in my workflow:

[1613641872.186619] [dgx13:77636:0]  cuda_ipc_cache.c:309  UCX  ERROR dest:77624:0: failed to open ipc mem handle. addr:0x7fc656000000 len:17044865024
[1613641872.326805] [dgx13:77803:0]  cuda_ipc_cache.c:309  UCX  ERROR dest:77628:0: failed to open ipc mem handle. addr:0x7fbed4000000 len:17044865024
[1613641872.327381] [dgx13:77803:0]  cuda_ipc_cache.c:309  UCX  ERROR dest:77800:0: failed to open ipc mem handle. addr:0x7f19b0000000 len:17044865024
...

@@ -305,7 +305,7 @@ UCS_PROFILE_FUNC(ucs_status_t, uct_cuda_ipc_map_memhandle, (key, mapped_addr),
}
}
} else {
ucs_debug("%s: failed to open ipc mem handle. addr:%p len:%lu",
ucs_error("%s: failed to open ipc mem handle. addr:%p len:%lu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like log level was "swapped" ith line 114..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now. cc @pentschev

Copy link
Contributor

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR addresses both issues I observed and LGTM now. Backporting #5815 along with this one to 1.10 is +1 from me now.

Thanks @Akshay-Venkatesh for fixing this so quickly!

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/cuda-ipc-fix-peer-access-map-init branch from da7c142 to 04e9f90 Compare February 18, 2021 14:48
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe is it ok to prepare a backport?

@yosefe
Copy link
Contributor

yosefe commented Feb 18, 2021

@yosefe is it ok to prepare a backport?

Yes, let's have it along with the fix

@yosefe yosefe merged commit 4b94414 into openucx:master Feb 21, 2021
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request May 21, 2021
The UCX-Py endpoint reuse is not anymore necessary, so we also disable that for UCX 1.11+. The primary reason it was introduced was to circumvent an issue with CUDA IPC that was resolved by openucx/ucx#6360. Using the endpoint reuse class has also proven to be very slow, taking a long time to initialize for clusters with just a few dozen workers and pretty much unusable for a cluster in the order of 100 workers.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #620
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants