-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA: Make cuda_ipc cache global #5815
UCT/CUDA: Make cuda_ipc cache global #5815
Conversation
Can one of the admins verify this patch? |
cc @bureddy @yosefe Can you provide a review when possible? cc @petro-rudenko @pentschev Can you try this PR and see if your issues with multiple UCP endpoints get resolved? |
Errors don't look related. |
khash_t(cuda_ipc_rem_cache) | ||
hash; | ||
ucs_recursive_spinlock_t | ||
lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be name can be on the same line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to align members on the same column. If I move everything to the same line, I would need to move all other members which is probably fine but I wanted to keep changes small. For now, I'll go ahead and make the change.
*cache = kh_val(&iface->remote_cache.hash, khiter); | ||
} else { | ||
/* if UCS_KH_PUT_BUCKET_EMPTY or UCS_KH_PUT_BUCKET_CLEAR */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove empty line
status = uct_cuda_ipc_create_cache(cache, target_name); | ||
if (status != UCS_OK) { | ||
ucs_error("could not create create cuda ipc cache: %s for pid %d", | ||
ucs_status_string(status), pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no unlock before return. may be you can have unlock: label at the end and use goto
ucs_error("unable to use cuda_ipc remote_cache hash"); | ||
ucs_recursive_spin_unlock(&iface->remote_cache.lock); | ||
return UCS_ERR_NO_RESOURCE; | ||
} else if (khret == UCS_KH_PUT_KEY_PRESENT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this most frequent "if" condition to above
|
||
ucs_recursive_spin_lock(&iface->remote_cache.lock); | ||
|
||
khiter = kh_put(cuda_ipc_rem_cache, &iface->remote_cache.hash, pid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is kh_put better than kh_get to check if the key is present or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using kh_put simplifies the flow here as we don't have to explicitly handle different cases. kh_put can involve resize operation but that's a cost we have to incur and handle at some point even if we are just using kh_get. kh_put seems to additionally check if an element was deleted which is not particularly useful for us other than knowing that the key isn't present. With these tradeoffs I don't know if there is a big difference between the two. @yosefe any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kh_get is better but not by much, since we indent to create a key if it doesn't exist we could use kh_put for simplicity
I can confirm this resolves the issue on the case we have with UCX-Py+Dask, tested with CUDA 10.2. Thanks a lot for the fix @Akshay-Venkatesh ! cc @quasiben for awareness. |
ok to test |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
if (status != UCS_OK) { | ||
ucs_error("could not create create cuda ipc cache: %s for pid %d", | ||
ucs_status_string(status), pid); | ||
goto err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error handling seems to be incorrect. If you print error and return here, the following call to uct_cuda_ipc_get_rem_cache
with the same pid will find the uninitialized value in the hash.
Probably need to delete the element here or use kh_get
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brminich
src/uct/cuda/cuda_ipc/cuda_ipc_md.h
Outdated
@@ -46,6 +46,7 @@ typedef struct uct_cuda_ipc_md_config { | |||
*/ | |||
typedef struct uct_cuda_ipc_key { | |||
CUipcMemHandle ph; /* Memory handle of GPU memory */ | |||
pid_t pid; /* PID as key to resolve peer_map hash*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add space after hash
Still got
|
Peter,
Can you provide the repro instructions? Is there's a standalone, that works
best.
…On Mon, Oct 26, 2020, 8:36 AM Peter Rudenko ***@***.***> wrote:
Still got [swx-dgx02:76439:0:78030] cuda_ipc_cache.c:263 Fatal:
dest:76435: failed to open ipc mem handle. addr:0x7f41ca000000 len:306807
56736 (Element already exists):
#0 0x00007ff26726eefd in pause () from /lib64/libpthread.so.0
#1 0x00007fe500c9b94d in ucs_debug_freeze () at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:820
#2 0x00007fe500c9bd34 in ucs_error_freeze (message=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000
000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:915#3 0x00007fe500c9c34e in ucs_handle_error (message=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000
000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:1078
#4 0x00007fe500c98b5c in ucs_fatal_error_message (file=0x7fe50008aad0 "/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c", line=263, function=0x7fe50008b0e0 <__FUNCTION__.15748> "uct_cuda_ipc_map_memhandle_inner", messa
ge_buf=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/assert.c:37
#5 0x00007fe500c98cdc in ucs_fatal_error_format (file=0x7fe50008aad0 "/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/
uct/cuda/cuda_ipc/cuda_ipc_cache.c", line=263, function=0x7fe50008b0e0 <__FUNCTION__.15748> "uct_cuda_ipc_map_memhandle_inner", format=0x7fe50008adf8 "Fatal: %s: failed to open ipc mem handle. addr:%p len:%lu (%s)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/c
ontrib/../src/ucs/debug/assert.c:53
#6 0x00007fe50008817c in uct_cuda_ipc_map_memhandle_inner (mapped_addr=0x7fe501bf86c0, key=0x7fd9cc6066e0, arg=0x7fd9cc742af0) at /hp
c/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:260
#7 uct_cuda_ipc_map_memhandle (arg=0x7fd9cc742af0, key=0x7fd9cc6066e0, mapped_addr=0x7fe501bf86c0) at /hpc/mtr_scrap/users/peterr/rap
ids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:193
#8 0x00007fe500085169 in uct_cuda_ipc_post_cuda_async_copy (direction=1, comp=0x7fd9cc2b1528, rkey=140573413500640, iov=0x7fe501bf885
0, remote_addr=139926485898496, tl_ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_
ipc/cuda_ipc_ep.c:68
#9 uct_cuda_ipc_ep_get_zcopy_inner (comp=0x7fd9cc2b1528, rkey=140573413500640, remote_addr=139926485898496, iovcnt=1, iov=0x7fe501bf8
850, tl_ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_ep.c:135
#10 uct_cuda_ipc_ep_get_zcopy (tl_ep=0x7fd9cd011ed0, iov=0x7fe501bf8850, iovcnt=1, remote_addr=139926485898496, rkey=140573413500640,
comp=0x7fd9cc2b1528) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_ep.c:127
#11 0x00007fe5005156ac in uct_ep_get_zcopy (comp=0x7fd9cc2b1528, rkey=140573413500640, remote_addr=139926485898496, iovcnt=1, iov=0x7f
e501bf8850, ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/api/uct.h:2598
#12 ucp_rndv_progress_rma_get_zcopy_inner (self=0x7fd9cc2b1540) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/
rndv/rndv.c:537
#13 ucp_rndv_progress_rma_get_zcopy (self=0x7fd9cc2b1540) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/r
ndv.c:452
#14 0x00007fe5005169df in ucp_request_try_send (pending_flags=0, req=0x7fd9cc2b1440) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ip
c/contrib/../src/ucp/core/ucp_request.inl:239
#15 ucp_request_send (pending_flags=0, req=0x7fd9cc2b1440) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/core/
ucp_request.inl:264
#16 ucp_rndv_req_send_rma_get (rndv_req=0x7fd9cc2b1440, rreq=0x7fbc795d09c0, rndv_rts_hdr=0x7fe4d133b490, rkey_buf=0x7fe4d133b4ba) at
/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:714
#17 0x00007fe50051b472 in ucp_rndv_receive_inner (rkey_buf=0x7fe4d133b4ba, rndv_rts_hdr=0x7fe4d133b490, rreq=0x7fbc795d09c0, worker=0x
7fd9cc735410) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1227
#18 ucp_rndv_receive (worker=0x7fd9cc735410, rreq=0x7fbc795d09c0, rndv_rts_hdr=0x7fe4d133b490, rkey_buf=0x7fe4d133b4ba) at /hpc/mtr_sc
rap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1163
#19 0x00007fe50055c05d in ucp_tag_rndv_matched (worker=0x7fd9cc735410, rreq=0x7fbc795d09c0, rts_hdr=0x7fe4d133b490) at /hpc/mtr_scrap/
users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/tag/tag_rndv.c:24
#20 0x00007fe50055c4b4 in ucp_tag_rndv_process_rts (worker=0x7fd9cc735410, common_rts_hdr=0x7fe4d133b490, length=205, tl_flags=1) at /
hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/tag/tag_rndv.c:42
#21 0x00007fe50051b88d in ucp_rndv_rts_handler_inner (tl_flags=1, length=205, data=0x7fe4d133b490, arg=0x7fd9cc735410) at /hpc/mtr_scr
ap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1293
#22 ucp_rndv_rts_handler (arg=0x7fd9cc735410, data=0x7fe4d133b490, length=205, tl_flags=1) at /hpc/mtr_scrap/users/peterr/rapids/ucx-c
uda-ipc/contrib/../src/ucp/rndv/rndv.c:1285
#23 0x00007fe5007f8062 in uct_iface_invoke_am (iface=0x7fd9cc740340, id=9 '\t', data=0x7fe4d133b490, length=205, flags=1) at /hpc/mtr_
scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/base/uct_iface.h:654
#24 0x00007fe5007f8afe in uct_mm_iface_invoke_am (flags=1, length=205, data=0x7fe4d133b490, am_id=9 '\t', iface=0x7fd9cc740340) at /hp
c/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/mm_iface.h:245
#25 uct_mm_iface_process_recv (elem=0x7fe4e816ed40, iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/.
./src/uct/sm/mm/base/mm_iface.c:250
#26 uct_mm_iface_poll_fifo (iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/mm_
iface.c:282
#27 uct_mm_iface_progress (tl_iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/m
m_iface.c:335
#28 0x00007fe5004df027 in ucs_callbackq_dispatch (cbq=0x7fd9cceb1dd0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../sr
c/ucs/datastruct/callbackq.h:211
#29 0x00007fe5004e8f18 in uct_worker_progress (worker=0x7fd9cceb1dd0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../sr
c/uct/api/uct.h:2406
#30 ucp_worker_progress (worker=0x7fd9cc735410) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/core/ucp_worker.
c:2308
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5815 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYNWASA3QPYO24L7YHBIP3SMVUMVANCNFSM4SXDDGOA>
.
|
The app is SparkUCX but you can reproduce it with multithreaded perftest: |
@petro-rudenko Can you try again? @pentschev Can you confirm that your case still works with the new changes? |
@Akshay-Venkatesh thanks, working now |
@Akshay-Venkatesh thanks for the ping, I confirm it still works for us on 9dbfb20 . |
this PR resolves issue:
|
status = UCS_ERR_ALREADY_EXISTS; | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to flatten these if:
if (cuerr== SUCCESS || cuerr == ALREADY_EXISTS) {
status == UCS_OK
}else{
ucs_error()
status = INVALID_PARAM
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhandled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe status is UCS_OK by default in this function; we cannot return UCS_OK when cuerr == ALREADY_EXISTS because functions which call uct_cuda_ipc_open_memhandle
handle UCS_ERR_ALREADY_EXISTS differently from handling UCS_INVALID_PARAM (fatal error) which is different from handling UCS_OK. So I'm not sure how we can flatten to :
if (cuerr== SUCCESS || cuerr == ALREADY_EXISTS) {
status == UCS_OK;
} else {
status == UCS_INVALID_PARAM;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we flatten in a different way?
ucs_pgt_region_t *pgt_region; | ||
uct_cuda_ipc_cache_region_t *region; | ||
|
||
status = uct_cuda_ipc_get_rem_cache(pid, &cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we remove remote cache entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. rem = remote here. I've changed all instances to remove ambiguity.
} else if (khret == UCS_KH_PUT_KEY_PRESENT) { | ||
*cache = kh_val(&uct_cuda_ipc_remote_cache.hash, khiter); | ||
} else { | ||
/* if (khret == UCS_KH_PUT_FAILED) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove or modify or make an assert? instead of "if" the condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts get compiled out if it's not a debug build, right? If KH_PUT_FAILED occurs then one option is to have ucs_fatal error but maybe don't want to handle the failure here but at a higher layer.. If we don't fail here then ep_put/get_zcopy which requests memory mapping would fail and maybe UCP is in a better position to handle the failure (say by switching to another protocol).
What happens today if ep_put/get_zcopy fails during rndv? Do we still try am_bcopy protocol as a fallback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to just change the comment here with wording instead of coding like comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that. I'll remove the comment now.
@yosefe Can you lmk if your comments have been addressed when you get a chance? |
ucs_recursive_spinlock_init(&uct_cuda_ipc_map_lock, 0); | ||
ucs_recursive_spinlock_init(&remote_cache.lock, 0); | ||
kh_init_inplace(cuda_ipc_rem_cache, &remote_cache.hash); | ||
ucs_recursive_spinlock_init(&uct_cuda_ipc_remote_cache.lock, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between uct_cuda_ipc_map_lock and uct_cuda_ipc_remote_cache.lock ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_cuda_ipc_map_lock
is to have mutual exclusion for threads trying open an IPC memory handle whereas uct_cuda_ipc_remote_cache.lock
is used to ensure that no two threads try to access the hashtable of cuda_ipc cache structures created per remote process per local cuda device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need uct_cuda_ipc_map_lock ? isn't cuda API already thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Akshay-Venkatesh we dont need uct_cuda_ipc_map_lock() right? driver seems already takes a cuda context lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bureddy @yosefe It's safe for multiple threads to call cuIpcOpenMemHandle. We still want to avoid the case where two threads end up checking for reachability and end up causing "memory handle already opened error". For this we need mutual exclusion on the segment of code that 1. opens the remote memory handle 2. updates peer_accessibility table 3. closes the memory handle. For this reason when need to use a lock around the above segment but it does seem like the same lock is not needed here because by that time the thread is guaranteed to have checked for reachability. I've removed the lock use in uct_cuda_ipc_open_memhandle
for this reason.
@yosefe Have I addressed all your comments? |
status = UCS_ERR_ALREADY_EXISTS; | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we flatten in a different way?
ucs_recursive_spin_lock(&uct_cuda_ipc_remote_cache.lock); | ||
|
||
key.pid = pid; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space line
ucs_recursive_spinlock_init(&uct_cuda_ipc_map_lock, 0); | ||
ucs_recursive_spinlock_init(&remote_cache.lock, 0); | ||
kh_init_inplace(cuda_ipc_rem_cache, &remote_cache.hash); | ||
ucs_recursive_spinlock_init(&uct_cuda_ipc_remote_cache.lock, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need uct_cuda_ipc_map_lock ? isn't cuda API already thread safe?
|
||
err: | ||
ucs_recursive_spin_unlock(&uct_cuda_ipc_remote_cache.lock); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove space line
@@ -37,15 +67,21 @@ struct uct_cuda_ipc_cache { | |||
}; | |||
|
|||
|
|||
typedef struct uct_cuda_ipc_remote_cache { | |||
khash_t(cuda_ipc_rem_cache) hash; | |||
ucs_recursive_spinlock_t lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a lock both in "struct uct_cuda_ipc_cache" and also in here?
maybe can remove lock from uct_cuda_ipc_cache ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attempt is to ensure that
- no two threads try to update a single rcache instance using uct_cuda_ipc_cache
- no two threads try to update the hashtable for storing different rcache instances (one per per GPU per remote process) using remote_cache.lock.
These two operations are independent so grabbing the locks for each should be independent too. I.e if a single thread decides to grab rache x for peer gpu x, then another thread should be allowed to grab rache y for peer gpu y. But two threads shouldn't be allowed to access rcache x simultaneously (at least for now we don't distinguish accessing rcache for read-only and for reads/writes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe when possible can we revisit this? I've addressed concern around ipc_map_lock here #5815 (comment)
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
@@ -153,7 +158,13 @@ static ucs_status_t uct_cuda_ipc_is_peer_accessible(uct_cuda_ipc_component_t *md | |||
} | |||
} | |||
|
|||
pthread_mutex_unlock(&uct_cuda_ipc_map_lock); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space line
@yosefe For this #5815 (comment), cuda calls are but I'm using the ipc_map_lock to avoid the case where one thread that does reachability check accidentally closes a memory handle that was opened by another thread for the purposes of transfer. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
err: | ||
ucs_recursive_spin_unlock(&uct_cuda_ipc_remote_cache.lock); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove space line
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
@@ -127,18 +128,22 @@ static ucs_status_t uct_cuda_ipc_is_peer_accessible(uct_cuda_ipc_component_t *md | |||
char* accessible; | |||
CUdeviceptr d_mapped; | |||
|
|||
pthread_mutex_lock(&uct_cuda_ipc_map_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this lock is protecting mdc->md->peer_accessible_cache
- it does not seem related to this PR
- it should be on the MD object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. It's not related to this PR but it is still needed. Hope it's ok to have this as part of this PR.
I've moved the lock into the MD object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make it spinlock and reduce lock scope to only updating the array (move cuda API calls outside of lock section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to spinlock and reduced scope. If the array element is in init state (0xFF
), then we want a thread to query CUDA API and update this. This write needs to be in the lock section. Also, we want at most one thread to call cuIpcOpenMemHandle and update the array element value from init to reachable or unreachable. We can't have have multiple threads calling cuIpcOpenMemHandle because that can lead to "already opened" error. With these two constraints, I couldn't think of a way to move cuda API out of the lock section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we probably don't care if more than one thread would update "accessible" at the same time, we don't even need a lock.
since single-value memory read/write is atomic (e.g will get one of the values previously written, and not some garbage), can do it like this:
ipc_md {
ucs_ternary_value_t accessible[];
};
...
uct_cuda_ipc_is_peer_accessible()
{
ucs_ternary_value_t is_accessible = *accessible;
if (is_accessible == UCS_TRY) {
cuIpc(...)
is_accessible = (result == ...) ? UCS_YES : UCS_NO;
*accessible = is_accessible;
}
return is_accessible==UCS_YES;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we probably don't care if more than one thread would update "accessible" at the same time, we don't even need a lock.
Agreed. We don't need lock on accessible array.
We still need no more than one thread call cuIpcOpenMemhandle. The following code doesn't prevent that.
uct_cuda_ipc_is_peer_accessible() {
...
ucs_ternary_value_t is_accessible = *accessible;
if (is_accessible == UCS_TRY) {
...
}
Two threads could end up calling OpenMemHandle on the same memory handle if they both see is_accessible as UCS_TRY
, right? We may need all of the following executed in atomic fashion to disallow that:
if (is_accessible == UCS_TRY) {
cuIpc(...)
is_accessible = (result == ...) ? UCS_YES : UCS_NO;
*accessible = is_accessible;
}
Don't we need a lock for this? Lmk if I've misunderstood something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 threads opening the same handle is a problem?
seems we treat CUDA_ERROR_ALREADY_MAPPED staus as success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe you're right. I ignored that aspect. This reminded me of another motivation for adding the lock which was to avoid opening the memory handle multiple times from the perspective of each thread because it can be in the order of ms. Even though opening the memory handle is expensive it's probably still ok to not have a lock because we would have accessible element set to UCS_YES soon that not many threads will end up opening the handle anyway. I'll make changes based on your code suggestion above.
*cache = kh_val(&iface->remote_cache.hash, khiter); | ||
} else { | ||
/* if UCS_KH_PUT_BUCKET_EMPTY or UCS_KH_PUT_BUCKET_CLEAR */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove empty line
ucs_error("unable to use cuda_ipc remote_cache hash"); | ||
status = UCS_ERR_NO_RESOURCE; | ||
} | ||
err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err -> err_unlock
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
uct_cuda_ipc_memset(md->peer_accessible_cache, UCS_TRY, | ||
num_devices * md->uuid_map_capacity, | ||
sizeof(ucs_ternary_value_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass ucs_ternary_value_t* and length to this function
and rename it to: uct_cuda_ipc_accessible_cache_init
BTW maybe can initialize this cache to NULL/capacity=0, and let the uct_cuda_ipc_get_unique_index_for_uuid function allocate in on first use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe I've addressed the above two suggestions. I also ended up calling cuda_ipc cache insertion as part of reachability check. This saves us on one OpenMemHandle call and also to prevent a multi-threaded issue that I've documented. Lmk if the changes look ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure why I'm seeing these errors on CI but not locally.
In file included from /scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:11:0:
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:23:5: error: unknown type name 'ucs_ternary_value_t'
ucs_ternary_value_t *peer_accessible_cache;
^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_accessible_cache_init':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:77:28: error: 'ucs_ternary_value_t' undeclared (first use in this function)
size_t offset = sizeof(ucs_ternary_value_t);
^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:77:28: note: each undeclared identifier is reported only once for each function it appears in
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:78:26: error: 'p' undeclared (first use in this function)
ucs_ternary_value_t *p;
^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:79:5: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
size_t i;
^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_get_unique_index_for_uuid':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:110:40: error: 'ucs_ternary_value_t' undeclared (first use in this function)
sizeof(ucs_ternary_value_t);
^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_is_peer_accessible':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:146:5: error: unknown type name 'ucs_ternary_value_t'
ucs_ternary_value_t *accessible;
^
cc1: all warnings being treated as errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Remaining errors seem unrelated. Can you review the new changes when possible?
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@yosefe I see that failing tests are complaining about ucs_ternary_value_t again. Any suggestions on this? Also, do the latest commits look good and address #5815 (comment)? |
`/scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:22:5: error: unknown type name 'ucs_ternary_value_t' |
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
uct_cuda_ipc_accessible_cache_init(md->peer_accessible_cache + original_cache_size, | ||
new_capacity - original_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we don't need this as a separate function, can just expand the code here
@yosefe I had config/types.h included earlier but it didn't matter so I removed the header. Now I added it back but I still see the same error. I don't see any local build failures. Maybe there's something else missing. |
pls use |
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
buffer = md->peer_accessible_cache + original_cache_size; | ||
num_elems = new_capacity - original_capacity; | ||
|
||
for (elem = 0; elem < num_elems; elem++) { | ||
p = UCS_PTR_BYTE_OFFSET(buffer, elem * offset); | ||
*p = UCS_TRY; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (i = original_cache_size; i < new_capacity ; ++i) {
md->peer_accessible_cache[i] = UCS_TRY;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
0126e33
to
d689192
Compare
What/Why
How ?