Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Make cuda_ipc cache global #5815

Merged
merged 1 commit into from
Feb 12, 2021

Conversation

Akshay-Venkatesh
Copy link
Contributor

What/Why

  • Move IPC cache from being a per-endpoint entity to per-iface entity
  • This allows multiple endpoints from the same process to reuse peer mapped memory and not avoid reopening already opened memory handles
    • needed for cuda < 11.1 to avoid fatal error when trying to re-open already opened memory handle

How ?

  • By using a cache per remote process
  • By using a hashtable of remote caches identified by remote pid

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@Akshay-Venkatesh
Copy link
Contributor Author

cc @bureddy @yosefe Can you provide a review when possible?

cc @petro-rudenko @pentschev Can you try this PR and see if your issues with multiple UCP endpoints get resolved?

@Akshay-Venkatesh
Copy link
Contributor Author

Errors don't look related.

khash_t(cuda_ipc_rem_cache)
hash;
ucs_recursive_spinlock_t
lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be name can be on the same line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to align members on the same column. If I move everything to the same line, I would need to move all other members which is probably fine but I wanted to keep changes small. For now, I'll go ahead and make the change.

*cache = kh_val(&iface->remote_cache.hash, khiter);
} else {
/* if UCS_KH_PUT_BUCKET_EMPTY or UCS_KH_PUT_BUCKET_CLEAR */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove empty line

status = uct_cuda_ipc_create_cache(cache, target_name);
if (status != UCS_OK) {
ucs_error("could not create create cuda ipc cache: %s for pid %d",
ucs_status_string(status), pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no unlock before return. may be you can have unlock: label at the end and use goto

ucs_error("unable to use cuda_ipc remote_cache hash");
ucs_recursive_spin_unlock(&iface->remote_cache.lock);
return UCS_ERR_NO_RESOURCE;
} else if (khret == UCS_KH_PUT_KEY_PRESENT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this most frequent "if" condition to above


ucs_recursive_spin_lock(&iface->remote_cache.lock);

khiter = kh_put(cuda_ipc_rem_cache, &iface->remote_cache.hash, pid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is kh_put better than kh_get to check if the key is present or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using kh_put simplifies the flow here as we don't have to explicitly handle different cases. kh_put can involve resize operation but that's a cost we have to incur and handle at some point even if we are just using kh_get. kh_put seems to additionally check if an element was deleted which is not particularly useful for us other than knowing that the key isn't present. With these tradeoffs I don't know if there is a big difference between the two. @yosefe any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kh_get is better but not by much, since we indent to create a key if it doesn't exist we could use kh_put for simplicity

@pentschev
Copy link
Contributor

I can confirm this resolves the issue on the case we have with UCX-Py+Dask, tested with CUDA 10.2. Thanks a lot for the fix @Akshay-Venkatesh !

cc @quasiben for awareness.

@brminich
Copy link
Contributor

ok to test

@brminich
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

if (status != UCS_OK) {
ucs_error("could not create create cuda ipc cache: %s for pid %d",
ucs_status_string(status), pid);
goto err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error handling seems to be incorrect. If you print error and return here, the following call to uct_cuda_ipc_get_rem_cache with the same pid will find the uninitialized value in the hash.
Probably need to delete the element here or use kh_get instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brminich

@@ -46,6 +46,7 @@ typedef struct uct_cuda_ipc_md_config {
*/
typedef struct uct_cuda_ipc_key {
CUipcMemHandle ph; /* Memory handle of GPU memory */
pid_t pid; /* PID as key to resolve peer_map hash*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add space after hash

@petro-rudenko
Copy link
Member

Still got [swx-dgx02:76439:0:78030] cuda_ipc_cache.c:263 Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000000 len:306807 56736 (Element already exists):

#0  0x00007ff26726eefd in pause () from /lib64/libpthread.so.0
#1  0x00007fe500c9b94d in ucs_debug_freeze () at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:820
#2  0x00007fe500c9bd34 in ucs_error_freeze (message=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000
000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:915#3  0x00007fe500c9c34e in ucs_handle_error (message=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000
000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/debug.c:1078
#4  0x00007fe500c98b5c in ucs_fatal_error_message (file=0x7fe50008aad0 "/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c", line=263, function=0x7fe50008b0e0 <__FUNCTION__.15748> "uct_cuda_ipc_map_memhandle_inner", messa
ge_buf=0x7fe501bf8040 "Fatal: dest:76435: failed to open ipc mem handle. addr:0x7f41ca000000 len:30680756736 (Element already exists)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucs/debug/assert.c:37
#5  0x00007fe500c98cdc in ucs_fatal_error_format (file=0x7fe50008aad0 "/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/
uct/cuda/cuda_ipc/cuda_ipc_cache.c", line=263, function=0x7fe50008b0e0 <__FUNCTION__.15748> "uct_cuda_ipc_map_memhandle_inner", format=0x7fe50008adf8 "Fatal: %s: failed to open ipc mem handle. addr:%p len:%lu (%s)") at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/c
ontrib/../src/ucs/debug/assert.c:53
#6  0x00007fe50008817c in uct_cuda_ipc_map_memhandle_inner (mapped_addr=0x7fe501bf86c0, key=0x7fd9cc6066e0, arg=0x7fd9cc742af0) at /hp
c/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:260
#7  uct_cuda_ipc_map_memhandle (arg=0x7fd9cc742af0, key=0x7fd9cc6066e0, mapped_addr=0x7fe501bf86c0) at /hpc/mtr_scrap/users/peterr/rap
ids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:193
#8  0x00007fe500085169 in uct_cuda_ipc_post_cuda_async_copy (direction=1, comp=0x7fd9cc2b1528, rkey=140573413500640, iov=0x7fe501bf885
0, remote_addr=139926485898496, tl_ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_
ipc/cuda_ipc_ep.c:68
#9  uct_cuda_ipc_ep_get_zcopy_inner (comp=0x7fd9cc2b1528, rkey=140573413500640, remote_addr=139926485898496, iovcnt=1, iov=0x7fe501bf8
850, tl_ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_ep.c:135
#10 uct_cuda_ipc_ep_get_zcopy (tl_ep=0x7fd9cd011ed0, iov=0x7fe501bf8850, iovcnt=1, remote_addr=139926485898496, rkey=140573413500640,
comp=0x7fd9cc2b1528) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_ep.c:127
#11 0x00007fe5005156ac in uct_ep_get_zcopy (comp=0x7fd9cc2b1528, rkey=140573413500640, remote_addr=139926485898496, iovcnt=1, iov=0x7f
e501bf8850, ep=0x7fd9cd011ed0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/api/uct.h:2598
#12 ucp_rndv_progress_rma_get_zcopy_inner (self=0x7fd9cc2b1540) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/
rndv/rndv.c:537
#13 ucp_rndv_progress_rma_get_zcopy (self=0x7fd9cc2b1540) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/r
ndv.c:452
#14 0x00007fe5005169df in ucp_request_try_send (pending_flags=0, req=0x7fd9cc2b1440) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ip
c/contrib/../src/ucp/core/ucp_request.inl:239
#15 ucp_request_send (pending_flags=0, req=0x7fd9cc2b1440) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/core/
ucp_request.inl:264
#16 ucp_rndv_req_send_rma_get (rndv_req=0x7fd9cc2b1440, rreq=0x7fbc795d09c0, rndv_rts_hdr=0x7fe4d133b490, rkey_buf=0x7fe4d133b4ba) at
/hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:714
#17 0x00007fe50051b472 in ucp_rndv_receive_inner (rkey_buf=0x7fe4d133b4ba, rndv_rts_hdr=0x7fe4d133b490, rreq=0x7fbc795d09c0, worker=0x
7fd9cc735410) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1227
#18 ucp_rndv_receive (worker=0x7fd9cc735410, rreq=0x7fbc795d09c0, rndv_rts_hdr=0x7fe4d133b490, rkey_buf=0x7fe4d133b4ba) at /hpc/mtr_sc
rap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1163
#19 0x00007fe50055c05d in ucp_tag_rndv_matched (worker=0x7fd9cc735410, rreq=0x7fbc795d09c0, rts_hdr=0x7fe4d133b490) at /hpc/mtr_scrap/
users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/tag/tag_rndv.c:24
#20 0x00007fe50055c4b4 in ucp_tag_rndv_process_rts (worker=0x7fd9cc735410, common_rts_hdr=0x7fe4d133b490, length=205, tl_flags=1) at /
hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/tag/tag_rndv.c:42
#21 0x00007fe50051b88d in ucp_rndv_rts_handler_inner (tl_flags=1, length=205, data=0x7fe4d133b490, arg=0x7fd9cc735410) at /hpc/mtr_scr
ap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/rndv/rndv.c:1293
#22 ucp_rndv_rts_handler (arg=0x7fd9cc735410, data=0x7fe4d133b490, length=205, tl_flags=1) at /hpc/mtr_scrap/users/peterr/rapids/ucx-c
uda-ipc/contrib/../src/ucp/rndv/rndv.c:1285
#23 0x00007fe5007f8062 in uct_iface_invoke_am (iface=0x7fd9cc740340, id=9 '\t', data=0x7fe4d133b490, length=205, flags=1) at /hpc/mtr_
scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/base/uct_iface.h:654
#24 0x00007fe5007f8afe in uct_mm_iface_invoke_am (flags=1, length=205, data=0x7fe4d133b490, am_id=9 '\t', iface=0x7fd9cc740340) at /hp
c/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/mm_iface.h:245
#25 uct_mm_iface_process_recv (elem=0x7fe4e816ed40, iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/.
./src/uct/sm/mm/base/mm_iface.c:250
#26 uct_mm_iface_poll_fifo (iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/mm_
iface.c:282
#27 uct_mm_iface_progress (tl_iface=0x7fd9cc740340) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/uct/sm/mm/base/m
m_iface.c:335
#28 0x00007fe5004df027 in ucs_callbackq_dispatch (cbq=0x7fd9cceb1dd0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../sr
c/ucs/datastruct/callbackq.h:211
#29 0x00007fe5004e8f18 in uct_worker_progress (worker=0x7fd9cceb1dd0) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../sr
c/uct/api/uct.h:2406
#30 ucp_worker_progress (worker=0x7fd9cc735410) at /hpc/mtr_scrap/users/peterr/rapids/ucx-cuda-ipc/contrib/../src/ucp/core/ucp_worker.
c:2308

@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Oct 26, 2020 via email

@petro-rudenko
Copy link
Member

petro-rudenko commented Oct 26, 2020

The app is SparkUCX but you can reproduce it with multithreaded perftest:
CUDA_VISIBLE_DEVICES=0 UCX_NET_DEVICES=mlx5_0:1 ./ucx-cuda-ipc/build/bin/ucx_perftest -T 10 -m cuda -s 2000000 -t tag_bw swx-dgx01

@Akshay-Venkatesh
Copy link
Contributor Author

@petro-rudenko Can you try again?

@pentschev Can you confirm that your case still works with the new changes?

@petro-rudenko
Copy link
Member

@Akshay-Venkatesh thanks, working now

@Akshay-Venkatesh Akshay-Venkatesh changed the title UCT/CUDA: move cuda_ipc cache from ep to iface UCT/CUDA: Make cuda_ipc cache global Oct 27, 2020
@pentschev
Copy link
Contributor

@Akshay-Venkatesh thanks for the ping, I confirm it still works for us on 9dbfb20 .

@hoopoepg
Copy link
Contributor

hoopoepg commented Nov 5, 2020

this PR resolves issue:

[1604562797.006386] [prm-dgx-35:26200:0] cuda_ipc_cache.c:57   UCX  ERROR cuIpcCloseMemHandle((CUdeviceptr)region->mapped_addr)() failed: context is destroyed
[1604562797.007714] [prm-dgx-35:26199:0] cuda_ipc_cache.c:57   UCX  ERROR cuIpcCloseMemHandle((CUdeviceptr)region->mapped_addr)() failed: context is destroyed

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
Comment on lines 84 to 85
status = UCS_ERR_ALREADY_EXISTS;
goto out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to flatten these if:
if (cuerr== SUCCESS || cuerr == ALREADY_EXISTS) {
status == UCS_OK
}else{
ucs_error()
status = INVALID_PARAM
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe status is UCS_OK by default in this function; we cannot return UCS_OK when cuerr == ALREADY_EXISTS because functions which call uct_cuda_ipc_open_memhandle handle UCS_ERR_ALREADY_EXISTS differently from handling UCS_INVALID_PARAM (fatal error) which is different from handling UCS_OK. So I'm not sure how we can flatten to :

if (cuerr== SUCCESS || cuerr == ALREADY_EXISTS) {
    status == UCS_OK;
} else {
    status == UCS_INVALID_PARAM;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we flatten in a different way?

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.h Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.h Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.h Outdated Show resolved Hide resolved
ucs_pgt_region_t *pgt_region;
uct_cuda_ipc_cache_region_t *region;

status = uct_cuda_ipc_get_rem_cache(pid, &cache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we remove remote cache entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. rem = remote here. I've changed all instances to remove ambiguity.

} else if (khret == UCS_KH_PUT_KEY_PRESENT) {
*cache = kh_val(&uct_cuda_ipc_remote_cache.hash, khiter);
} else {
/* if (khret == UCS_KH_PUT_FAILED) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove or modify or make an assert? instead of "if" the condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts get compiled out if it's not a debug build, right? If KH_PUT_FAILED occurs then one option is to have ucs_fatal error but maybe don't want to handle the failure here but at a higher layer.. If we don't fail here then ep_put/get_zcopy which requests memory mapping would fail and maybe UCP is in a better position to handle the failure (say by switching to another protocol).

What happens today if ep_put/get_zcopy fails during rndv? Do we still try am_bcopy protocol as a fallback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to just change the comment here with wording instead of coding like comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that. I'll remove the comment now.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe Can you lmk if your comments have been addressed when you get a chance?

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
ucs_recursive_spinlock_init(&uct_cuda_ipc_map_lock, 0);
ucs_recursive_spinlock_init(&remote_cache.lock, 0);
kh_init_inplace(cuda_ipc_rem_cache, &remote_cache.hash);
ucs_recursive_spinlock_init(&uct_cuda_ipc_remote_cache.lock, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between uct_cuda_ipc_map_lock and uct_cuda_ipc_remote_cache.lock ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uct_cuda_ipc_map_lock is to have mutual exclusion for threads trying open an IPC memory handle whereas uct_cuda_ipc_remote_cache.lock is used to ensure that no two threads try to access the hashtable of cuda_ipc cache structures created per remote process per local cuda device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need uct_cuda_ipc_map_lock ? isn't cuda API already thread safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshay-Venkatesh we dont need uct_cuda_ipc_map_lock() right? driver seems already takes a cuda context lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bureddy @yosefe It's safe for multiple threads to call cuIpcOpenMemHandle. We still want to avoid the case where two threads end up checking for reachability and end up causing "memory handle already opened error". For this we need mutual exclusion on the segment of code that 1. opens the remote memory handle 2. updates peer_accessibility table 3. closes the memory handle. For this reason when need to use a lock around the above segment but it does seem like the same lock is not needed here because by that time the thread is guaranteed to have checked for reachability. I've removed the lock use in uct_cuda_ipc_open_memhandle for this reason.

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.h Outdated Show resolved Hide resolved
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe Have I addressed all your comments?

Comment on lines 84 to 85
status = UCS_ERR_ALREADY_EXISTS;
goto out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we flatten in a different way?

ucs_recursive_spin_lock(&uct_cuda_ipc_remote_cache.lock);

key.pid = pid;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space line

ucs_recursive_spinlock_init(&uct_cuda_ipc_map_lock, 0);
ucs_recursive_spinlock_init(&remote_cache.lock, 0);
kh_init_inplace(cuda_ipc_rem_cache, &remote_cache.hash);
ucs_recursive_spinlock_init(&uct_cuda_ipc_remote_cache.lock, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need uct_cuda_ipc_map_lock ? isn't cuda API already thread safe?


err:
ucs_recursive_spin_unlock(&uct_cuda_ipc_remote_cache.lock);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove space line

src/uct/cuda/cuda_ipc/cuda_ipc_cache.h Outdated Show resolved Hide resolved
@@ -37,15 +67,21 @@ struct uct_cuda_ipc_cache {
};


typedef struct uct_cuda_ipc_remote_cache {
khash_t(cuda_ipc_rem_cache) hash;
ucs_recursive_spinlock_t lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a lock both in "struct uct_cuda_ipc_cache" and also in here?
maybe can remove lock from uct_cuda_ipc_cache ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attempt is to ensure that

  1. no two threads try to update a single rcache instance using uct_cuda_ipc_cache
  2. no two threads try to update the hashtable for storing different rcache instances (one per per GPU per remote process) using remote_cache.lock.

These two operations are independent so grabbing the locks for each should be independent too. I.e if a single thread decides to grab rache x for peer gpu x, then another thread should be allowed to grab rache y for peer gpu y. But two threads shouldn't be allowed to access rcache x simultaneously (at least for now we don't distinguish accessing rcache for read-only and for reads/writes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe when possible can we revisit this? I've addressed concern around ipc_map_lock here #5815 (comment)

@@ -153,7 +158,13 @@ static ucs_status_t uct_cuda_ipc_is_peer_accessible(uct_cuda_ipc_component_t *md
}
}

pthread_mutex_unlock(&uct_cuda_ipc_map_lock);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space line

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe For this #5815 (comment), cuda calls are but I'm using the ipc_map_lock to avoid the case where one thread that does reachability check accidentally closes a memory handle that was opened by another thread for the purposes of transfer.

@yosefe
Copy link
Contributor

yosefe commented Dec 15, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved

err:
ucs_recursive_spin_unlock(&uct_cuda_ipc_remote_cache.lock);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove space line

@@ -127,18 +128,22 @@ static ucs_status_t uct_cuda_ipc_is_peer_accessible(uct_cuda_ipc_component_t *md
char* accessible;
CUdeviceptr d_mapped;

pthread_mutex_lock(&uct_cuda_ipc_map_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this lock is protecting mdc->md->peer_accessible_cache

  1. it does not seem related to this PR
  2. it should be on the MD object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. It's not related to this PR but it is still needed. Hope it's ok to have this as part of this PR.

I've moved the lock into the MD object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it spinlock and reduce lock scope to only updating the array (move cuda API calls outside of lock section)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to spinlock and reduced scope. If the array element is in init state (0xFF), then we want a thread to query CUDA API and update this. This write needs to be in the lock section. Also, we want at most one thread to call cuIpcOpenMemHandle and update the array element value from init to reachable or unreachable. We can't have have multiple threads calling cuIpcOpenMemHandle because that can lead to "already opened" error. With these two constraints, I couldn't think of a way to move cuda API out of the lock section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we probably don't care if more than one thread would update "accessible" at the same time, we don't even need a lock.
since single-value memory read/write is atomic (e.g will get one of the values previously written, and not some garbage), can do it like this:

ipc_md {
    ucs_ternary_value_t accessible[];
};
...
uct_cuda_ipc_is_peer_accessible()
{
ucs_ternary_value_t is_accessible = *accessible;
if (is_accessible == UCS_TRY) {
   cuIpc(...)
   is_accessible = (result == ...) ? UCS_YES : UCS_NO;
  *accessible = is_accessible;
}
return is_accessible==UCS_YES;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we probably don't care if more than one thread would update "accessible" at the same time, we don't even need a lock.

Agreed. We don't need lock on accessible array.

We still need no more than one thread call cuIpcOpenMemhandle. The following code doesn't prevent that.

uct_cuda_ipc_is_peer_accessible() {
...
ucs_ternary_value_t is_accessible = *accessible;
if (is_accessible == UCS_TRY) {
...
}

Two threads could end up calling OpenMemHandle on the same memory handle if they both see is_accessible as UCS_TRY, right? We may need all of the following executed in atomic fashion to disallow that:

if (is_accessible == UCS_TRY) {
   cuIpc(...)
   is_accessible = (result == ...) ? UCS_YES : UCS_NO;
  *accessible = is_accessible;
}

Don't we need a lock for this? Lmk if I've misunderstood something here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 threads opening the same handle is a problem?
seems we treat CUDA_ERROR_ALREADY_MAPPED staus as success

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe you're right. I ignored that aspect. This reminded me of another motivation for adding the lock which was to avoid opening the memory handle multiple times from the perspective of each thread because it can be in the order of ms. Even though opening the memory handle is expensive it's probably still ok to not have a lock because we would have accessible element set to UCS_YES soon that not many threads will end up opening the handle anyway. I'll make changes based on your code suggestion above.

*cache = kh_val(&iface->remote_cache.hash, khiter);
} else {
/* if UCS_KH_PUT_BUCKET_EMPTY or UCS_KH_PUT_BUCKET_CLEAR */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove empty line

ucs_error("unable to use cuda_ipc remote_cache hash");
status = UCS_ERR_NO_RESOURCE;
}
err:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err -> err_unlock

src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
Comment on lines 339 to 341
uct_cuda_ipc_memset(md->peer_accessible_cache, UCS_TRY,
num_devices * md->uuid_map_capacity,
sizeof(ucs_ternary_value_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass ucs_ternary_value_t* and length to this function
and rename it to: uct_cuda_ipc_accessible_cache_init
BTW maybe can initialize this cache to NULL/capacity=0, and let the uct_cuda_ipc_get_unique_index_for_uuid function allocate in on first use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe I've addressed the above two suggestions. I also ended up calling cuda_ipc cache insertion as part of reachability check. This saves us on one OpenMemHandle call and also to prevent a multi-threaded issue that I've documented. Lmk if the changes look ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure why I'm seeing these errors on CI but not locally.

In file included from /scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:11:0:
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:23:5: error: unknown type name 'ucs_ternary_value_t'
     ucs_ternary_value_t *peer_accessible_cache;
     ^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_accessible_cache_init':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:77:28: error: 'ucs_ternary_value_t' undeclared (first use in this function)
     size_t offset = sizeof(ucs_ternary_value_t);
                            ^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:77:28: note: each undeclared identifier is reported only once for each function it appears in
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:78:26: error: 'p' undeclared (first use in this function)
     ucs_ternary_value_t *p;
                          ^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:79:5: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
     size_t i;
     ^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_get_unique_index_for_uuid':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:110:40: error: 'ucs_ternary_value_t' undeclared (first use in this function)
                                 sizeof(ucs_ternary_value_t);
                                        ^
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c: In function 'uct_cuda_ipc_is_peer_accessible':
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.c:146:5: error: unknown type name 'ucs_ternary_value_t'
     ucs_ternary_value_t *accessible;
     ^
cc1: all warnings being treated as errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe Remaining errors seem unrelated. Can you review the new changes when possible?

@yosefe
Copy link
Contributor

yosefe commented Feb 2, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe I see that failing tests are complaining about ucs_ternary_value_t again. Any suggestions on this? Also, do the latest commits look good and address #5815 (comment)?

@yosefe
Copy link
Contributor

yosefe commented Feb 8, 2021

`/scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:22:5: error: unknown type name 'ucs_ternary_value_t'
looks like some include is missing

Comment on lines 126 to 127
uct_cuda_ipc_accessible_cache_init(md->peer_accessible_cache + original_cache_size,
new_capacity - original_capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we don't need this as a separate function, can just expand the code here

@Akshay-Venkatesh
Copy link
Contributor Author

`/scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:22:5: error: unknown type name 'ucs_ternary_value_t'
looks like some include is missing

@yosefe I had config/types.h included earlier but it didn't matter so I removed the header. Now I added it back but I still see the same error. I don't see any local build failures. Maybe there's something else missing.

@yosefe
Copy link
Contributor

yosefe commented Feb 8, 2021

@yosefe I had config/types.h included earlier but it didn't matter so I removed the header. Now I added it back but I still see the same error. I don't see any local build failures. Maybe there's something else missing.

pls use ucs_ternary_auto_value_t

Comment on lines 119 to 113
buffer = md->peer_accessible_cache + original_cache_size;
num_elems = new_capacity - original_capacity;

for (elem = 0; elem < num_elems; elem++) {
p = UCS_PTR_BYTE_OFFSET(buffer, elem * offset);
*p = UCS_TRY;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (i = original_cache_size; i < new_capacity ; ++i) {
md->peer_accessible_cache[i] = UCS_TRY;
}

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants