-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/UGNI: Start using spinlocks to protect critical structures #1494
UCT/UGNI: Start using spinlocks to protect critical structures #1494
Conversation
Test PASSed. |
Test PASSed. |
@shamisp Could you review? |
v1.3, right ? |
@shamisp Yes, this a bit much of a change for 1.2. |
@MattBBaker it is actually up to you. I think these changes can be considered as a thread safety bugfix. If you think it is important for ORNL to have it in v1.2 we can push. |
@shamisp This is part of a larger set of patches coming down the pipe , with some more rearrangements and potential performance impacts. I think that saying it's 1.3 is safer. Though I think that we should also talk about what kind of release time frame there is for 1.3. |
@shamisp Also, don't merge yet, I just found a bug. |
55f8146
to
e88b6c6
Compare
@shamisp Fixed now. I guess gcc doesn't check the parameter list for empty macros. |
it does not. Ready for review ?
…On Thu, May 11, 2017 at 10:34 AM, MattBBaker ***@***.***> wrote:
@shamisp <https://github.com/shamisp> Fixed now. I guess gcc doesn't
check the parameter list for empty macros.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1494 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACIe2N_8Xbs86PvMCbNkHOL7VFl4Jf1Xks5r4ypqgaJpZM4NXMNw>
.
|
Yes. |
Test PASSed. |
Test PASSed. |
@@ -53,7 +53,7 @@ static ucs_status_t uct_ugni_smsg_mbox_reg(uct_ugni_smsg_iface_t *iface, uct_ugn | |||
} | |||
pthread_mutex_lock(&uct_ugni_global_lock); | |||
|
|||
ugni_rc = GNI_MemRegister(iface->super.nic_handle, (uint64_t)address, | |||
ugni_rc = GNI_MemRegister(uct_ugni_iface_nic_handle(&iface->super), (uint64_t)address, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious - what is the motivation behind uct_ugni_iface_nic_handle() ? type more :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's most useful in UDT, where changes to the structure (like when I was deciding where that field should land in the final structure layout) required changes in a lot of place. This way, change it once and everything compiles fine. I can condense the name. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay... you are the one that will be typing this
@@ -479,8 +466,6 @@ static UCS_CLASS_INIT_FUNC(uct_ugni_udt_iface_t, uct_md_h md, uct_worker_h worke | |||
ucs_warn("GNI_EpDestroy failed, Error status: %s %d", | |||
gni_err_str[ugni_rc], ugni_rc); | |||
} | |||
clean_iface_activate: | |||
ugni_deactivate_iface(&self->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did the cleanup go ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a different commit that apparently I failed to include. Oops, will rectify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I noticed that the clean up flow disappeared from a few places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Story is that there are multiple related commits, together they are massive so I broke them up. The clean up path was fixed in the next PR, but I need to fix it for this one.
src/uct/ugni/base/ugni_iface.h
Outdated
static inline int uct_ugni_check_device_type(uct_ugni_iface_t *iface, gni_nic_device_t type) | ||
{ | ||
uct_ugni_device_t *dev = uct_ugni_iface_device(iface); | ||
return dev->type == type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool instead of int ?
Also I can be a single line.
GNI_CQ_NOBLOCK, | ||
NULL, NULL, &self->local_cq); | ||
if (GNI_RC_SUCCESS != ugni_rc) { | ||
ucs_error("GNI_CqCreate failed, Error status: %s %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somebody has to clean CDM ?
Test PASSed. |
Test PASSed. |
Test FAILed. |
src/uct/ugni/udt/ugni_udt_iface.h
Outdated
@@ -101,4 +101,8 @@ static inline int uct_ugni_udt_ep_any_post(uct_ugni_udt_iface_t *iface) | |||
return UCS_OK; | |||
} | |||
|
|||
static inline gni_nic_handle_t uct_ugni_udt_iface_nic_handle(uct_ugni_udt_iface_t *iface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a macro to me..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an inline function. The one just pushed one looks like a macro.
@hppritcha if you have cycles, please take a look. |
Test PASSed. |
38e5ea0
to
6a82d5c
Compare
Test PASSed. |
ORNL jenkins failure is an mlx failure that is already known. Not relevant to this PR. |
Test PASSed. |
bot:ornl:retest |
src/uct/ugni/base/ugni_device.c
Outdated
if (GNI_RC_SUCCESS != ugni_rc) { | ||
ucs_error("GNI_CdmAttach failed (domain id %d, %d), Error status: %s %d", | ||
cdm->domain_id, ugni_domain_counter, gni_err_str[ugni_rc], ugni_rc); | ||
GNI_CdmDestroy(cdm->cdm_handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No status check for the return code
@shamisp Updated. Should be good to merge now yes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the rest of the code for similar issues
} | ||
return status; | ||
clean_cq: | ||
GNI_CqDestroy(self->local_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log error code
ucs_mpool_cleanup(&self->free_desc, 1); | ||
ucs_mpool_cleanup(&self->free_mbox, 1); | ||
GNI_CqDestroy(self->remote_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please log the error
Test PASSed. |
{ | ||
ucs_arbiter_cleanup(&iface->arbiter); | ||
ucs_mpool_cleanup(&iface->flush_pool, 1); | ||
GNI_CqDestroy(iface->local_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamisp Thing is, the next PR pulls out all of the GNI_Cq* calls and has common spin locked code paths with error logging. I'd like to fix these problems in the next PR instead.
@@ -180,7 +180,7 @@ ucs_status_t uct_ugni_smsg_ep_connect_to_ep(uct_ep_h tl_ep, | |||
} | |||
|
|||
ep_hash = (uint32_t)iface_addr->ep_hash; | |||
gni_rc = GNI_EpSetEventData(ep->super.ep, iface->domain_id, ep_hash); | |||
gni_rc = GNI_EpSetEventData(ep->super.ep, iface->cdm.domain_id, ep_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems in some places we use gni_rc and another ugni_rc. IMHO should be consistent. (separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan.
Test FAILed. |
Opened Issue #1529 |
This PR is deceptively big. A lot of it is shuffling a few big functions and rearranging some data structures to allow a shared thread safe code path. I couldn't find a useful way of splitting it up more.
What it does is add a spin lock to the device, and that is what will get used to do locking in UGNI. This lock is then used to attach the CDM to the device. The cdm device fields are then merged together into a shared struct.