-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm,rxm,efa, fabtests: fix min multi recv size setting #10618
Conversation
shm uses the util srx and sets the minimum multi receive size through the srx. However, the srx code doesn't get initialized until the endpoint gets enabled. So if the application calls setopt (before FI_ENABLE), this will segfault because the srx has not been initialized. Instead, we need to save the multi recv size in the shm endpoint to be valid during setopt and then pass that into the util_srx creation to set the multi recv size Signed-off-by: Alexia Ingerson <[email protected]>
rxm uses the util srx and sets the minimum multi receive size through the srx. However, the srx code doesn't get initialized until the endpoint gets enabled. So if the application calls setopt (before FI_ENABLE), this will segfault because the srx has not been initialized. Instead, we need to save the multi recv size in the rxm endpoint to be valid during setopt and then pass that into the util_srx creation to set the multi recv size Signed-off-by: Alexia Ingerson <[email protected]>
|
||
if (level != FI_OPT_ENDPOINT) | ||
return -FI_ENOPROTOOPT; | ||
|
||
if (optname == FI_OPT_MIN_MULTI_RECV) { | ||
srx = smr_ep->srx->ep_fid.fid.context; | ||
srx->min_multi_recv_size = *(size_t *)optval; | ||
smr_ep->min_multi_recv_size = *(size_t *)optval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think EFA provider has the same bug: I was copying it from shm :(
https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1666-L1667
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijin-aws Oops! Sorry!! I will add the fix to efa as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the EFA code, it seems enough to remove L1666 and L1667
@aingerson Would you mind replacing your efa commit by shijin-aws@9eb3c79. I fixed a bug in our unit test which makes this bug not caught earlier. Thanks! |
Intel CI failed with fi_multi_recv. The reason is that the test calls fi_setopt() after enabling the EP, which won't be effective with the new code. |
Yikes, that explained why the bug was not discovered earlier. |
AWS CI failure should be fixed after ingesting shijin-aws@9eb3c79 |
efa uses the util srx and sets the minimum multi receive size through the srx. However, the srx code doesn't get initialized until the endpoint gets enabled. So if the application calls setopt (before FI_ENABLE), this will segfault because the srx has not been initialized. Instead, we need to save the multi recv size in the efa endpoint to be valid during setopt and then pass that into the util_srx creation to set the multi recv size Signed-off-by: Alexia Ingerson <[email protected]> Signed-off-by: Shi Jin <[email protected]>
… enable fi_setopt has to be called before enabling an endpoint. This adds an opt arg to allow setting this option in the common code like the other EP options. Signed-off-by: Alexia Ingerson <[email protected]>
Man, that was just one after another issue. This new version should fix it now - I picked your patch @shijin-aws and I also added a patch in fabtests to call setopt before enable instead of after |
Fixes #10591