Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting the rx capabilites and tx capabilites to 0 #1080

Merged
merged 3 commits into from
Feb 24, 2023

Conversation

tmh97
Copy link
Collaborator

@tmh97 tmh97 commented Jan 20, 2023

Signed-off-by: Thomas Huber [email protected]

In transport_ofi.c there are two instances of behavior that might be a small violation of the libfabric man pages. Both instances are related to setting the capabilites (caps) for tx_attr->caps and rx_attr->caps

  1. In function shmem_trasnport_ofi_target_ep_init(void)
  • The general caps are set to info->p_info->caps = FI_RMA | FI_ATOMIC | FI_REMOTE_READ | FI_REMOTE_WRITE which overwrites whatever what returned in call to get_info()
  • Meanwhile, the info->p_info->tx_attr->caps are still set to whatever was returned by get_info()
  • Following this, calls to fi_endpoint(), fi_cq_open_fi_ep_bind(), and fi_enable() are made
  • In the manpages for fi_endpoint() https://ofiwg.github.io/libfabric/v1.12.0/man/fi_endpoint.3.html under the section Transmit Context Attributes is the following: The capabilities must be a subset of those requested of the associated endpoint
  • Also stated in the manpages is: "The following capabilities apply to the transmit attributes: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM, FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX, and FI_COLLECTIVE."
  • I believe the proper way to resolve this manpage violation is to set the tx_attr->caps to 0, resulting in tx_attr->caps being a subset of the general caps
  • Alternatively, another option would be to bitwise OR the capabilities that apply to transmit attributes (FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM, FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX, and FI_COLLECTIVE) with the capabilities that SOS sets in this function ( FI_RMA | FI_ATOMIC | FI_REMOTE_READ | FI_REMOTE_WRITE) which would result in tx_attr->caps = FI_RMA | FI_ATOMIC
  1. In function shmem_transport_ofi_ctx_init(shmem_transport_ctx_t *ctx, int id) the (almost) exact same situation exists
  • The general caps are set to info->p_info->caps = FI_RMA | FI_WRITE | FI_READ | FI_ATOMIC; which is slightly different than the behavior of the above function
  • The same violation occurs
  • I believe the proper way to resolve this manpage violation is to set the tx_attr->caps to 0, resulting intx_attr->capsbeing a subset of the general caps
  • Alternatively, another option would be to bitwise OR the the capabilities that apply to transmit attributes (FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM, FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX, and FI_COLLECTIVE) with the capabilities that SOS sets in this function ( FI_RMA | FI_WRITE | FI_READ | FI_ATOMIC) which would result in tx_attr->caps = FI_RMA | FI_ATOMIC | FI_READ | FI_WRITE slightly different from behavior of the above function

@davidozog davidozog added this to the 1.5.2 milestone Feb 7, 2023
Copy link
Member

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tmh97!

I noticed a problem on a certain provider we need to support and think your alternative suggestion to request the relevant subset of caps on the endpoint might be better, at least for now until we can assure providers of interest handle the 0 initialization as expected (the docs say the caps subset is inherited for fi_getinfo... but perhaps not explicitly for fi_endpoint). Anyway, let me know what you think.

src/transport_ofi.c Outdated Show resolved Hide resolved
src/transport_ofi.c Outdated Show resolved Hide resolved
@davidozog davidozog requested a review from wrrobin February 9, 2023 15:43
@davidozog
Copy link
Member

BTW, I think issue #148 is related to this PR... I'm a bit nervous to play with it now, but I think we should try to fill the tx_attrs in fi_getinfo soon, possibly in a v1.5.3 release.

@tmh97
Copy link
Collaborator Author

tmh97 commented Feb 9, 2023

BTW, I think issue #148 is related to this PR... I'm a bit nervous to play with it now, but I think we should try to fill the tx_attrs in fi_getinfo soon, possibly in a v1.5.3 release.

Sounds like a good idea :) will keep this in mind for the future.

@tmh97
Copy link
Collaborator Author

tmh97 commented Feb 16, 2023

@davidozog Hey Dave, just finished testing (ran every test bucket twice with 3 different combinations of ppn/rank). Everything looks good to go on Cornelis's end.

@davidozog
Copy link
Member

Thanks @tmh97 - For completeness, we just need to check another provider or two, but I'm more confident in the (non-zero) flags now. Please feel free to make those changes on this PR and we'll be ready to merge shortly.

tmh97 and others added 2 commits February 16, 2023 16:45
Instead of zeroing caps, we will set them to a relevant subset of the p_info->caps.

Co-authored-by: David Ozog <[email protected]>
Instead of zeroing caps, we will set them to a relevant subset of the p_info->caps.

Co-authored-by: David Ozog <[email protected]>
@tmh97
Copy link
Collaborator Author

tmh97 commented Feb 16, 2023

Thanks @tmh97 - For completeness, we just need to check another provider or two, but I'm more confident in the (non-zero) flags now. Please feel free to make those changes on this PR and we'll be ready to merge shortly.

Changes made :)

Copy link
Member

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you ready to merge this today @tmh97?

@tmh97
Copy link
Collaborator Author

tmh97 commented Feb 24, 2023

Are you ready to merge this today @tmh97?

Definitely! Let's do it.

@tmh97
Copy link
Collaborator Author

tmh97 commented Feb 24, 2023

FYI i'm not authorized so will need your help to merge :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants