-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/logging: Fix race in writing to vrb_ep_ops #10588
base: main
Are you sure you want to change the base?
Conversation
'vrb_ep_ops' is a static object which field 'ops_open' gets overriden in 'vrb_open_ep'. If the operation is performed by multiple threads at once, which is totally possible, we end up with a data race. Switching it to thread local fixes it. Signed-off-by: Dariusz Sciebura <[email protected]>
There is a bunch of static variables spread all over the place. IMHO they all should be marked as _Thread_local, as, even if they are not overridden somewhere now, they can be in future leading to another race. If you agree, just let know - I can modify my PR. Otherwise, let's stick to this ver. |
It is in a sense that it generally solves the same issue. However what I am promoting here is a general rule of marking static globals with _Thread_local as probably there are still used concurrently in several places and will probably be used that way in the future. IMO the best solution would be to rethink the architecture and refactor the existing code, but as a quick fix adding thread locality to variables like the one mentioned may help fix a lot of UBs for pretty much 'free'. |
I accidentally found another spot with an identical bug, this time in xnet_ep.c. The line: |
@shefty - counting on your voice! |
The function pointers here must be set per endpoint that's created, not per thread. If the same thread creates 2 endpoints, the second endpoint should not change the function pointers for the first endpoint. The original change that modified ops_open is what's wrong and needs to be changed. We should just need this:
The xnet change should be similar. If a provider wants to customize a function pointer for an individual object, they need to dynamically allocate the struct fi_xxx_ops structure, which is almost certainly a bad idea and shouldn't be done. |
Correct fix is #10571. The static structs referencing function pointers is desirable. Consider an application which allocates 10k endpoints which are nearly identical. Each endpoint has a single pointer to a static struct, where the function pointers are stored. When an app calls fi_send(), fi_write(), etc. those functions are inline wrappers which access the correct underlying function. The benefit is that the size of each endpoint only increases by 1 pointer, rather than the size of the entire struct. It also allows adding new functions to the end of the struct while maintaining backwards compatibility. That wouldn't be possible if the structs were embedded directly into the user visible object. The cost is that 2 pointers must be dereferenced when making a function call, rather than only 1. |
'vrb_ep_ops' is a static object which field 'ops_open' gets overriden in 'vrb_open_ep'. If the operation is performed by multiple threads at once, which is totally possible, we end up with a data race. Switching it to thread local fixes it.
Signed-off-by: Dariusz Sciebura [email protected]