-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how an rpc can be validly completed at this point?
From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note to make here, we only used crt_rpc_completed() to detect bugs in the ref counting logic or elsewhere before. I don't think we should start using this function to decide ref counting, and instead should fix underlying ref count problem instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the call steps according to my test logs:
crt_req_send() => crt_corpc_req_hdlr() => crt_tree_get_children()
that hit failure and gotoforward_done
Then
crt_corpc_complete() => crt_rpc_complete_and_unlock() => RPC_DECREF()
So when
crt_corpc_req_hdlr()
returns back to its callercrt_req_send()
, the PRC has already been completed with single reference left. If we still call another twoRPC_DECREF()
at the end ofcrt_req_send()
for cleanup, then it will trigger assertion in the lastRPC_DECREF()
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. does this issue only happen on the issuer or the corpc or on the intermediate nodes as well that are part of corpc tree? I recall there is some special handling for 'am_root' in corpc completion and sounds like there might be ref count going wrong in that place as a result (in crt_corpc_complete()).
If you have a good reproducer are you able to instrument ADDREF/DECREF macros and add prints to track what is happening to ref counts in your failure case?
as to usage of 'crt_rpc_completed(rpc_priv)' -- it's not safe. if rpc is completed, then the access to rpc_priv is no longer valid, as the rpc_priv should be free-ed already once rpc is completed and ref count is 0; as mentioned before it was only used to previously to detect bugs in ref count logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I only saw the failure happened on the CORPC tree root. Not sure whether other corner cases. I traced the RPC reference count: after
crt_rpc_complete_and_unlock()
returned tocrt_corpc_complete()
, the reference was 2. And thencrt_corpc_complete()
called another RPC_DECREF(), so the reference became 1. At that time, the RPC was still valid. So it is safe to checkcrt_rpc_completed()
incrt_req_send()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frostedcmos , how do you think that? Currently, this issue often causes CR20 to be failed during master/b26 daily tests. I prefer to make a simple fix firstly. There may be more work for related CRT logic, but we can enhance that sometime later. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still dont like the fact that crt_rpc_completed() also sets completed bit on its own, as it can lead later to unintended wrong behaviors if the function is used in other places. However in the current usages it's ok to land this as a fix.
I think the better solution would be to refine completion logic for the error case when am_root==true, but we can address it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against the adjustment for
crt_rpc_completed
() related logic. That is not directly related with current CR test failure in DAOS-16179. So we can do that in another independent patch. As my understand, some network expert will do that, right?BTW, please help to review the master version (#15476) to follow our land process. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can consider to merge with #15572 later