-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477
Conversation
Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - server was not found in its expected state - 17 TEST(S) FAILED' |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/1/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15477/1/execution/node/1349/log |
6548ecc
to
3af43c9
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/2/testReport/ |
3af43c9
to
766d811
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/3/testReport/ |
52ba27a
to
bb8a0c2
Compare
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/5/testReport/ |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/6/testReport/ |
For collective RPC, when handle failure cases during crt_req_send(), its reference may has been released via crt_rpc_complete_and_unlock() that is triggered by crt_corpc_complete(). Under such case, we should check whether the RPC is completed or not before calling RPC_DECREF() to avoid releasing the RPC reference repeatedly. The patch also initializes some local variable for CHK RPC to avoid accessing invalid DRAM when handle failed collective CHK RPC. Some enhancement for CR test logic. Test-tag: test_daos_cat_recov_core Allow-unstable-test: true Signed-off-by: Fan Yong <[email protected]>
bb8a0c2
to
0df56ec
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/7/testReport/ |
With the patch, CR tests passed on CI: But seems only CR was tests this cycle, let's tigger another CI tests for others. |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/8/testReport/ |
Known NLT failure because of DAOS-16787, not related with the patch. |
@@ -1537,7 +1537,7 @@ crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg) | |||
/* failure already reported through complete cb */ | |||
if (complete_cb != NULL) | |||
rc = 0; | |||
} else { | |||
} else if (!crt_rpc_completed(rpc_priv)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how an rpc can be validly completed at this point?
From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note to make here, we only used crt_rpc_completed() to detect bugs in the ref counting logic or elsewhere before. I don't think we should start using this function to decide ref counting, and instead should fix underlying ref count problem instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.
Here is the call steps according to my test logs:
crt_req_send() => crt_corpc_req_hdlr() => crt_tree_get_children()
that hit failure and goto forward_done
int
crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv)
{
...
rc = crt_tree_get_children(co_info->co_grp_priv, co_info->co_grp_ver,
rpc_priv->crp_flags &
CRT_RPC_FLAG_FILTER_INVERT,
co_info->co_filter_ranks,
co_info->co_tree_topo, co_info->co_root,
co_info->co_grp_priv->gp_self,
&children_rank_list, &ver_match);
if (rc != 0) {
RPC_CERROR(crt_quiet_error(rc), DB_NET, rpc_priv,
"crt_tree_get_children(group %s) failed: "DF_RC"\n",
co_info->co_grp_priv->gp_pub.cg_grpid, DP_RC(rc));
crt_corpc_fail_parent_rpc(rpc_priv, rc);
D_GOTO(forward_done, rc);
}
...
forward_done:
if (rc != 0 && rpc_priv->crp_flags & CRT_RPC_FLAG_CO_FAILOUT) {
crt_corpc_complete(rpc_priv);
goto out;
}
...
}
Then crt_corpc_complete() => crt_rpc_complete_and_unlock() => RPC_DECREF()
So when crt_corpc_req_hdlr()
returns back to its caller crt_req_send()
, the PRC has already been completed with single reference left. If we still call another two RPC_DECREF()
at the end of crt_req_send()
for cleanup, then it will trigger assertion in the last RPC_DECREF()
:
#define RPC_DECREF(RPC) \
do { \
int __ref; \
__ref = atomic_fetch_sub(&(RPC)->crp_refcount, 1); \
D_ASSERTF(__ref != 0, "%p decref from zero\n", (RPC)); \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. does this issue only happen on the issuer or the corpc or on the intermediate nodes as well that are part of corpc tree? I recall there is some special handling for 'am_root' in corpc completion and sounds like there might be ref count going wrong in that place as a result (in crt_corpc_complete()).
If you have a good reproducer are you able to instrument ADDREF/DECREF macros and add prints to track what is happening to ref counts in your failure case?
as to usage of 'crt_rpc_completed(rpc_priv)' -- it's not safe. if rpc is completed, then the access to rpc_priv is no longer valid, as the rpc_priv should be free-ed already once rpc is completed and ref count is 0; as mentioned before it was only used to previously to detect bugs in ref count logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I only saw the failure happened on the CORPC tree root. Not sure whether other corner cases. I traced the RPC reference count: after crt_rpc_complete_and_unlock()
returned to crt_corpc_complete()
, the reference was 2. And then crt_corpc_complete()
called another RPC_DECREF(), so the reference became 1. At that time, the RPC was still valid. So it is safe to check crt_rpc_completed()
in crt_req_send()
.
int
crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg)
{
struct crt_rpc_priv *rpc_priv = NULL;
bool locked = false;
int rc = 0;
...
out:
/* internally destroy the req when failed */
if (rc != 0) {
if (!rpc_priv->crp_coll) {
crt_rpc_complete_and_unlock(rpc_priv, rc);
locked = false;
/* failure already reported through complete cb */
if (complete_cb != NULL)
rc = 0;
} else if (!crt_rpc_completed(rpc_priv)) {
RPC_DECREF(rpc_priv);
}
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx
@frostedcmos , how do you think that? Currently, this issue often causes CR20 to be failed during master/b26 daily tests. I prefer to make a simple fix firstly. There may be more work for related CRT logic, but we can enhance that sometime later. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still dont like the fact that crt_rpc_completed() also sets completed bit on its own, as it can lead later to unintended wrong behaviors if the function is used in other places. However in the current usages it's ok to land this as a fix.
I think the better solution would be to refine completion logic for the error case when am_root==true, but we can address it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against the adjustment for crt_rpc_completed
() related logic. That is not directly related with current CR test failure in DAOS-16179. So we can do that in another independent patch. As my understand, some network expert will do that, right?
BTW, please help to review the master version (#15476) to follow our land process. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can consider to merge with #15572 later
Ticket does not have merge approval |
Got the merge approve finally. |
FYI the test tag should have been in the latest commit too: |
For collective RPC, when handle failure cases during crt_req_send(),
its reference may has been released via crt_rpc_complete_and_unlock()
that is triggered by crt_corpc_complete(). Under such case, we should
check whether the RPC is completed or not before calling RPC_DECREF()
to avoid releasing the RPC reference repeatedly.
The patch also initializes some local variable for CHK RPC to avoid
accessing invalid DRAM when handle failed collective CHK RPC.
Some enhancement for CR test logic.
Test-tag: test_daos_cat_recov_core
Allow-unstable-test: true
Signed-off-by: Fan Yong [email protected]
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: