DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

Nasf-Fan · 2024-11-08T16:56:18Z

For collective RPC, when handle failure cases during crt_req_send(),
its reference may has been released via crt_rpc_complete_and_unlock()
that is triggered by crt_corpc_complete(). Under such case, we should
check whether the RPC is completed or not before calling RPC_DECREF()
to avoid releasing the RPC reference repeatedly.

The patch also initializes some local variable for CHK RPC to avoid
accessing invalid DRAM when handle failed collective CHK RPC.

Some enhancement for CR test logic.

Test-tag: test_daos_cat_recov_core
Allow-unstable-test: true

Signed-off-by: Fan Yong [email protected]

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-11-08T16:56:36Z

Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - server was not found in its expected state - 17 TEST(S) FAILED'
Status is 'In Progress'
Labels: 'ci_2.6_daily,ci_master_daily,daily_test'
https://daosio.atlassian.net/browse/DAOS-16170

daosbuild1 · 2024-11-08T17:39:22Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/1/testReport/

daosbuild1 · 2024-11-08T23:12:48Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15477/1/execution/node/1349/log

daosbuild1 · 2024-11-09T17:43:37Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/2/testReport/

daosbuild1 · 2024-11-10T05:38:03Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/3/testReport/

daosbuild1 · 2024-11-11T05:11:13Z

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

daosbuild1 · 2024-11-11T05:11:14Z

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

daosbuild1 · 2024-11-11T05:11:15Z

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

daosbuild1 · 2024-11-11T05:11:16Z

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

daosbuild1 · 2024-11-11T05:11:17Z

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

daosbuild1 · 2024-11-11T05:50:54Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/5/testReport/

daosbuild1 · 2024-11-12T15:49:24Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/6/testReport/

For collective RPC, when handle failure cases during crt_req_send(), its reference may has been released via crt_rpc_complete_and_unlock() that is triggered by crt_corpc_complete(). Under such case, we should check whether the RPC is completed or not before calling RPC_DECREF() to avoid releasing the RPC reference repeatedly. The patch also initializes some local variable for CHK RPC to avoid accessing invalid DRAM when handle failed collective CHK RPC. Some enhancement for CR test logic. Test-tag: test_daos_cat_recov_core Allow-unstable-test: true Signed-off-by: Fan Yong <[email protected]>

daosbuild1 · 2024-11-18T07:26:41Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/7/testReport/

Nasf-Fan · 2024-11-21T13:41:26Z

With the patch, CR tests passed on CI:
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/7/artifact/Functional%20Hardware%20Medium%20Verbs%20Provider/recovery/cat_recov_core.py/

But seems only CR was tests this cycle, let's tigger another CI tests for others.

daosbuild1 · 2024-11-21T14:32:09Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/8/testReport/

Nasf-Fan · 2024-11-25T01:03:50Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/8/testReport/

Known NLT failure because of DAOS-16787, not related with the patch.

frostedcmos · 2024-11-25T02:45:50Z

src/cart/crt_rpc.c

@@ -1537,7 +1537,7 @@ crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg)
 			/* failure already reported through complete cb */
 			if (complete_cb != NULL)
 				rc = 0;
-		} else {
+		} else if (!crt_rpc_completed(rpc_priv)) {


Can you explain how an rpc can be validly completed at this point?

From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.

One note to make here, we only used crt_rpc_completed() to detect bugs in the ref counting logic or elsewhere before. I don't think we should start using this function to decide ref counting, and instead should fix underlying ref count problem instead

From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.

Here is the call steps according to my test logs:

crt_req_send() => crt_corpc_req_hdlr() => crt_tree_get_children() that hit failure and goto forward_done

int crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv) { ... rc = crt_tree_get_children(co_info->co_grp_priv, co_info->co_grp_ver, rpc_priv->crp_flags & CRT_RPC_FLAG_FILTER_INVERT, co_info->co_filter_ranks, co_info->co_tree_topo, co_info->co_root, co_info->co_grp_priv->gp_self, &children_rank_list, &ver_match); if (rc != 0) { RPC_CERROR(crt_quiet_error(rc), DB_NET, rpc_priv, "crt_tree_get_children(group %s) failed: "DF_RC"\n", co_info->co_grp_priv->gp_pub.cg_grpid, DP_RC(rc)); crt_corpc_fail_parent_rpc(rpc_priv, rc); D_GOTO(forward_done, rc); } ... forward_done: if (rc != 0 && rpc_priv->crp_flags & CRT_RPC_FLAG_CO_FAILOUT) { crt_corpc_complete(rpc_priv); goto out; } ... }

Then crt_corpc_complete() => crt_rpc_complete_and_unlock() => RPC_DECREF()

So when crt_corpc_req_hdlr() returns back to its caller crt_req_send(), the PRC has already been completed with single reference left. If we still call another two RPC_DECREF() at the end of crt_req_send() for cleanup, then it will trigger assertion in the last RPC_DECREF():

#define RPC_DECREF(RPC) \ do { \ int __ref; \ __ref = atomic_fetch_sub(&(RPC)->crp_refcount, 1); \ D_ASSERTF(__ref != 0, "%p decref from zero\n", (RPC)); \

hmm. does this issue only happen on the issuer or the corpc or on the intermediate nodes as well that are part of corpc tree? I recall there is some special handling for 'am_root' in corpc completion and sounds like there might be ref count going wrong in that place as a result (in crt_corpc_complete()).

If you have a good reproducer are you able to instrument ADDREF/DECREF macros and add prints to track what is happening to ref counts in your failure case?

as to usage of 'crt_rpc_completed(rpc_priv)' -- it's not safe. if rpc is completed, then the access to rpc_priv is no longer valid, as the rpc_priv should be free-ed already once rpc is completed and ref count is 0; as mentioned before it was only used to previously to detect bugs in ref count logic.

Currently, I only saw the failure happened on the CORPC tree root. Not sure whether other corner cases. I traced the RPC reference count: after crt_rpc_complete_and_unlock() returned to crt_corpc_complete(), the reference was 2. And then crt_corpc_complete() called another RPC_DECREF(), so the reference became 1. At that time, the RPC was still valid. So it is safe to check crt_rpc_completed() in crt_req_send().

int crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg) { struct crt_rpc_priv *rpc_priv = NULL; bool locked = false; int rc = 0; ... out: /* internally destroy the req when failed */ if (rc != 0) { if (!rpc_priv->crp_coll) { crt_rpc_complete_and_unlock(rpc_priv, rc); locked = false; /* failure already reported through complete cb */ if (complete_cb != NULL) rc = 0; } else if (!crt_rpc_completed(rpc_priv)) { RPC_DECREF(rpc_priv); } } ... }

I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx

I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx

@frostedcmos , how do you think that? Currently, this issue often causes CR20 to be failed during master/b26 daily tests. I prefer to make a simple fix firstly. There may be more work for related CRT logic, but we can enhance that sometime later. Thanks!

I still dont like the fact that crt_rpc_completed() also sets completed bit on its own, as it can lead later to unintended wrong behaviors if the function is used in other places. However in the current usages it's ok to land this as a fix.

I think the better solution would be to refine completion logic for the error case when am_root==true, but we can address it later.

I am not against the adjustment for crt_rpc_completed() related logic. That is not directly related with current CR test failure in DAOS-16179. So we can do that in another independent patch. As my understand, some network expert will do that, right?

BTW, please help to review the master version (#15476) to follow our land process. Thanks!

can consider to merge with #15572 later

daltonbohning · 2024-12-11T18:53:24Z

Ticket does not have merge approval

Nasf-Fan · 2024-12-16T15:25:11Z

Ticket does not have merge approval

Got the merge approve finally.

daltonbohning · 2024-12-16T15:34:25Z

FYI the test tag should have been in the latest commit too: Test-tag: test_daos_cat_recov_core. But since that test passed previously, and then there was just a clean merge, and this is a backport, I think this is safe to land.

Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from 6548ecc to 3af43c9 Compare November 9, 2024 17:00

Nasf-Fan changed the title ~~DAOS-16170 chk: initialize local variable for CHK RPC - b26~~ DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 Nov 9, 2024

Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from 3af43c9 to 766d811 Compare November 10, 2024 04:55

Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch 2 times, most recently from 52ba27a to bb8a0c2 Compare November 11, 2024 05:10

Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from bb8a0c2 to 0df56ec Compare November 18, 2024 06:44

Merge branch 'release/2.6' into Nasf-Fan/DAOS-16170_2_b26

90fa52b

Nasf-Fan marked this pull request as ready for review November 25, 2024 01:03

Nasf-Fan requested review from a team as code owners November 25, 2024 01:03

frostedcmos requested changes Nov 25, 2024

View reviewed changes

Nasf-Fan requested a review from liuxuezhao December 2, 2024 15:17

liuxuezhao approved these changes Dec 4, 2024

View reviewed changes

frostedcmos approved these changes Dec 5, 2024

View reviewed changes

Nasf-Fan requested a review from a team December 6, 2024 13:11

daltonbohning added the release-2.6.3 Targeted for 2.6.3 label Dec 10, 2024

daltonbohning removed the request for review from a team December 11, 2024 18:53

Nasf-Fan requested a review from a team December 16, 2024 15:25

daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 16, 2024

daltonbohning merged commit da9001e into release/2.6 Dec 16, 2024
49 of 54 checks passed

daltonbohning deleted the Nasf-Fan/DAOS-16170_2_b26 branch December 16, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

Nasf-Fan commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 9, 2024

daosbuild1 commented Nov 10, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 12, 2024

daosbuild1 commented Nov 18, 2024

Nasf-Fan commented Nov 21, 2024

daosbuild1 commented Nov 21, 2024

Nasf-Fan commented Nov 25, 2024

frostedcmos Nov 25, 2024 •

edited

Loading

frostedcmos Nov 25, 2024

Nasf-Fan Nov 26, 2024

frostedcmos Nov 26, 2024

Nasf-Fan Nov 27, 2024 •

edited

Loading

liuxuezhao Dec 4, 2024

Nasf-Fan Dec 5, 2024

frostedcmos Dec 5, 2024

Nasf-Fan Dec 6, 2024

liuxuezhao Dec 7, 2024

daltonbohning commented Dec 11, 2024

Nasf-Fan commented Dec 16, 2024

daltonbohning commented Dec 16, 2024

DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

Conversation

Nasf-Fan commented Nov 8, 2024 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Nov 8, 2024 • edited Loading

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 9, 2024

daosbuild1 commented Nov 10, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 12, 2024

daosbuild1 commented Nov 18, 2024

Nasf-Fan commented Nov 21, 2024

daosbuild1 commented Nov 21, 2024

Nasf-Fan commented Nov 25, 2024

frostedcmos Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daltonbohning commented Dec 11, 2024

Nasf-Fan commented Dec 16, 2024

daltonbohning commented Dec 16, 2024

Nasf-Fan commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading

frostedcmos Nov 25, 2024 •

edited

Loading

Nasf-Fan Nov 27, 2024 •

edited

Loading