Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 #15477

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Nov 8, 2024

For collective RPC, when handle failure cases during crt_req_send(),
its reference may has been released via crt_rpc_complete_and_unlock()
that is triggered by crt_corpc_complete(). Under such case, we should
check whether the RPC is completed or not before calling RPC_DECREF()
to avoid releasing the RPC reference repeatedly.

The patch also initializes some local variable for CHK RPC to avoid
accessing invalid DRAM when handle failed collective CHK RPC.

Some enhancement for CR test logic.

Test-tag: test_daos_cat_recov_core
Allow-unstable-test: true

Signed-off-by: Fan Yong [email protected]

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Nov 8, 2024

Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - server was not found in its expected state - 17 TEST(S) FAILED'
Status is 'In Progress'
Labels: 'ci_2.6_daily,ci_master_daily,daily_test'
https://daosio.atlassian.net/browse/DAOS-16170

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/1/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15477/1/execution/node/1349/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from 6548ecc to 3af43c9 Compare November 9, 2024 17:00
@Nasf-Fan Nasf-Fan changed the title DAOS-16170 chk: initialize local variable for CHK RPC - b26 DAOS-16170 cart: do not release completed RPC reference repeatedly - b26 Nov 9, 2024
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/2/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from 3af43c9 to 766d811 Compare November 10, 2024 04:55
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/3/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch 2 times, most recently from 52ba27a to bb8a0c2 Compare November 11, 2024 05:10
@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/5/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/6/testReport/

For collective RPC, when handle failure cases during crt_req_send(),
its reference may has been released via crt_rpc_complete_and_unlock()
that is triggered by crt_corpc_complete(). Under such case, we should
check whether the RPC is completed or not before calling RPC_DECREF()
to avoid releasing the RPC reference repeatedly.

The patch also initializes some local variable for CHK RPC to avoid
accessing invalid DRAM when handle failed collective CHK RPC.

Some enhancement for CR test logic.

Test-tag: test_daos_cat_recov_core
Allow-unstable-test: true

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16170_2_b26 branch from bb8a0c2 to 0df56ec Compare November 18, 2024 06:44
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/7/testReport/

@Nasf-Fan
Copy link
Contributor Author

With the patch, CR tests passed on CI:
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15477/7/artifact/Functional%20Hardware%20Medium%20Verbs%20Provider/recovery/cat_recov_core.py/

But seems only CR was tests this cycle, let's tigger another CI tests for others.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/8/testReport/

@Nasf-Fan
Copy link
Contributor Author

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15477/8/testReport/

Known NLT failure because of DAOS-16787, not related with the patch.

@Nasf-Fan Nasf-Fan marked this pull request as ready for review November 25, 2024 01:03
@Nasf-Fan Nasf-Fan requested review from a team as code owners November 25, 2024 01:03
@@ -1537,7 +1537,7 @@ crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg)
/* failure already reported through complete cb */
if (complete_cb != NULL)
rc = 0;
} else {
} else if (!crt_rpc_completed(rpc_priv)) {
Copy link
Contributor

@frostedcmos frostedcmos Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how an rpc can be validly completed at this point?

From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note to make here, we only used crt_rpc_completed() to detect bugs in the ref counting logic or elsewhere before. I don't think we should start using this function to decide ref counting, and instead should fix underlying ref count problem instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my view, to complete, we would need to send it via 1515 rc = crt_req_send_internal(rpc_priv); , but that call should either a return rc on error, or send the rpc and invoke a completion cb. Your check here implies that somehow both can happen.

Here is the call steps according to my test logs:

crt_req_send() => crt_corpc_req_hdlr() => crt_tree_get_children() that hit failure and goto forward_done

int
crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv)
{
...
        rc = crt_tree_get_children(co_info->co_grp_priv, co_info->co_grp_ver,
                                   rpc_priv->crp_flags &
                                   CRT_RPC_FLAG_FILTER_INVERT,
                                   co_info->co_filter_ranks,
                                   co_info->co_tree_topo, co_info->co_root,
                                   co_info->co_grp_priv->gp_self,
                                   &children_rank_list, &ver_match);

        if (rc != 0) {
                RPC_CERROR(crt_quiet_error(rc), DB_NET, rpc_priv,
                           "crt_tree_get_children(group %s) failed: "DF_RC"\n",
                           co_info->co_grp_priv->gp_pub.cg_grpid, DP_RC(rc));
                crt_corpc_fail_parent_rpc(rpc_priv, rc);
                D_GOTO(forward_done, rc);
        }
...
forward_done:
        if (rc != 0 && rpc_priv->crp_flags & CRT_RPC_FLAG_CO_FAILOUT) {
                crt_corpc_complete(rpc_priv);
                goto out;
        }
...
}

Then crt_corpc_complete() => crt_rpc_complete_and_unlock() => RPC_DECREF()

So when crt_corpc_req_hdlr() returns back to its caller crt_req_send(), the PRC has already been completed with single reference left. If we still call another two RPC_DECREF() at the end of crt_req_send() for cleanup, then it will trigger assertion in the last RPC_DECREF():


#define RPC_DECREF(RPC)                                                                            \
        do {                                                                                       \
                int __ref;                                                                         \
                __ref = atomic_fetch_sub(&(RPC)->crp_refcount, 1);                                 \
                D_ASSERTF(__ref != 0, "%p decref from zero\n", (RPC));                             \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. does this issue only happen on the issuer or the corpc or on the intermediate nodes as well that are part of corpc tree? I recall there is some special handling for 'am_root' in corpc completion and sounds like there might be ref count going wrong in that place as a result (in crt_corpc_complete()).

If you have a good reproducer are you able to instrument ADDREF/DECREF macros and add prints to track what is happening to ref counts in your failure case?

as to usage of 'crt_rpc_completed(rpc_priv)' -- it's not safe. if rpc is completed, then the access to rpc_priv is no longer valid, as the rpc_priv should be free-ed already once rpc is completed and ref count is 0; as mentioned before it was only used to previously to detect bugs in ref count logic.

Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I only saw the failure happened on the CORPC tree root. Not sure whether other corner cases. I traced the RPC reference count: after crt_rpc_complete_and_unlock() returned to crt_corpc_complete(), the reference was 2. And then crt_corpc_complete() called another RPC_DECREF(), so the reference became 1. At that time, the RPC was still valid. So it is safe to check crt_rpc_completed() in crt_req_send().

int
crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg)
{
        struct crt_rpc_priv     *rpc_priv = NULL;
        bool                     locked = false;
        int                      rc = 0;
...
out:
        /* internally destroy the req when failed */
        if (rc != 0) {
                if (!rpc_priv->crp_coll) {
                        crt_rpc_complete_and_unlock(rpc_priv, rc);
                        locked = false;
                        /* failure already reported through complete cb */
                        if (complete_cb != NULL)
                                rc = 0;
                } else if (!crt_rpc_completed(rpc_priv)) {
                        RPC_DECREF(rpc_priv);
                }
        }
...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to land it as it to fix the problem first, later I'll check more details to see if some details can be refined. Thx

@frostedcmos , how do you think that? Currently, this issue often causes CR20 to be failed during master/b26 daily tests. I prefer to make a simple fix firstly. There may be more work for related CRT logic, but we can enhance that sometime later. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dont like the fact that crt_rpc_completed() also sets completed bit on its own, as it can lead later to unintended wrong behaviors if the function is used in other places. However in the current usages it's ok to land this as a fix.

I think the better solution would be to refine completion logic for the error case when am_root==true, but we can address it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against the adjustment for crt_rpc_completed() related logic. That is not directly related with current CR test failure in DAOS-16179. So we can do that in another independent patch. As my understand, some network expert will do that, right?

BTW, please help to review the master version (#15476) to follow our land process. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can consider to merge with #15572 later

@Nasf-Fan Nasf-Fan requested a review from liuxuezhao December 2, 2024 15:17
@Nasf-Fan Nasf-Fan requested a review from a team December 6, 2024 13:11
@daltonbohning daltonbohning added the release-2.6.3 Targeted for 2.6.3 label Dec 10, 2024
@daltonbohning
Copy link
Contributor

Ticket does not have merge approval

@daltonbohning daltonbohning removed the request for review from a team December 11, 2024 18:53
@Nasf-Fan
Copy link
Contributor Author

Ticket does not have merge approval

Got the merge approve finally.

@Nasf-Fan Nasf-Fan requested a review from a team December 16, 2024 15:25
@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 16, 2024
@daltonbohning
Copy link
Contributor

FYI the test tag should have been in the latest commit too: Test-tag: test_daos_cat_recov_core. But since that test passed previously, and then there was just a clean merge, and this is a backport, I think this is safe to land.

@daltonbohning daltonbohning merged commit da9001e into release/2.6 Dec 16, 2024
49 of 54 checks passed
@daltonbohning daltonbohning deleted the Nasf-Fan/DAOS-16170_2_b26 branch December 16, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. release-2.6.3 Targeted for 2.6.3
Development

Successfully merging this pull request may close these issues.

5 participants