Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16824 cart: lower error messages to warn level #15529

Merged
merged 8 commits into from
Dec 3, 2024

Conversation

jxiong
Copy link
Contributor

@jxiong jxiong commented Nov 21, 2024

Lower some chatty messages from error level to warn level.

Ideally the error messages should be printed by callers because they should know how serious the error is.

This PR also removes some error messages because they are being printed repeatedly due to the same error.

Signed-off-by: Jinshan Xiong [email protected]

Allow-unstable-test: true
Change-Id: I71966a235c068abbd3c74c422458a68e3a86154a

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Change-Id: I71966a235c068abbd3c74c422458a68e3a86154a
@jxiong jxiong requested review from a team as code owners November 21, 2024 21:37
Copy link

github-actions bot commented Nov 21, 2024

Ticket title is 'Remove chatty error messages in cloud environment '
Status is 'Open'
Labels: 'GCP'
https://daosio.atlassian.net/browse/DAOS-16824

Copy link
Contributor

@frostedcmos frostedcmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments inline

src/cart/crt_context.c Show resolved Hide resolved
Comment on lines 1157 to 1159
RPC_WARN(rpc_priv,
"failed to invoke RPC handler, rc: "DF_RC"\n",
DP_RC(rc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect either crp_rpc_common_hdlr() or crt_corpc_common_hdlr() to print reason for failure if they return rc != 0, so probably we can just nuke message here altogether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept the error message over there. How about we change this to INFO just in case it will miss thing?

Comment on lines 1367 to 1368
RPC_CERROR(crt_quiet_error(crt_cbinfo.cci_rc), DB_NET, rpc_priv,
"RPC failed; rc: " DF_RC "\n", DP_RC(crt_cbinfo.cci_rc));
RPC_WARN(rpc_priv,
"RPC failed; rc: " DF_RC "\n", DP_RC(crt_cbinfo.cci_rc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need something like RPC_CWARN here that takes the same condition. you dont want to print warnings for all group version mismatches either, and those are very frequent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure.

Comment on lines 715 to 717
RPC_WARN(rpc_priv,
"RPC failed to execute on target. "
"error code: "DF_RC"\n", DP_RC(rc2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, you dont really want to print these as warns for DER_GRPVER either

Change-Id: I94190d2b4617a50cfbc8fad4f1a16f6973bedc1b
Copy link
Contributor

@frostedcmos frostedcmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. 1 more place to change, noted inline

@@ -534,8 +534,8 @@ crt_rpc_complete_and_unlock(struct crt_rpc_priv *rpc_priv, int rc)
cbinfo.cci_rc = rpc_priv->crp_reply_hdr.cch_rc;

if (cbinfo.cci_rc != 0)
RPC_CERROR(crt_quiet_error(cbinfo.cci_rc), DB_NET, rpc_priv,
"failed, " DF_RC "\n", DP_RC(cbinfo.cci_rc));
RPC_WARN(rpc_priv,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be RPC_CWARN as well

frostedcmos
frostedcmos previously approved these changes Nov 22, 2024
Copy link
Contributor

@frostedcmos frostedcmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks:)

jolivier23
jolivier23 previously approved these changes Nov 22, 2024
@jxiong
Copy link
Contributor Author

jxiong commented Nov 22, 2024

NLT is broken - I have seen this from multiple of my PRs.

@mchaarawi
Copy link
Contributor

i don't have a strong opinion one way or the other for this change.
we do have a lot of chatty error logging in several areas, especially timeouts.
lowering to info does not really do much, because 99% of users actually do use info logging, not ERR or WARN.

my main concern on this is what if there are actual errors on the fabric (not just timeouts). are these also downgraded to warn / info messages?

@jxiong
Copy link
Contributor Author

jxiong commented Nov 22, 2024

i don't have a strong opinion one way or the other for this change. we do have a lot of chatty error logging in several areas, especially timeouts. lowering to info does not really do much, because 99% of users actually do use info logging, not ERR or WARN.

Well, at least the users would have a choice to adjust the log level now. Right now, the rate error message is at ~400 messages per second at error level.

On the server side, one of the major problem is that the same error now is printed thrice, in crt_rpc_handler_common(), crt_rpc_common_hdlr(), and crt_proc_out_common() respectively. This PR lowers the message level in crt_rpc_handler_common() and crt_proc_out_common() but keeps the error message on crt_rpc_common_hdlr().

my main concern on this is what if there are actual errors on the fabric (not just timeouts). are these also downgraded to warn / info messages?

This PR is not intended to change any internal errors in fabric. I tend to think the framework should only mind its own internal errors. The users will know how serious an error is then decide to print out something. As my example shows in the ticket, evetually the error will be printed by CLI, why would the framework bother printing those errors?

Another issue is the same as I mentioned previously - the same error is being printed over and over again in the stack. This PR does its best to remove those duplicated messages.

@jxiong jxiong requested a review from a team November 22, 2024 23:28
@jolivier23
Copy link
Contributor

i don't have a strong opinion one way or the other for this change. we do have a lot of chatty error logging in several areas, especially timeouts. lowering to info does not really do much, because 99% of users actually do use info logging, not ERR or WARN.

my main concern on this is what if there are actual errors on the fabric (not just timeouts). are these also downgraded to warn / info messages?

@mchaarawi our logs explorer allows us to filter messages based on the log level during or after the logs have been produced. Also, ERROR messages are marked "red" in the timeline and it would be useful to only see red when it's an actual error to make it easier to spot the root cause of real problems without having to sift through noise. Additionally, there is a psychological issue for customer who sees a bunch of errors vs warnings or informational messages. If we flag everything as an error, especially when it's not fatal, it gives a worse impression, IMO

soumagne
soumagne previously approved these changes Nov 25, 2024
Change-Id: I05409a9ab6187566de9e201bdaeac5ea02079136
@jxiong jxiong dismissed stale reviews from soumagne, jolivier23, and frostedcmos via b965d36 November 26, 2024 21:12
jolivier23
jolivier23 previously approved these changes Nov 27, 2024
frostedcmos
frostedcmos previously approved these changes Nov 27, 2024
@jxiong jxiong requested a review from soumagne November 27, 2024 23:22
@jxiong jxiong requested a review from jolivier23 November 27, 2024 23:22
@jolivier23
Copy link
Contributor

Looks like it hit some NLT issues. Try merging latest and running with

Allow-unstable-test: true in final commit message to make sure it runs full tests regardless.

@@ -48,8 +48,16 @@ main(int argc, char **argv)
rc = dfs_init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this file are unrelated to the PR. a wrong merge perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@jxiong jxiong force-pushed the jxiong/lower_error_msgs branch from 3546683 to ea28d57 Compare December 3, 2024 01:02
@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15529/7/execution/node/334/log

@jolivier23 jolivier23 merged commit 9a4bec3 into master Dec 3, 2024
53 of 55 checks passed
@jolivier23 jolivier23 deleted the jxiong/lower_error_msgs branch December 3, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants