-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16824 cart: lower error messages to warn level #15529
Conversation
Change-Id: I71966a235c068abbd3c74c422458a68e3a86154a
Ticket title is 'Remove chatty error messages in cloud environment ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments inline
src/cart/crt_hg.c
Outdated
RPC_WARN(rpc_priv, | ||
"failed to invoke RPC handler, rc: "DF_RC"\n", | ||
DP_RC(rc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i expect either crp_rpc_common_hdlr() or crt_corpc_common_hdlr() to print reason for failure if they return rc != 0, so probably we can just nuke message here altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I kept the error message over there. How about we change this to INFO just in case it will miss thing?
src/cart/crt_hg.c
Outdated
RPC_CERROR(crt_quiet_error(crt_cbinfo.cci_rc), DB_NET, rpc_priv, | ||
"RPC failed; rc: " DF_RC "\n", DP_RC(crt_cbinfo.cci_rc)); | ||
RPC_WARN(rpc_priv, | ||
"RPC failed; rc: " DF_RC "\n", DP_RC(crt_cbinfo.cci_rc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need something like RPC_CWARN here that takes the same condition. you dont want to print warnings for all group version mismatches either, and those are very frequent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure.
src/cart/crt_hg_proc.c
Outdated
RPC_WARN(rpc_priv, | ||
"RPC failed to execute on target. " | ||
"error code: "DF_RC"\n", DP_RC(rc2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, you dont really want to print these as warns for DER_GRPVER either
Change-Id: I94190d2b4617a50cfbc8fad4f1a16f6973bedc1b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. 1 more place to change, noted inline
src/cart/crt_context.c
Outdated
@@ -534,8 +534,8 @@ crt_rpc_complete_and_unlock(struct crt_rpc_priv *rpc_priv, int rc) | |||
cbinfo.cci_rc = rpc_priv->crp_reply_hdr.cch_rc; | |||
|
|||
if (cbinfo.cci_rc != 0) | |||
RPC_CERROR(crt_quiet_error(cbinfo.cci_rc), DB_NET, rpc_priv, | |||
"failed, " DF_RC "\n", DP_RC(cbinfo.cci_rc)); | |||
RPC_WARN(rpc_priv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be RPC_CWARN as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks:)
NLT is broken - I have seen this from multiple of my PRs. |
i don't have a strong opinion one way or the other for this change. my main concern on this is what if there are actual errors on the fabric (not just timeouts). are these also downgraded to warn / info messages? |
Well, at least the users would have a choice to adjust the log level now. Right now, the rate error message is at ~400 messages per second at error level. On the server side, one of the major problem is that the same error now is printed thrice, in crt_rpc_handler_common(), crt_rpc_common_hdlr(), and crt_proc_out_common() respectively. This PR lowers the message level in crt_rpc_handler_common() and crt_proc_out_common() but keeps the error message on crt_rpc_common_hdlr().
This PR is not intended to change any internal errors in fabric. I tend to think the framework should only mind its own internal errors. The users will know how serious an error is then decide to print out something. As my example shows in the ticket, evetually the error will be printed by CLI, why would the framework bother printing those errors? Another issue is the same as I mentioned previously - the same error is being printed over and over again in the stack. This PR does its best to remove those duplicated messages. |
@mchaarawi our logs explorer allows us to filter messages based on the log level during or after the logs have been produced. Also, ERROR messages are marked "red" in the timeline and it would be useful to only see red when it's an actual error to make it easier to spot the root cause of real problems without having to sift through noise. Additionally, there is a psychological issue for customer who sees a bunch of errors vs warnings or informational messages. If we flag everything as an error, especially when it's not fatal, it gives a worse impression, IMO |
Change-Id: I05409a9ab6187566de9e201bdaeac5ea02079136
b965d36
Looks like it hit some NLT issues. Try merging latest and running with Allow-unstable-test: true in final commit message to make sure it runs full tests regardless. |
src/tests/simple_dfs.c
Outdated
@@ -48,8 +48,16 @@ main(int argc, char **argv) | |||
rc = dfs_init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in this file are unrelated to the PR. a wrong merge perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
3546683
to
ea28d57
Compare
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15529/7/execution/node/334/log |
Lower some chatty messages from error level to warn level.
Ideally the error messages should be printed by callers because they should know how serious the error is.
This PR also removes some error messages because they are being printed repeatedly due to the same error.
Signed-off-by: Jinshan Xiong [email protected]
Allow-unstable-test: true
Change-Id: I71966a235c068abbd3c74c422458a68e3a86154a
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: