Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Merged
merged 1 commit into from
Jun 3, 2021

Conversation

soareschen
Copy link
Contributor

@soareschen soareschen commented May 31, 2021

Closes: #1018

Description

This PR should cause a large number of error logs to be printed when running cargo tests on foreign_client::tests.


For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@soareschen soareschen changed the title Enable logging in foreign_client tests Enable logging in tests and fix error log flooding from mock chain runtime May 31, 2021
@soareschen
Copy link
Contributor Author

The error log flooding has now been fixed. There are 2 causes for this:

  • The MockChain returns an EventReceiver with the sender end being dropped immediately.
  • The ChainRuntime's run() method runs in a loop that keeps polling the receivers. Because the sender end is dropped, the receiver end errors out immediately. However the chain runtime merely prints the error message and continue looping. So the number of error messages printed is affected by how long the test itself has run.

The fix is as follows:

  • Add the event sender end to the MockChain struct, so that it never gets dropped.
  • Make ChainRuntime return from run() if event_receiver is disconnected. This can potentially break the runtime, as subsequent messages for request_receiver are not handled. However in the match arm for request_receiver also returns from the loop when any error happens. So the loop is already not fault tolerant anyway. The error should instead be propagated to the caller through the JoinHandle, and have higher level logic to handle the failure such as respawning the runtime.

@soareschen soareschen marked this pull request as ready for review May 31, 2021 20:06
@soareschen soareschen force-pushed the soares/enable-test-logging branch from bd5f5d9 to d5ea005 Compare May 31, 2021 20:10
@@ -209,6 +209,7 @@ impl From<IdentifiedAnyClientState> for IdentifiedClientState {
#[cfg(test)]
mod tests {
use std::convert::TryFrom;
use test_env_log::test;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should keep this around. Do you think it would be valuable to have logging enabled in tests going forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think they help especially when the tests are failing. Not having them enabled would only make it more inconvenient for contributors to write new tests and fix failing tests. The logging setup for Rust is not very straightforward, and just figuring out how to enable that can waste minutes to hour of a contributor's time.

Err(e) => error!("received error via event bus: {}", e),
Err(e) => {
error!("received error via event bus: {}", e);
return Err(Kind::Channel.into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it is by design that this match arm does not return Err(..) here but instead permits the loop to continue. For context, please see #895 (and follow-up #903). The idea is that the chain runtime would continue polling via select! here even if the monitor's connection to the full node is down (or some other error appears in the event_receiver), and it's the job of the monitor to retry and eventually heal if the connection to the node is re-established.

It would be good if you can verify that your changes do not introduce a regression. Could you please just double-check that? The recipe for testing is in the description of #903 (see "Tested with").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I was trying earlier to test but was not sure how to exercise the Err path here. Any tips?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the source code of crossbeam, and the only possible error is when the sender end of the channel is dropped, and so it is unrecoverable. We probably never encounter this in production, because if the channel is disconnected, it will also result in the flooding of error messages in the relayer log.

Also, if you check the other branches, each of them are called with ? at the end. This means any error happening in those places will break the loop and return error. We should have another issue to address this by having some way to recover from any error inside the loop.

Copy link
Member

@adizere adizere Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the findings & explanation!

I agree with you that the loop is not fault-tolerant anyway. I originally thought that returning in the Err path here could make the runtime (and the Hermes process in general) exit early (instead of waiting on the monitor to retry). But it seems there is a regression in master (#1026) and that has nothing to do with the changes in this branch.

We should have another issue to address this by having some way to recover from any error inside the loop.

I agree. If you know how to capture this in an issue, could you please handle that? We didn't notice these problems (yet), but it would be good to track this issue and add a general method for handling recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, filed #1025 for that.

@soareschen soareschen merged commit 2647051 into master Jun 3, 2021
@soareschen soareschen deleted the soares/enable-test-logging branch June 3, 2021 11:44
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error logs are shown in tests when logging is enabled
3 participants