Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deflake: use 2 miners for flaky tests #10764

Merged
merged 1 commit into from
Apr 27, 2023
Merged

fix: deflake: use 2 miners for flaky tests #10764

merged 1 commit into from
Apr 27, 2023

Conversation

arajasek
Copy link
Contributor

Related Issues

These tests fail quite often, and it's always because the PoSt process doesn't kick off, causing the network to halt eventually (because no blocks can be produced).

Proposed Changes

Run the network with two miners, in the hope that the second miner (that isn't being actively tested) will keep the blockchain going. If that doesn't work, we can try using the BeginMiningMustPoSt mode.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@arajasek arajasek requested a review from a team as a code owner April 26, 2023 14:34
@arajasek
Copy link
Contributor Author

The two affected tests passed once, rerunning!

Copy link
Contributor

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love it because it makes flakes less likely but still possible, unless I'm missing something.

I won't stop you giving it a try to see if we get lots of practical improvement. It might be a quick way to make things manageable in the short term.

@arajasek
Copy link
Contributor Author

@ZenGround0 My theory is that the miner being actively tested has a lot going on (sector imports, commits, etc.) that it needs to do with little hardware while also PoSting. I think it's reasonable say "we'll give you a pass on needing to PoSt for this test".

@arajasek
Copy link
Contributor Author

Attempt 2 was good too, let's do a third!

@ZenGround0
Copy link
Contributor

@ZenGround0 My theory is that the miner being actively tested has a lot going on (sector imports, commits, etc.) that it needs to do with little hardware while also PoSting. I think it's reasonable say "we'll give you a pass on needing to PoSt for this test".

I agree with all this but it doesn't mean that the second miner can't fail for similar reasons even if they are less likely to happen. This miner is usually in the same process as the actively tested miner as far as I understand and can also be impacted by tests on the first miner. It's possible that in some tests it actually is impossible for second miner to fail post too but there is no good general reason.

To truly remove flakiness we need rigorous guarantees on failure.

@arajasek
Copy link
Contributor Author

Attempt #3 also good. I'd like to merge this, and continue to monitor.

Separately, batch deals failed every single time, so that's going to be my next area of focus :P

@Stebalien
Copy link
Member

It seems unlikely to be a resource issue given the shared process. Maybe a message pool issue? If adding a second miner helps, it sounds like we have some kind of heavily contended lock somewhere.

However, I think "less flaky" is better than nothing. If this doesn't help, we can always revert.

I would add some code comments explaining why we're starting two (in case this code gets refactored and git history gets muddy).

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "can't hurt to try it" kind of fix.

@arajasek arajasek merged commit 727a711 into master Apr 27, 2023
@arajasek arajasek deleted the asr/flaky-tests branch April 27, 2023 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants